Getting started with Sycamore is fast and easy, and I'll show you how to get started in minutes with a small demo. Sycamore is a conversational search and analytics platform for complex unstructured data, such as documents, presentations, transcripts, embedded tables, and internal knowledge repositories. It retrieves and synthesizes high-quality answers through bringing AI to data preparation, indexing, and retrieval. Sycamore makes it easy to prepare unstructured data for search and analytics, providing a toolkit with Python for data cleaning, information extraction, enrichment, summarization, and generation of vector embeddings that encapsulate the semantics of data. You can query your data using a RAG pipeline, hybrid search (OpenSearch compatible API), and analytical functions. For more information, visit the documentation.
In this blog post, we will launch a Sycamore stack locally using Docker and pretend we are financial analysts interested in learning more about Amazon’s Q3 2023 earnings. We will download and ingest these financial reports into our stack, and then ask some questions using conversational search. For simplifying dependencies, we aren't going to enable table extraction, which requires Amazon Textract (and AWS credentials). However, we recommend using table extraction for asking questions on data like this, and the Part 2 for this blog post will quickly enable Textract and power tougher questions on this dataset.
Let’s get started!
Launch Sycamore using Docker
To launch Sycamore, you'll first need to install Docker. If you don't have Docker already installed, visit here. Second, you need an OpenAI key so Sycamore can use the OpenAI's large language model (LLM) service for entity extraction and RAG data flows. Keep in mind that you will accrue costs for this usage, though it will likely be negligible. You can create an OpenAI account here, or if you already have one, you can retrieve your key here.
Next, follow these instructions to launch a Sycamore stack using Docker to get started in seconds.
We also recommend adjusting the Docker memory settings to 6 GB and Swap to 4 GB. You can make these adjustments in the “Resources” section in the “Settings” menu, which is accessed via the gear icon in the top right of the UI.
You will know when the Aryn stack is ready when you see log messages similar to:
No changes at [datetime] sleeping
Ingest the earnings report data
Next, let’s ingest our Amazon Q3 2023 earning reports. You will use the Aryn crawlers to download and ingest the files. In a new terminal window, run:
docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/AMZN-Q3-2023-Earnings-Release.pdf
docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/64aef25f-17ea-4c46-985a-814cb89f6182.pdf
In the terminal window with the original Docker compose command, you’ll see log notes from Sycamore running a new data processing job. The stack will use the default data preparation script for segmentation and chunking, data extraction, and creating vector embeddings. You can find the default data preparation script here.
When the new files are prepared and loaded into Sycamore's vector and term-based indexes, you will see log messages similar to:
No changes at [datetime] sleeping
Ask questions and search
Now that the data was prepared and loaded, you will go to the Sycamore demo query UI for conversational search over this data. Using your internet browser, visit http://localhost:3000 to access the demo query UI. Create a new conversation by entering the name in the text box in the "Conversations" panel and hit enter.
You can now ask questions on this data. Let’s start by asking “How much impact did AWS have on the Q3 2023 Amazon earnings?” and hit enter. You’ll see Sycamore spring into action, with question rewriting using generative AI, hybrid search results (semantic + keyword search) return in the right panel, and a conversational answer to the question using generative AI and RAG in the middle panel.
You can follow up by asking "What AI related events happened with it?" and you will see how Sycamore uses conversational memory to understand prior context. It relates "it" to AWS (from the prior interaction), and rephrases the search query accordingly. You can click on the citation links or the documents from the hybrid search results to take you to a highlighted section of the source documents.
Next, you could ask “What was the effect of Rivian on the Q3 2023 Amazon earnings?” and you'll see a similar set of data flows. You can also create other conversations in parallel, ask additional questions, or ingest more data by replacing the URL in the HTTP crawler commands you ran above.
If you want to shut down and clean up your Sycamore stack:
docker compose down
docker compose run reset
This example uses the default Sycamore data preparation script to process public financial documents and make them available for conversational search. However, these documents have many tables, and to ask detailed questions on those tables, we need to use table extraction in our preparation phase. Iterating on data preparation is easy with Sycamore, and you can check out a second part to this example that enables table extraction for better answers data in tables. NEED LINK
In this example, you launched a Sycamore stack using Docker, prepared and ingested Amazon’s Q3 2023 earnings reports, and ran conversational search over that dataset. If you have any feedback or questions, please email us at: email@example.com
Or, join our Sycamore Slack group here: