top of page
  • Jon Fritz

Answer questions on tables with Sycamore's table extraction transform

Updated: Feb 14

When building a conversational search application, you need to consider how to get the highest quality answers on unstructured data. The downstream data flows in a search query (e.g. RAG) rely on retrieving the correct data from a set of indexes (in Sycamore's case, we use hybrid search, which combines semantic and term-based retrieval). Furthermore, the ability to retrieve the most relevant parts of your dataset to a query is directly correlated with how that data was prepared and enriched before being indexed.


Sycamore not only provides RAG pipelines, but also enables you to focus on getting your data ready for search. You can prepare and enrich complex unstructured data for search and analytics through advanced data segmentation, LLM-powered UDFs for data enrichment, performant data manipulation with Python, and vector embeddings using a variety of AI models.


In unstructured datasets, it’s common to find tables throughout the documents. In order to answer questions from data in these tables with high-quality, you often need to extract and process the tables during the preparation data flow. Luckily, Sycamore makes table extraction easy, and we’ll show you an example in this post using Amazon Textract as the underlying extractor for PDFs. Table extraction for HTML files is already enabled in the default script, and it does not require AWS Textract.


AWS prerequisites for using Amazon Textract


This example will build from our previous blog post here, and you will configure and relaunch your Sycamore stack to use table extraction. We will assume you have already run through those instructions before starting this example. 


Sycamore has the ability to utilize a variety of libraries and web services during the data preparation process. In this example, we will use the table extraction transform for PDFs and configure it to use Amazon Textract as the mechanism for the extraction. Therefore, you will need AWS credentials that can access Textract in the us-east-1 region and an Amazon S3 bucket in us-east-1. Please note that you will be charged for AWS resources consumed, though it should be negligible. If you do not have an AWS account, sign up here.


First, install the AWS CLI here


Next, you can enable AWS SSO login with these instructions, or you can use other methods to configure the values for AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and if needed AWS_SESSION_TOKEN.

If using AWS SSO:


aws sso login --profile YOUR-PROFILE-NAMEeval "$(aws configure export-credentials --format env --profile your-profile-name)"

Next, create an Amazon S3 bucket for use with Textract. Make sure to set the region to US-East-1.


aws s3 mb your-bucket-name --region us-east-1 –profile your-profile-name

Then, set the configuration for Textract's input and output S3 location:


export SYCAMORE_TEXTRACT_PREFIX=s3://your-bucket-name

Relaunch Aryn stack with new configuration

Take down and relaunch your Sycamore stack from the previous blog post if you have not already done so. This will also ensure the previous indexes you created are deleted. Run these commands in the folder where you downloaded the Docker compose files:


docker compose down
docker compose run reset

\Now, you will relaunch your Sycamore stack:


docker compose up --pull=always

Sycamore will start back up. You will know when the Aryn stack is ready when you see log messages similar to:


No changes at [datetime] sleeping

Ingest the earnings report data


Next, let’s ingest the Amazon Q3 2023 earning reports into the relaunched Sycamore stack using table extraction. Sycamore will automatically configure the default data preparation script to use table extraction for PDFs if you have configured AWS credentials. It adds this code to the default script:

table_extractor = TextractTableExtractor(
	region_name="us-east-1",
	s3_upload_root=os.environ["SYCAMORE_TEXTRACT_PREFIX"]
)

ctx.read.binary(
	paths, 
	binary_format="pdf",
	filter_paths_by_extension=False
).partition(
	partitioner=UnstructuredPdfPartitioner(),
    table_extractor=table_extractor,
)

Sycamore will now use Textract for table extraction during the preparation data flow. In a new terminal window, run:


docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/AMZN-Q3-2023-Earnings-Release.pdf

docker compose run sycamore_crawler_http https://s2.q4cdn.com/299287126/files/doc_financials/2023/q3/64aef25f-17ea-4c46-985a-814cb89f6182.pdf

In the terminal window with the original docker compose command, you’ll see log notes from Sycamore running a new data processing job. When the new files are loaded, you will see log messages similar to:


No changes at [datetime] sleeping

Ask questions and search


Now that the newly prepared data is loaded, you will return to the Sycamore demo query UI to ask some questions on the tables in the 10Q earnings report. Using your internet browser, visit http://localhost:3000 to access the demo query UI. Create a new conversation by entering the name in the text box in the "Conversations" panel and hit enter.


First, ask “What were the net sales for Amazon in Q3 2023?” and we’ll get back the answer and a citation to where the data was found. This data was in a table in the document, and would not have been retrieved without the table extraction step. Next, we may want to see how this was divided across business units. Let’s use the conversational features of Sycamore and ask “Can you break this up into business units?”



Aryn returns the requested information, and this data was also taken from a table in the 10Q document. Using table extraction for PDFs enabled Sycamore to retrieve information from the tables in the dataset and return high-quality answers.



Finally, we can ask “What was the biggest operating expense in Q3 2023?” and Sycamore will retrieve the correct value from data and provide a citation.




You can choose to ask other sample questions on this dataset. Also, if you want to shut down and clean up your Sycamore stack:


docker compose down
docker compose run reset

You can also choose to delete the S3 bucket you created as well. 


Conclusion


This second part to this blog post shows how data preparation and enrichment can have a huge impact on conversational search quality. In this example, we used Sycamore's table extraction transform for PDFs to easily extract tables from our dataset and make this data available for high-quality search.


For more examples on data preparation and enrichment with Sycamore, and check out this tutorial on using Jupyter to write data preparation jobs.


If you have any feedback or questions, please email us at: info@aryn.ai 


Or, join our Sycamore Slack group here:


Commentaires


bottom of page