By Jon Fritz and Ben Sowell
At Aryn, we spend time helping users build search and analytics applications over complex unstructured datasets — an extension of RAG that we call LUnA (LLM-powered unstructured analytics). We’ve found that the way in which you segment, enrich, and chunk complex, unstructured data is critical for generating high-quality answers. Sycamore, our open source, document processing engine is purpose built for this. Instead of trying to process a document all at once, Sycamore first decomposes a document into its constituent components, which can be of various types such as paragraphs, tables, images, and more. Then, it can apply the best AI model for each component based on its type to extract and process it with high fidelity.
In this blog post, we’re excited to show you how to use our new Aryn Partitioner to accomplish these ETL tasks. We are also excited to share that we’ve released a new open source, Apache v2.0 AI model for high-fidelity document segmentation, the first step of partitioning. We then show you how you can use component specific AI models for high-quality table extraction, OCR, and summarizing figures.
Sycamore DocSets and document partitioning
Sycamore gives users a scalable and robust abstraction for document processing: DocSets. A DocSet is like a Python DataFrame in Spark, but instead of operating on structured records, you can transform and manipulate collections of unstructured documents.
DocSets are schema-free, thus each document in a DocSet can have a different structure.
The DocSet represents each document as a tree of elements, where each element represents a chunk or component of the document. Elements can represent different types of components, such as text or images. Documents and elements also have additional associated metadata called properties. Because the elements are stored as a tree, Sycamore can process each element individually while retaining its context in the whole document. Supported transformations on DocSets range from simple formatting to complex AI-based techniques. And, with DocSets, scalability and fault-tolerance are built-in.
To process raw PDFs and bring them into DocSets, Sycamore must first segment the document and label each element, such as headings, tables, and figures. This process is called document segmentation, and is a critical step in processing unstructured data. Partitioner transforms can use a variety of techniques, from hardcoded heuristics to advanced AI models, to identify bounding boxes and label unstructured components.
With Sycamore, users can choose from multiple partitioners and their respective configurations. We initially included off-the-shelf open source partitioners as part of our stack. However, for real world use cases, we quickly found that our previously included partitioners lacked the fidelity and accuracy we needed for getting high quality results for RAG and unstructured analytics. So, we built our own partitioner powered by our new document segmentation AI model.
We are excited to introduce the Aryn Partitioner in the latest release. The first version is focused on PDFs, and it includes a newly trained object detection model that provides better accuracy in labelling and segmenting documents.
But most importantly, the Aryn Partitioner and associated AI models are 100% open source with the Apache v2.0 license. You don’t need to sign-up or pay for anything – it’s free to use with Sycamore or your own projects. We believe the data community should have access to the latest-and-greatest tech in open source. However, we hope you do join our Sycamore Slack group and share your feedback (or contribute!).
Using object detection for better segmentation
You’ve likely seen complex PDFs – ranging from technical manuals for equipment to scientific journals. Sub-headings might have other associated headings, tables can have rows spanning multiple columns, and text might be broken up into multiple columns and separated by images. These complicated layouts tend to confuse standard OCR or extraction techniques, and will leave you with a mess.
Many segmentation approaches use an AI model for object detection and labelling to try to solve this problem. Object detection models take an image as input, and then generate bounding boxes to identify specific parts of that image. The bounding boxes will correlate to what the model was trained to identify. Some object detection models can also label the different bounding boxes. Data preparation systems can then use this information to merge elements into the right size chunks, enrich with metadata, or do OCR. Therefore, the quality of your semantic search or RAG application is directly correlated to how well your unstructured data is segmented by your partitioner, and ultimately prepared.
When experimenting with various open source partitioners with different object detection models, we found that there were issues consistently segmenting and labeling complex PDF documents across various use cases. Furthermore, these partitioners didn’t allow for using the relevant state-of-the-art AI model for each document component type to get the best extraction.
So, we embarked on a journey to create our own partitioner with a new object detection model at its core. First, we wanted to start with a model built with the latest AI architectures, and we chose the Deformable DEtection TRansformer (DETR) model. However, existing DETR models were not trained on enterprise data and business documents, and we realized that we’d need to train our own.
Enter DocLayNet – an open source, human-annotated document layout segmentation dataset containing tens of thousands of pages from a broad variety of document sources. We used this dataset to train our DETR model, and the result was much better object detection and labeling for enterprise documents. More information about the model is on our Hugging Face model page.
Here is the previous partitioner that relies on a rules-based extraction method and our new Aryn Partitioner (with DETR model) segmenting and labeling a document:
With the new DETR model, the Aryn Partitioner accurately draws the proper bounding boxes around each component, and correctly labels the type of component. On the other hand, we see the other partitioner over-segmenting most of the document, and not identifying the chart and table as unique component types.
Our trained DETR model is at the heart of the Aryn Partitioner, and it enables a variety of downstream ETL. Within our partitioner, we also included high-quality table extraction, image processing, and OCR features in our initial release. We discuss and show examples of these below.
Easily extract your tables with high-fidelity
When building the new Aryn Partitioner, we saw an opportunity to include high-quality table extraction. When our DETR model identifies and labels a table, the Aryn Partitioner will then use the Table Transformer model to outline each cell in the table. We then use PDF Miner to extract the text from the page, and use the DocSet’s table representation to intersect each extracted element with the outline of the cell in the table. With this representation, Sycamore stores the contents of each cell and table properties, and allows you to manipulate the table in subsequent transforms. Sycamore can even convert the table to different formats, like HTML, CSV, Pandas, and more!
For a quick demo, we ran a Sycamore job on the document we showed above, which contains a table, graph, and various headers and text blocks. We used this notebook, and you can test the Aryn Partitioner with other documents. It’s easy to configure partitioning and table extraction:
ds = context.read.binary(paths=["/path/to/folder"],\
binary_format="pdf")\
.partition(partitioner=ArynPartitioner\
(extract_table_structure=True))
Below, we visualize the output of the Aryn Partitioner. It correctly identifies and labels each part of the document, including the multi-colored table in the top right:
The partitioner then uses PDF Miner to extract the text, and then finds the intersection of the text and the cell bounding boxes to construct the table as an element in the DocSet.
In cases where a table has a column that spans several rows (or other complex formatting), the model will identify this and properly represent the table. Here is an example showing a table a column header (”USD billion”) that spans three columns and three sub-headers (”Revenue by business,” “By region,” and “By Industry”) that span four columns:
The Aryn Partitioner draws the correct bounding boxes across the cells in the table. We can explore the table’s representation to see how the partitioner extracted the cells and stored the table. We can also use the convert_to_html transform, and then visualize the HTML output for the table (an excerpt is below):
With the table extraction process, the Aryn Partitioner has information about every cell and header in the table. For instance, it can correctly place the spanning rows like “Revenue by Business.”
Identifying and processing images with multi-modal LLMs
Similar to tables, the Aryn Partitioner can identify images in your documents and process them in various ways. Using multi-modal LLMs, it can create summary descriptions in words and extract metadata from images. This enables you to load images with relevant metadata in your vector and keyword indexes, and allows you to retrieve them during a search query. DocSets have a specific element type for images that handles metadata differently. For instance, when creating vector embeddings for an image, Sycamore can instead use the supplied image summary (in text) to create the vectors.
You can choose to send images along with a prompt to a multi-modal LLM like GPT-4-turbo to create helpful metadata like summaries or classifications:
context = sycamore.init()
doc = context.read.binary(paths=paths, binary_format="pdf")\ .partition(partitioner=ArynPartitioner(extract_images=True))\
.transform(SummarizeImages)\
.show()
From the example document above, we can extract and process this image of a graph:
Using GPT-4V, the SummarizeImages transform gets this description of the image and add it as metadata in the DocSet:
{
'is_graph': True,
'x-axis': 'Total Investment (in 'Billions of U.S. Dollars)',
'y-axis': 'Country',
'summary': "The bar graph displays the total investment in billions of U.S. dollars by various countries. The United States leads with a significant margin at approximately 47.36 billion dollars, followed by China with about 13.41 billion dollars, and the United Kingdom with around 4.37 billion dollars. Other countries like Israel, India, and South Korea also show investments ranging from 3 to 4 billion dollars. The graph clearly illustrates the disparity in investment amounts among the countries, with the United States investing more than three times the amount of the second highest, China."
}
The summarization gives additional context to the image in text, which is helpful for downstream search and analytics queries on this data.
Processing documents that require OCR
Many datasets consist of images of documents, such as signature-based documents. For document images, you need to include an OCR step in your processing pipeline to extract text from the image. Using an accurate segmentation model is important for this, because AI models for OCR pull the text from each labeled part of the image. Because it can accurately segment images of documents, the Aryn Partitioner’s DETR model is a great fit in combination with open source models like Easy OCR. Also, the Aryn Partitioner can do OCR on tables.
Below is a snippet from an example notebook using the OCR feature to extract text from a scanned PDF. Each element in the document was identified and segmented using the DETR model:
ds = context.read.binary(paths=[str(path)], binary_format="pdf")\ .partition(ArynPartitioner(use_ocr=True))\
.explode()
The output from the notebook is a DocSet that includes the extracted text as part of each element in the DocSet.
Bringing it all together in a DocSet
Here’s what it looks like if you want to run all of these operations on the document in one go:
ds = context.read.binary(paths=[str(path)], binary_format="pdf")\ .partition(ArynPartitioner(extract_table_structure=True, use_ocr=True,\ extract_images=True))\
.transform(SummarizeImages)\
.explode()
Once you have the partitioned documents in a DocSet, associated with each element is its type (e.g. table), relevant properties and metadata, and its context within the rest of the document. Using the show() function in a notebook, we can show the elements from the prior document. In the screenshot below, you can see some of the elements with bounding boxes, text, and other properties for headers, footers, and a table:
Now, you can take this DocSet and continue your ETL processing. For example, you can extract and add more metadata (blog for example schema extraction), transform specific elements, assemble chunks, and choose what to create vector embeddings for (and how). To get high-quality answers from your RAG or unstructured analytics pipelines, these steps are often needed to add additional context for better retrieval and more accurate answers.
But, it all starts with high-fidelity partitioning.
Get started with the Aryn Partitioner
The new Aryn Partitioner is included in the latest containerized release and Sycamore Python library, and currently operates on PDFs. You can use the partitioner with CPU or an NVDIA GPU, though performance on larger processing jobs can be faster if you use a GPU. Our DETR AI model that powers the partitioner is 100% open source with an Apache v2.0 license, and we encourage you to try it out in Sycamore (also open source) or use it in your own projects. And, feel free to improve on it! You can fine tune it for your own purposes, or contribute back so that we can make it better for everyone to use.
To learn more about the Aryn Partitioner’s Deformable DETR model, visit Aryn’s Hugging Face model page. To learn more about the Aryn Partitioner, visit the documentation. To get started with Sycamore, visit the getting the started guide.
Join the Sycamore Slack: https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg
Comments