Wrangling your gnarly PDF documents for chunking and processing just got a lot easier! We’re excited to announce the launch of the Aryn Partitioning Service (APS), a serverless, GPU-powered API for segmenting and labeling PDF documents, doing OCR, extracting tables and images, and more. APS runs the Aryn Partitioner and it’s state-of-the-art, open source deep learning DETR AI model trained on 80k+ enterprise documents. This can lead to 6x more accurate data chunking and 2x improved recall on hybrid search or RAG when compared to off-the-shelf systems.
The Aryn Partitioning Service takes PDFs and returns the partitioned output in JSON. You can use it to partition documents and extract information directly in your code, or use with Sycamore for additional processing. Sign-up here for free and use the Aryn Playground to visually see how it segments and processes your own documents. Or, watch the Aryn Partitioning Service in action in this video.
Easy, high quality, GPU-powered document partitioning
When preparing complex, unstructured data for RAG applications, GenAI use cases, or general document processing workflows, the ability to partition and extract information from these documents is a critical step. In this stage, a document is broken down into its constituent parts, like paragraphs, tables, images, headers, and more, and then different techniques are used to extract and process those parts (e.g. OCR and table extraction).
In May, we released the Aryn Partitioner, which is used in Sycamore data processing scripts as an initial step to break apart PDFs into their constituent, labelled parts. It uses our purpose-built, modern AI model trained on DocLayNet, a collection of hundreds of thousands of labeled enterprise documents. This model offers 6x better precision (mAP) and 4x better recall (mAR) than alternatives. In real world applications, we also found that this model could handle complexity that others could not. (Don’t believe us? Try it with your own documents in the Aryn Playground).
Although users could leverage the Aryn Partitioner for high-quality segmentation processing, there was one issue - to get the fastest performance, you needed to run it on NVIDIA GPU hardware. This isn’t unreasonable - modern AI models generally require GPUs - but this requirement didn’t make it easy for users to try the partitioner on laptops or easily use it on a standard Amazon EC2 instance type. Also, because the other data processing steps when using the Sycamore engine are CPU-heavy, the costly GPU is underutilized for a significant portion of the job.
To make it easy and more cost effective for users to leverage the Aryn Partitioner, we built the Aryn Partitioning Service (APS). It’s serverless and leverages NVIDIA GPU technology, meaning that it’s blazing fast and you don’t need to provision and manage your own GPU capacity. All you need is an API key for Aryn Cloud (sign up here for free), and you can get started in seconds.
You can use the Aryn Partitioning Service in your scripts to convert your PDFs into a variety of text formats. You can also use it with the Sycamore document processing engine when writing a job. However, the easiest way to test out the service is using our Playground. We’ll explore these below.
Test your PDFs in the Playground
You can drag-and-drop your PDF files into the Aryn Playground to visualize the segmentation and download the output. All you need is your Aryn Cloud API key (and if you don’t have an API key, sign up here for free), and go to the Playground. Then select your PDF, optionally change the configuration values, and chunk it!
Depending on your document, you might need to “Re-chunk” with different options. You can adjust the Threshold to adjust the sensitivity of the document segmentation, or trying “enable OCR” for documents where certain elements are not being identified. In some cases, a PDF that isn’t an image needs OCR for proper processing.
You can also download the JSON output of the API, or the processed PDF with labeled bounding boxes. The Playground limits the PDF to the first 25 pages, so if you’d like to experiment with larger documents, use the service directly with one of the options below.
Use partitioned output directly in custom app
It’s common for developers and data scientists to have applications or custom data workflows that need to do unstructured data processing. For example, your custom data preparation code for a GenAI application might require chunking PDFs or extracting tables for use with LLMs. These applications can utilize the Aryn Partitioning Service directly with the Aryn SDK or curl, upload a PDF, and return the partitioned document with extracted components. This enables your applications to have access to high-quality document partitioning, table extraction, and more via a single API request.
With the Aryn SDK
You can use the Aryn Python SDK to call the Aryn Partitioning Service from your scripts. To install it:
pip install aryn-sdk
Then, you can partition a PDF and use the partitioned output in your application. You can specify several options, like table and image extraction, and a list of these options is here.
from aryn_sdk.python.client import partition_file
aryn_api_key="YOUR-KEY"
f = open('/path/to/document.pdf', 'rb')
partitioned_file = partition_file(f, aryn_api_key, extract_table_structure=True)
#partition the document
table_data = partitioned_file['elements'][5]
#take an element from the partitioned document
The output contains bounding boxes for each element, labels for type of element, and the extracted representation of each element (e.g. text for an element labeled as a paragraph) and a data structure down to the cell level for tables.
If you are enabling OCR on PDFs with a large number of pages (more than 100), we recommend batching the pages across requests. You can specify a page range for your document using the selected_pages option:
'options={"selected_pages":[[25,30]]}'
#process pages 25-30
'options={"selected_pages":[11]}'
#process page 11
There are other options you can choose, and a list of those options is here. Also, a notebook with an Aryn SDK example is here.
With curl
You can also use curl directly in your script. For example, a request with extracting tables and images would look like:
aryn_api_key="YOUR-KEY" \\
curl -s -N -D headers.txt "<https://api.aryn.cloud/v1/document/partition>" \\
-H "Authorization: Bearer $aryn_api_key" -F "pdf=@/path/to/document.pdf" \\
-F 'options={"threshold":0.4,"extract_table_structure":true,"extract_images":true}' \\
| tee output.json
You can also choose additional options, similar to the Aryn SDK. Also, a notebook with an example curl command is here.
Use in a data processing job with Sycamore engine
By default, your Sycamore job will use the Aryn Partitioning Service in the Partition transform when selecting the Aryn Partitioner (instead of running it locally). Just specify your aryn_api_key:
.partition(partitioner=ArynPartitioner(), aryn_api_key=YOUR-KEY)
You can also specify other options for the Aryn Partitioning Service to extract tables, do OCR, and more. The list of those options are here.
Similar to the selected_pages option above, if you are enabling OCR on PDFs with a large number of pages (more than 100), we recommend using the pages_per_call option to specify the number of pages to run in each batch. This helps divide up the processing to more efficiently process the document across multiple requests.
Most of Aryn’s example notebooks use the Aryn Partitioning Service by default, and an example where you can visualize labeled bounding boxes in the notebook is here.
Get started today!
The Aryn Partitioning Service is finally here, and just an API call away from handling your gnarliest PDFs. You can use it via the Sandbox, directly with the Aryn SDK or curl, or in a Sycamore script - all you need is an API key (sign up here for free). We’d love to hear your feedback on the service or any feature requests you have for your workloads.
To learn more, visit the Aryn Partitioning Service documentation.
Email us: info@aryn.ai
Join the Sycamore Slack: https://join.slack.com/t/sycamore-ulj8912/shared_invite/zt-23sv0yhgy-MywV5dkVQ~F98Aoejo48Jg
Kommentarer