You’ve used Aryn DocParse to break apart your document and extract tables, images, and more into a structured JSON format. But…now what? If you want to load this data into your vector database for semantic search or RAG, we’re excited to introduce you to Aryn DocPrep!
DocPrep is a UI wizard for building ETL pipelines in Python for processing documents and loading them into target vector databases, and it enables you to then easily run this code in a Google Colab environment or export as a .ipynb file, which you can run in a Jupyter notebook locally. It’s available in the Aryn Playground and Console. You can also completely customize the DocPrep pipeline code generated by just editing it in a notebook. For instance, you can add additional transforms or use embedding models not available in the DocPrep UI.
DocPrep makes it easy to configure the different steps in your pipeline, like chunking strategy and embedding models, using a wizard. Let’s take a look.
Create a DocPrep pipeline
Let’s get to the DocPrep UI through the Aryn Playground by choosing DocPrep on the landing page:
Next, choose the documents you want to prepare and load into your vector database. You can choose documents in Amazon S3 buckets, stored locally, or files you want to upload to Google Colab. The output ETL pipeline will be configured to read from the location you choose. If you don’t have a file in mind, you can select a sample NTSB report (PDF format) in a S3 bucket.
In this walkthrough, let’s configure our ETL pipeline to use a document we will uploaded to Colab storage:
Then, we will select the data transforms, chunking, and vector embedding options for our pipeline. DocPrep provides defaults for the configuration for Aryn DocParse and chunking strategies, but you can change them in the optional section if needed.
We will keep the default choice for embedding model, which is OpenAI’s text-embedding-3-small. To use this model, you need to provide an OpenAI API key in the environment where you run this pipeline. One advantage to using OpenAI to generate embeddings is that you do not need to wait a few extra minutes to download the dependencies required for running an embedding model locally. But, if you don’t mind the extra installation time, you can choose MiniLM-L6-v2 and run it locally:
Lastly, we will choose which vector database to load and configure the connector. In this example, we’ll choose DuckDB and run it in the notebook. This makes it easy to quickly test our pipeline without needing to create and configure an external database:
Now that we’ve selected our options, let’s click “Generate pipeline”. We now have our ETL pipeline code in a notebook, which we can inspect in the UI:
Next, you can run the generated pipeline in a Google Colab notebook or download as a.ipynb file, which you can run in a Jupyter notebook locally. We’ll run it in Colab, so click that option and Colab will open in a new browser tab.
Run ETL pipeline in Colab
Our pipeline will use Aryn DocParse and OpenAI, so we will need to configure the API keys for each in our Colab environment. Click the key icon on the left to add these secrets, and then enable them for “Notebook access” using the slider on each row:
The pipeline code includes a cell to make it easy to upload your local files to Colab so it can be accessed. Run this cell, and a “Choose Files” button will appear in the cell output. Click the button and choose the file(s) to upload:
Once your files are uploaded, you will also be able to see them in the file browser in the left nav.
Now, you can run the rest of the cells in the notebook! You’ll see a visual representation of the document partitioning as output of one of the cells, too. The final cell of the notebook is a quick query to verify that the documents were loaded into DuckDB:
Next steps
Congrats! You just created a DocPrep pipeline to load your vector database with high-quality output from Aryn DocParse. You can now choose to create a RAG pipeline or run semantic search on the data in your vector index. For instance, for DuckDB in the example above, you can find and use sample code for semantic search and RAG (with LangChain) at the end of this notebook.
Additionally, you can experiment with different settings in your ETL pipeline. You can select different options in the DocPrep wizard, or edit and add to the ETL pipeline code directly in your notebook. DocPrep generates code using the open source Sycamore library, which can be customized as your build your application.
We’d love to hear what other options you’d like to configure in DocPrep! Drop us a note at info@aryn.ai or on Slack.
Comments