August 14, 2025

Reliably parsing millions of PDF pages with Aryn DocParse and BigQuery

Mehul A. Shah, CEO

Recently, one of our customers offering an AI-powered insurance risk platform wanted to release a new headlining feature — a RAG-based agent specialized for various geographies. Insurance is a highly localized business due to region specific risk factors and regulations. They wanted the agent to answer natural language questions across millions of pages of insurance documents — mostly PDFs comprised of innumerable complex tables and figures.

For the agent to be fast, accurate, and trustworthy, the data that it relied on needed to be high quality. After trying a number of options including those from hyperscalars, our customer settled on Aryn to parse their PDFs because of its quality, cost effectiveness, and flexibility.

To meet their deadlines, they needed to scale their document processing pipelines in a hurry. They are a GCP shop, and they build all their pipelines that feed their agents using BigQuery. At first, the choice of BigQuery for document processing was puzzling. But, after the implementation, the reasons became crystal clear. With a BigQuery and DocParse integration, the resulting pipelines are entirely serverless. They are easy to spin up, easy to scale up and down, easy to experiment with, and easy to maintain.

In this blog post, we describe the scaling challenges that our customer ran into and how we helped them overcome them. With this integration of BigQuery and DocParse, they reliably parsed millions of pages in a few days. We have made the example code available for those of you that want to start tinkering right away.

The agent architecture

The customer implemented a standard RAG pipeline in GCP. They store documents in Google Cloud Storage (GCS), and they have a document processing pipeline built using BigQuery that feeds their vector database. Their agent uses semantic search, rerankers, and an LLM to answer user questions. BigQuery makes it easy to experiment with chunking strategies, vector embedding models, and test search quality.

Scaling the Aryn DocParse and BigQuery integration

Our customer’s initial approach was to use BigQuery user-defined functions (UDFs) to process each document. The python UDF synchronously called the DocParse API using the Aryn SDK. They stored the parsed results (JSON) in a central table along with the URI to the original document and its hash. They further chunked and embedded the parsed results using additional SQL operations.

This simple synchronous approach worked well when testing at small scale with a few, average sized documents. However, once the customer started to scale up the workload, errors quickly arose that stalled the pipeline. We discovered that the UDFs were timing out silently and retrying large documents, resulting in wasted work. Requests also queued up, resulting in lack of progress. Given their imminent deadline, they asked us for help.

BigQuery and DocParse two-phase pipeline

We helped scale their workload using two key techniques:

  1. Two-phase asynchronous processing. We split the parsing into two phases using DocParse’s asynchronous APIs. In the first phase, we send a parsing request for each document to DocParse, and store the resulting asyncIDs in the central table. In the second phase, we use the asyncID to check the request status and fetch the parsed results. Splitting the calls allows each UDF call to complete quickly, thereby eliminating timeouts.
  2. Streaming results. We also discovered that we could not fetch the parsed JSON results for  very large documents. The Aryn SDK was gathering the entire result in memory and exhausting it, leading to out of memory errors. We improved the fetching logic in the Aryn SDK to incrementally stream back the results in the UDF. We then stored small parsed results in the BigQuery table, and streamed the large parsed results into GCS.

Take aways

The BigQuery + DocParse pipeline above is entirely serverless, which makes it easy for developers. Our insurance-tech customer quickly scaled up and processed millions of pages over a few days without worrying about infrastructure. They didn’t need to worry about provisioning or configuration details. Their PDFs varied in size from a few to over 10,000 pages and contained complex table and figures. DocParse’s accuracy was better than the alternatives, including the hyperscalars, resulting in a more reliable agent. In addition, with our improved solution, they were able to keep track of incremental status during these multi-day runs, scale up and down quickly, and pick up from where they left off.

You can download our scripts from GitHub, and use them to process your own document collections. The scripts use Aryn DocParse’s asynchronous APIs for parsing, and BigQuery stored procedures and UDFs for queueing and processing. They periodically report the number of documents that are finished, in progress, or failed. This tracking made it easy to also retry failed documents, or remove them (e.g. the PDF was corrupted). There are also a few helper scripts to extract GCS checksums so you can write a SQL query to eliminate duplicates.

If you’re using BigQuery to orchestrate PDF parsing workflows, we encourage you to check out these scripts to build reliable and scalable pipelines. You can get started today with Aryn DocParse’s free trial! If you have any questions or feedback, drop us an email (info@aryn.ai) or join our Slack.