Aryn

June 9, 2025

Vision OCR and new text extraction settings in DocParse

Karan Sampath, Engineer

We’re excited to share some updates to the DocParse API, making it even easier to integrate it with your document processing applications.

‍

Text Mode

We have added a new text_mode parameter with four options, which makes it easier to specify how you want DocParse to extract the text from your document. This new parameter should be used instead of the now deprecated use_ocr.

By default, DocParse will use an intelligent, automatic setting (specified in the API as inline_fallback_to_ocr) to decide on the optimal text extraction settings. If there is embedded text, it will extract it using the inline option, otherwise it will use the standard_ocr option (more on these below).

However, you can specify exactly how DocParse will extract text using the other 3 available options:

‍

1. Embedded Text Extraction (inline): This option uses an improved fine-grained text extraction technique to retrieve the text embedded within the document. We show the quality improvements of Aryn's fine-grained approach in the examples below:

‍

Generic method output:

‍

This text was extracted using a general embedded text extraction technique, and you can visually inspect the issues with the table cells.

‍

Aryn’s fine-grained method output:

‍

Aryn's fine-grained text extraction approach clearly identifies the tables correctly, and ensures we get each smaller cell properly.

‍

2. Standard OCR (standard_ocr)

Aryn uses an optimized AI model for OCR, which is our standard selection for most documents. We discuss it more depth in this blog post, and show an example of its 40x improvement on reducing errors for character recognition (including symbols like ‘$’ and ‘%’).

‍

3. Vision OCR (vision_ocr )

DocParse’s Pay As You Go (PAYG) customers can choose another OCR option using a Vision Language Model (VLM). For some documents, we have seen VLMs outperform our standard OCR model, and we recommend trying this option if needed.

Here’s an example table where the vision_ocr option outperforms standard_ocr:

‍

‍

Vision OCR Model (correct extraction):

vision_5 = "(acid α -glucosidase deficiency)." vision_8 = "LOW in late forms of Pompe disease"

‍

Standard OCR model:
ocr_5 = "(acid a -glucosidase deficiency)." (error: "a" is not the same as "α")
ocr_8 = "LOw in late forms of Pompe disease" (error: "L0w" is not fully capitalized)

‍

Text Extraction Options

We’ve added additional text_extraction_options, which enables further customization of the text extraction output. We will add more options over time, and our first one is:

‍

remove_line_breaks: Removes line breaks from text in an intelligent manner. For example, on this page:

‍
‍

‍

Default Output:

In the past, lead in the ambient air \\n and in food contributed much more \\n to overall exposure of the general U.S. \\n population than it does now. Because \\n lead from these sources has been drasti-\\n cally reduced, blood-lead levels in U.S. \\n children and the general population are \\n lower than they were 1 or 2 decades ago \\n (EPA 2004b). However, despite the de-\\n creased use of lead in food canning, gas-\\n oline and paint, lead will remain in our \\n environment for many years. While it is \\n critical that recognized sources of lead \\n be minimized in a child’s environment, \\n

‍

With remove_line_breaks enabled:

In the past, lead in the ambient air and in food contributed much more to overall exposure of the general U.S. population than it does now. Because lead from these sources has been drasti cally reduced, blood-lead levels in U.S. children and the general population are lower than they were 1 or 2 decades ago (EPA 2004b). However, despite the de creased use of lead in food canning, gas oline and paint, lead will remain in our environment for many years. While it is critical that recognized sources of lead be minimized in a child’s environment,

‍

You can learn more about these new options in our documentation. And, we’d love to know what options work best for you - drop us a note!

‍