When RAG runs out of steam, use schema extraction and analytics with Sycamore

By Vinayak Thapliyal, Engineer

April 10, 2024

Unstructured data is usually free-form and schemaless. It can vary from PDFs, emails, and blog posts like this, to images and video files. While the types of unstructured data can vary massively, one can often loosely find some type of structure within a dataset of unstructured documents. Examples of these include scanned tax forms (deductions tables), earning reports (revenue statistics), and emails (sender and receiver names).

Sycamore is an LLM-powered search and analytics platform for unstructured data, built on OpenSearch. With Sycamore, you can easily implement RAG pipelines, but simple RAG has limitations. For example, RAG assumes that the answer lies the top-ranked results because that's what is passed to LLMs for generation. To improve RAG quality, Sycamore allows you to run analytics (e.g. filtering and aggregation) to ensure the top-ranked results contain the right information, giving better quality. Furthermore, you can run analytics independently to get answers to complex questions that RAG simply cannot address.

To use analytics on your data in Sycamore, you will first need to extract structured metadata as attributes for these operations. When ingesting data, Sycamore Data Prep enables you to easily and automatically detect patterns across unstructured documents, extract structured metadata, and use those fields for analytics. In this post, we walk you through how to use Sycamore Data Prep’s unsupervised schema extraction and population, and then use this metadata for analytics when querying your data. This blog post is a follow on to our previous post on LUnA (LLM-powered Unstructured Analytics) and describes the data layer of Sycamore, which offers a 100% OpenSearch compatible API.

When RAG is not enough

In this post, we’ll be using aviation incident reports from The National Transportation Safety Board (NTSB). The NTSB is a U.S. government investigative agency for civil transport incident investigations. Each aviation incident is reported in a PDF that contains a mix of specific required fields and free-form analysis of the incident in text. An example report can be found here.

When building a conversational interface over this data, a traditional RAG approach can be quite limiting. RAG assumes the answer is contained within the top K results (to feed the LLM context window), and it places all of the burden on keywords and vector embeddings to return the right passages for an arbitrary query. Many times, that breaks down, leading a RAG pipeline to produce an incomplete or incorrect answer.

But, there is a way to deal with this! In this example, we’ll go beyond traditional RAG by using metadata, analytics operations, and hybrid search to return answers that are elusive to a simple RAG approach.

Answering questions where traditional RAG fails

With this in mind, let’s get more specific on what we can do with the NTSB dataset when using metadata in our queries.

Answer questions that need more information than can fit in the LLM context: An aggregation query is a simple example of this, e.g. “What types of planes were involved in incidents in California?” LLMs aren’t accurate with these aggregation style questions, and especially when all of the unique values aren’t in the top K results. This is a problem much better suited for analytics-style queries. Extracting values for “location” and “aircraft type” is sufficient to answer the question using aggregations.

Answer questions that don’t retrieve the correct documents in the top K results: Many datasets have a lot of similar looking information, making it hard for even the best vector embedding models and hybrid search techniques to retrieve the required documents in the top K results. For instance, the NTSB dataset has 100s of incidents in Washington over several years, and there is a burden on the vector embedding model to separate these incidents by time. If your question is about a specific year, e.g. “Were there any incidents in the last three days of January 2023 in Washington?”, there is no guarantee your vector or hybrid search will retrieve the documents from that year.As more constraints are added to the question, the probability of errors in filtering only by vector search (or term searches) increases, and this results in your relevant documents being pushed outside the top K. Instead, by having the fields “location” and “year” available as structured metadata, you can easily construct reliable filters to get the right documents by year.

As the above examples show, using analytics and filters with this metadata enables you to retrieve the right information for a query, which then gives better input into a RAG pipeline or other downstream operations. However, preparing data for this can be difficult to do, especially when dealing with unstructured, varying documents.

But if you're using Sycamore, it's a breeze! When ingesting data, Sycamore Data Prep lets you automatically detect and extract structured data from unstructured documents using LLMs. You can then use this metadata with filters and analytics in your search queries, using the OpenSearch query language. In your data prep job, you only need to provide a class type to guide the data extraction to select relevant fields, and the rest is done automatically.

For the NTSB data, we will prompt Sycamore to infer a schema and populate it for each report, and then query these fields after ingestion using analytics operations in our search queries.

Let's build it

You can find a notebook with the full code for the Sycamore Data Prep job here. In the sections below, we will describe some key functions.

NTSB dataset: document format

The screenshot below is an example NTSB incident report. It contains some metadata (e.g. date, location, aircraft type), followed by an incident summary in free-form text.

‍

‍

Defining and extracting structured metadata

In this section, we will discuss a few parts of the job in more detail. Sycamore Data Prep uses a data structures called DocSets, distributed collections of documents bundled together for processing. There are a variety of transformations on DocSets to help customers easily handle unstructured data and prepare it for search and analytics in Sycamore.

In this data prep job, we will use DocSet transformations that call LLMs to define the metadata schema for each Document and then extract the values for that schema.

Extract_batch_schema accepts an entity class name (e.g. FlightAccidentReport), and attempts to extract field names as a schema from a document that relate to the given class. Given a schema (either generated by extract_batch_schema, or provided explicitly by a user), extract_properties then extracts values for the properties in the schema for a given Document.

Writing a data prep job with these transforms is straightforward. The following code ingests and partitions the Documents in the DocSet, then extracts metadata belonging to the “FlightAccidentReport” class. By default, the extract_batch_schema transform only extracts the 7 most important attributes of the class, but you can adjust this in the configuration. Also, we set the num_of_elements to 10, which indicates that only the first 10 elements (passages) of each Document will be used for schema generation and properties, respectively.

For a DocSet named "docs", we run these transforms:

‍

docs = docs.extract_batch_schema(schema_extractor=OpenAISchemaExtractor("FlightAccidentReport", llm=llm, num_of_elements=10)\
.extract_properties(property_extractor=OpenAIPropertyExtractor(llm=llm, num_of_elements=10))

‍

Next, we can explore the generated schema and extracted properites by showing the fields for a Document in the DocSet.

‍

taken_docs = docs.take() 
# executes the DAG or plan shown above. Sycamore Data Prep uses lazy evaluation.

taken_docs[0].properties['_schema'] 
# shows the schema for the first doc.

‍

It will return the schema:

‍

{'type': 'object',
 'properties': {'location': {'type': 'string'},
  'dateAndTime': {'type': 'string'},
  'aircraft': {'type': 'string'},
  'definingEvent': {'type': 'string'},
  'probableCause': {'type': 'string'},
  'accidentNumber': {'type': 'string'},
  'aircraftDamage': {'type': 'string'}}}

‍

Next, we'll show the values for this schema for a Document:

‍

taken_docs[0].properties['entity'] 
# entity is a top-level document property.

‍

And it returns:

‍

{'location': 'Brashear, Texas', 
'dateAndTime': 'January 15, 2023, 09:30 Local', 
'aircraft': 'CHEEK CARROLL HATZ/CHEEK', 
'definingEvent': 'Fuel related', 
'probableCause': 'A partial loss of engine power due to the formation of carburetor ice, which resulted in the inability to maintain altitude and the subsequent hard forced landing.', 'accidentNumber': 'CEN23LA083', 
'aircraftDamage': 'Substantial'}

‍

To populate these fields, Sycamore prompts the LLM with the schema and text from the Document, and the LLM extracts the information.

‍

Formatting structured metadata

Sycamore Data Prep mixes LLM-powered functions with arbitrary Turing complete Python code, giving you the best of both worlds when preparing and enriching data. In this example, we have to format the output of the dateAndTime field. The dateAndTime field has been inferred as a string, and there are cases in our dataset where the time component is missing (e.g. dateAndTime = “January 20, 2022”). In the job, we write a simple function called convert_timestamp and run it on our DocSet to convert it to a Python date object.

You can find the rest of the Sycamore Data Prep code in the notebook. This notebook includes other data prep steps to partition the documents, create vector embeddings, and load the Sycamore indexes. After you run the data prep job and load the data, we can now run some queries.

‍

Adding analytics to search queries

With our enriched dataset, we can now augment our search queries and RAG pipelines with filters and aggregations that use the structured metadata.

‍

Using aggregations

To go back to the example earlier in the post, relying on the usual top K results would not answer “What types of planes were involved in incidents in California?” correctly. This is because LLMs do not reliably aggregate information, and even if they did, you cannot fit all of the sample set in an LLMs context window.

Instead, we can use Sycamore's analytics operations on the dataset. We will use a filter on the metadata field "location" to select for incidents in California. Next, we will use an aggregation function on the metadata field “aircraftType” to find the unique plane types. Sycamore uses the rich OpenSearch query language to express queries, and information on this syntax is found here.

To query your Sycamore stack using the OpenSearch query language, you can post the search request below to localhost:9200/[index-name]/_search . If you ran the notebook as-is, your index name is ntsb.

‍

{
  "size": 0,
  "query": {
    "bool": {
      "must": [
        {
          "match_all": {}
        }
      ],
      "filter": [
        {
          "match_phrase": {
            "properties.entity.location": "California"
          }
        }
      ]
    }
  },
  "aggs": {
    "unique_aircraft_types": {
      "terms": {
        "field": "properties.entity.aircraftType.keyword"
      }
    }
  }
}

‍

This query will return a list of list of unique airplane types:

‍

{...
    "aggregations": {
        "unique_aircraft_types": {
            "doc_count_error_upper_bound": 0,
            "sum_other_doc_count": 0,
            "buckets": [
                {
                    "key": "Piper PA-28-180",
                    "doc_count": 9
                },
                {
                    "key": "Cessna 195A",
                    "doc_count": 8
                },
                {
                    "key": "Cessna 172",
                    "doc_count": 7
                },
                {
                    "key": "Cessna 180K",
                    "doc_count": 7
                }
            ]
        }
	}

‍

The doc_count field refers to the number of document chunks (passages) from incident reports that list that airplane type, not the number of incident reports. An incident report is segmented into chunks during the data prep job.

If you want to find the number of unique incident reports for each airplane type in the query, you can add a nested aggregation (in the aggregation for airplane type) that counts the number of unique incident reports tagged by parent_id:

‍

"aggs": {
            "unique_incidents": {
            "cardinality": {
                "field": "parent_id.keyword"
                }
            }
        }

‍

Using filters

When asking the question “Were there any incidents in the last three days of January 2023 in Washington?”, we need to ensure all of the documents related to Seattle are retrieved in the top K results. With Sycamore, we can utilize filters to ensure we are only sending relevant documents to our RAG pipeline.

Sycamore uses OpenSearch’s Search Pipeline feature set to define RAG pipelines, and you can visit here for more information on the syntax. It includes a hybrid search query, that is broken up into the vector (neural) and keyword searches, and the generative_qa parameters to configure the generative step with an LLM (in this example GPT-4).

We will use the Sycamore's default RAG pipeline. Use the index name from the prior example, and provide the embedding model ID. Once you have this info, run this search pipeline query:

‍

GET <index_name>/_search?search_pipeline=hybrid_rag_pipeline

{
    "query": {
        "hybrid": {
            "queries": [
                {
                    "bool": {
                        "must": [
                            {
                                "exists": {
                                    "field": "text_representation"
                                }
                            },
                            {
                                "match": {
                                    "text_representation": "Were there any incidents in the last three days of January 2023 in Washington?"
                                }
                            }
                        ],
                        "filter": {
                            "bool": {
                                "filter": [
                                    {
                                        "match": {
                                            "properties.entity.location": "Washington"
                                        }
                                    },
                                    {
                                        "range": {
                                            "properties.entity.day": {
                                                "gte": "2023-01-29",
                                                "lte": "2023-01-31"
                                            }
                                        }
                                    }
                                ]
                            }
                        }
                    }
                },
                {
                    "neural": {
                        "embedding": {
                            "query_text": "Were there any incidents in the last three days of January 2023 in Washington?",
                            "k": 100,
                            "model_id": "Fm-oeo0BGP6T7B4ZhIzB",
                            "filter": {
                                "bool": {
                                    "filter": [
                                        {
                                            "match": {
                                                "properties.entity.location": "Washington"
                                            }
                                        },
                                        {
                                            "range": {
                                                "properties.entity.day": {
                                                    "gte": "2023-01-29",
                                                    "lte": "2023-01-31"
                                                }
                                            }
                                        }
                                    ]
                                }
                            }
                        }
                    }
                }
            ]
        }
    },
    "ext": {
        "generative_qa_parameters": {
            "llm_question": "Were there any incidents in the last three days of January 2023 in Washington?",
            "context_size": 5,
            "llm_model": "gpt-4"
        }
    },
    "size": 20
}

‍

In the above query, there is a filter to the vector/neural and keyword sections of the hybrid search query, and it filters on location (Washington) and the date range. This ensures that only documents that have these attributes are retrieved and used in the RAG pipeline. Here is the excerpt:

‍

"filter": {
    "bool": {
        "filter": [
            {
                "match": {
                    "properties.entity.location": "Washington"
                }
            },
            {
                "range": {
                    "properties.entity.day": {
                        "gte": "2023-01-29",
                        "lte": "2023-01-31"
                    }
                }
            }
        ]
    }

‍

Run the above RAG pipeline with and without this filter clause. Without the clause, RAG could not generate the answer correctly because hybrid search could not retrieve the incident in the top K results. But, with these filters, we get the correct answer:

‍

"ext": {
        "retrieval_augmented_generation": {
            "answer": "Yes, there was an incident on January 29, 2023, in Dallesport, Washington. An amphibious float-equipped airplane, CUB CRAFTERS INC CC18-180, collided with a channel marker during a step taxi in preparation for takeoff on the water. The left wing and left side float separated from the fuselage, resulting in substantial damage. The National Transportation Safety Board determined the probable cause to be the flight instructor's failure to see and avoid the channel marker [3]."
        }
    }

‍

Conclusion

In this post, we used Sycamore to extract and populate schemas for our unstructured documents. We then loaded this data into our indexes, and used queries with aggregations and filters to ensure we retrieve the correct and comprehensive results directly. This approach solves problems found in simple RAG pipelines where a possibly incomplete or incorrect set of top K results get passed to the LLM for a generated answer.

By using structured metadata, you can use analytics alongside search to ask different and more complex questions on your data. If you have any questions on using structured metadata in Sycamore, join our Slack and get in touch!