Large language models (LLMs) are all the rage. The arms race to build the best generative AI models has really heated up, with an ever increasing zoo of models that show steady improvements. For example, as of this post, HuggingFace has over 68K text generation models, and multi-modal models for processing images and video are emerging. LLMs have caught the imagination of the industry at large, and enterprises are experimenting with innumerable uses of LLMs such as code co-pilots, application assistants, and chatbots.
Our roots are in databases, so we naturally view LLMs through that lens. They ostensibly challenge the very nature and purpose of database systems — to accurately answer questions from data. These models are stochastic, and they generate answers from the massive data on which they were trained. But, using these models on their own as question-answering oracles will not work for enterprise use cases. Training them on private data is prohibitive for most companies. More importantly, LLMs are dream machines. They hallucinate by design, which can be a liability. In many settings, e.g. financial services, healthcare, and government intelligence, users want accurate and explainable answers, so they can trust the answers and take actions. To try to address this, developers are employing retrieval-augmented generation (RAG) to extend LLMs with external data and ground answers to limit hallucinations.
In this post, we argue that current RAG approaches are insufficient and brittle, and that we need a more flexible approach inspired by the tenets that made relational databases successful. RAG is a stylized method for search-oriented questions, and does not use LLMs to their full potential. LLMs are uncannily effective tools for analyzing unstructured data. They can extract metadata and structure, make qualitative assessments, summarize, synthesize, do simple reasoning and planning, and use external tools. Relational systems were the first to allow users to easily specify queries in a succinct high-level language (SQL) and automatically determine a plan to compute the answer from structured data. Similarly, we need systems that allow users to ask free-form questions in natural language and automatically determine and orchestrate a plan to compute an answer from unstructured data using LLMs. We call this LLM-powered unstructured analytics (LUnA), and outline an approach that we are taking in our open source search and analytics platform, Sycamore, to build it.
RAG is fundamentally limited
The simplicity of the original RAG approach is alluring [1]. Developers must first segment their data into smaller passages or chunks and encode those chunks as static high-dimensional vectors using embedding models. The embeddings aim to capture semantic meaning and are typically managed in a vector index or database. When a question arrives, simple RAG computes the embedding vector for the question and finds the nearest vectors of the indexed chunks using a distance metric like cosine similarity. Then, it passes the most relevant chunks into the LLM’s context with an appropriate prompt to ground its answers. This has the added benefit that references to chunks in the context are useful for explainability.
While RAG works for small and well-behaved datasets, the community now realizes that this simple RAG approach does not scale. RAG accuracy degrades quickly as one (a) asks more complex questions, (b) adds more data, or (c) works with more complex data. RAG has two main limitations:
Limited context: LLM context sizes are limited (e.g. GPT-4 currently supports 8K tokens), so they can only provide answers that are contained in the top K results. This works when, for example, users ask factual questions where the answers are contained among a few chunks, but fails when the answer involves synthesizing across a large collection. For example, users often ask analytics-style questions like “List all the locations in which there were aviation accidents involving snow,” or “List the top 10 companies in the construction engineering market by revenue and growth rate.” These “sweep and harvest” questions require sweeping through a corpus, e.g. NTSB reports or financial research reports, harvesting information from those documents, and synthesizing a result. RAG cannot handle these because the entire corpus simply cannot fit into the context. There’s ongoing work to scale LLM contexts, such as Gemini 1.5 which will soon support 10M tokens. Previous studies show that LLMs with extremely long contexts cannot “attend” to everything in the context [2]. While the Gemini model seems to do better than those studies [3], the jury is still out. In addition, with large contexts, it is harder to determine the provenance of the result, making explainability more difficult. Finally, data size requirements are outstripping the rate at which we can increase context sizes.
Embedding fidelity: Second, RAG puts a tremendous burden on embedding models. While they are trained on a variety of questions and text, they simply cannot anticipate every type of question or data. We find that as you add more data, accuracy starts to deteriorate because it becomes harder for embeddings to discriminate among chunks for any given query. Moreover, analytics-style questions often involve time, e.g. “Which companies CEO changed last year”, hierarchy, or categories, e.g. “ … in the AI sector.” Vector embeddings do not precisely capture time, hierarchy, or categories, so narrowing on time ranges or faceting is not possible. Finally, current embedding models do not work well for complex documents with tables, graphs, or infographics. There are often many possible interpretations of the data, and conversion to text loses fidelity.
There have been many variations of RAG architectures proposed to address these problems [4], each of which needs to be manually constructed and customized for the dataset and application workload. These solutions are also extremely tedious and brittle, and oftentimes cannot achieve the needed quality. Even when developers get them working on a narrow workload, they complain that they have to retool their solution when document types with new features are added, e.g. customer contracts vs. employment agreements, when new types of queries are asked, or as data grows. This is reminiscent of hierarchical and network databases prior to the relational database era — developers had to retool data layout and access paths for every application and as workloads evolved.
LLM-powered unstructured analytics (LUnA): the challenge and an approach
A hallmark of relational databases is declarative query processing, which makes it easier to build and futureproof applications. Users can specify queries in a concise high-level language like SQL, and the system automatically compiles a strategy (i.e. plan) to compute the answer. The queries are declarative — they specify what the user wants, — and the database determines the steps — the how — to execute it. Since applications need not worry about the low-level details of the how, it is easier for them to adapt to changing workloads and scale.
We posit that with LLMs, we now have the technology to borrow this idea and apply it to questions in natural language such as English on unstructured data such as PDFs, HTML, presentations, images, and more. We call this LLM-powered unstructured analytics, or LUnA for short. Such a system needs to answer a variety of queries such as simple facts, multi-hop deductions from several facts, comparisons, enumerations, aggregations, and summarizations to name a few. To be clear, our goals are not entirely new; they overlap with that of the data integration and knowledge-graph communities. LLMs, however, offer a new perspective and opportunities. In this section, we sketch an approach to develop such a system.
As the name LUnA suggests, we aspire to support analyses that involve a mixture of search and analytics-style processing. We roughly see three main classes of plans. The first are “RAG-style” plans which retrieve a limited amount of data from indexes. The second are “sweep and harvest” plans which run LLM-based dataflow pipelines on a large amount data for ad-hoc analyses that cannot be served by indexes. The third are “data integration” plans which combine data from multiple sources. For example, one may ask “Summarize expenditure plans for all companies that made $1B+ acquisitions in the past three years,” which requires combining earnings calls with a public database of acquisitions. We also expect that we will need to support complex compositions of these plan types.
To tackle this challenge, we envision a compound AI system [5] that uses LLMs to orchestrate dataflows and AI kernels of all kinds that operate on unstructured data. Like most data processing systems, we see this system composed of a data layer and a query layer:
Data Layer. The unstructured world is much messier than the structured, so the data layer includes an ingestion stage that helps to corral the mess. Our experience is that a rich and flexible ingestion stage makes it easier to get better answer quality. The ingestion stage includes robust dataflow processing frameworks for extracting structure and meaning from source data. It uses LLMs and other specialized AI models to segment, enrich, compute embeddings, and extract metadata and relationships. The data layer also includes a variety of specialized indexes such as a keyword index, metadata index, vector index, and graph index to store the extracted information and efficiently retrieve it for analyses.
Query Layer. The query layer includes a planning component that turns a natural language query into a dataflow of operations. In our case, we do not have the equivalent of relational algebra and a query optimizer, the workhorse that accurately and efficiently evaluates a SQL query. Instead, we hope to rely on LLMs. LLMs have shown the ability to do simple planning, e.g. chain-of-thought reasoning, and classification [6]. In our own tests, we’ve seen that they can analyze English queries, and identify metadata constraints and time ranges to use as filters. With appropriate prompting and tuning, LLMs could select from a few hand-crafted plans or generate valid plans on-the-fly with human-readable explanations. Like with self-driving cars, we suspect initially humans will want to be in the loop to help build trust. In long term, though, we believe automated LLM-based planning will be the norm. The query layer also includes an execution component that leverages LLMs, indexes, and dataflow-style processing. In addition to traditional dataflow operators like filtering, projection, and aggregation, we include a variety of ones useful in the unstructured world such as summarization, semantic retrieval, semantic filtering, re-ranking, question-answering, and synthesis. Although we are not limited to LLMs, we can implement many of these operators simply with appropriate prompts to LLMs.
Since LLMs inherently hallucinate, one important open challenge is how to ensure the reliability and accuracy of these plans and their output. We expect that we will need to employ a number of techniques at various layers to identify, contain, and guard against errors. For outputs of individual LLMs, we can use redundancy at the level of individual calls, e.g. voting. For query outputs, we can include lineage to help trace how an output was generated to build trust in the answer. For embeddings and chunk retrieval, we could assess relevance based on explicit end-user feedback. We also envision automated end-to-end pipeline tuning, similar to DSPys [7], using sample questions and answers either provided apriori or as feedback from trials. Finally, where automation fails, we expect to include humans-in-the-loop. For example, for query planning, one option is to provide an intuitive UI for humans to validate, correct, or iteratively construct a plan with LLM assistance.
The AI community will continue to scale LLMs and their contexts, but that’s only part of the story for enterprise use cases. LLMs and modern AI offer remarkable capabilities and yet unexplored versatility for analyzing unstructured data. We argue that for enterprises, we need LUnA - a declarative LLM-powered abstraction inspired from relational databases that allow us to build flexible and robust applications on unstructured data. We have embarked on this exploration with our open source search and analytics platform, Sycamore, and encourage you to join us on this journey.
[1] Lewis, Patrick et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Taks. arXiv:2005.11401v4. 21 Apr 2021.
[2] Lui, Nelson F. et al. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172v3. 20 Nov 2023.
[3] Gemini Team. Google. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. Gemini 1.5 Pro technical report. 15 Feb 2024.
[4] Gao, Yunfan et. al. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv:2312.10997v4. 5 Jan 2024.
[5] Zaharia, M. et al. The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/compound-ai-systems/. 18 Feb 2024.
[6] Wei, J. et al. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903v6. 10 Jan 2023.
[7] Khattab, O. et al. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714v1. 5 Oct 2023.
I thank Ben Sowell and Jon Fritz for feedback on drafts of this post.
Comments