Aryn

April 2, 2024

Near-Duplicate Detection in Sycamore: What Is It Good For?

By Alex Meyer, Engineer

Recently, we added near-duplicate-detection (NDD) support to Sycamore. Back in the late 1990s, in the competition between various web search engines vying for market dominance, NDD was an important technique for improving relevance. Since Andrei Broder published his landmark paper in 2000, NDD has found its way into many aspects of computing. As the world turns toward generative AI, we believe that NDD continues to deliver value.

One application that NDD can improve is retrieval-augmented generation (RAG). This refers to the use of a large language model (LLM) to answer questions beyond the scope of what the LLM was trained upon. RAG starts by retrieving documents from a traditional search engine. Then, it sends the query, along with a limited number of top-scoring documents as context, to an LLM. The result is an answer synthesized by the LLM from the documents. Because the effective context size is limited, it’s important for us to fill it with as rich a set of documents as possible. This is where near-duplicate removal is helpful.

In order to show how NDD improves retrieval and RAG in the real world, we chose a readily-available dataset from data.gov. It contains marketing agreements between credit card companies and colleges for branded cards from 2011 through 2019. As one might imagine, a lot of these documents are rather similar, both across years and across cards. Here are some examples:

‍

‍

Generally, NDD consists of two phases: tagging each document with a “sketch”, and eliminating or grouping documents based on sketch comparisons. These days, many search systems, including Sycamore, actually cut documents into smaller chunks in order to improve relevance. One reason is that individual topics can get lost in longer documents. Another is that vector embedding models tend to operate on a limited number of tokens. In RAG, longer documents can overload an LLM’s context size. We’ll use the words “document” and “chunk” interchangeably.

In Sycamore, we use a transform called Sketcher to generate a sketch for each document. The current implementation uses hashing to generate a set of numbers called a “shingle”. The shingle is an attribute of the document and can be stored in OpenSearch during ingestion.

While it’s possible to drop near-duplicate documents at ingest-time, there are problems with this approach that impact recall and relevance. We can sidestep these problems if we eliminate (or collapse) near-duplicates at query-time. To illustrate the problem, consider the two documents below, represented as keywords. Let’s assume they meet the threshold to be considered duplicates. If we use an ingestion-time transform like SketchUniquify to de-duplicate the incoming batch of documents, one of the documents will win and one will lose. Let’s assume Amex wins by virtue of alphabetical order.

‍

*Venn diagram of documents with words in common*

‍

Now, the index contains the one document “Amex penalty fee interest rate arbitration”. If we try to query for “Visa interest rate”, there are zero documents matching the query. This is poor relevance, and entirely self-inflicted, as we had the matching document and dropped it.

It’s better to de-duplicate the result set at query-time. In this case, retrieval would have yielded only the Visa document and no NDD would have come into play at all.

‍

A Duplicative Example

Before we get into RAG, let’s take a brief detour. We want to illustrate the value of NDD in pure text retrieval. We used Sycamore to ingest the credit card PDF files into OpenSearch. In total, there were 1911 files from which we could extract meaningful text without resorting to optical character recognition (OCR). Our ingestion used the thenlper/gte-small embedding model for vector indexing. We also used the BAAI/bge-reranker-base model to do query reranking.

Then, we decided to research the various types of arbitration language these documents contained. We used OpenSearch to ask “Summarize the rules for summary judgment and penalties in arbitration.” Our top ten results looked like the following.

‍

‍

Results like these are problematic, as many of the lines are the same. At first glance, it appears we really got only 4 useful chunks even though we wanted 10. Getting 10 different chunks may involve manually sifting through many more results. Further, the task of de-duplicating the result set by eye is error-prone and biased toward the front of each chunk. Eventually our eyes will glaze over.

‍

NDD to the Rescue

For this demonstration, we used a cool piece of technology developed at Aryn. It’s called a remote search processor and it allows one to implement search pipeline logic in Python outside OpenSearch. We wrote a bit of code to drop near-duplicates from a result set as part of query processing. For any pair of near-duplicate documents, the higher-scoring document survives.

Using our fancy pipeline and the same question as above, we were able to get the following results:

‍

‍

What we see here is 9 different chunks, which is much better than 4. Looking carefully at the chunks, it seems there is a spurious difference caused by optical character recognition (OCR) errors. There could also be chunking boundary differences. These are outside the scope of NDD to solve. That said, the aggressiveness of NDD can be tuned at query-time.

For the above example, the remote "dedup-response" processor was configured with a distance "threshold" of 0.4, which dropped 73% of the 100 chunks that we retrieved. Thresholds closer to 1.0 drop more documents, but the rate of false positives increases and the likelihood of retrieving sufficient relevant documents decreases. In the example below, a distance threshold of 0.15 dropped 60% of chunks.

‍

‍

Of these top-10 chunks, 8 are unique. Compared to the previous result, these results are missing the chunk that starts with “Agreement”.

‍

Consequences for RAG

As mentioned above, RAG uses LLMs to generate answers from provided documents. One of the challenges with RAG is the limited effective “context”, which refers to how much data can be given to the LLM in order to generate the summary. If the context is 10 documents, it makes sense that having 9 unique documents will lead to a more comprehensive summary than 4 unique documents. For this reason, NDD is a valuable complement to RAG.

To show this, we ran some RAG queries on the college credit card data. The first question is still “Summarize the rules for summary judgment and penalties in arbitration.” that we used above. We used a basic RAG pipeline with GPT-4, with a context size of 5 chunks to ensure the data fits in the LLM context window. Without using NDD, the RAG pipeline yields this answer:

‍

"In arbitration, the arbitrator or Arbitration Panel grants or denies a motion of summary judgment within
fifteen days after the arbitration hearing concludes. They prepare and distribute a written document
detailing their findings of facts and conclusions of law, including reasons for granting or denying any 
award [1]. The arbitrator or Arbitration Panel can issue monetary sanctions against a party if they 
unreasonably delay the proceeding [1]. The arbitrator has no authority to award punitive damages [3]. Any 
award rendered by the arbitrator or Arbitration Panel is final and binding [1]."

‍

This answer misses many details needed to answer the question. It does mention that an arbitrator writes a document with facts and conclusion, a fifteen day window, a penalty for delaying the proceeding, and awards are final and binding. But, this information doesn't really get deeper into the rules for the actual arbitration process.

Enabling NDD in the same RAG pipeline adds 3 unique documents in the context (removing 3 duplicates), and provides the LLM more information from which to generate an answer. Here is the result:

‍

"The rules of arbitration include: cooperation between parties to make the process efficient and expeditious 
[1]; initiation of arbitration by notifying the other party in writing, describing the dispute, stating the 
claim amount, and specifying the requested relief [2]; replacement of an arbitrator who refuses or is unable 
to proceed [3]; the arbitrator or panel's responsibility to schedule discovery and procedural steps promptly 
[3]; the finality of any award rendered by the arbitrator or panel [3]; each party bearing a pro rata share 
of all fees, costs, and expenses of the arbitrators [3]; and the application of the Federal Rules of 
Evidence to the arbitration hearing [4]. The arbitration process is subject to the confidentiality 
provisions of the agreement [1][4]."

‍

Not only is this a more comprehensive answer, but it reads better as a summary. Beyond the finality of award, it explains what an arbitration goal is, describes the details needed to initiate it, the costs shared by each party, the application of Federal Rules of Evidence, and a confidentiality provision. Overall, the second answer includes more rules and a better description of the arbitration process.

Another example using the same dataset is the question “Summarize the highest and lowest interest rates, penalties, and fees.” The basic answer looks like this, which seems like it’s for one specific credit card:

‍

"The highest and lowest interest rates for purchases and cash advances range from 9.99% to 17.99% when you 
open your account, then vary with the market based on the Prime Rate plus a margin [1]. The minimum monthly 
payment is $25.00 [1]. Penalties include a late payment fee of $10.00 or 10% of the outstanding balance, 
whichever is less, and a return check fee of $4.50 [1]. ATM transactions incur a 50-cent charge for each 
deposit or withdrawal exceeding 15 a month, a 25-cent charge on each balance inquiry, and $4 for each ATM 
deposit adjustment [2]. International transactions may incur a 1% fee [2]."

‍

With NDD, after dropping 77% of chunks, we get the answer below, which more succinctly answers the actual question and also includes the $50 emergency replacement fee which is missing above.

‍

"The highest interest rate is 17.99% and the lowest is 9.99% [1]. The highest penalty fee is $50 for an 
emergency replacement card, while the lowest is $4.50 for a return check fee [1]. The highest fee is a 1% 
foreign transaction fee [3], and the lowest is a 25-cent charge for each balance inquiry [2]."

‍

Conclusion

Query-time near-duplicate detection can make query results simultaneously more informative, useful, and pleasing to users. In the first examples above, the number of good results in the top 10 more than doubled. The later examples show that RAG applications benefit from more good results. The NDD process consists of ingestion-time and query-time processing. Both can be tuned in various ways if defaults aren’t suitable to a particular dataset. The processing overhead involved is not large, especially considering the use of embeddings and vector search in a typical search stack.

-----------------------------------------------------

We have provided an NDD Jupyter notebook which is available from within the Sycamore Docker containers. The notebook provides instructions and code to walk through an NDD example inspired by this blog post.

‍