One of the projects I have built is a long-standing retrieval-augmented generation (RAG) application. Documents are saved in a database, chunked into a reasonable amount of text that a large language model (LLM) can handle, and turned into numerical representation (vectors).
A user at some point later asks a question that get turned also into a numerical representation. We compare the numbers and do some math to identify the top k(3) chunks of texts that matches the question. This is feed that into an LLM, and we get an answer based on the documents uploaded.
RAG implementations are notorious for being unscientific in nature, and more art than science. How do you transform a PDF into text? What about tabular data? What if there are pictures in the document and they are pretty important? How long should the chunks be? How many chunks? Should you use Cosine similarity for the math or something else? and a million other questions where the correct answers is always "it depends".
I am and you should be highly suspicious of generic "one size fits all" RAG solutions. The really good ones customize for the use-case, user types, and the documents.
But, that's all irrelevant now, I am moving the project to use long-context instead because I think no one should be building RAG products in 2024 - and I will explain why below.
RAG is a workaround
In 2021/2022 - when I first started building on LLMs we had GPT3.5 and ~4000 context length. LLMs without guardrails hallucinated at unacceptable rate for anything that is accuracy-critical.
Adding a knowledge base solved hallucination to a large extent. But, you were only limited to 4000 words. Many real-world problems required documents that are larger than the model context. So, RAG become the standard solutions.
Break down large documents into smaller pieces and use semantic search techniques to get the relevant parts, instead of the whole documents. A clever solution to the problem of context length and hallucination.
Now - we have neither of these problems. Gemini Pro handles 1m+ tokens with no issues, and and LLMs hallucinate a lot less. Some have seen degradation in GPT-4 after 16k tokens, but this more of a GPT-4 limitations than all LLMs.
This is my favorite formal eval (link) for the "real context size" for LLMs and we can clearly sees that context degradation is model-specific and not an LLM limitation.
In summary - we used RAG because we had a problem that no longer exists.
RAG is less performant than long-context
As of late July 2024, we had pretty solid evidence and evals that long-context beats most RAG implementations. I like this paper as it proves this point, even though it tries to tell you to use some RAG.
If you care about performance, then you absolutely should take away RAG from your stack.
KISS (Keep it simple) always wins
Some people argue that inference cost is a factor, and you should consider RAG for some queries (like the paper above). Nope. Bad idea.
LLM inference costs are coming down rapidly
The development cost (AKA all the engineers you need to hire to maintain your bespoke RAG solutions) is a lot more expensive than the API costs
Large RAG applications are very difficult to maintain and iterate on, because they are fragile and small changes can lead to big changes.
Long-context applications on the other hand are very simple. You just worry about getting the text out of documents correctly and everything else works via calling an API. No chunking, no vectors, no math. Easy to maintain, easy to iterate on, and as simple as it gets.
The unpopular opinion
Simplicity is great, but it doesn't sell. So, there would be a lot of noise and disagreement to folks invested in the RAG ecosystem. I get it. If you just raised a bunch of money for a vector DB, you aren't a big fan of this new development.
I ran a large RAG application in production and have been for the last 2.5 years. My skill-set and experience in that corner of development are essentially obsolete. But, it is the nature of AI engineering and being on the bleeding edge.