Goodbye RAG?

by Pasi Karhu, AI4Value CTO

What is RAG? Why it is currently so important, and why it might soon not be so much anymore?

Because of large language models’ limited context windows, a technique called RAG, Retrieval-Augmented Generation, has been developed. If you have a very long document or lots of shorter documents, you currently cannot put them all at once into a request for an LLM, because of its context length limitations. E.g., for GPT-4 the limit is 128 thousand tokens, which is roughly 100 thousand words of English text or about an average novel’s length.

RAG-systems overcome this limit by splitting larger text collections into smaller chunks and giving them a linguistic meaning preserving mathematical presentation in the form of a vector. The vector is no more complicated than what we all have learned in school, although it is much longer than the usual 2-dimensional vectors handled in school.

Some of you may remember how to calculate the angle between vectors. A question for a RAG-system is also turned into a vector, and with the very same elementary angle calculation the closest (i.e. with smallest angle) document chunk vectors are found. The corresponding original text content in those chunks is then used to answer the question. This all can and does get quite tricky though, when you are asking a question which requires reasoning and aggregating information from several smaller chunks. A lot of effort is being used for developing different kind of RAG retrieval and aggregation strategies.

It would be so much easier, if we could put all the required documents at once for the LLM to process. Well – this is where Google seems to have finally surpassed OpenAI with a huge margin. They just introduced their next generation multi-modal model Gemini 1.5 pro, which can swallow a monstrous 10 million tokens (c.a. 700 books) in one single bite and answer questions from that data with remarkable accuracy. Being multi-modal it can also process long videos in one go.

Currently only selected early testers have access to Google’s new model and there is no pricing information. Even though it is tempting to abandon RAG development based on this kind of news, the LLM call price and document upload overhead for processing millions of tokens each time you want to query you data may be prohibiting for other than the most demanding needs. So, at least for now – RAG is not quite yet dead.

A layman understandable presentation on Google Gemini 1.5:

A more technical lab report: