towards advances on long context LLMs and RAG
Newly published LLMs have context windows larger and larger, i.e., GPT-4 turbo, a model with a 128k input context window. Naturally, this raises the question - is RAG still necessary? Most small data use cases can fit within a 1-10M context window. Tokens will get cheaper and faster to process over time. There is ongoing debate over this topic.
Pain points resolved:
In fact, event 10M tokens is still not enough for large document corpuses.e.g., 1M tokens is around ~7 Uber SEC 10K filings. Many knowledge corpuses in the enterprise are in the gigabytes or terabytes.
A recent study found that LLMs perform better when the relevant information is located at the beginning or end of the input context.
However, when relevant context is in the middle of longer contexts, the retrieval performance is degraded considerably. This is also the case for models specifically designed for long contexts.
Extended-context models are not necessarily better at using input context.
A recent work has compared the performance-per-token of ChatGPT-4 under two different scenarios:
To conclude, the experiments show that stuffing the context window isn’t the optimal way to provide information for LLMs. Specifically, they show that:
Embedding models are lagging behind in context length. Most embedding models are BERT/Transformer-based and typically have short context lengths (e.g., 512). That’s only about two pages of text. So far the largest context window for embeddings are 32k from M2-Bert. This means that even if the chunks used for synthesis with long-context LLMs can be big, any text chunks used for retrieval still need to be a lot smaller.
Lost in the Middle, a paper published in 2023, found that models with long context windows do not robustly retrieve information that is buried in the middle of a long prompt. The bigger the context window, the greater the loss of mid-prompt context.