Steve Shirkey’s Post

Director, Azure AI (ANZ, ASEAN, Korea) at Microsoft

3mo

Huge release by Daekeun Kim and Hyo (HK) Choi from our AI GBB team in Korea, sharing Korean language benchmarks for our latest Azure OpenAI model, gpt-4o-mini - like gpt-4o, the model excels on Korean language tasks, in this case mini completely eclipses gpt-35-turbo and nearly achieves gpt-4-turbo performance. The benchmarks and code are open source, extensible beyond our Azure models, so we encourage our customers and the larger community to try it out and even fork/share your own findings - and please let me in the comments what other Asia languages beyond Korean you may be looking to evaluate with your language models.

Daekeun Kim

AI Global Black Belt @ Microsoft | ex-AWS

3mo

다양한 LLM/SLM 모델이 지속적으로 등장하면서 기본적인 평가 데이터셋으로 LLM/SLM의 성능을 빠르게 파악하려는 고객들이 많습니다. 이에 CLIcK(Cultural and Linguistic Intelligence in Korean)과 HAE_RAE_BENCH 1.0 데이터셋에 대해 LLM이 객관식 문제를 얼마나 정확하게 푸는지 판별하는 벤치마킹 코드를 구현하여 공개합니다. 코드는 https://lnkd.in/gnVsNtq9 의 구현을 뼈대로 많은 부분을 수정하고 추가했습니다. 저는 같은 팀의 Hyo (HK) Choi 님과 함께 GPT-4o-mini (2024-07-18), GPT-4o (2024-05-13), GPT-4-turbo (2024-04-09), GPT-3.5-turbo (2023-06-13) 4개 모델에 대한 벤치마킹을 수행하였습니다. 벤치마킹 결과 GPT-4o-mini 의 성능이 매우 인상적입니다. 모든 지표에서 GPT-3.5-turbo보다 압도적이고 일부 지표는 GPT-4-turbo에 근접한 성능을 보이고 있습니다. 허깅페이스 모델을 비롯한 커스텀 모델의 벤치마킹도 가능하니 이 지표를 베이스라인으로 다른 모델도 자유롭게 비교해 보세요. ------------------------------------------------------------------------------- As different LLM/SLM models continue to emerge, many customers want to quickly see how LLM/SLM performs on basic evaluation datasets. We have implemented and released benchmarking code to determine how accurately LLM solves multiple-choice questions on the CLIcK (Cultural and Linguistic Intelligence in Korean) and HAE_RAE_BENCH 1.0 datasets. The code is based on the implementation at https://lnkd.in/gnVsNtq9, with many modifications to make it suitable for benchmarking. Together with Hyo (HK) Choi, I performed benchmarking on 4 models: GPT-4o-mini (2024-07-18), GPT-4o (2024-05-13), GPT-4-turbo (2024-04-09), and GPT-3.5-turbo (2023-06-13). The benchmarking results show that the performance of GPT-4o-mini is very impressive. It outperforms the GPT-3.5-turbo on all metrics and is close to the GPT-4-turbo on some metrics. You can also benchmark custom models, including Hugging Face models, so feel free to compare other models using these metrics as a baseline. Code: https://lnkd.in/geCXW2Nz #azureopenai #gpt4omini #gpt4o

GitHub - daekeun-ml/evaluate-llm-on-korean-dataset: Performs benchmarking on two Korean datasets with minimal time and effort.

github.com

7 Comments

Dipen Mehta

Business focused technology executive passionate about technology driving customer value

3mo

Thai!

1 Reaction

Myles Hosford

Zero to X Security & Technology Leader

3mo

Welsh 🏴󠁧󠁢󠁷󠁬󠁳󠁿

1 Reaction

Kevin (SangWoo) Kim

Head of Data Biz Div. at SOCAR

3mo

Thank you to your team for the great contribution to the community I'd like to note our internal benchmarking for specific application shows significant inconsistencies between gpt-4o and gpt-4o mini. Using the mini model requires caution for applications need strong reasoning ability.

3 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Andreas Nigg

I write about tips and tricks around AI, LLMs and data
7mo
Report this post
1. What is HyDE? HyDE stands for Hypothetical Document Embeddings. It is a technique used in Retrieval-Augmented Generation (RAG) to improve the document retrieval process by creating hypothetical embeddings that represent the ideal documents to answer a specific query. 2. The Need for HyDE Traditional RAG models use actual document embeddings based on similarity to retrieve information. However, these can often miss nuances in queries or fail in out-of-domain scenarios. HyDE addresses these issues by generating idealized, query-specific document embeddings, leading to more accurate and relevant retrievals. 3. How HyDE Works a. Generate Hypothetical Content: HyDE begins by using a language model (e.g., GPT-3.5) to generate text that hypothetically answers the query. b. Create Embeddings: These hypothetical texts are then transformed into embeddings using an embedding model. c. Retrieve Using Hypothetical Embeddings: These embeddings are used to find the most similar real documents in a knowledge base, rather than relying directly on the initial query embeddings. 4. Advantages of Using HyDE - Improved relevance of retrieved documents. - Enhanced performance of RAG systems, especially in complex or technical domains. - Better handling of out-of-domain queries. 5. Considerations for Effective Use The success of HyDE depends on the quality of the hypothetical document generation and the subsequent embeddings. It is crucial to tailor the HyDE process to the specific requirements and contexts of the application to maximize effectiveness. Read more in our latest blog post: https://lnkd.in/dmGXMaz5

Advanced RAG: Improving Retrieval-Augmented Generation with Hypothetical Document Embeddings (HyDE)

pondhouse-data.com
Like Comment
To view or add a comment, sign in
Pondhouse Data OG

69 followers
7mo
Report this post
Hypothetical Document Embeddings (HyDE) is an advanced technique designed to improve the effectiveness of Retrieval-Augmented Generation (RAG) systems. HyDE generates synthetic document embeddings from a query, which better represent the ideal answer. These embeddings guide the retrieval process towards more relevant documents, enhancing the overall quality of the generated responses. To implement HyDE, use a language model to create hypothetical documents based on the query, encode these documents into embeddings, and employ these embeddings to retrieve the most relevant information. This approach ensures that RAG systems can more accurately access and utilize external knowledge, especially in specialized or complex domains. Explore the full capabilities and setup instructions of HyDE on our latest blog for detailed guidance on integrating this technique into your RAG systems.

Andreas Nigg

I write about tips and tricks around AI, LLMs and data
7mo

1. What is HyDE? HyDE stands for Hypothetical Document Embeddings. It is a technique used in Retrieval-Augmented Generation (RAG) to improve the document retrieval process by creating hypothetical embeddings that represent the ideal documents to answer a specific query. 2. The Need for HyDE Traditional RAG models use actual document embeddings based on similarity to retrieve information. However, these can often miss nuances in queries or fail in out-of-domain scenarios. HyDE addresses these issues by generating idealized, query-specific document embeddings, leading to more accurate and relevant retrievals. 3. How HyDE Works a. Generate Hypothetical Content: HyDE begins by using a language model (e.g., GPT-3.5) to generate text that hypothetically answers the query. b. Create Embeddings: These hypothetical texts are then transformed into embeddings using an embedding model. c. Retrieve Using Hypothetical Embeddings: These embeddings are used to find the most similar real documents in a knowledge base, rather than relying directly on the initial query embeddings. 4. Advantages of Using HyDE - Improved relevance of retrieved documents. - Enhanced performance of RAG systems, especially in complex or technical domains. - Better handling of out-of-domain queries. 5. Considerations for Effective Use The success of HyDE depends on the quality of the hypothetical document generation and the subsequent embeddings. It is crucial to tailor the HyDE process to the specific requirements and contexts of the application to maximize effectiveness. Read more in our latest blog post: https://lnkd.in/dmGXMaz5

Advanced RAG: Improving Retrieval-Augmented Generation with Hypothetical Document Embeddings (HyDE)

pondhouse-data.com
Like Comment
To view or add a comment, sign in
Emkademy

95 followers
3mo
Report this post
Exploring the trade-offs between Retrieval-Augmented Generation (RAG) and Long-Context LLMs: Can we achieve top performance without breaking the bank? This new study dives deep into the efficiency vs. accuracy debate with a hybrid approach. The paper discusses a comparison between two approaches for handling long texts in large language models (LLMs): Retrieval Augmented Generation (RAG) and long-context (LC) LLMs. RAG works by retrieving relevant information from a large database and then using that information to generate responses. On the other hand, LC LLMs can directly process and understand very long input texts. The researchers conducted a comprehensive study to compare these two methods using various public datasets and three recent LLMs. The study found that when given sufficient resources, LC LLMs consistently outperformed RAG in terms of average performance across different tasks. However, RAG still has a significant advantage in terms of cost-efficiency. This is because RAG requires less computational power and fewer input tokens, which typically determine the cost of using LLM APIs. The researchers observed that for over 60% of queries, RAG and LC produced identical predictions, suggesting that RAG could be used in these cases to reduce costs without sacrificing performance. Based on these findings, the researchers proposed a new method called SELF-ROUTE. This approach combines the strengths of both RAG and LC by using the LLM itself to decide whether to use RAG or LC for each query. SELF-ROUTE significantly reduces the overall computational cost while maintaining performance comparable to LC LLMs. For example, it reduced costs by 65% for Gemini-1.5-Pro and 39% for GPT-4O compared to using LC alone. The SELF-ROUTE method works in two steps. First, in the RAG-and-Route step, the LLM is given the query and retrieved chunks of information. It's asked to predict whether the query can be answered based on this information and, if so, to generate an answer. If the LLM determines that the query is answerable, it uses the RAG prediction as the final answer. For queries deemed unanswerable, it proceeds to the second step, where the full context is provided to the long-context LLM to obtain the final prediction. The researchers evaluated their method using three recent LLMs on nine datasets. The results showed that SELF-ROUTE consistently outperformed RAG by over 5% across all models. Compared to LC, there was a slight performance drop for GPT-4O (-0.2%) and Gemini-1.5-Pro (-2.2%), but an improvement for GPT-3.5-Turbo (+1.7%). Importantly, SELF-ROUTE achieved these results while using significantly fewer tokens. For example, GPT-4O used only 61% of the tokens required by LC while achieving comparable performance. This reduction in token usage translates to substantial cost savings. For more information check-out the paper: https://lnkd.in/dN7_SZaA
Like Comment
To view or add a comment, sign in
Nilesh Barla

Founder @PerceptronAI | Content Writer | Researcher and Deep Learning Engineer
5mo
Report this post
Most recently Google released Gemini 1.5 pro which can process up to 128,000 to 1 million tokens; these models are long-context language models. As the name implies these models can learn and process information from more than 100 documents containing 5000 words. That is insane!!! How? Because now you can conduct complex research, there will be efficiency while retrieving information, responses will be much more coherent even in lengthier conversations, and much more. Other LCLMs are Claude-3 Opus which can process from 200,000 to 1 million tokens and maybe GPT4 and GPT4o. In this paper, researchers show that LCLMs can rival RAG-based LLMs even when they are not explicitly trained. Here, are some key takeaways from this paper: 1. Challenge: The entire RAG-based LLM system is complex because of the integration of many components which includes retrieval systems or databases. It introduces cascading errors. Solution: By natively processing the entire corpora in the LCLM itself. LCLM can even leverage sophisticated prompting techniques across the entire system. 2. Challenge: The authors argue that the entire benchmarking system is faulty as they don’t have real-world resemblance. The real-world data is much more complex. Solution: They introduce Long-context frontiers. It is a benchmark that evaluates LLM with long-context capabilities, essentially to check whether the model can retain contextual information up to millions of tokens 3. Challenge: RAGs require extensive training and a highly sophisticated pipeline. Solution: LCLMs don’t require extensive training and they were able to rival state-of-the-art retrieval and RAG systems. Lastly, these are my two favorite highlights: 1. Corpus-in-Context Prompting: In CiC the model provides relevant data in the context window itself allowing the model to process and leverage all the important information in one shot. This eliminates the need to rely on multiple, separate retrieval steps. This means now the LLM can take a bunch of documents to generate a cohesive response for a long conversation. In RAGs-based you will get responses from maybe the top 10 to 50 documents. 2. Many-shot in Context Learning: In Many-shot ICL you provide LLM with multiple examples from within the prompt to help the LLM perform better. This is something I do wherever I need to understand complex subjects. Read this paper to learn more in detail about the topics I have explained: https://lnkd.in/gRYFuisK Github: https://lnkd.in/gf3sCmQn
Like Comment
To view or add a comment, sign in
Eduardo Muñoz

Software Architecture | Data Engineer | Project Management Lead | Machine Learning | NLP | ITSM Manager
5mo
Report this post
💥 New Qwen2 series The developers behind the Qwen series have unveiled the next-generation Qwen2 models, marking a significant advancement from Qwen1.5. These new models are characterized by enhanced size diversity, multilingual proficiency, extended context handling, and state-of-the-art performance across various benchmarks. 👏 Diverse Model Sizes: The Qwen2 series introduces a range of models in five different sizes: Qwen2-0.5B, Qwen2-1.5B, Qwen2-7B, Qwen2-57B-A14B, and Qwen2-72B. Including both base and instruction-tuned versions. 🌏 Multilingual Training: Qwen2 models have been trained on data spanning 27 additional languages besides English and Chinese. This broad linguistic training improves the models' generalization and understanding across various languages, setting a new standard in multilingual LLM performance. 📥 Enhanced Context Length: Support for extended context lengths: Qwen2-7B-Instruct and Qwen2-72B-Instruct models now support context lengths up to 128K tokens. All base models have been pretrained on data with a context length of 32K tokens. The ability to handle such extensive contexts allows these models to manage and interpret lengthy documents and conversations more effectively. 🧐 Group Query Attention (GQA): GQA has been applied across all Qwen2 model sizes, enhancing inference speed and reducing memory usage, optimizing performance for both small and large models. 🎢 Improved Performance in Coding and Mathematics: The Qwen2 series exhibits significantly improved performance in coding and mathematical tasks, reflecting advancements in model architecture and training processes. 📢 Opensource Availability: The Qwen2 models have been open-sourced on Hugging Face and ModelScope. 🛠 Instruction-Tuned Models: The instruction-tuned models’ ability to manage long contexts is assessed through tasks like the "Needle in a Haystack," revealing their capability to handle context lengths up to 128K tokens, especially when augmented with YARN. 📚 Dataset Augmentation: Extensive efforts were made to expand and improve the datasets used for pretraining and instruction-tuning. This includes a focus on increasing both the volume and quality of data across various languages, enhancing the models' competencies beyond the default English and Chinese. Link to the model repo in the comments. #llm #nlp #machinelearning https://lnkd.in/disrdyrF

GitHub - QwenLM/Qwen2: Qwen2 is the large language model series developed by Qwen team, Alibaba Cloud.

github.com

1 Comment
Like Comment
To view or add a comment, sign in
Tridib Roy Roy

Founder @Mogoj | Dynamic, Action-paced agents /w @Mitra
9mo
Report this post
Comical to expect large context windows to replace fine-grained retrieval systems. 💪 The context window jumping to a whopping 10M tokens is absolutely commendable, but the need of RAG is yet not to be underestimated. Although it can digest a lot of documents altogether, the users will eventually feel the need to upload more documents. And for large enterprises, it is also farcical to throw petabytes of big and bulky to humongously large language models, and expect it to provide competitive results. The context still remains important, in terms of large enterprise data. 📊
Pasi Karhu

CTO and Partner @ Ai4Value | GPT-models, AI expertise
9mo Edited

Goodbye RAG? What is RAG? Why it is currently so important, and why it might soon not be so much anymore? Because of large language models' limited context windows, a technique called RAG, Retrieval-Augmented Generation, has been developed. If you have a very long document or lots of shorter documents, you currently cannot put them all at once into a request for an LLM, because of its context length limitations. E.g., for GPT-4 the limit is 128 thousand tokens, which is roughly 100 thousand words of English text or about an average novel's length. RAG-systems overcome this limit by splitting larger text collections into smaller chunks and giving them a linguistic meaning preserving mathematical presentation in the form of a vector. The vector is no more complicated than what we all have learned in school, although it is much longer than the usual 2-dimensional vectors handled in school. Some of you may remember how to calculate the angle between vectors. A question for a RAG-system is also turned into a vector, and with the very same elementary angle calculation the closest (i.e. with smallest angle) document chunk vectors are found. The corresponding original text content in those chunks is then used to answer the question. This all can and does get quite tricky though, when you are asking a question which requires reasoning and aggregating information from several smaller chunks. A lot of effort is being used for developing different kind of RAG retrieval and aggregation strategies. It would be so much easier, if we could put all the required documents at once for the LLM to process. Well - this is where Google seems to have finally surpassed OpenAI with a huge margin. They just introduced their next generation multi-modal model Gemini 1.5 pro, which can swallow a monstrous 10 million tokens (c.a. 700 books) in one single bite and answer questions from that data with remarkable accuracy. Being multi-modal it can also process long videos in one go. Currently only selected early testers have access to Google's new model and there is no pricing information. Even though it is tempting to abandon RAG development based on this kind of news, the LLM call price and document upload overhead for processing millions of tokens each time you want to query you data may be prohibiting for other than the most demanding needs. So, at least for now - RAG is not quite yet dead. A layman understandable presentation on Google Gemini 1.5: https://lnkd.in/dKQr3cri A more technical lab report: https://lnkd.in/dG5-n54N #LLMs #contextwindow #limited #RAG #Google #Gemini #OpenAI #GPT4
Like Comment
To view or add a comment, sign in
Pasi Karhu

CTO and Partner @ Ai4Value | GPT-models, AI expertise
9mo Edited
Report this post
Goodbye RAG? What is RAG? Why it is currently so important, and why it might soon not be so much anymore? Because of large language models' limited context windows, a technique called RAG, Retrieval-Augmented Generation, has been developed. If you have a very long document or lots of shorter documents, you currently cannot put them all at once into a request for an LLM, because of its context length limitations. E.g., for GPT-4 the limit is 128 thousand tokens, which is roughly 100 thousand words of English text or about an average novel's length. RAG-systems overcome this limit by splitting larger text collections into smaller chunks and giving them a linguistic meaning preserving mathematical presentation in the form of a vector. The vector is no more complicated than what we all have learned in school, although it is much longer than the usual 2-dimensional vectors handled in school. Some of you may remember how to calculate the angle between vectors. A question for a RAG-system is also turned into a vector, and with the very same elementary angle calculation the closest (i.e. with smallest angle) document chunk vectors are found. The corresponding original text content in those chunks is then used to answer the question. This all can and does get quite tricky though, when you are asking a question which requires reasoning and aggregating information from several smaller chunks. A lot of effort is being used for developing different kind of RAG retrieval and aggregation strategies. It would be so much easier, if we could put all the required documents at once for the LLM to process. Well - this is where Google seems to have finally surpassed OpenAI with a huge margin. They just introduced their next generation multi-modal model Gemini 1.5 pro, which can swallow a monstrous 10 million tokens (c.a. 700 books) in one single bite and answer questions from that data with remarkable accuracy. Being multi-modal it can also process long videos in one go. Currently only selected early testers have access to Google's new model and there is no pricing information. Even though it is tempting to abandon RAG development based on this kind of news, the LLM call price and document upload overhead for processing millions of tokens each time you want to query you data may be prohibiting for other than the most demanding needs. So, at least for now - RAG is not quite yet dead. A layman understandable presentation on Google Gemini 1.5: https://lnkd.in/dKQr3cri A more technical lab report: https://lnkd.in/dG5-n54N #LLMs #contextwindow #limited #RAG #Google #Gemini #OpenAI #GPT4
5 Comments
Like Comment
To view or add a comment, sign in
Ziaul Kamal

Coder Enthusias
5mo
Report this post
Using SPLADE to Generate Learned Sparse Embeddings https://lnkd.in/gfTxqiYr This is the second of two articles about learned sparse embeddings. Be sure to check out the previous installment on BGE-M3, which includes some critical background for understanding how the SPLADE model works. TL;DR Vector databases rely on various embeddings to retrieve data and generate accurate output for users. Learned sparse embeddings combine the ability of sparse embeddings to match keywords with the ability of dense embeddings to power semantic searches. Bidirectional Encoder Representations from Transformers (or BERT) is the underlying architecture that powers the SPLADE model. We covered how BERT creates embeddings from a query text string in the last installment. What Is SPLADE? The Sparse Lexical and Dense Embeddings (SPLADE) model is designed for information retrieval tasks, combining the strengths of sparse lexical representations with dense embeddings. Before we get to SPLADE, we need to return to BERT. There are two pre-training tasks that underpin BERT, one of which is Masked Language Modeling (MLM). This process randomly hides components of the token, and trains the model to predict what would best fit there. We used the following query to explain both BERT and BGE-M3, and we’ll use it again here for consistency. Milvus is a vector database built for scalable similarity search. You can see in the token generated below that MLM masks two components of the token. This technique results in a model with deeper linguistic comprehension and structural awareness of language because it depends on adjacent tokens to replace the masked values with accurate predictions. For every masked slot during the pre-training, the model uses the contextualized embedding from BERT (we called this Q), here we represent it as Q[i] to output a probability distribution w_i, with w_{ij} denoting the likelihood that a specific BERT vocabulary token occupies the masked position. The length of this output vector w_i matches the size of BERT’s extensive vocabulary, typically 30,522 words, and serves as a key learning signal for refining the model’s predictions. (Note: Probabilities are made up for demonstration purposes.) While the BERT architecture has some MLM built into it, SPLADE takes that application of MLM to the next level. The key difference is that once BERT generates tokens and embeddings, SPLADE applies MLM across all token positions, calculating the probability that each token corresponds to every word in BERT’s vocabulary. It also uses advanced processing to determine a weighted relevance for each vocabulary word for the input token, creating a learned sparse vector. One of the key advantages of using SPLADE is that it identifies relevant terms that were not present in the original text. This provides a lot of flexibility and dynamism for end-result output by expanding the vector to include more tokens. This extends term-matching...
Like Comment
To view or add a comment, sign in
Raphaël MANSUY

Data Engineering | DataScience | AI & Innovation | Author | Follow me for deep dives on AI & data-engineering
3mo
Report this post
ULLME: Advancing Text Embeddings with Large Language Models ... A new paper from researchers at the University of Oregon introduces ULLME - a Unified framework for Large Language Model Embeddings that addresses key challenges in this space. 👉 The Text Embedding Challenge Text embeddings allow us to represent words and documents as dense vectors, enabling powerful applications in search, recommendation systems, and more. However, existing frameworks for LLM-based embeddings have been limited in flexibility and performance. 👉 ULLME: A Versatile Solution ULLME offers a modular, plug-and-play approach to leveraging LLMs for text embeddings: - Supports multiple LLM architectures - Enables bidirectional attention for improved context understanding - Integrates various fine-tuning strategies It's like having a universal set of building blocks that can be assembled to create optimal embedding solutions for diverse needs. 👉 Innovative Fine-Tuning with GRL A key innovation in ULLME is Generation-augmented Representation Learning (GRL). This novel fine-tuning strategy combines the strengths of contrastive learning and text generation to create more nuanced embeddings. Here's how it works: 1. Taste Testing (Contrastive Learning): The model learns to recognize similar and different text passages. 2. Cooking Practice (Generation Task): The model generates text based on given prompts or queries. 3. Consistency Check: We align the model's understanding of relevance in both embedding and generation spaces. 4. Preference Learning: The model learns to generate more relevant and high-quality text. By bridging these skills, GRL creates a more well-rounded "AI chef" capable of handling a wide variety of language tasks. 👉 Practical Impact ULLME demonstrates significant performance improvements across various embedding tasks, with potential applications in: - Information retrieval - Question answering systems - Semantic search - And more By providing a flexible, high-performance framework, ULLME can accelerate research and development in LLM-based text understanding. 👉 Looking Ahead The open-source nature of ULLME invites collaboration and further innovation. As LLM architectures continue to evolve, ULLME provides a foundation for exploring their full potential in text embedding tasks. Interested in advanced text embeddings? Check out the ULLME GitHub repository and paper in comments.

4 Comments
Like Comment
To view or add a comment, sign in
Ziaul Kamal

Coder Enthusias
6mo
Report this post
Build an Advanced RAG Application Using MyScaleDB and LlamaIndex https://lnkd.in/gfTxqiYr Eight large language models (LLMs) have brought immense value with their ability to understand and generate human-like text. However, these models also come with notable challenges. They are trained on vast datasets that demand extensive cost and time. The extensive cost and time required to train these models on large datasets make it nearly impossible to retrain them regularly. This limitation means they often lack updates with the latest data, leading to potential inaccuracies when queried about unfamiliar topics. This phenomenon is known as “hallucination,” and it can deteriorate the performance of applications and raise concerns about their reliability and authenticity. To overcome hallucination, several techniques are employed, with retrieval-augmented generation (RAG) being the most widely used due to its efficiency and performance. I’ll show how to design a complete advanced RAG system that can be used in production environments. What Is Retrieval Augmented Generation RAG is the most widely used technique to overcome hallucination. It ensures that LLMs remain up to date with the most recent information and provide better responses. It dynamically retrieves relevant external data during the model’s response generation phase. This approach allows the LLM to access the most current information without the need for frequent retraining. It makes the model’s responses more accurate and contextually appropriate. The process begins with a user query, which is transformed into embeddings via an embedding model to capture its semantic essence. These embeddings then undergo a similarity search against vectors in a knowledge base or vector database to identify the most relevant information. The top “K” results from this search are integrated as additional context into the LLM. By processing both the original query and this supplementary data, the LLM is equipped to generate more accurate and contextually relevant responses. This not only mitigates the issue of hallucinations but also ensures the model’s outputs remain up to date and reliable without frequent retraining. What Is LlamaIndex LlamaIndex, previously known as the GPT Index, acts like glue that helps you connect LLMs and knowledge bases. It provides some built-in methods to fetch data from different sources and use it in your RAG applications. This includes a variety of file formats, such as .pdfs and PowerPoints, as well as applications like Notion and Slack and even databases like Postgres and MyScaleDB. LlamaIndex provides important tools that help in collecting, organizing, retrieving and integrating data with various application frameworks. It makes your data easier to access and use, allowing you to build powerful, customized LLM applications and workflows. Some of the main components of LlamaIndex include: Data connectors: These allow LlamaIndex to access a variety of d...
Like Comment
To view or add a comment, sign in

4,273 followers

1,834 Posts

View Profile Connect

Steve Shirkey’s Post

More Relevant Posts

Explore topics