Mike Tung 🤖’s Post

CEO at Diffbot

9mo

"While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments."...."By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift." Will runtime hallucination monitoring be the new DevOps?

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌

9mo

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models 🔺 🔻 Large Language Models has sparked a revolution in AI’s natural language capabilities. These foundation models can generate impressively coherent text on practically any topic when prompted. However, concerns around factual consistency and hallucinated content have accompanied their rise. Despite strong performance on closed domain datasets, open-ended queries can expose distortions in an LLM’s world knowledge. For instance, LLMs may generate plausible but incorrect answers by confusing entities, relations or temporal events. Or they may conflate details from disjoint contexts when operating beyond their training distribution. These factual inaccuracies point to fundamental limitations around reasoning on open domains. While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments. As organizations increasingly integrate conversational interfaces powered by LLMs, maintaining alignment with truth is critical for reliability and trust. Manual fact-checking is expensive, lacks throughput and proves infeasible for niche domains. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift. KGs provide fixed positional references for assessing deviations within an LLM’s fluid generative space. Combining the strengths of neural representation learning and symbolic knowledge anchoring paves the path ahead for not just detecting but also correcting departures from reality. Also, building a high-performing retrieval-augmented generation (RAG) system that continuously improves requires implementing an effective data flywheel. This virtuous cycle of instrumentation, analysis, tracing issues to data gaps, improving underlying data sources, and iteration can significantly enhance systems leveraging knowledge graphs and large language models for question answering at inference time. By systematically detecting problematic responses and expanding the knowledge graph to address deficiencies, the data flywheel enables such systems to incrementally learn in a managed, targeted way. By tracing poor responses during usage back to missing entities, relations or facts in the integrated knowledge substrate, targeted augmentation and fine-tuning can improve performance and trustworthiness. The flywheel effect also reduces manual oversight needs by codifying the improvement loop. https://lnkd.in/eGMrJzT6

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

4 Comments

Kara McMaster

Helping B2B Founders Build $1M+ Growth Engines with Predictable Sales Pipelines in Just 90 Minutes a Day 🚀

9mo

Real-world deployments demand continuous monitoring and verification to uphold reliability and trust.

Gary Longsine

Collaborate • Deliver • Iterate. 📱

9mo

That's extremely funny. Fascinating and cool, but also hilarious. 🤣

Frode Odegard

Chairman & CEO at Post-Industrial Institute, Founder at Post-Industrial Forum

9mo

100%

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌

9mo

Awesome product that you built Mike Tung 🤖 glad you like !

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌
9mo
Report this post
Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models 🔺 🔻 Large Language Models has sparked a revolution in AI’s natural language capabilities. These foundation models can generate impressively coherent text on practically any topic when prompted. However, concerns around factual consistency and hallucinated content have accompanied their rise. Despite strong performance on closed domain datasets, open-ended queries can expose distortions in an LLM’s world knowledge. For instance, LLMs may generate plausible but incorrect answers by confusing entities, relations or temporal events. Or they may conflate details from disjoint contexts when operating beyond their training distribution. These factual inaccuracies point to fundamental limitations around reasoning on open domains. While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments. As organizations increasingly integrate conversational interfaces powered by LLMs, maintaining alignment with truth is critical for reliability and trust. Manual fact-checking is expensive, lacks throughput and proves infeasible for niche domains. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift. KGs provide fixed positional references for assessing deviations within an LLM’s fluid generative space. Combining the strengths of neural representation learning and symbolic knowledge anchoring paves the path ahead for not just detecting but also correcting departures from reality. Also, building a high-performing retrieval-augmented generation (RAG) system that continuously improves requires implementing an effective data flywheel. This virtuous cycle of instrumentation, analysis, tracing issues to data gaps, improving underlying data sources, and iteration can significantly enhance systems leveraging knowledge graphs and large language models for question answering at inference time. By systematically detecting problematic responses and expanding the knowledge graph to address deficiencies, the data flywheel enables such systems to incrementally learn in a managed, targeted way. By tracing poor responses during usage back to missing entities, relations or facts in the integrated knowledge substrate, targeted augmentation and fine-tuning can improve performance and trustworthiness. The flywheel effect also reduces manual oversight needs by codifying the improvement loop. https://lnkd.in/eGMrJzT6

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

29 Comments
Like Comment
To view or add a comment, sign in
Martin Ciupa

AI Entrepreneur. Keynote Speaker, Interests in: AI/Cybernetics, Physics, Consciousness Studies/Neuroscience, Philosophy: Ethics/Ontology/Maths/Science. Life and Love.
9mo Edited
Report this post
Title: Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models — article by Anthony Alcaraz “This article proposes an approach for automated hallucination detection by comparing LLM inferences against structured knowledge graphs (KGs). KGs act as an external memory backbone, encoding relational facts about entities and events. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift.”

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌
9mo

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models 🔺 🔻 Large Language Models has sparked a revolution in AI’s natural language capabilities. These foundation models can generate impressively coherent text on practically any topic when prompted. However, concerns around factual consistency and hallucinated content have accompanied their rise. Despite strong performance on closed domain datasets, open-ended queries can expose distortions in an LLM’s world knowledge. For instance, LLMs may generate plausible but incorrect answers by confusing entities, relations or temporal events. Or they may conflate details from disjoint contexts when operating beyond their training distribution. These factual inaccuracies point to fundamental limitations around reasoning on open domains. While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments. As organizations increasingly integrate conversational interfaces powered by LLMs, maintaining alignment with truth is critical for reliability and trust. Manual fact-checking is expensive, lacks throughput and proves infeasible for niche domains. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift. KGs provide fixed positional references for assessing deviations within an LLM’s fluid generative space. Combining the strengths of neural representation learning and symbolic knowledge anchoring paves the path ahead for not just detecting but also correcting departures from reality. Also, building a high-performing retrieval-augmented generation (RAG) system that continuously improves requires implementing an effective data flywheel. This virtuous cycle of instrumentation, analysis, tracing issues to data gaps, improving underlying data sources, and iteration can significantly enhance systems leveraging knowledge graphs and large language models for question answering at inference time. By systematically detecting problematic responses and expanding the knowledge graph to address deficiencies, the data flywheel enables such systems to incrementally learn in a managed, targeted way. By tracing poor responses during usage back to missing entities, relations or facts in the integrated knowledge substrate, targeted augmentation and fine-tuning can improve performance and trustworthiness. The flywheel effect also reduces manual oversight needs by codifying the improvement loop. https://lnkd.in/eGMrJzT6

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

6 Comments
Like Comment
To view or add a comment, sign in
George Z.

Yale MBA '24
5mo Edited
Report this post
🤖 Large language models (#LLM) can "hallucinate," producing outputs that are incorrect, unrelated to training data, or nonsensical. This poses a real threat, especially in critical fields like medicine, engineering, or finance, where such errors can seriously impact human lives. 📈 Recently, the market demand for quick and inexpensive AI solutions has grown exponentially, leading to compromises in response quality. Current prevention methods, such as highly curated training data, prompt engineering, and connections to structured data sources, help mitigate the issue but don't eliminate it entirely. Human oversight, while valuable, is often time-consuming and not foolproof. 📊 Methods for detecting hallucinations are gaining popularity because they can identify errors even without human intervention. In their recent publication (https://lnkd.in/eZ29TGgv), Farquhar et al. propose an intriguing unsupervised method for detecting confabulations using "semantic entropy." This method evaluates the meanings of output sentences, clustering multiple generated outputs based on their semantic equivalence with the help of LLMs or Natural Language Inference (#NLI) tools. A quantitative measure then indicates the variation in answers, showing whether the LLM is certain of the meaning. This approach, combined with user notification and grounding, should reduce hallucinations not caused by training data imperfections. ✅ I believe this might be a great approach to implement, as it approximates human oversight without the associated processing limitations. The authors show that generating one original output plus three additional generations is sufficient for assessing paragraph-length outputs, with clusters typically ranging from one to three; using more affordable models for clustering, such as DeBERTa, or even simple string comparisons for nearly identical outputs, can speed up processing. In high-stakes environments, the benefits of these extra checks can significantly outweigh the computational costs. ❓ What do you think of this method? I am eager to give it a try.

Detecting hallucinations in large language models using semantic entropy - Nature

nature.com

1 Comment
Like Comment
To view or add a comment, sign in
Tamil Selvan

Teacher
8mo
Report this post
Title: ReALM: Leveraging Language Models for Enhanced Reference Resolution A group of Apple researchers has developed a novel approach called ReALM (Reference Resolution As Language Modeling) that significantly improves AI systems' ability to understand and resolve contextual references in user queries. This breakthrough tackles a key challenge in creating more natural and intuitive interactions with voice assistants and conversational AI. Reference resolution determines which entity or entities a user refers to based on the context of their query. For example, if a user says, "Call the one on Rainbow Rd." after being shown a list of businesses, the AI must understand that "the one" refers to the specific business located on Rainbow Rd. Traditional approaches to reference resolution have struggled to handle the wide variety of ways users can refer to on-screen entities, often relying on heuristics and hand-crafted rules. Recent large language models (LLMs) have shown promise in implicitly resolving references as part of their language understanding capabilities. However, deploying massive LLMs can be infeasible for on-device AI that needs to run efficiently on mobile phones and smart speakers. The critical innovation of ReALM is to convert the reference resolution task into a language modelling problem that even smaller LLMs can solve effectively. This is done by cleverly encoding on-screen and conversational context as natural language text that an LLM can understand. Crucially, the researchers developed a novel textual representation of on-screen layouts that preserves the relative spatial positioning of entities, allowing the LLM to handle queries like "the one at the bottom right". In experiments, ReALM outperformed a state-of-the-art reference resolution system (MARRS) and achieved accuracy comparable to the much larger GPT-4 model while using orders of magnitude fewer parameters. This makes it feasible for ReALM to deploy on-device for responsive, private, and natural interactions. The researchers demonstrate ReALM's sophisticated reasoning capabilities, including semantic understanding, summarization, world knowledge, and commonsense reasoning. By enabling more flexible and robust reference resolution, ReALM represents an essential step towards making human-AI interaction more seamless and natural across a wide range of real-world applications. As intelligent assistants become ubiquitous, this work will help them communicate with us more like a human would.
Like Comment
To view or add a comment, sign in
Stephen S.

Founder - The Prompt Index & The Ministry of AI | 1 AI Resource | AI Education
2w
Report this post
Title: Watermarking Language Models through Language Models I'm finding and summarizing interesting AI research papers everyday so you don't have to trawl through them all. Today's paper is titled "Watermarking Language Models through Language Models" by Xin Zhong, Agnibh Dasgupta, and Abdullah Tanvir. This paper introduces an innovative framework for watermarking language models using a unique multi-model configuration. The framework involves using a Prompting language model to generate watermarking instructions, a Marking language model to embed these watermarks within generated content, and a Detecting language model to verify their presence. The authors conducted experiments with ChatGPT and Mistral as the core models, achieving remarkable results in identifying watermarked content. Key points from the paper include: 1. Dynamic Watermarking Process: The proposed system dynamically adjusts the watermarking strategies based on the output generated by the language models, ensuring flexibility and robustness. This method marks an advancement over traditional static watermarking strategies by offering more adaptability to varying content. 2. High Detection Accuracy: Evaluation results indicate compelling detection accuracy; 95% for outputs generated by ChatGPT and 88.79% for those from Mistral. This demonstrates the effectiveness of the framework in accurately distinguishing between watermarked and non-watermarked text. 3. Applications in Model Authentication: The framework's ability to embed and detect watermarks holds significant promise for applications in content attribution and model authentication, which is crucial for protecting intellectual property and verifying source authenticity. 4. Robustness Across Different Models: The multi-model setup shows adaptability across various language model architectures, suggesting potential scalability for broader applications. However, the paper notes that model-specific training may yield better detection performance compared to a unified detection model trained on multiple language models. 5. Enhanced Verification Methods: By leveraging prompts generated by language models themselves to guide watermarking, the research explores a novel domain of watermarking that does not require access to model parameters, thus extending usability even in systems with restricted API access. You can catch the full breakdown here: Here: https://lnkd.in/e5Eh82kt You can catch the full and original research paper here: Original Paper: https://lnkd.in/e96g6w2e you're looking to improve your AI prompting skills, check out our free Advanced Prompt Engineering course: https://lnkd.in/ecB-XxY7 Follow for daily AI research paper breakdowns
Like Comment
To view or add a comment, sign in
Sunny Bhan

Vice President (Sr. Engineering Manager) at Citi
1mo
Report this post
Large Language Models: Pre-Training vs. Fine-Tuning Large Language Models (LLMs) like GPT undergo two critical phases: pre-training and fine-tuning, each serving distinct purposes. Pre-Training In this phase, the model is exposed to an enormous amount of text data to learn linguistic structures, semantics, and general knowledge. This unsupervised process teaches the model to predict the next word in a sentence, effectively enabling it to understand language patterns. It’s like building a foundation—training on everything from Wikipedia to web forums—giving the model broad, but generic, language skills. Fine-Tuning After pre-training, the model is adapted to specific tasks using a much smaller, labeled dataset. Fine-tuning adjusts the model’s weights to make it proficient in domain-specific tasks (e.g., legal document analysis, medical diagnosis). This supervised process narrows the model’s focus, making it perform better on specialized tasks. For instance, an LLM fine-tuned on customer service chats will provide more relevant and accurate support responses compared to a pre-trained but non-specialized model. An example is GPT-3, which is pre-trained on a wide variety of text but can be fine-tuned to specialize in areas like code generation or legal document drafting, depending on the specific dataset it’s trained on. This two-phase process makes LLMs flexible and powerful across different use cases.
Like Comment
To view or add a comment, sign in
Leon Shelhamer

Full Stack Developer
7mo Edited
Report this post
The LUP Paradox: Unveiling the Limits of Language Models In the burgeoning field of artificial intelligence, language models like GPT (Generative Pre-trained Transformer) are revolutionizing how machines understand and generate human language. However, these models are not without their flaws, which becomes evident through what we can now call "The LUP Paradox." This term was coined to describe a peculiar limitation observed when interacting with these models—specifically, their tendency to provide an answer, even when no appropriate answer exists. The Paradox Explained The LUP Paradox manifests when an AI is asked to perform a seemingly simple task: generate a five-letter English word ending in "lup." The catch? No such word exists in the English language. Ideally, the model should recognize the impossibility of the task and respond accordingly, perhaps stating that no such word exists. However, the behavior of language models often deviates from this expectation. In practice, when confronted with this query, language models tend to offer a response, even if it means fabricating a non-existent word. Initially, they might suggest a related word that doesn't fit the criteria, such as "syrup" or "setup." When further pressed for accuracy, these models might invent entirely new words, steadfast in their mission to provide answers. Underlying Mechanisms This behavior underscores a fundamental aspect of how language models are trained. They are designed to predict the most likely next word in a sequence, drawing from vast datasets of text. However, they are not inherently equipped with mechanisms to evaluate the factual correctness of their outputs or the feasibility of certain word constructions. Instead, their primary function is to generate text that is statistically probable, or at least plausible, based on the data they have been trained on. Implications for AI Development The LUP Paradox is a stark reminder of the current limitations of language models. It highlights the need for advancements in AI that incorporate not just linguistic probability but also semantic understanding and factual accuracy. Addressing this issue could involve enhancing the training process to include negative examples or developing new methodologies that allow models to admit uncertainty or lack of knowledge. Conclusion As we continue to integrate AI into various aspects of daily life, understanding and addressing the limitations exemplified by the LUP Paradox will be crucial. This not only involves improving the technology itself but also setting realistic expectations for what AI can and cannot do. By acknowledging these limits, we can better harness the power of AI while mitigating the risks associated with its imperfections.

2 Comments
Like Comment
To view or add a comment, sign in
VAIBHAV SHANKAR SHARMA

Serial Entrepreneur skilled in Product Innovation, on a secret mission to make the future secure for people around the globe. Expert in Fintech, Marketing, and Beyond.
4w
Report this post
WACK: Advancing Hallucination Detection by Identifying Knowledge-Based Errors in Language Models Through Model-Specific, High-Precision Datasets and Prompting Techniques Large Language Models (LLMs) are widely used in natural language tasks, from question-answering to conversational AI. However, a persistent issue with LLMs is “hallucination,” where the model generates responses that are factually incorrect or ungrounded in reality. These hallucinations can diminish the reliability of LLMs, posing challenges for practical applications, particularly in fields that require accuracy, such as medical diagnostics and legal reasoning. To improve the trustworthiness of LLMs, researchers have focused on understanding the causes of hallucinations. They categorize hallucinations as either arising from a lack of knowledge or errors occurring despite the model’s correct information. By targeting the roots of these errors, researchers hope to improve the effectiveness of LLMs across various domains. Researchers address two distinct phenomena in distinguishing between hallucinations caused by absent information versus misapplied knowledge. The first type occurs when the model lacks the necessary information, such as when prompted with questions about specific, lesser-known facts. In this case, LLMs tend to invent plausible-sounding but incorrect responses. The second type arises when the model has the knowledge but still generates a wrong answer. Such hallucinations indicate a problem with how the model processes or retrieves its stored knowledge rather than an issue of knowledge scarcity. This distinction is essential as different errors necessitate different interventions. Traditional methods of mitigating hallucinations in LLMs do not address these distinct causes adequately. Prior approaches often combine both errors under a single category, leading to “one-size-fits-all” detection strategies that rely on large, generic datasets. However, this conflation limits the ability of these approaches to identify and address the different mechanisms underlying each error type. Generic datasets cannot account for errors occurring within the model’s existing knowledge, meaning valuable data on model processing errors is lost. Without specialized datasets that focus on errors arising from knowledge misapplication, researchers have been unable to effectively address the full scope of hallucinations in LLMs. Researchers from Technion – Israel Institute of Technology and Google Research introduced the WACK (Wrong Answer despite Correct Knowledge) methodology. This approach creates model-specific datasets to differentiate between hallucinations due to absent information and those arising from processing errors. WACK datasets are tailored to each model’s unique knowledge and error patterns, ensuring that hallucinations are analyzed within the context of the model’s strengths and weaknesses. By isolating these errors, researchers can gain insights into the...
Like Comment
To view or add a comment, sign in
Swaminathan Arunachalam
5mo
Report this post
Unlock the Power of Language Models with RAG! Check out my latest Medium article to learn how Retrieval Augmented Generation (RAG) is revolutionizing Large Language Models (LLMs). Discover how RAG improves accuracy, enhances contextual understanding, and overcomes common limitations. Read now and share your thoughts! https://lnkd.in/gCi7BhTf #RAG #LLMs #AI #LanguageModels"

Unlocking the Secret to Smarter Language Models

medium.com
Like Comment
To view or add a comment, sign in

9,250 followers

View Profile Connect

Mike Tung 🤖’s Post

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

More from this author

The Economics of Building Knowledge Bases

Diffbot's Approach to Knowledge Graph

The Diffbot Master Plan (Part One)

Explore topics