Leon Shelhamer’s Post

Full Stack Developer

7mo Edited

The LUP Paradox: Unveiling the Limits of Language Models In the burgeoning field of artificial intelligence, language models like GPT (Generative Pre-trained Transformer) are revolutionizing how machines understand and generate human language. However, these models are not without their flaws, which becomes evident through what we can now call "The LUP Paradox." This term was coined to describe a peculiar limitation observed when interacting with these models—specifically, their tendency to provide an answer, even when no appropriate answer exists. The Paradox Explained The LUP Paradox manifests when an AI is asked to perform a seemingly simple task: generate a five-letter English word ending in "lup." The catch? No such word exists in the English language. Ideally, the model should recognize the impossibility of the task and respond accordingly, perhaps stating that no such word exists. However, the behavior of language models often deviates from this expectation. In practice, when confronted with this query, language models tend to offer a response, even if it means fabricating a non-existent word. Initially, they might suggest a related word that doesn't fit the criteria, such as "syrup" or "setup." When further pressed for accuracy, these models might invent entirely new words, steadfast in their mission to provide answers. Underlying Mechanisms This behavior underscores a fundamental aspect of how language models are trained. They are designed to predict the most likely next word in a sequence, drawing from vast datasets of text. However, they are not inherently equipped with mechanisms to evaluate the factual correctness of their outputs or the feasibility of certain word constructions. Instead, their primary function is to generate text that is statistically probable, or at least plausible, based on the data they have been trained on. Implications for AI Development The LUP Paradox is a stark reminder of the current limitations of language models. It highlights the need for advancements in AI that incorporate not just linguistic probability but also semantic understanding and factual accuracy. Addressing this issue could involve enhancing the training process to include negative examples or developing new methodologies that allow models to admit uncertainty or lack of knowledge. Conclusion As we continue to integrate AI into various aspects of daily life, understanding and addressing the limitations exemplified by the LUP Paradox will be crucial. This not only involves improving the technology itself but also setting realistic expectations for what AI can and cannot do. By acknowledging these limits, we can better harness the power of AI while mitigating the risks associated with its imperfections.

2 Comments

Leon Shelhamer

Full Stack Developer

ChatGPT-01-preview solved the LUP paradox. It did have to think for 7 seconds though.

Daniel Englert

Associate Director of Technology at AREA 23

6mo

I've already started using LUP paradox in conversations

1 Reaction

See more comments

To view or add a comment, sign in

More Relevant Posts

Mike Tung 🤖

CEO at Diffbot
9mo
Report this post
"While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments."...."By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift." Will runtime hallucination monitoring be the new DevOps?

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌
9mo

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models 🔺 🔻 Large Language Models has sparked a revolution in AI’s natural language capabilities. These foundation models can generate impressively coherent text on practically any topic when prompted. However, concerns around factual consistency and hallucinated content have accompanied their rise. Despite strong performance on closed domain datasets, open-ended queries can expose distortions in an LLM’s world knowledge. For instance, LLMs may generate plausible but incorrect answers by confusing entities, relations or temporal events. Or they may conflate details from disjoint contexts when operating beyond their training distribution. These factual inaccuracies point to fundamental limitations around reasoning on open domains. While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments. As organizations increasingly integrate conversational interfaces powered by LLMs, maintaining alignment with truth is critical for reliability and trust. Manual fact-checking is expensive, lacks throughput and proves infeasible for niche domains. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift. KGs provide fixed positional references for assessing deviations within an LLM’s fluid generative space. Combining the strengths of neural representation learning and symbolic knowledge anchoring paves the path ahead for not just detecting but also correcting departures from reality. Also, building a high-performing retrieval-augmented generation (RAG) system that continuously improves requires implementing an effective data flywheel. This virtuous cycle of instrumentation, analysis, tracing issues to data gaps, improving underlying data sources, and iteration can significantly enhance systems leveraging knowledge graphs and large language models for question answering at inference time. By systematically detecting problematic responses and expanding the knowledge graph to address deficiencies, the data flywheel enables such systems to incrementally learn in a managed, targeted way. By tracing poor responses during usage back to missing entities, relations or facts in the integrated knowledge substrate, targeted augmentation and fine-tuning can improve performance and trustworthiness. The flywheel effect also reduces manual oversight needs by codifying the improvement loop. https://lnkd.in/eGMrJzT6

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

4 Comments
Like Comment
To view or add a comment, sign in
Jason Fishbein

Your Partner for AI, Data & Analytics || Director of Data & Analytics @ 🚀rockITdata
4mo
Report this post
Are You Choosing the Right AI Model? Discover the Benchmarks that Matter! Large Language Models (LLMs) have seen a rapid evolution in a short period of time. As these models continue to grow in complexity and capability, evaluating their performance through standardized benchmarks becomes essential. Benchmarks serve as standardized tests designed to evaluate and compare the performance of AI models. They provide a consistent framework for measuring various aspects of model capabilities, such as understanding, generation, reasoning, and efficiency. The primary purposes of benchmarks include: Performance Measurement: Quantifying how well a model performs specific tasks. Comparison: Allowing researchers and practitioners to compare different models objectively. Progress Tracking: Monitoring improvements in AI models over time. Guidance: Helping in selecting the most suitable model for particular applications. Fortunately or unfortunately, there are no shortage of benchmarks, but it is good to get familiar with some of the most commonly used: MMLU (Massive Multitask Language Understanding): MMLU is a comprehensive benchmark that evaluates a model's ability to handle a diverse set of tasks spanning multiple domains. It assesses general knowledge, language understanding, and reasoning skills, making it a robust indicator of a model's versatility. GLUE (General Language Understanding Evaluation): GLUE is a collection of nine natural language understanding tasks, including sentiment analysis, text similarity, and question answering. It is widely used to gauge a model's overall language understanding capabilities. SuperGLUE: An extension of GLUE, SuperGLUE includes more challenging tasks to push the boundaries of language understanding. It adds tasks like Winograd Schema Challenge and Common Sense Reasoning, which require deeper reasoning. HellaSwag: HellaSwag evaluates commonsense reasoning and natural language inference, essential for tasks requiring logical consistency. HumanEval: HumanEval is a benchmark for assessing the functional correctness of code generated by large language models through programming tasks with specified input-output examples. It evaluates the accuracy, functionality, and diversity of the generated code to determine how well models can perform real-world programming tasks. While benchmarks provide valuable insights into a model's capabilities, it's crucial to recognize that the highest-ranked model on a particular benchmark may not always be the best choice for every use case. Factors such as model size, computational requirements, specific task requirements, and of course cost play a significant role in determining the most suitable model for a given application. For instance, a model excelling in conversational AI may not be the best fit for tasks requiring extensive factual knowledge retrieval. #GenAI #AI #ArtificialIntelligence #GenerativeAI #Benchmarking #Benchmark
Like Comment
To view or add a comment, sign in
Swaminathan Arunachalam
5mo
Report this post
Unlock the Power of Language Models with RAG! Check out my latest Medium article to learn how Retrieval Augmented Generation (RAG) is revolutionizing Large Language Models (LLMs). Discover how RAG improves accuracy, enhances contextual understanding, and overcomes common limitations. Read now and share your thoughts! https://lnkd.in/gCi7BhTf #RAG #LLMs #AI #LanguageModels"

Unlocking the Secret to Smarter Language Models

medium.com
Like Comment
To view or add a comment, sign in
Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌
9mo
Report this post
Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models 🔺 🔻 Large Language Models has sparked a revolution in AI’s natural language capabilities. These foundation models can generate impressively coherent text on practically any topic when prompted. However, concerns around factual consistency and hallucinated content have accompanied their rise. Despite strong performance on closed domain datasets, open-ended queries can expose distortions in an LLM’s world knowledge. For instance, LLMs may generate plausible but incorrect answers by confusing entities, relations or temporal events. Or they may conflate details from disjoint contexts when operating beyond their training distribution. These factual inaccuracies point to fundamental limitations around reasoning on open domains. While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments. As organizations increasingly integrate conversational interfaces powered by LLMs, maintaining alignment with truth is critical for reliability and trust. Manual fact-checking is expensive, lacks throughput and proves infeasible for niche domains. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift. KGs provide fixed positional references for assessing deviations within an LLM’s fluid generative space. Combining the strengths of neural representation learning and symbolic knowledge anchoring paves the path ahead for not just detecting but also correcting departures from reality. Also, building a high-performing retrieval-augmented generation (RAG) system that continuously improves requires implementing an effective data flywheel. This virtuous cycle of instrumentation, analysis, tracing issues to data gaps, improving underlying data sources, and iteration can significantly enhance systems leveraging knowledge graphs and large language models for question answering at inference time. By systematically detecting problematic responses and expanding the knowledge graph to address deficiencies, the data flywheel enables such systems to incrementally learn in a managed, targeted way. By tracing poor responses during usage back to missing entities, relations or facts in the integrated knowledge substrate, targeted augmentation and fine-tuning can improve performance and trustworthiness. The flywheel effect also reduces manual oversight needs by codifying the improvement loop. https://lnkd.in/eGMrJzT6

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

29 Comments
Like Comment
To view or add a comment, sign in
Jonathan Boymal

Associate Professor of Economics
2mo Edited
Report this post
A recent paper “Do Large Language Models Perform the Way People Expect? Measuring the Human Generalisation Function" by Keyon Vafa, Ashesh Rambachan and Sendhil Mullainathan explores how human interactions shape the deployment of large language models (LLMs). LLMs require human involvement to assess their capabilities dynamically. This is because of the sheer diversity of tasks LLMs can perform, many of which do not have existing evaluation metrics. As such, human beliefs and generalisations about the capabilities of LLMs shape how, when, and for what tasks these models are deployed. People do not evaluate LLMs in a vacuum but rather use their subjective experience of how the model performed on specific tasks to guide future use. This human-centric approach to evaluating Al can be both a strength and a weakness. It allows for flexibility in exploring Al capabilities, but also introduces the risk of overconfidence or underestimation based on a narrow scope of interaction. The human generalisation function refers to how people generalise AI’s capabilities after limited use, often leading to misaligned expectations. This tendency towards generalisation can cause overreliance, particularly with larger models. The paper provides examples where an accurate answer to one question incorrectly affected humans’ beliefs about the ability to answer apparently similar questions: an LLM computes an economic quantity correctly but fails to answer a basic arithmetic question; an LLM correctly tracks the positions of players on a soccer team but fails on a very similarly worded problem; an LLM can answer a question about moral philosophy correctly but cannot apply moral reasoning. LLMs can exhibit surprising inconsistencies when presented with questions that appear very similar to humans. This phenomenon, known as "brittleness," occurs because LLMs don't truly understand the underlying concepts or context in the way humans do The paper has important implications for how students should approach using generative AI in their academic work. 1. Understanding AI’s Limitations: AI tools like GPT-4 are powerful but inconsistent across tasks. Students should avoid overgeneralising based on initial successes and test AI across various academic contexts. 2. Selective Deployment: Use AI for low-stakes tasks like idea generation, while being more cautious with high-stakes tasks, reducing the risk of errors in critical work. 3. Calibrate Expectations: Given the human tendency to overestimate AI's abilities, students should treat AI outputs as starting points, not final answers. 4. Iterative Testing: Since human expectations evolve through interaction, students should engage in constant experimentation with AI. 5. Evaluative Judgement: By fostering evaluative judgement, educators can prepare students to interact more effectively and critically with Al tools, mitigating the risks of overreliance and misaligned expectations.
5 Comments
Like Comment
To view or add a comment, sign in
Thomas Sittek

Asset Management | Specialist in AI Governance & Regulatory Compliance | Bridging Law and Technology in Financial Services
1mo
Report this post
I’ve recently followed the rise of Small Language Models (SLMs) and am genuinely excited about their potential to outperform larger, generalist models like GPT-4. These specialised models are opening new avenues in AI applications, particularly in sectors where efficiency and customisation are paramount. Here are some key points: 1️⃣ Customizable, Efficient, and Cost-Effective • SLMs vs. LLMs: SLMs are more adaptable and require fewer computational resources compared to large language models (LLMs), making them a cost-effective alternative. 2️⃣ Easy to Adapt for Specific Tasks • Versatility: They can be fine-tuned for tasks like data analysis, translation, and summarisation with relative ease. 3️⃣ Competitive Performance • Models like Llama, Mistral, and Granite: These SLMs have demonstrated benchmarks where they compete with, and sometimes outperform, larger models. 4️⃣ Operational Benefits • Efficiency: Lower memory and compute needs. • Flexible Deployment: Easier to integrate into existing systems. • Cost-Effective Iteration: Faster development cycles without the hefty costs. 5️⃣ Advantages for Businesses • Greater IP Control: Maintain ownership over proprietary models. • Data Privacy and Security: Enhanced protection of sensitive information. • No Licensing Issues: Avoid complexities associated with LLM licensing agreements. For businesses aiming to integrate AI solutions while maintaining compliance and operational efficiency, SLMs present a compelling option. They not only align with robust legal standards but also bridge the gap between technological innovation and the evolving demands of the financial and technology sectors. 📘 Here’s a simple guide to learn more about SLMs and their applicability to business: https://lnkd.in/dSkEsfey I’m curious to hear your thoughts: How do you see Small Language Models impacting your industry or area of expertise?

The Emergence of Small Language Models

nocode.ai
Like Comment
To view or add a comment, sign in
Xinmeng Huang

Incoming QR @ Citadel Securities
7mo
Report this post
Does higher uncertainty (lower confidence) of language models necessarily tie with poorer generation? Check our latest paper "Uncertainty in Language Models: Assessment through Rank-Calibration.” 🔗 https://lnkd.in/eEpnDMeg Despite the impressive generative capabilities of LMs, principled and unified assessments for the quality of various uncertainty and confidence measures remain a challenge. Our work introduces “Rank-Calibration”, a novel framework offering a practical and principled assessment for uncertainty and confidence measures of LMs. Key Highlights: * 🔍 Deep dive into why accurately assessing LMs' uncertainty/confidence levels is challenging for advancing AI reliability. * 🌟 Introducing Rank-Calibration: A groundbreaking method that correlates higher uncertainty with lower text generation quality, offering a nuanced assessment without binary correctness thresholds. * 🛠️ Empirical demonstrations of our methods' broad applicability and granular interpretability across multiple tasks and LMs, including the challenging long-form Meadow benchmark. * 📈 Extensive experiments validating the effectiveness and robustness of our approach, promising a significant tool towards trustworthy language modeling. Dive into our study to explore how we're paving the way for more reliable and interpretable Language generation. A must-read for AI researchers, practitioners, and enthusiasts aiming to push the boundaries of AI safety and effectiveness! #AI #LanguageModels #UncertaintyQuantification #RankCalibration #MachineLearning #ArtificialIntelligence

2404.03163.pdf

arxiv.org
Like Comment
To view or add a comment, sign in
Stephen S.

Founder - The Prompt Index & The Ministry of AI | 1 AI Resource | AI Education
2w
Report this post
Title: Watermarking Language Models through Language Models I'm finding and summarizing interesting AI research papers everyday so you don't have to trawl through them all. Today's paper is titled "Watermarking Language Models through Language Models" by Xin Zhong, Agnibh Dasgupta, and Abdullah Tanvir. This paper introduces an innovative framework for watermarking language models using a unique multi-model configuration. The framework involves using a Prompting language model to generate watermarking instructions, a Marking language model to embed these watermarks within generated content, and a Detecting language model to verify their presence. The authors conducted experiments with ChatGPT and Mistral as the core models, achieving remarkable results in identifying watermarked content. Key points from the paper include: 1. Dynamic Watermarking Process: The proposed system dynamically adjusts the watermarking strategies based on the output generated by the language models, ensuring flexibility and robustness. This method marks an advancement over traditional static watermarking strategies by offering more adaptability to varying content. 2. High Detection Accuracy: Evaluation results indicate compelling detection accuracy; 95% for outputs generated by ChatGPT and 88.79% for those from Mistral. This demonstrates the effectiveness of the framework in accurately distinguishing between watermarked and non-watermarked text. 3. Applications in Model Authentication: The framework's ability to embed and detect watermarks holds significant promise for applications in content attribution and model authentication, which is crucial for protecting intellectual property and verifying source authenticity. 4. Robustness Across Different Models: The multi-model setup shows adaptability across various language model architectures, suggesting potential scalability for broader applications. However, the paper notes that model-specific training may yield better detection performance compared to a unified detection model trained on multiple language models. 5. Enhanced Verification Methods: By leveraging prompts generated by language models themselves to guide watermarking, the research explores a novel domain of watermarking that does not require access to model parameters, thus extending usability even in systems with restricted API access. You can catch the full breakdown here: Here: https://lnkd.in/e5Eh82kt You can catch the full and original research paper here: Original Paper: https://lnkd.in/e96g6w2e you're looking to improve your AI prompting skills, check out our free Advanced Prompt Engineering course: https://lnkd.in/ecB-XxY7 Follow for daily AI research paper breakdowns
Like Comment
To view or add a comment, sign in
Martin Ciupa

AI Entrepreneur. Keynote Speaker, Interests in: AI/Cybernetics, Physics, Consciousness Studies/Neuroscience, Philosophy: Ethics/Ontology/Maths/Science. Life and Love.
9mo Edited
Report this post
Title: Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models — article by Anthony Alcaraz “This article proposes an approach for automated hallucination detection by comparing LLM inferences against structured knowledge graphs (KGs). KGs act as an external memory backbone, encoding relational facts about entities and events. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift.”

Anthony Alcaraz

Senior AI/ML Strategist Startups & VC @AWS - Writing on AI/ML, analysis are my own 👌
9mo

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models 🔺 🔻 Large Language Models has sparked a revolution in AI’s natural language capabilities. These foundation models can generate impressively coherent text on practically any topic when prompted. However, concerns around factual consistency and hallucinated content have accompanied their rise. Despite strong performance on closed domain datasets, open-ended queries can expose distortions in an LLM’s world knowledge. For instance, LLMs may generate plausible but incorrect answers by confusing entities, relations or temporal events. Or they may conflate details from disjoint contexts when operating beyond their training distribution. These factual inaccuracies point to fundamental limitations around reasoning on open domains. While measurement benchmarks help quantify an LLM’s factual gaps post-training, the focus now needs to shift towards runtime monitoring and verification in real-world deployments. As organizations increasingly integrate conversational interfaces powered by LLMs, maintaining alignment with truth is critical for reliability and trust. Manual fact-checking is expensive, lacks throughput and proves infeasible for niche domains. By continuously extracting assertions from LLM responses and matching them against such a domain-specific KG, contradictions can point to potential hallucinations. Tracking this metric over time provides a holistic view into factual drift. KGs provide fixed positional references for assessing deviations within an LLM’s fluid generative space. Combining the strengths of neural representation learning and symbolic knowledge anchoring paves the path ahead for not just detecting but also correcting departures from reality. Also, building a high-performing retrieval-augmented generation (RAG) system that continuously improves requires implementing an effective data flywheel. This virtuous cycle of instrumentation, analysis, tracing issues to data gaps, improving underlying data sources, and iteration can significantly enhance systems leveraging knowledge graphs and large language models for question answering at inference time. By systematically detecting problematic responses and expanding the knowledge graph to address deficiencies, the data flywheel enables such systems to incrementally learn in a managed, targeted way. By tracing poor responses during usage back to missing entities, relations or facts in the integrated knowledge substrate, targeted augmentation and fine-tuning can improve performance and trustworthiness. The flywheel effect also reduces manual oversight needs by codifying the improvement loop. https://lnkd.in/eGMrJzT6

Leveraging Structured Knowledge to Automatically Detect Hallucination in Large Language Models

medium.com

6 Comments
Like Comment
To view or add a comment, sign in
BBVA AI Factory

18,839 followers
6mo
Report this post
🤖 Large Language Models (#LLM) are behind #AI assistants such as #ChatGPT and #Gemini. They are well-known for their ability to engage in human-like conversations, but there’s so much more to them! The ability to communicate with technology using natural language has expanded the possibilities of what we can achieve. LLMs are helping us accomplish a wide range of tasks: 🏷 Reducing time spent on data labeling 📈 Minimizing effort in evaluation tasks ✨ Enriching model answers with large knowledge bases 🧪 Generating synthetic data 🗣 Invoking or performing actions If you feel curious about these applications, follow the link 👇 https://lnkd.in/dSw7P6Um

Large Language Models beyond dialogue - BBVA AI Factory

bbvaaifactory.com

1 Comment
Like Comment
To view or add a comment, sign in

747 followers

10 Posts

View Profile Connect

Leon Shelhamer’s Post

More Relevant Posts

Explore topics