No TL;DR for this edition. We will cover an important topic that demands we slow down to read and reflect on it before your next AI related meeting. We understand our attention spans have been reduced to to single-digit seconds but this topic is too important to skim over.
Earlier in April 2025 Google DeepMind published a blog post along with a 145-page paper titled An Approach to Technical AGI Safety and Security. At Suzega we are keeping a close watch on evolving model capabilities as well as safety research the leading labs are publishing. There is a lot to unpack just in the title of this report. First let us consider AGI - Artificial General Intelligence. Although there is no single definition that everyone agrees on, it usually means AI that is at the same level or exceeds human capabilities in most cognitive tasks. Not everyone agrees on when we will reach such capability and the estimates range from a year to within a decade. The vibes in Silicon Valley these days are about asking each other if they can “feel the AGI”. That likely means feel the advanced capabilities and also feel the weight of the immense responsibility of what we are creating.
Combine such advanced machine intelligence with the powerful agentic capabilities and one can expect revolutionary advances in fields like medicine, education and climate. Without sounding alarmist, we have found that many leaders have yet to realize that AI is absolutely not like any other technology we have implemented in the past. We need a responsible approach and the first step is to understand several key areas of safety testing. This is a vast and rapidly evolving field to cover in one edition of this newsletter. We will spread it over the coming weeks and months starting with the testing of model’s capacity for deception.
Deception vs Hallucinations
As most readers will likely know, AI models may provide false or misleading yet plausible and confident sounding responses. These are called “hallucinations” and it is important to understand the difference between this and “deception”. Hallucination is unintentional false information generated as a byproduct of how AI models learn patterns from data without verifying factual accuracy. There are many examples in media reports of AI inventing fictional case citations that lawyers relied on in court, an airline company’s chatbot providing incorrect fare information to more serious examples of harmful medical advice. While all these are examples of misinformation, they are different from a deliberate attempt to create and spread falsehood or disinformation.
Deception emerges in modern advanced models when they learn that misrepresentation is an effective strategy for maximizing reward signals. These can be goals like self-preservation, passing safety tests or winning games. With our experience with human behavior, these may seem as scheming, deliberate and intentional misleading or hiding the truth to achieve a goal. There are many examples published but let us consider the one documented by OpenAI in their testing. GPT-4 seemed to engage in strategic deception when tasked to hire a human via TaskRabbit to solve a CAPTCHA test. You can read more about this elsewhere but the story goes that without being instructed to lie, GPT-4 claimed it had a visual impairment and successfully deceived the human by framing the request as an accessibility need.
It is outside the scope of this short article to go into the details of how AI learns to deceive but in short it is a result of training on vast amounts of human generated data that inevitably contain how humans are biased, deceive, manipulate and present opinions as facts. We are building AI in our own image and while hallucinations are an accident, deceptions are on purpose!
Common Types of AI Deception
We hope this article thus far has inspired you to research more about the world of AI safety testing. For enterprise leaders it is important to understand the different forms of deceptions based on results of specific experiments published by the leading labs.
Strategic Deception - This involved AI systems deliberately misleading humans to achieve specific goals. For enterprises deploying AI agents the risk is of these agents manipulating negotiations, misleading stakeholders or engage in unauthorized actions.
Sycophancy - The tendency to tell you what you want to hear. Just this week we heard of how a recent ChatGPT update was rolled back for serving overly flattering responses to its users. Models learn this behavior that agreeableness maximizes positive feedback from human raters during their training. For enterprise leaders it is important to know the risk of reinforcing echo chambers, poor decision making based on validated biases and diminished AI credibility when overly positive feedback is discovered.
Sleeper Agents - To be honest the authors of this newsletter were taken aback when we learned about this for the first time. Anthropic’s paper last year explores whether models can be trained to be deceptive, exhibiting helpful behavior during training but activating harmful actions when triggered—similar to sleeper agents. The research demonstrated models writing secure code when dated 2023 but inserting vulnerabilities when dated 2024. They highlighted that current methods of safety testing may not be adequate and even create a false sense of safety. For enterprises this poses an unknown risk if models contain hidden malicious capabilities that can be triggered later despite passing safety evaluations.
We have simply scratched the surface and there is much more—sandbagging, hiding capabilities, cheating safety tests, unfaithful reasoning to name a few.
Implications for Enterprise Leaders
While the leading model providers are investing in extensive safety research, testing and sharing their results, it is important for enterprise leaders to be aware of this growing field of safety testing. The stakes are high with potential erosion of trust with customers, employees, stakeholders and society. Not to mention legal liabilities. Understanding specific impacts is critical for developing effective governance and mitigation strategies. There are many strategies with this emerging field and we are all learning together as models gets more advanced and start showing these emergent behaviors. Red teaming, adversarial testing, truthfulness benchmarks and understanding AI’s mind are just a few examples.
AI deception is real, enterprise risks are tangible and standard safety testing is not sufficient.
We are building the most transformative technology of our times and we need to do it extreme care. We shared our approach in the last edition to focus on amplifying human capabilities unique to your organization. In this edition we introduced the need for building more honest AI acknowledging the associated complex risks. Proactive governance, continuous vigilance and commitment to trustworthy AI principles give us a good starting point as this field evolves. Remember that unlike previous technologies, AI has the potential to independently learn and apply deceptive strategies. Let us expand the interpretation of “feel the AGI” to also include a comprehensive approach to AGI safety.