AI safety


AI safety is an interdisciplinary field focused on preventing accidents, misuse, or other harmful consequences arising from artificial intelligence systems. It encompasses AI alignment, monitoring AI systems for risks, and enhancing their robustness. The field is particularly concerned with existential risks posed by advanced AI models.
Beyond technical research, AI safety involves developing norms and policies that promote safety. It gained significant popularity in 2023, with rapid progress in generative AI and public concerns voiced by researchers and CEOs about potential dangers. During the 2023 AI Safety Summit, the United States and the United Kingdom both established their own AI Safety Institute. However, researchers have expressed concern that AI safety measures are not keeping pace with the rapid development of AI capabilities.

Motivations

Scholars discuss current risks from critical systems failures, bias, and AI-enabled surveillance, as well as emerging risks like technological unemployment, digital manipulation, weaponization, AI-enabled cyberattacks and bioterrorism. They also discuss speculative risks from losing control of future artificial general intelligence agents, or from AI enabling perpetually stable dictatorships.

Existential safety

Some have criticized concerns about AGI, such as Andrew Ng who compared them in 2015 to "worrying about overpopulation on Mars when we have not even set foot on the planet yet". Stuart J. Russell on the other side urges caution, arguing that "it is better to anticipate human ingenuity than to underestimate it".
AI researchers have widely differing opinions about the severity and primary sources of risk posed by AI technology – though surveys suggest that experts take high consequence risks seriously. In two surveys of AI researchers, the median respondent was optimistic about AI overall, but placed a 5% probability on an "extremely bad " outcome of advanced AI. In a 2022 survey of the natural language processing community, 37% agreed or weakly agreed that it is plausible that AI decisions could lead to a catastrophe that is "at least as bad as an all-out nuclear war".

History

Risks from AI began to be seriously discussed at the start of the computer age:
In 1988 Blay Whitby published a book outlining the need for AI to be developed along ethical and socially responsible lines.
From 2008 to 2009, the Association for the Advancement of Artificial Intelligence commissioned a study to explore and address potential long-term societal influences of AI research and development. The panel was generally skeptical of the radical views expressed by science-fiction authors but agreed that "additional research would be valuable on methods for understanding and verifying the range of behaviors of complex computational systems to minimize unexpected outcomes".
In 2011, Roman Yampolskiy introduced the term "AI safety engineering" at the Philosophy and Theory of Artificial Intelligence conference, listing prior failures of AI systems and arguing that "the frequency and seriousness of such events will steadily increase as AIs become more capable".
In 2014, philosopher Nick Bostrom published the book Superintelligence: Paths, Dangers, Strategies. He has the opinion that the rise of AGI has the potential to create various societal issues, ranging from the displacement of the workforce by AI, manipulation of political and military structures, to even the possibility of human extinction. His argument that future advanced systems may pose a threat to human existence prompted Elon Musk, Bill Gates, and Stephen Hawking to voice similar concerns.
In 2015, dozens of artificial intelligence experts signed an open letter on artificial intelligence calling for research on the societal impacts of AI and outlining concrete directions. To date, the letter has been signed by over 8000 people including Yann LeCun, Shane Legg, Yoshua Bengio, and Stuart Russell.
In the same year, a group of academics led by professor Stuart J. Russell founded the Center for Human-Compatible AI at the University of California Berkeley and the Future of Life Institute awarded $6.5 million in grants for research aimed at "ensuring artificial intelligence remains safe, ethical and beneficial".
In 2016, the White House Office of Science and Technology Policy and Carnegie Mellon University announced The Public Workshop on Safety and Control for Artificial Intelligence, which was one of a sequence of four White House workshops aimed at investigating "the advantages and drawbacks" of AI. In the same year, Concrete Problems in AI Safety – one of the first and most influential technical AI Safety agendas – was published.
In 2017, the Future of Life Institute sponsored the Asilomar Conference on Beneficial AI, where more than 100 thought leaders formulated principles for beneficial AI including "Race Avoidance: Teams developing AI systems should actively cooperate to avoid corner-cutting on safety standards".
In 2018, the DeepMind Safety team outlined AI safety problems in specification, robustness, and assurance. The following year, researchers organized a workshop at ICLR that focused on these problem areas.
In 2021, Unsolved Problems in ML Safety was published, outlining research directions in robustness, monitoring, alignment, and systemic safety.
In 2023, Rishi Sunak said he wants the United Kingdom to be the "geographical home of global AI safety regulation" and to host the first global summit on AI safety. The AI safety summit took place in November 2023, and focused on the risks of misuse and loss of control associated with frontier AI models. During the summit the intention to create the International Scientific Report on the Safety of Advanced AI was announced.
In 2024, The US and UK forged a new partnership on the science of AI safety. The MoU was signed on 1 April 2024 by US commerce secretary Gina Raimondo and UK technology secretary Michelle Donelan to jointly develop advanced AI model testing, following commitments announced at an AI Safety Summit in Bletchley Park in November.
In 2025, an international team of 96 experts chaired by Yoshua Bengio published the first International AI Safety Report. The report, commissioned by 30 nations and the United Nations, represents the first global scientific review of potential risks associated with advanced artificial intelligence. It details potential threats stemming from misuse, malfunction, and societal disruption, with the objective of informing policy through evidence-based findings, without providing specific recommendations.

Research focus

AI safety research areas include robustness, monitoring, and alignment.

Robustness

Adversarial robustness

AI systems are often vulnerable to adversarial examples or "inputs to machine learning models that an attacker has intentionally designed to cause the model to make a mistake". For example, in 2013, Szegedy et al. discovered that adding specific imperceptible perturbations to an image could cause it to be misclassified with high confidence. This continues to be an issue with neural networks, though in recent work the perturbations are generally large enough to be perceptible.
The image on the right is predicted to be an ostrich after the perturbation is applied. is a correctly predicted sample, perturbation applied magnified by 10x, adversarial example.
Adversarial robustness is often associated with security. Researchers demonstrated that an audio signal could be imperceptibly modified so that speech-to-text systems transcribe it to any message the attacker chooses. Network intrusion and malware detection systems also must be adversarially robust since attackers may design their attacks to fool detectors.
Models that represent objectives must also be adversarially robust. For example, a reward model might estimate how helpful a text response is and a language model might be trained to maximize this score. Researchers have shown that if a language model is trained for long enough, it will leverage the vulnerabilities of the reward model to achieve a better score and perform worse on the intended task. This issue can be addressed by improving the adversarial robustness of the reward model. More generally, any AI system used to evaluate another AI system must be adversarially robust. This could include monitoring tools, since they could also potentially be tampered with to produce a higher reward.
Large language models can be vulnerable to prompt injection and model stealing, and may be used to generate misinformation. Prompt injection involves embedding instructions into prompts in order to bypass safety measures.

Monitoring

Estimating uncertainty

It is often important for human operators to gauge how much they should trust an AI system, especially in high-stakes settings such as medical diagnosis. ML models generally express confidence by outputting probabilities; however, they are often overconfident, especially in situations that differ from those that they were trained to handle. Calibration research aims to make model probabilities correspond as closely as possible to the true proportion that the model is correct.
Similarly, anomaly detection or out-of-distribution detection aims to identify when an AI system is in an unusual situation. For example, if a sensor on an autonomous vehicle is malfunctioning, or it encounters challenging terrain, it should alert the driver to take control or pull over. Anomaly detection has been implemented by simply training a classifier to distinguish anomalous and non-anomalous inputs, though a range of additional techniques are in use.

Detecting malicious use

Scholars and government agencies have expressed concerns that AI systems could be used to help malicious actors to build weapons, manipulate public opinion, or automate cyber attacks. These worries are a practical concern for companies like OpenAI which host powerful AI tools online. In order to prevent misuse, OpenAI has built detection systems that flag or restrict users based on their activity.