Explainable artificial intelligence


Within artificial intelligence, explainable AI, generally overlapping with interpretable AI or explainable machine learning, is a field of research that explores methods that provide humans with the ability of intellectual oversight over AI algorithms. The main focus is on the reasoning behind the decisions or predictions made by the AI algorithms, to make them more understandable and transparent. This addresses users' requirement to assess safety and scrutinize the automated decision making in applications. XAI counters the "black box" tendency of machine learning, where even the AI's designers cannot explain why it arrived at a specific decision.
XAI hopes to help users of AI-powered systems perform more effectively by improving their understanding of how those systems reason. XAI may be an implementation of the social right to explanation. Even if there is no such legal right or regulatory requirement, XAI can improve the user experience of a product or service by helping end users trust that the AI is making good decisions. XAI aims to explain what has been done, what is being done, and what will be done next, and to unveil which information these actions are based on. This makes it possible to confirm existing knowledge, challenge existing knowledge, and generate new assumptions.

Background

algorithms used in AI can be categorized as white-box or black-box. White-box models provide results that are understandable to experts in the domain. Black-box models, on the other hand, are extremely hard to explain and may not be understood even by domain experts. XAI algorithms follow the three principles of transparency, interpretability, and explainability.
  • A model is transparent "if the processes that extract model parameters from training data and generate labels from testing data can be described and motivated by the approach designer."
  • Interpretability describes the possibility of comprehending the ML model and presenting the underlying basis for decision-making in a way that is understandable to humans.
  • Explainability is a concept that is recognized as important, but a consensus definition is not yet available; one possibility is "the collection of features of the interpretable domain that have contributed, for a given example, to producing a decision ".
In summary, Interpretability refers to the user's ability to understand model outputs, while Model Transparency includes Simulatability, Decomposability, and Algorithmic Transparency. Model Functionality focuses on textual descriptions, visualization, and local explanations, which clarify specific outputs or instances rather than entire models. All these concepts aim to enhance the comprehensibility and usability of AI systems.
If algorithms fulfill these principles, they provide a basis for justifying decisions, tracking them and thereby verifying them, improving the algorithms, and exploring new facts.
Sometimes it is also possible to achieve a high-accuracy result with white-box ML algorithms. These algorithms have an interpretable structure that can be used to explain predictions. Concept Bottleneck Models, which use concept-level abstractions to explain model reasoning, are examples of this and can be applied in both image and text prediction tasks. This is especially important in domains like medicine, defense, finance, and law, where it is crucial to understand decisions and build trust in the algorithms. Many researchers argue that, at least for supervised machine learning, the way forward is symbolic regression, where the algorithm searches the space of mathematical expressions to find the model that best fits a given dataset.
AI systems optimize behavior to satisfy a mathematically specified goal system chosen by the system designers, such as the command "maximize the accuracy of assessing how positive film reviews are in the test dataset." The AI may learn useful general rules from the test set, such as "reviews containing the word "horrible" are likely to be negative." However, it may also learn inappropriate rules, such as "reviews containing 'Daniel Day-Lewis' are usually positive"; such rules may be undesirable if they are likely to fail to generalize outside the training set, or if people consider the rule to be "cheating" or "unfair." A human can audit rules in an XAI to get an idea of how likely the system is to generalize to future real-world data outside the test set.

Goals

Cooperation between agents – in this case, algorithms and humans – depends on trust. If humans are to accept algorithmic prescriptions, they need to trust them. Incompleteness in formal trust criteria is a barrier to optimization. Transparency, interpretability, and explainability are intermediate goals on the road to these more comprehensive trust criteria. This is particularly relevant in medicine, especially with clinical decision support systems, in which medical professionals should be able to understand how and why a machine-based decision was made in order to trust the decision and augment their decision-making process.
AI systems sometimes learn undesirable tricks that do an optimal job of satisfying explicit pre-programmed goals on the training data but do not reflect the more nuanced implicit desires of the human system designers or the full complexity of the domain data. For example, a 2017 system tasked with image recognition learned to "cheat" by looking for a copyright tag that happened to be associated with horse pictures rather than learning how to tell if a horse was actually pictured. In another 2017 system, a supervised learning AI tasked with grasping items in a virtual world learned to cheat by placing its manipulator between the object and the viewer in a way such that it falsely appeared to be grasping the object.
One transparency project, the DARPA XAI program, aims to produce "glass box" models that are explainable to a "human-in-the-loop" without greatly sacrificing AI performance. Human users of such a system can understand the AI's cognition and can determine whether to trust the AI. Other applications of XAI are knowledge extraction from black-box models and model comparisons. In the context of monitoring systems for ethical and socio-legal compliance, the term "glass box" is commonly used to refer to tools that track the inputs and outputs of the system in question, and provide value-based explanations for their behavior. These tools aim to ensure that the system operates in accordance with ethical and legal standards, and that its decision-making processes are transparent and accountable. The term "glass box" is often used in contrast to "black box" systems, which lack transparency and can be more difficult to monitor and regulate.
The term is also used to name a voice assistant that produces counterfactual statements as explanations.

Explainability and interpretability techniques

There is a subtle difference between the terms explainability and interpretability in the context of AI.
TermDefinitionSource
Interpretability"level of understanding how the underlying technology works"ISO/IEC TR 29119-11:2020, 3.1.42
Explainability"level of understanding how the AI-based system... came up with a given result"ISO/IEC TR 29119-11:2020, 3.1.31

Some explainability techniques don't involve understanding how the model works, and may work across various AI systems. Treating the model as a black box and analyzing how marginal changes to the inputs affect the result sometimes provides a sufficient explanation.

Explainability

Explainability is useful for ensuring that AI models are not making decisions based on irrelevant or otherwise unfair criteria. For classification and regression models, several popular techniques exist:
  • Partial dependency plots show the marginal effect of an input feature on the predicted outcome.
  • SHAP enables visualization of the contribution of each input feature to the output. It works by calculating Shapley values, which measure the average marginal contribution of a feature across all possible combinations of features.
  • Feature importance estimates how important a feature is for the model. It is usually done using permutation importance, which measures the performance decrease when the feature value is randomly shuffled across all samples.
  • LIME approximates locally a model's outputs with a simpler, interpretable model.
  • Multitask learning provides a large number of outputs in addition to the target classification. These other outputs can help developers deduce what the network has learned.
For images, saliency maps highlight the parts of an image that most influenced the result.
Systems that are expert or knowledge based are software systems that are made by experts. This system consists of a knowledge based encoding for the domain knowledge. This system is usually modeled as production rules, and someone uses this knowledge base which the user can question the system for knowledge. In expert systems, the language and explanations are understood with an explanation for the reasoning or a problem solving activity.
However, these techniques are not very suitable for language models like generative pretrained transformers. Since these models generate language, they can provide an explanation, but which may not be reliable. Other techniques include attention analysis, probing methods, causal tracing and circuit discovery. Explainability research in this area overlaps significantly with interpretability and alignment research.

Interpretability

Scholars sometimes use the term "mechanistic interpretability" to refer to the process of reverse-engineering artificial neural networks to understand their internal decision-making mechanisms and components, similar to how one might analyze a complex machine or computer program.
Studying the interpretability of the most advanced foundation models often involves searching for an automated way to identify "features" in generative pretrained transformers. In a neural network, a feature is a pattern of neuron activations that corresponds to a concept. A compute-intensive technique called "dictionary learning" makes it possible to identify features to some degree. Enhancing the ability to identify and edit features is expected to significantly improve the safety of frontier AI models.
For convolutional neural networks, DeepDream can generate images that strongly activate a particular neuron, providing a visual hint about what the neuron is trained to identify.