Foundations of Interpretable Deep Learning

Pietro Barbiero; Mateo Espinosa Zarlenga

AAAI 2026

Tutorial

20 Jan 2026

Foundations of Interpretable Deep Learning

View publication

Abstract

As notoriously opaque deep neural networks (DNNs) become commonplace in powerful Artificial Intelligence (AI) systems, Interpretable Deep Learning (IDL) has emerged as a promising direction for designing interpretable-by-construction neural architectures. At their core, IDL models learn a latent space where some of their representations are aligned with high-level units of information, or concepts, that domain experts are familiar with (e.g., “stripped texture”, “round object”, etc.). By introducing inductive biases that encourage predictions to be made based on these interpretable representations, IDL models enable the construction of expressive yet highly transparent architectures that can be vetted, analysed, and intervened on.

This tutorial aims to capitalise on the surge of interest in IDL by exposing AI researchers and engineers to the core foundations necessary to understand the general principles behind existing IDL models. By doing so, we aim to equip attendees with the knowledge necessary to comprehend the current state of this extensive body of literature, enabling them to build upon it for their research. Specifically, this tutorial will provide an overview of (1) core interpretability principles, (2) seminal works in the field, and (3) recent directions in IDL. Our tutorial will include hands-on demonstrations throughout our session and will conclude with a discussion of the key open questions in the field.

Workshop paper