Francesco Fabiano, Marianna B. Ganapini, et al.
Communications of the ACM
AI systems that learn through reward feedback about the actions they take are deployed in domains that have significant impact on our daily life. However, in many cases the online rewards should not be the only guiding criteria, as there are additional constraints and/or priorities imposed by regulations, values, preferences, or ethical principles. We detail a novel online agent that learns a set of behavioral constraints by observation and uses these learned constraints when making decisions in an online setting, while still being reactive to reward feedback. We propose a novel extension to the contextual multi-armed bandit setting and provide a new algorithm called Behavior Constrained Thompson Sampling (BCTS) that allows for online learning while obeying exogenous constraints. Our agent learns a constrained policy that implements observed behavioral constraints demonstrated by a teacher agent, and uses this constrained policy to guide its online exploration and exploitation. We characterize the upper bound on the expected regret of BCTS and provide a case study with real-world data in two application domains. Our experiments show that the designed agent is able to act within the set of behavior constraints without significantly degrading its overall reward performance.
Francesco Fabiano, Marianna B. Ganapini, et al.
Communications of the ACM
Baihan Lin, Guillermo Cecchi, et al.
HAI-GEN+user2agent 2023
Baihan Lin, Djallel Bouneffouf, et al.
IJCNN 2022
Francesca Rossi, Nicholas Mattei
AAAI 2019