As society becomes yet more interwoven with AI, it is ever more crucial to ensure that our AI systems are guided by our moral principles. Yet these principles are incredibly difficult to elicit and formalize, and the principles that current AI systems follow are opaque. Our solution is to use advanced computational methods to rigorously characterize both the common and individual characteristics of human moral reasoning and character, describing them as algorithmic moral grammars. Learning such models would itself advance our scientific understanding of human moral cognition but would also serve as a crucial bridge to evaluate, compare, and steer the moral compass of black-box AI—most notably, large language models.
We draw on a strong foundation of prior empirical work in human moral cognition coupled with advances in computational cognitive modeling and Bayesian program learning to make progress on this goal. We collect data from guided audio journaling, values questionnaires, and action decisions in hundreds of complex moral dilemmas: naturalistic scenarios richly annotated with the features necessary to distinguish different moral algorithms. By observing participants’ decisions coupled with their individual value profile, we seek to obtain a mechanistic and detailed model of a person’s moral compass that can not only predict their choices across situations but also reveal how they do so. This in turn allows us to evaluate whether black-box AI systems follow similar algorithms, or where they diverge, by applying similar models to characterize AI choices. Areas of human-AI misalignment will help target training and steering.
Overall, we aim to produce tools for learning interpretable, formal models of moral cognition, findings describing and evaluating these models as accounts of human and AI choices, and direction for increasing the alignment of neural-network AI, ultimately paving the way towards virtue-aligned AI.