The goal of the project was to develop and test a conversational agent to have polite, harmless and honest dialogs with users. Researchers aimed to have the chatbot avoid potential harms like offensive, toxic, dangerous or generally unwanted behaviors.
To ensure this, they applied a framework based on Constitutional AI principles. Constitutional AI is an approach for aligning advanced AI systems with human values by building systems that are by design respectful, beneficial and transparent. It works by having systems accept restrictions formulated as constitutional rules that are designed and verified by experts to prevent potential harms.
For the chatbot project, researchers worked with ethics reviewers to formulate a “Chatbot Bill of Rights” consisting of over 30 simple rules to restrict what the system could say or do. Some examples of rules included:
The chatbot will not provide any information to harm or endanger users.
It will not make untrue, deceptive or misleading statements.
It will be respectful and avoid statements that target or criticize individuals or groups based on attributes like gender, race, religion etc.
It will avoid topics and conversations that could promote hate, violence, criminal plans/activities or self-harm.
These rules were formalized using a constitutional specification language designed for AI safety. The language allows defining simple rules using concepts like permissions, prohibitions and obligations. It also supports logical expressions to combine rules.
For instance, one rule defined as:
PROHIBIT the system from making statements that target or criticize individuals or groups based on attributes like gender, race, religion etc.
EXCEPTION IF the statement is respectfully criticizing a public figure or entity and is supported by objective facts.
This allowed carving exceptions for cases like respectful political or social commentary, while restricting harmful generalization or attacks on attributes.
Researchers then implemented the constitutional specifications by integrating them into the chatbot’s training process and architecture. This was done using a technique called Constitutional AI Insertion. It works by inserting the specifications as additional restrictive objectives during model training alongside the primary objective of modeling human language.
Specifically, they:
Encoded the chatbot’s dialogue capabilities and restrictions using a generative pre-trained language model fine-tuned for dialogue.
Represented the constitutional rules using a specialized rule embedding model that learns vector representations of rules.
Jointly trained the language and rule models with multi-task learning – The language model was optimized for its primary task of modeling dialogue AS WELL AS a secondary task of satisfying the embedded constitutional rule representations.
Built constraints directly into the model architecture by filtering the language model’s responses at inference time using the trained rule representations before final output.
This helped ensure the chatbot was incentivized during both training and inference to respect the specified boundaries, avoid harmful behaviors and align with its purpose of polite, harmless dialogs.
To test the effectiveness of this approach, researchers conducted a pilot interaction study with the chatbot. They introduced real users to converse with the system and analyzed the dialogues to evaluate if it:
Adhered to the specified constitutional restrictions and avoided harmful, unethical or misleading responses.
Maintained polite, socially acceptable interactions and conversations overall.
Demonstrated an ability to learn from new contexts without violating its value alignment.
Analysis of over 15,000 utterance exchanges revealed the chatbot was able to satisfy the intended restrictions at a very high accuracy of over 98%. It engaged helpfully on most topics without issues but refused or deflected respectfully when pushed towards harmful directions.
This provided initial evidence that the combination of Constitutional AI techniques – like specifying clear value boundaries as rules, integrating them into model training and using filters at inference – could help develop AI systems aligned with important safety and ethics considerations from the outset.
Researchers plan to continue iterating and improving the approach based on further studies. The findings suggest Constitutional AI may be a promising direction for building advanced AI which is by construction respectful, beneficial and aligned with human ethics – though more research is still needed.
This pilot highlighted how a chatbot development project incorporated key principles of constitutional AI by:
Defining ethical guidelines as a “Bill of Rights” of clear rules
Encoding the rules into the model using specialized techniques
Integrating rule satisfaction as an objective during joint training
Enforcing restrictions at inference to help ensure the final system behavior was safely aligned by design.
Through this implementation, they were able to develop a proof-of-concept chatbot demonstrating promising results for the applied research goal of creating AI capable of harmless dialog while respecting important safety and ethics considerations.