Anthropic Developing Constitutional Classifiers to Safeguard AI Models From Jailbreak Attempts

Anthropic is hosting a temporary live demo version of a Constitutional Classifiers system to let users test its capabilities.

Advertisement
Written by Akash Dutta, Edited by Siddharth Suvarna | Updated: 4 February 2025 13:47 IST
Highlights
  • Constitutional Classifiers act as a layer on top of the AI model
  • Anthropic ran a bug bounty programme to test the system’s robustness
  • Constitutional Classifiers were tested on the Claude 3.5 Sonnet

Jailbreaking of an AI model is done by using unusual prompts to make it generate harmful output

Photo Credit: Anthropic

Anthropic announced the development of a new system on Monday that can protect artificial intelligence (AI) models from jailbreaking attempts. Dubbed Constitutional Classifiers, it is a safeguarding technique that can detect when a jailbreaking attempt is made at the input level and prevent the AI from generating a harmful response as a result of it. The AI firm has tested the robustness of the system via independent jailbreakers and has also opened a temporary live demo of the system to let any interested individual test its capabilities.

Anthropic Unveils Constitutional Classifiers

Jailbreaking in generative AI refers to unusual prompt writing techniques that can force an AI model to not adhere to its training guidelines and generate harmful and inappropriate content. Jailbreaking is not a new thing, and most AI developers implement several safeguards against it within the model. However, since prompt engineers keep creating new techniques, it is difficult to build a large language model (LLM) that is completely protected from such attacks.

Some jailbreaking techniques include extremely long and convoluted prompts that confuse the AI's reasoning capabilities. Others use multiple prompts to break down the safeguards, and some even use unusual capitalisation to break through AI defences.

Advertisement

In a post detailing the research, Anthropic announced that it is developing Constitutional Classifiers as a protective layer for AI models. There are two classifiers — input and output — which are provided with a list of principles to which the model should adhere. This list of principles is called a constitution. Notably, the AI firm already uses constitutions to align the Claude models.

Advertisement

How Constitutional Classifiers work
Photo Credit: Anthropic

Advertisement

 

Now, with Constitutional Classifiers, these principles define the classes of content that are allowed and disallowed. This constitution is used to generate a large number of prompts and model completions from Claude across different content classes. The generated synthetic data is also translated into different languages and transformed into known jailbreaking styles. This way, a large dataset of content is created that can be used to break into a model.

Advertisement

This synthetic data is then used to train the input and output classifiers. Anthropic conducted a bug bounty programme, inviting 183 independent jailbreakers to attempt to bypass Constitutional Classifiers. An in-depth explanation of how the system works is detailed in a research paper published on arXiv. The company claimed no universal jailbreak (one prompt style that works across different content classes) was discovered.

Further, during an automated evaluation test, where the AI firm hit Claude using 10,000 jailbreaking prompts, the success rate was found to be 4.4 percent, as opposed to 86 percent for an unguarded AI model. Anthropic was also able to minimise excessive refusals (refusal of harmless queries) and additional processing power requirements of Constitutional Classifiers.

However, there are certain limitations. Anthropic acknowledged that Constitutional Classifiers might not be able to prevent every universal jailbreak. It could also be less resistant towards new jailbreaking techniques designed specifically to beat the system. Those interested in testing the robustness of the system can find the live demo version here. It will stay active till February 10.

 

For details of the latest launches and news from Samsung, Xiaomi, Realme, OnePlus, Oppo and other companies at the Mobile World Congress in Barcelona, visit our MWC 2025 hub.

Advertisement
Popular Mobile Brands
  1. Realme Narzo Power 5G With 10,001mAh Battery Launched in India: Price, Specifications
  2. Moto Watch Review: The Best Smartwatch Under Rs. 6,000 in 2026?
  3. iPhone 17e vs iPhone 17: Price in India, Features, Specifications Compared
  4. MacBook Neo Launched in India With 13-Inch Display, A18 Pro Chip: See Price
  5. Infinix Note 60 Ultra With Pininfarina Design Launched at MWC 2026
  6. Nothing Phone 4a Pro Teaser Hints at the Presence of This Phone 3 Feature
  7. Vivo X300 FE Launched as Global Version of This Chinese Smartphone
  8. Vivo T5x 5G AnTuTu Score Exceeds 1 Million Points, Will Launch in India Soon
  9. OnePlus 15T Confirmed to Launch With a Larger Battery, Faster Charging
  10. Vivo V70 FE Colour Options, Key Features Revealed Ahead of March 9 Launch
  1. Google Introduces Gemini 3.1 Flash-Lite as Its Fastest and Most Cost-Efficient AI Model
  2. Oppo Find N6 Key Features, Colour Options Leaked Ahead of Imminent China Launch
  3. Honor 600 Lite Launched With MediaTek Dimensity 7100 Elite, 6,520mAh Battery: Price, Specifications
  4. Vivo T5x 5G Teased to Launch in India Soon; Company Says AnTuTu Score Exceeds 1 Million Points
  5. MWC 2026: Oppo, MediaTek Join Hands to Showcase New On-Device AI Capabilities for Future Smartphones
  6. Lava Bold 2 5G India Launch Teased; Company Teases Design Ahead of Debut
  7. Nubia Neo 5 GT With MediaTek Dimensity 7400 SoC Launched at MWC 2026: Price, Specifications
  8. OnePlus 16, iQOO 16, Redmi K100 Pro Max Tipped to Launch at Higher Prices This Year
  9. Google Play Announces New Android Policies With Expanded Billing Options, Eases Access to Third-Party App Stores
  10. Google's NotebookLM Upgraded With Cinematic Video Overviews Feature
Download Our Apps
Available in Hindi
© Copyright Red Pixels Ventures Limited 2026. All rights reserved.