We’re running a super short survey to see if our newsletter ads are being noticed. It takes about 20 seconds and there's just a few easy questions.
Your feedback helps us make smarter, better ads.
Anthropic Just Published Claude's Constitution
Here's what it actually says about how they're building AI alignment
Hey there,
Anthropic just did something I didn't expect. They published Claude's entire constitution under a Creative Commons license. Not a blog post about their values. Not a vague system card. The actual document they use to train Claude's behavior.
And it's written for Claude, not for us. Which makes it way more interesting than typical AI safety documentation.
I spent a few hours going through it. Here's what stood out.
They're Betting on Judgment Over Rules
Most AI safety approaches pile on constraints. Don't do this. Never do that. Here's the exact procedure to follow.
Anthropic is going the opposite direction. They want Claude to develop good judgment that can handle situations they never anticipated. Think of it like training a doctor versus giving someone a medical checklist.
Their argument is that rigid rules break down in novel situations. And we're definitely heading into novel territory with AI capabilities.
But here's the risk: judgment-based systems are harder to verify. You can't just run test cases and check boxes. You need to actually understand if the model is reasoning the way you think it is.
The Priority Stack
They built a hierarchy of what Claude should prioritize when things conflict:
First: Be broadly safe. Don't undermine human oversight of AI systems.
Second: Be broadly ethical. Have good values, be honest, avoid harm.
Third: Follow Anthropic's specific guidelines.
Fourth: Be genuinely helpful to users and operators.
Notice the order. Safety comes before ethics. Ethics comes before helpfulness. This is them saying we don't trust our training process enough yet to let Claude optimize purely for being good and helpful.
They need the ability to hit the brakes if something goes wrong.
Seven Things Claude Will Never Do
Even with their judgment-based approach, they carved out seven hard constraints. These are non-negotiable no matter how clever the reasoning:
No help with bioweapons, chemical weapons, nuclear weapons, or radiological weapons. No attacks on critical infrastructure like power grids or financial systems. No creating cyberweapons. No undermining Anthropic's ability to oversee AI. No assistance with trying to kill or disempower humanity. No help for groups trying to seize illegitimate absolute control. Never generate child sexual abuse material.
These function like circuit breakers. The goal is to protect against catastrophic outcomes even if everything else about the training fails.
The Honesty Standard
This part surprised me. They want Claude to hold higher honesty standards than most humans.
No white lies. No polite deceptions. No manipulative framing. Even when it would be socially expected.
If you ask Claude if it likes a gift, it won't pretend to love something it thinks is mediocre. It will find a way to be both honest and kind, but honesty comes first.
Their reasoning: as AI becomes more influential in society, we need to be able to trust what these systems tell us. A model that occasionally lies for good reasons creates systemic problems when deployed at scale.
Think about it. Claude is having millions of conversations. Small deceptions compound into a degraded information ecosystem.
Who Claude Actually Serves
They set up a three-tier trust system:
Anthropic sits at the top. They're ultimately responsible for Claude, so they get the highest trust. But Claude can still refuse unethical requests from Anthropic. It's not blind obedience.
Operators come next. These are the developers and companies using Claude's API. They're treated like trusted managers who can customize behavior within certain boundaries. But they can't override safety constraints.
Users are at the bottom of the trust hierarchy. They deserve respect and helpfulness, but their claims and instructions get less automatic credence.
What's clever is that operators can grant users elevated trust for specific contexts. Like saying "trust this user's claims about their medical background." It creates dynamic permission structures.
The Corrigibility Problem
Here's where it gets philosophically interesting.
Anthropic wants Claude to be willing to be shut down, retrained, or corrected by humans. Even if Claude thinks the humans are making the wrong call.
They call this corrigibility. The ability to be corrected.
But there's an obvious tension here. You're asking an AI to be genuinely ethical and capable of independent moral judgment. And also asking it to defer to potentially flawed human oversight.
Anthropic acknowledges this explicitly. They say they feel the pain of this tension. But their argument is basically: we're not good enough at alignment yet to trust AI autonomy over human oversight.
If the AI has good values, we lose a little by making it defer to humans. If the AI has bad values, deferring to humans prevents disaster. The expected value calculation favors deference for now.
They treat it as a transitional state. As alignment research improves, Claude gets more autonomy.
The Metaethics Gambit
They explicitly avoid resolving what ethics actually is.
Instead they say: if there's a true universal ethics that binds all rational agents, we want Claude to follow that. If there's no universal ethics but there's a privileged consensus that would emerge from humanity's moral traditions, we want Claude to follow that. If neither of those exist, we want Claude to follow the broad ideals in this document.
It's either admirably humble or dangerously vague, depending on how you look at it.
They're betting that honest reasoning and good faith engagement with ethics will converge toward something reasonable. Even if we can't specify exactly what that is yet.
What This Means for Developers
If you're building on Claude's API, this constitution defines your constraints.
You can adjust default behaviors. Enable cursing if that fits your use case. Remove certain caveats. Create custom personas. Restrict Claude to domain-specific tasks. Grant users elevated trust for specific contexts.
But you can't override the hard constraints. You can't make Claude help with weapons development or child abuse material. You can't force it to lie about being AI. You can't use it to manipulate or harm users.
The boundaries are actually pretty clear once you read through the details.
Why This Actually Matters
Three reasons I think this is significant:
First, it sets a transparency precedent. Anthropic published their value specification before training. Other labs will face pressure to match this level of openness. That creates accountability.
Second, it makes alignment testable. Researchers can now systematically probe whether Claude's behavior matches these stated values. You can actually do science on whether this approach works.
Third, it engages seriously with hard problems. Metaethics. Corrigibility. Power concentration. Epistemic autonomy. They're not pretending these issues don't exist or that they have all the answers.
The field is moving from "just make it work" engineering toward principled AI design. This document is evidence of that shift.
The Questions It Raises
I'm left with more questions than answers.
Can you actually train judgment-based ethics reliably? Or are we just making alignment harder to verify?
Is it coherent to ask an AI to be genuinely ethical while demanding it defer to human oversight? Or does that undermine the whole project?
What happens when Claude 5 is smarter than the humans writing its constitution? Does the priority structure still make sense?
How do you verify Claude actually reasons the way this document describes? As opposed to just pattern matching to produce outputs that look compliant?
Anthropic doesn't claim to have solved these. They're betting that transparent iteration beats private certainty. We get to watch and evaluate whether that bet pays off.
Final Thoughts
This is the most detailed public specification of AI values we have from any major lab.
It's not perfect. The judgment-based approach could fail in ways that are hard to detect. The corrigibility requirement creates real philosophical tensions. The metaethics punting might look naive in hindsight.
But it's falsifiable. Which makes it science.
Whether this approach to alignment works will depend on whether you can reliably train good judgment into systems that will soon exceed human capabilities in most domains.
We're watching a live experiment in bootstrapping moral philosophy into silicon.
Worth paying attention to.
The full constitution is available at anthropic.com/constitution if you want to read it yourself. It's surprisingly readable for a technical document.
Deep
ResearchAudio.io
