The Skills Playbook Anthropic Actually Uses
Hundreds in production. 9 categories, 9 rules, one reframe most of us missed.
|
100s
skills at Anthropic
|
9
categories found
|
1 wk
on verification alone
|
Thariq Shihipar from the Claude Code team just published the internal skills playbook Anthropic has been compiling for a year. Buried in it is a reframe most of us missed.
A skill is not a markdown file. It is a folder. The SKILL.md is a router, and the real work happens in the references, scripts, and configs it points at. Claude reads lazily through that folder the same way you grep a repo, which makes the filesystem itself a form of context engineering.
That one reframe changes how you build everything else. Here is what it unlocks.
|
BEFORE
just a markdown file
my-skill/
└─ SKILL.md flat. Limited. Everything in one file, so everything competes for the same context window.
|
→
|
AFTER
a folder Claude explores
my-skill/
├─ SKILL.md ◄ router ├─ references/ ├─ scripts/ ├─ assets/ └─ config.json progressive disclosure. Claude reads what it needs, when it needs it.
|
|
STEP 1
User message arrives
"help me write a migration"
|
STEP 2
Claude scans all descriptions
treats each as a trigger prompt. Does any description match the intent?
|
STEP 3
Match, skill fires
No match, your skill stays cold, no matter how good the content is.
|
Where skills actually go in production
After cataloging hundreds of internal skills, the Claude Code team found they cluster into exactly 9 types.
|
01 REFERENCE
Library, CLI, SDK how-to
|
02 VERIFY
Playwright, tmux, assertions
|
03 DATA
Dashboards, queries, creds
|
|
04 AUTOMATE
Standups, tickets, recaps
|
05 SCAFFOLD
new-migration, create-app
|
06 REVIEW
adversarial-review, style
|
|
07 SHIP
babysit-pr, deploy, cherry-pick
|
08 RUNBOOK
error signatures, oncall-runner
|
09 INFRA OPS
orphans, deps, cost-investigation
|
The 9 rules, compressed
Distilled from hundreds of internal skills. Read this as a checklist, not an essay.
| 1. Don't state the obvious. Claude already knows Python. Spend your context on what pushes it out of its defaults. The frontend-design skill exists because Claude loves Inter and purple gradients. |
| 2. Gotchas is the highest-signal section. Every failure you observe becomes a line. Update it over time. It is the tightest eval loop you own. |
| 3. Use the filesystem as progressive disclosure. Keep SKILL.md short. Point to references/api.md, assets/template.md, scripts/. Claude reads what it needs, when it needs it. |
| 4. Don't railroad Claude. Give the goal and the constraints. Prescriptive step-by-step instructions break on any task the skill wasn't written for. |
| 5. Set up via config.json plus AskUserQuestion. If the skill needs user context, store it in config.json. If it isn't there, have Claude ask once, cache it, move on. |
| 6. Description is for the model, not the human. Write it like a classifier prompt. List the trigger phrases. If Claude never picks your skill, you have a description bug, not a content bug. |
| 7. Store memory in $CLAUDE_PLUGIN_DATA. Data inside the skill directory gets wiped on upgrade. Use the stable folder for logs, SQLite, journaled state. |
| 8. Ship scripts, not instructions. Code is the most powerful tool you can hand the model. Let Claude spend its turns on composition, not reconstructing boilerplate. |
| 9. Use on-demand hooks for the dangerous stuff. /careful blocks rm -rf and DROP TABLE. /freeze blocks writes outside a directory. Scoped to the session, not always on. |
|
Most skills fail silently. Not because the content is bad, but because the description never triggers the classifier. The gap between "I wrote a skill" and "the model uses it" is almost entirely in that one field.
The description field is the API between your knowledge and Claude's action space. Treat it like one.
Quick Hits
Thariq's direct line: it can be worth having an engineer spend a week just making your verification skills excellent. Record video of the test run. Assert state programmatically at each step. Treat it like infrastructure, not documentation.
Small team, few repos: check skills into .claude/skills. Scaling past that: build an internal plugin marketplace and let teams opt in. Every checked-in skill adds to base context, so at scale the marketplace wins.
Anthropic has no central skill-review team. Authors post to a sandbox folder, share in Slack, and promote to the marketplace via PR once the skill gets traction. Curation before release, not gatekeeping before creation.
Log skill invocations via a PreToolUse hook. Now you can see what's undertriggering vs. popular, and the description field becomes something you can A/B test instead of guess at.
The full teardown, a SKILL.md template with the description-as-classifier pattern pre-wired, and the Gotchas log format we use across ResearchAudio skills, goes out to the paid archive this week.
The description field isn't a summary for humans. It's a trigger for the model. Write it like a classifier prompt, not a blurb.
If the description is a classifier prompt, the obvious next move is to evaluate it like one. A held-out set of user messages, a target skill per message, a trigger accuracy score. Nobody I know has shipped this harness yet. Reply if you have.
Part 2: The prompt-caching rules Claude Code declares a SEV over. Why swapping tool sets mid-conversation is the most expensive bug an agent runtime can ship.
Part 3: Seeing like an agent, how tool design decides whether your agent works at all.
