ResearchAudio.io · The Claude Code Series, Part 1 of 3

The Skills Playbook Anthropic Actually Uses

Hundreds in production. 9 categories, 9 rules, one reframe most of us missed.

100s

skills at Anthropic

categories found

1 wk

on verification alone

Thariq Shihipar from the Claude Code team just published the internal skills playbook Anthropic has been compiling for a year. Buried in it is a reframe most of us missed.

A skill is not a markdown file. It is a folder. The SKILL.md is a router, and the real work happens in the references, scripts, and configs it points at. Claude reads lazily through that folder the same way you grep a repo, which makes the filesystem itself a form of context engineering.

That one reframe changes how you build everything else. Here is what it unlocks.

diagram 1 / a skill is a folder, not a file

BEFORE

just a markdown file

my-skill/

  └─ SKILL.md

flat. Limited. Everything in one file, so everything competes for the same context window.

→

AFTER

a folder Claude explores

my-skill/

  ├─ SKILL.md  ◄ router

  ├─ references/

  ├─ scripts/

  ├─ assets/

  └─ config.json

progressive disclosure. Claude reads what it needs, when it needs it.

Source: Thariq Shihipar, Claude Code team, March 2026.

Here's the part nobody's talking about

The description field in frontmatter is not a human summary. It is a trigger prompt. Claude scans every skill description at session start and classifies your request against them. If your skill never fires, the content quality is irrelevant.

diagram 2 / why your description is a classifier prompt

STEP 1

User message arrives

"help me write a migration"

STEP 2

Claude scans all descriptions

treats each as a trigger prompt. Does any description match the intent?

STEP 3

Match, skill fires

No match, your skill stays cold, no matter how good the content is.

The description field runs before your skill, not alongside it.

Where skills actually go in production

After cataloging hundreds of internal skills, the Claude Code team found they cluster into exactly 9 types.

diagram 3 / the 9 categories found in the wild

01 REFERENCE Library, CLI, SDK how-to	02 VERIFY Playwright, tmux, assertions	03 DATA Dashboards, queries, creds
04 AUTOMATE Standups, tickets, recaps	05 SCAFFOLD new-migration, create-app	06 REVIEW adversarial-review, style
07 SHIP babysit-pr, deploy, cherry-pick	08 RUNBOOK error signatures, oncall-runner	09 INFRA OPS orphans, deps, cost-investigation

Source: Thariq's catalog of internal skills at Anthropic.

The 9 rules, compressed

Distilled from hundreds of internal skills. Read this as a checklist, not an essay.

1. Don't state the obvious. Claude already knows Python. Spend your context on what pushes it out of its defaults. The frontend-design skill exists because Claude loves Inter and purple gradients.

2. Gotchas is the highest-signal section. Every failure you observe becomes a line. Update it over time. It is the tightest eval loop you own.

3. Use the filesystem as progressive disclosure. Keep SKILL.md short. Point to references/api.md, assets/template.md, scripts/. Claude reads what it needs, when it needs it.

4. Don't railroad Claude. Give the goal and the constraints. Prescriptive step-by-step instructions break on any task the skill wasn't written for.

5. Set up via config.json plus AskUserQuestion. If the skill needs user context, store it in config.json. If it isn't there, have Claude ask once, cache it, move on.

6. Description is for the model, not the human. Write it like a classifier prompt. List the trigger phrases. If Claude never picks your skill, you have a description bug, not a content bug.

7. Store memory in $CLAUDE_PLUGIN_DATA. Data inside the skill directory gets wiped on upgrade. Use the stable folder for logs, SQLite, journaled state.

8. Ship scripts, not instructions. Code is the most powerful tool you can hand the model. Let Claude spend its turns on composition, not reconstructing boilerplate.

9. Use on-demand hooks for the dangerous stuff. /careful blocks rm -rf and DROP TABLE. /freeze blocks writes outside a directory. Scoped to the session, not always on.

diagram 4 / where the leverage actually is

content

10%

scaffold

12%

voice

15%

scripts

22%

gotchas

30%

description

45%

Directional leverage map, author's read of the post. The description field is the gate every other component has to pass through. Optimize it first.

The Take

Most skills fail silently. Not because the content is bad, but because the description never triggers the classifier. The gap between "I wrote a skill" and "the model uses it" is almost entirely in that one field.

The description field is the API between your knowledge and Claude's action space. Treat it like one.

Quick Hits

Verification is worth a week of engineering

Thariq's direct line: it can be worth having an engineer spend a week just making your verification skills excellent. Record video of the test run. Assert state programmatically at each step. Treat it like infrastructure, not documentation.

Distribution: repo vs marketplace

Small team, few repos: check skills into .claude/skills. Scaling past that: build an internal plugin marketplace and let teams opt in. Every checked-in skill adds to base context, so at scale the marketplace wins.

Organic curation beats a review board

Anthropic has no central skill-review team. Authors post to a sandbox folder, share in Slack, and promote to the marketplace via PR once the skill gets traction. Curation before release, not gatekeeping before creation.

Measurement is a PreToolUse hook

Log skill invocations via a PreToolUse hook. Now you can see what's undertriggering vs. popular, and the description field becomes something you can A/B test instead of guess at.

For paid subscribers

The full teardown, a SKILL.md template with the description-as-classifier pattern pre-wired, and the Gotchas log format we use across ResearchAudio skills, goes out to the paid archive this week.

The description field isn't a summary for humans. It's a trigger for the model. Write it like a classifier prompt, not a blurb.

The Open Question

If the description is a classifier prompt, the obvious next move is to evaluate it like one. A held-out set of user messages, a target skill per message, a trigger accuracy score. Nobody I know has shipped this harness yet. Reply if you have.

Next in the series

Part 2: The prompt-caching rules Claude Code declares a SEV over. Why swapping tool sets mid-conversation is the most expensive bug an agent runtime can ship.

Part 3: Seeing like an agent, how tool design decides whether your agent works at all.

Source: Thariq Shihipar, Lessons from Building Claude Code: How We Use Skills (March 2026)

ResearchAudio.io · AI research, compressed for builders

The Skills Playbook Anthropic Actually Uses

The Skills Playbook Anthropic Actually Uses

Where skills actually go in production

The 9 rules, compressed

Quick Hits

Keep Reading

Quick Links

Stay Updated