The Anatomy of a Claude Skill

21 Feb, 2026

A Claude skill is a folder containing a SKILL.md file and optional supporting resources. When invoked, it injects domain-specific instructions into Claude's conversation context. Skills are not executable code but instead prompt templates that modify how Claude processes subsequent requests by changing the conversation context and, optionally, the execution context (tool permissions, model selection). This post covers the mechanical structure of SKILL.md, concrete examples at each quality tier, and the dimensions that separate functional skills from effective ones. The analysis draws from Anthropic's official skill authoring documentation, the SkillsBench research paper (Feb 2026, 7,308 evaluated trajectories), and patterns validated by practitioners shipping skills in production.

Try my skill-audit.md document to evaluate the current state of your skills.

How Skill Selection Works

Claude receives a Skill meta-tool whose description contains the name and description from every installed skill's YAML frontmatter. Selection happens entirely through LLM reasoning with no algorithmic routing, keyword matching, or intent classification at the code level. Claude reads the descriptions and decides which skill matches the user's intent.

This means the description field functions as routing logic. If it's vague, Claude either won't find the skill or will trigger the wrong one when selecting from many (100+) candidates.

Structure of SKILL.md

The file has two parts: YAML frontmatter (configuration) and markdown body (instructions).

Frontmatter

--
name: processing-pdfs          # Required. Lowercase, hyphens, numbers. Max 64 chars.
description: >                  # Required. Max 1024 chars.
  Extracts text and tables from PDF files, fills forms, merges
  documents. Use when working with PDF files or when the user
  mentions PDFs, forms, or document extraction.
allowed-tools: "Read,Write,Bash" # Optional. Scopes tool permissions.
model: "claude-opus-4-20250514"  # Optional. Override model.
---

name — Lowercase letters, numbers, hyphens only. Community convention favors gerund form (processing-pdfs, analyzing-spreadsheets) over noun phrases or vague labels (helper, utils).

description — Must be third person ("Processes Excel files..." not "I can help you..."). Inconsistent point-of-view causes discovery failures because the description is injected directly into the system prompt. Should include what the skill does, when to trigger it, and key vocabulary users would actually say.

allowed-tools — Scopes which tools the skill can access without user approval. Supports wildcards (Bash(git:*) restricts to git subcommands). Should contain only what the skill actually needs.

Markdown Body

The body is the prompt Claude receives when the skill activates. It shares the context window with the system prompt, conversation history, other skills' metadata, and the user's request. Anthropic recommends keeping it under 500 lines and pushing detailed content to reference files.

Directory Structure

my-skill/
├── SKILL.md              # Entry point (loaded when skill activates)
├── references/           # Text loaded INTO context via Read tool (costs tokens)
├── scripts/              # Code executed via Bash (only output costs tokens)
├── assets/               # Files referenced by path (zero token cost until used)
└── LICENSE.txt

The distinction between references/ and scripts/ matters for context management. A 10KB markdown file in references/ consumes context tokens when loaded. A 10KB Python script in scripts/ does not because Claude executes it and only the output enters context.

Quality Tiers

Bad

yaml
---
name: helper
description: "Helps with documents"
---

markdown
# Document Helper

PDF (Portable Document Format) files are a common file format
that contains text, images, and other content. To extract text
from a PDF, you'll need to use a library...

## What is a PDF?
PDFs were created by Adobe in the 1990s...

Failures: Vague name provides no differentiation. Description has no trigger conditions or key terms and zero routing signals. Body explains concepts Claude already has in its weights. No workflow, no validation, no examples, no error handling.

Good

yaml
---
name: processing-pdfs
description: "Extracts text and tables from PDF files, fills forms,
  merges documents. Use when working with PDF files or when the user
  mentions PDFs, forms, or document extraction."
allowed-tools: "Read,Write,Bash"
---

markdown
# PDF Processing

## Quick Start
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
    text = pdf.pages[0].extract_text()

## Form Filling
See FORMS.md for the complete guide.

## Common Issues
- Scanned PDFs need OCR first — use pytesseract
- Encrypted PDFs — try pikepdf for decryption

Strengths: Specific name, gerund form. Description includes trigger conditions. Body is concise and assumes Claude knows what pdfplumber is. Uses progressive disclosure via FORMS.md link. Addresses common failure modes.

Missing: No conditional routing between PDF types. No validation loop. No checklists. No input/output examples. No structured error mapping.

Great

yaml
---
name: processing-pdfs
description: "Extracts text and tables from PDF files, fills forms,
  merges documents. Use when working with PDF files, forms, or
  document extraction. Handles scanned (OCR), encrypted, and
  multi-page documents."
allowed-tools: "Read,Write,Bash"
---

markdown
# PDF Processing

## Determine Approach
1. Text-based → "Text Extraction" below
2. Scanned/image-based → "OCR Pipeline" below
3. Form to fill → See FORMS.md
4. Encrypted → Decrypt first: python {baseDir}/scripts/decrypt.py input.pdf

## Text Extraction
import pdfplumber
with pdfplumber.open("file.pdf") as pdf:
    text = pdf.pages[0].extract_text()

## OCR Pipeline
Run: python {baseDir}/scripts/ocr_extract.py input.pdf output.txt

## Validation Loop
After ANY extraction:
1. Run: python {baseDir}/scripts/validate.py output.txt
2. If validation fails → fix the issue → run validation again
3. Only proceed when validation passes

## Task Checklist
- [ ] Identify PDF type (text/scanned/form)
- [ ] Check for encryption
- [ ] Extract content
- [ ] Validate output
- [ ] Deliver result

## Examples
Input: "Extract the table from page 3 of quarterly-report.pdf"
Output:
| Quarter | Revenue | Growth |
|---------|---------|--------|
| Q1      | $2.1M   | 12%    |

## Error Handling
| Symptom         | Cause        | Fix                     |
|-----------------|--------------|-------------------------|
| No text found   | Scanned PDF  | Switch to OCR pipeline  |
| Encoding error  | Non-UTF8     | Detect with chardet     |
| Permission denied | Encrypted  | Run decrypt script first |

What this adds over Good: Conditional routing as the first step. A validation feedback loop (run → fix → repeat). A copy-paste checklist for progress tracking. Concrete input/output examples. An error table mapping symptoms → causes → fixes. All paths use {baseDir} for portability. Tool permissions scoped to what's needed.

The 8 Differentiators

1. Description as Routing Logic

Low Tier: states what the skill does. Top Tier: states what it does + when to trigger + includes key vocabulary for discovery. The description must work as a selection mechanism across many (100+) candidates.

2. Token Economy

Low Tier: explains everything inline. Top Tier: progressive disclosure across three loading stages. At discovery, only the frontmatter is loaded (~30 tokens). At activation, SKILL.md loads (~500 tokens). During execution, reference files load on demand. A skill with 10 reference files can cost 30 tokens in discovery and only load what's relevant during execution.

3. Workflow Structure

Low Tier: unstructured information. Top Tier: sequential steps with decision points, copy-paste checklists (- [ ] items), and explicit exit criteria.

4. Validation Loops

Low Tier: no validation, or vague "check your work." Top Tier: run validator → fix errors → validate again → proceed only when passing. For destructive operations, the plan-validate-execute pattern adds an intermediate step: Claude creates a structured plan file, validates it with a script, then executes only after validation passes.

5. Error Handling

Low Tier: silent or absent. Top Tier: structured tables mapping symptom → cause → fix. This is where procedural knowledge gets encoded — the pattern recognition that experienced practitioners apply automatically but that Claude lacks by default.

6. Concrete Examples

Low Tier: none or vague descriptions. Top Tier: input/output pairs covering typical and edge cases.

7. Freedom Calibration

Low Tier: uniform specificity. Top Tier: calibrated to task fragility. High freedom (general guidance) for context-dependent decisions like code review. Low freedom (exact commands, no modification) for fragile operations like database migrations or deployments.

8. Eval-Driven Development

Low Tier: untested. Top Tier: RED-GREEN-REFACTOR cycle applied to skill authoring. Run Claude on representative tasks without the skill and document failures (RED). Write the skill to address those failures (GREEN). Trim tokens and tighten (REFACTOR). If the skill wasn't written to address observed failures, there's no evidence it teaches the right thing.

Empirical Validation

The SkillsBench paper (Feb 2026) evaluated skill efficacy across 84 tasks, 11 domains, and 7,308 agent trajectories using Claude Code, Gemini CLI, and Codex CLI under three conditions: no skills, curated skills, and self-generated skills (model writes its own).

Key findings:

Curated skills provided +12.66 percentage points average improvement over no skills.
Self-generated skills did not reliably match curated ones. Models could not produce effective procedural guidance on the fly, even with the same context budget.
Quality criteria used by reviewers: data validity, task realism, oracle quality, skill quality (error-free, internally consistent, genuinely useful beyond the benchmark), and anti-cheating measures.

The second finding is critical in that it suggests improvement comes from the quality of authoring, not merely from having more context available.

Anti-Patterns

Kitchen Sink - > - Everything crammed into SKILL.md. Context overflow causes Claude to ignore buried instructions.

Encyclopedia - > - Explaining concepts already in Claude's weights. Every token spent on background is a token not spent on validation steps or error tables.

Force-Load Trap - > - Using @ syntax to immediately load files, consuming large token budgets before relevance is established. Use Read tool references for on-demand loading.

Voodoo Constants - > - Configuration values with no justification. TIMEOUT = 30 without explaining why 30.

Deeply Nested References - > - SKILL.md → A.md → B.md → actual content. Claude may partially read nested files using head -100 previews. Keep references one level deep from SKILL.md.

Time-Sensitive Content - > - Instructions referencing specific dates ("before August 2025, use the old API"). Use "Current method" and "Legacy" sections instead.

Summary

The structural difference between a bad skill and a great one reduces to: route precisely (description-as-routing-logic), load minimally (progressive disclosure), guide explicitly (conditional routing + workflows + checklists), validate continuously (feedback loops), encode judgment (error tables + examples), and verify empirically (eval-driven development).

Use my skill-audit.md document to evaluate the current state of your skills.

Sources:

Anthropic Skill Authoring Best Practices (platform.claude.com)
Claude Code Skills Docs (code.claude.com)
SkillsBench, Feb 2026 (arxiv.org)
Han Lee, "Claude Agent Skills: A First Principles Deep Dive," Oct 2025
HumanLayer, "Writing a Good CLAUDE.md," Nov 2025
Obra, "writing-skills" TDD methodology, Dec 2025
Tessl Registry evaluation framework
Pulumi, "The Claude Skills I Actually Use for DevOps," Feb 2026

#agents #context-engineering #llms #skills