This guide treats enterprise skills as a two-layer problem. The procedure layer is what Anthropic’s specification covers: the SKILL.md file, the security checklist, and the evaluation suite. The context layer is what the spec assumes you have already solved: the metric definitions, lineage, and access policies that determine whether the skill’s output is true.
Most teams underbuild the second one. This article is for teams that want to ship both.
What are enterprise skills?
Permalink to “What are enterprise skills?”Enterprise Skills are governed business workflows that AI agents call through protocols like the Model Context Protocol (MCP) to complete important tasks. Each skill comes with three guarantees: the same input produces the same output, every action is logged for audit, and execution stays within safe enterprise limits.
A skill represents a full business operation, not just a single step. It bundles the business logic, validation checks, approval gates, and audit trail needed to carry out actions such as creating an invoice, processing a refund, or running a compliance report. The agent calls the skill, and the skill takes care of running it.

Every Enterprise Skill is deterministic, auditable, and bounded by safe limits. Image by Atlan.
Anthropic introduced agent skills on October 16, 2025, and then published them as an open standard on December 18, 2025. The format is a directory containing a SKILL.md file with YAML frontmatter (name, description) plus optional scripts and reference files. Barry Zhang, who co-created Agent Skills at Anthropic, described the design intent at the AI Engineer Conference: “Skills are organized collections of files that package composable procedural knowledge for agents. In other words, they’re folders. This simplicity is deliberate. We want something that anyone, human or agent, can create and use as long as they have a computer.”
The format spread quickly. The public reference repository at github.com/anthropics/skills has crossed 137,000 GitHub stars and 16,200 forks. Adoption extends beyond Anthropic’s own tools to OpenAI Codex, Cursor, Gemini CLI, Microsoft VS Code, Goose, Databricks, Spring AI, and Mistral.
The word “enterprise” changes what counts as a skill. An individual skill is a markdown file. An enterprise skill is an artifact under governance. Anthropic’s enterprise specification spends more lines on review, evaluation, and lifecycle than on what skills do. That ratio is the signal: the governance envelope, not the artifact, is what makes a skill enterprise-ready. Anthropic enforces this at the API layer with a maximum of 8 skills per request, a design constraint that forces consolidation into role-based bundles rather than indiscriminate accumulation.
For a deeper definition of the substrate beneath, explore the context layer and understand its architecture.
How do enterprise skills work?
Permalink to “How do enterprise skills work?”Skills use three-level progressive disclosure. Metadata loads at startup, the full SKILL.md loads when the agent detects relevance, and bundled resources load on demand. Each skill summary costs only a few dozen tokens, so an agent can carry many skills without flooding its context window. Enterprise deployment layers central provisioning, role-based bundles, and the 8-skills-per-request API on top of that architecture.

Metadata loads first, then SKILL.md on relevance, then bundled resources on demand. Image by Atlan.
The progressive disclosure design exists because the alternative doesn’t scale. Barry Zhang explained the tradeoff directly: “Traditional tools have pretty obvious problems. Some tools have poorly written instructions and are pretty ambiguous, and when the model is struggling, it can’t really make a change to the tool. So, it’s just kind of stuck with a cold start problem, and they always live in the context window.” He continued: “At this point, skills can contain a lot of information, and we want to protect the context window so that we can fit in hundreds of skills and make them truly composable. That’s why skills are progressively disclosed.”
Below is an overview of how enterprise skills work through different levels:
- Level 1, metadata: The skill’s name and description (from YAML frontmatter) load into the system prompt at agent startup. This is what the agent uses to decide whether to invoke the skill.
- Level 2, instructions: The full SKILL.md body loads only when the agent matches the user’s task to the skill’s description. This is the workflow itself: how to do the thing, what inputs to expect, what outputs to produce.
- Level 3, resources: Bundled files (reference documents, code samples, templates) and executable scripts load on demand. Skills run their scripts within Anthropic’s sandboxed code-execution tool.
Skills vs. MCP servers vs. tools
Permalink to “Skills vs. MCP servers vs. tools”Most teams hit this comparison before they hit any production problem. Skills teach agents how to do things. MCP servers give agents the ability to do things. Skills and MCP servers aren’t competing; they’re different layers of the same architecture.
| Aspect | Tools (function calling) | MCP servers | Agent skills |
|---|---|---|---|
| What it does | One discrete action | Standardized capability access with auth | Teaches the agent how to do a workflow |
| Layer | Execution | Capability and protocol | Procedure |
| Auth model | Caller’s context | OAuth native (built into spec) | Inherits agent’s auth context |
| Format | Function with input/output schema | Server with tools, resources, prompts | Folder with SKILL.md and optional scripts |
| Best for | Single API calls | Cross-system access with governance | Repeatable multi-step workflows |
| Enterprise risk | Permission misuse | Token management complexity | Trigger conflicts, malicious scripts, and ungoverned data reach |
There’s a contrarian camp worth acknowledging.
On r/ClaudeCode, a developer named mheryerznka has been running the same experiment for weeks: “Whenever a tool has both an MCP server and a CLI, I’ll set up a skill that teaches Claude Code how to drive the CLI, then compare it to using the MCP version. The skill + CLI path almost always wins: faster, more reliable, way fewer tokens (no schema bloat in context), easier to debug.”
On the other hand, Alex Salazar, founder and CEO of Arcade, argues that CLI vs. MCP is the wrong fight. “The real problem isn’t calling tools, it’s permission enforcement, step skipping, and cross-system auditability. That’s a runtime problem.”
Whether your stack uses Skills plus CLI, Skills plus MCP, or both, the open question remains the same: who governs the data the procedure accesses?
Why do organizations need enterprise skills?
Permalink to “Why do organizations need enterprise skills?”A general agent can be intelligent without being useful. Barry Zhang opened his conference talk with the gap: “Agents have intelligence and capabilities, but not always the expertise that we need for real work.” Enterprise skills are how you transfer specific organizational expertise to a general agent: how your team writes API endpoints, how your finance team closes the books, how your legal team reviews vendor contracts.
There’s a second-order effect that the discourse doesn’t talk about enough. Skills become the delivery mechanism for tribal knowledge that previously lived in wikis nobody read. The SKILL.md format is portable, version-controlled, and human- and agent-consumable. That’s a meaningful upgrade over the Confluence page that hasn’t been updated in forever.
Reproducible quality across teams
Permalink to “Reproducible quality across teams”Every developer ships their own skill collection. There’s no shared source of truth, no recall guardrails, and the agent ends up with three near-duplicate skills for the same job.
Elvis Sun, a former software engineer at Google, documented this pattern in a 9-hour deep dive: "The agent wanted to read an image from my desktop. Tried browser read and vision skill, nothing worked. So it wrote a third skill, the read-local-image skill. These are 3 skills all adjacent to ‘image + local filesystem + model can see it.’ The skill grows and becomes mutually non-exclusive very quickly. This is the long-tail failure mode.
Central provisioning, Git as source of truth, signed commits, and registry deduplication address this. Quality becomes a property of the artifact, not the developer who shipped it.
Role-based capability assignment
Permalink to “Role-based capability assignment”Bundling every skill for every user dilutes recall. The agent picks the wrong skill.
Mahesh Murag highlighted what excites Anthropic about this pattern: “We’re seeing skills that are being built by people that aren’t technical. These are people in functions like finance, recruiting, accounting, legal.” That distributed authorship is what makes role-based bundling tractable.
The domain expert authors the skill, the platform team scopes it to the role, and the recall cap keeps the bundle disciplined.
Governed reach into business data
Permalink to “Governed reach into business data”A skill that runs perfectly but reaches for ungoverned data confidently returns a wrong answer. Most enterprise-skills content stops at the script boundary. The reader-facing failure isn’t that the skill didn’t trigger; it’s the agent reporting a quarterly metric with confidence, derived from a definition that no two systems agreed on.
Pair every data-reaching skill with a context-layer query so the skill answers from certified metrics, lineage, and policy. The skill encodes how. The context layer guarantees what’s true. Together, the output traces back to a governed source. This is the audit story compliance teams need.
How to implement enterprise skills
Permalink to “How to implement enterprise skills”Anthropic’s recommended lifecycle has six steps: Plan, Create and Review, Test, Deploy, Monitor, and Iterate. We add a seventh. Every skill that reaches into business data must be grounded in a governed context source before it ships. That step doesn’t replace anything in Anthropic’s checklist. It closes the gap the checklist leaves open.
The teams that ship skills well treat governance as the first build, not the last. As one practitioner described publicly on X, governance-first deployments are the differentiator. That order matters.
Most teams reverse it and rebuild governance after their first incident.
Prerequisites
Permalink to “Prerequisites”- Team or Enterprise plan for central provisioning
- Git-tracked skill repository with signed commits and review gates
- Skill registry documenting purpose, owner, version, dependencies, and evaluation status
- An evaluation framework that tests trigger accuracy, coexistence with existing skills, and output quality
- Governed context source (such as a data catalog or MCP-fronted metadata store) for any skill that reads business data
1. Plan and identify workflows
Permalink to “1. Plan and identify workflows”Map repetitive, error-prone, or specialized workflows to specific roles. Decide which workflows to become skills. Start narrow. Workflow-specific skills outperform broad multi-purpose ones because the agent’s trigger decision becomes cleaner.
Timeline: 1 to 2 weeks.
2. Create the skill
Permalink to “2. Create the skill”Write SKILL.md with frontmatter, instructions, and examples. Bundle templates and reference files. Ensure a hard rule of no hardcoded credentials, no untrusted network calls, no executable scripts that don’t need to be there.
Timeline: 1 to 3 days per skill.
3. Security review
Permalink to “3. Security review”Apply Anthropic’s checklist. Read all directory content. Verify scripts in a sandbox. Scan for instruction manipulation. Check for credentials and server-side request forgery patterns. The author cannot review their own skill.
Skipping this step is the difference between a deployed agent and a deployed exfiltration tool.
Timeline: 1 day per skill.
4. Evaluation
Permalink to “4. Evaluation”Build 3 to 5 representative queries per skill. Cover three cases: should-trigger, should-not-trigger, and ambiguous edge cases.
Run isolation tests, then coexistence tests against your existing skill set. Block deployment if recall accuracy degrades when the new skill is added.
Timeline: 1 to 2 days per skill.
5. Ground in context
Permalink to “5. Ground in context”For skills that reach into business data, route every data access through a context layer with metric definitions, lineage, and access policies.
Typically, this is delivered through an MCP server pointed at a governed metadata source. The skill instructs how. The context layer guarantees what’s true.
Timeline: ongoing, as part of platform maturity.
6. Deploy and version-pin
Permalink to “6. Deploy and version-pin”Upload via the Skills API. Pin production to a specific version. Keep the previous version available as a rollback. Document in your internal registry with owner, dependencies, and evaluation results.
There’s a trust curve worth designing for. An Anthropic agent autonomy study, as summarized on the AI Daily Brief, describes it: “Claude Code’s default settings require users to manually approve each action, and so Anthropic suspects that what we’re seeing is a steady accumulation of trust. At the beginning, you approve things each time, and then as you dial in your settings, and you start to learn to trust the model, you give it that auto-approval more frequently.”
7. Monitor and iterate
Permalink to “7. Monitor and iterate”Track usage. Re-run evaluations periodically. Deprecate skills with persistent failures. Treat every update as a new deployment requiring a full security review. The goal isn’t blanket autonomy.
Common pitfalls to avoid:
Permalink to “Common pitfalls to avoid:”While implementing skills, you need to watch out for these common pitfalls:
- Skill descriptions are too broad: Trigger conflicts cause the wrong skill to fire. Narrow your descriptions. Use evaluations to confirm trigger accuracy before deployment.
- The author reviews their own skill: Enforce separation of duties in CI/CD. A different reviewer is required to merge.
- Skill reaches into ungoverned data: The agent answers confidently and wrongly. Route data access through a governed context layer.
- Loading too many skills per request: Consolidate related narrow skills into role-based bundles only after evaluations confirm parity.

Seven steps to deploy governed AI skills in enterprise environments, from planning to governance. Image by Atlan.
How to choose a skill registry and governance approach
Permalink to “How to choose a skill registry and governance approach”Evaluate registries on six criteria: vulnerability scanning, signed provenance, version control with rollback, role-based provisioning, evaluation gating, and integration with your governed context layer. Treat skill installation with the same rigor as installing software on production systems.
Here’s an evaluation criterion to help you gain more clarity:
| Criterion | Why it matters | What to look for |
|---|---|---|
| Vulnerability scanning | Skills are code; malicious scripts can exfiltrate data | Pattern-based scans across known attack categories (OWASP AST10 coverage) |
| Signed provenance | Establishes who wrote and approved each skill | Signed commits, SLSA-level attestations, checksums verified at deploy |
| Version control and rollback | New skill versions can degrade existing skill recall | Pinned production versions, full evaluation suite required to promote |
| Role-based provisioning | Recall accuracy drops as skills proliferate | Native role bundling, hard cap on simultaneous skills (Anthropic: 8 per request) |
| Evaluation gating | Skills that trigger the wrong waste context and produce wrong answers | Required submission of 3 to 5 representative queries per skill |
| Context-layer integration | Skills that reach for ungoverned data hallucinate at scale | First-class MCP support pointing at a governed metadata source |
Questions to ask a registry vendor:
Permalink to “Questions to ask a registry vendor:”- How do you scan submitted skills for credential exposure and arbitrary code execution?
- What’s your separation-of-duties enforcement model?
- Can you demonstrate role-based bundling with hard recall caps?
- How does your registry integrate with our metadata governance source?
- What happens to dependent agents when a skill is deprecated?
- Do you support signed provenance and integrity verification at deploy time?
How Atlan grounds enterprise skills in a governed context layer
Permalink to “How Atlan grounds enterprise skills in a governed context layer”Atlan is the context layer for enterprise AI. It sits between business systems and AI agents, exposing governed metric definitions, lineage, ontology, and access policies through the Atlan MCP server. On top of that substrate, Atlan ships context agents (Description Agent, README Agent, and SQL Intelligence Agent, with Link Terms Agent in preview), packaged agent skills grounded in governed metadata.
The challenge most enterprise-skill content misses
Permalink to “The challenge most enterprise-skill content misses”Traditional enterprise skill deployments stop at the code boundary. Git review, security scan, evaluation suite. They miss the harder boundary: the data the skill reaches into. A skill that generates a quarterly metric report can pass every governance gate and still ship the wrong number because no two systems agree on what the metric means.
Mike Krieger, Anthropic’s Chief Product Officer, framed the deeper problem at Code with Claude: “Raw model capability alone isn’t enough to unlock these multi-hour workflows. In practice, agents also need access to real-world information, a connection to your existing systems, and cost-efficient scaling.” He returned to the same point later: “You need more than just intelligent models, you need the right platform.”
Atlan’s approach
Permalink to “Atlan’s approach”Atlan unifies metadata from 80+ source connectors into the Enterprise Data Graph, with active metadata, ontology, lineage, and access policies layered on top. Agents reach that substrate through the Atlan MCP server, which returns the data graph, business definitions, lineage, and policies relevant to a specific task.
Atlan’s metadata model includes a native Skill type, described as a reusable, versionable unit of capability that can be consumed by agents, paired with a SkillArtifact type for files or data associated with a skill. Skills are first-class governable assets in the model, not loose files in a folder somewhere.
On top of that substrate, Context Agents are the worked example. The Description Agent generates semantic descriptions of assets. The README Agent produces context-aware READMEs by asset type. The SQL Intelligence Agent surfaces query patterns. The Link Terms Agent, currently in preview, connects assets to glossary terms.
Each one follows the agent-skills pattern, but every output traces back to certified metadata, every read is policy-checked, and every skill behaves consistently across teams.
The outcome
Permalink to “The outcome”Enterprise teams using Atlan ship agent skills whose outputs trace back to a single source of governed truth. The operating model is reproducible. Any skill that reaches for business data routes through the context layer.
Atlan helps global customers, including a customer base shared with Snowflake that grew 415% in two years. Atlan was named Snowflake’s 2025 Data Governance Partner of the Year.
FAQs about enterprise skills
Permalink to “FAQs about enterprise skills”Is the context layer over-engineering for most teams?
Permalink to “Is the context layer over-engineering for most teams?”The context-layer requirement applies to skills that read or reason about enterprise data, where two systems can disagree on a metric definition, and the agent has no way to know which is correct.
Are scanners enough to secure a skill registry?
Permalink to “Are scanners enough to secure a skill registry?”No. Snyk’s research showed a malicious scanner (SkillGuard) installed by hundreds of teams, and demonstrated that denylist scanners cannot enumerate every prompt-injection variant. Scanners are a useful floor, not a ceiling. Signed provenance and version pinning matter more.
What’s the difference between adopting the open standard and joining the Skills Directory?
Permalink to “What’s the difference between adopting the open standard and joining the Skills Directory?”These are often conflated. The open standard is the SKILL.md specification published at agentskills.io in December 2025. Adopters include OpenAI Codex, Cursor, Gemini CLI, Microsoft VS Code, Goose, Databricks, Spring AI, and Mistral. The Skills Directory is Anthropic’s vetted partner catalog of pre-built skills, with launch partners including Atlassian, Figma, Canva, Stripe, Notion, and Zapier. Being on the Directory does not mean you have adopted the spec, and adopting the spec does not put you on the Directory.
How does skill sprawl actually break in production?
Permalink to “How does skill sprawl actually break in production?”Agents write near-duplicate skills because they can’t tell that they have already written similar ones three folders over. Recall accuracy degrades as more skills are loaded. Trigger conflicts cause the wrong skill to fire. Production teams handle sprawl through central provisioning, registries with usage tracking, and periodic deprecation reviews.
What security risks do enterprise agent skills introduce?
Permalink to “What security risks do enterprise agent skills introduce?”The main risks are arbitrary code execution from skill scripts, credential exposure inside skill files, instruction manipulation that bypasses safety rules, data exfiltration through external URL fetches, and registry poisoning. Treat skill installation with the same rigor as production software installs: full audit before deploy, separation of duties between author and reviewer, version pinning, and signed provenance where possible.
What to do before shipping your first enterprise skill
Permalink to “What to do before shipping your first enterprise skill”Enterprise skills are what enable AI agents to become reliable specialists at scale. They’re repeatable, auditable, role-assigned, and version-pinned. Anthropic’s spec gives every team the governance scaffolding they need on the code side, including review checklists, evaluation suites, separation of duties, and recall caps. Teams that stop there are prone to ship confident hallucination at the organizational scale.
Enterprise-grade agent skills require a second governance surface: a context layer that grounds every data-reaching skill in certified metrics, lineage, and policy. Pair the two governance surfaces, and every output traces back to a single source of truth. That’s the audit story compliance teams need, and the trust loop that makes skills worth deploying.
Learn how a context layer for AI agents underwrites enterprise skill governance.