This stuff smells like maybe the bitter lesson isn't fully appreciated.
You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)
Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.
The instructions are standard documents - but this is not all. What the system adds is an index of all skills, built from their descriptions, that is passed to the llm in each conversation. The idea is to let the llm read the skill when it is needed and not load it into context upfront. Humans use indexes too - but not in this way. But there are some analogies with GUIs and how they enhance discoverability of features for humans.
I wish they arranged it around READMEs. I have a directory with my tasks and I have a README.md there - before codex had skills it already understood that it needs to read the readme when it was dealing with tasks. The skills system is less directory dependent so is a bit more universal - but I am not sure if this is really needed.
I have been using Claude Code to automate a bunch of my business tasks, and I set up slash commands for each of them. Each slash command starts by reading from a .md file of instructions. I asked Claude how this is different from skills and the only substantive thing it could come up with was that Claude wouldn't be able to use these on its own, without me invoking the slash command (which is fine; I wouldn't want it to go off and start checking my inventory of its own volition).
So yeah, I agree that it's all just documentation. I know there's been some evidence shown that skills work better, but my feeling is that in the long run it'll fall to the wayside, like prompt engineering, for a couple of reasons. First, many skills will just become unnecessary - models will be able to make slide decks or do frontend design without specific skills (Gemini's already excellent at design without anything beyond the base model, imho). Second, increased context windows and overall intelligence will obviate the need for the specific skills paradigm. You can just throw all the stuff you want Claude to know in your claude.md and call it a day.
Claude Code recently deprecated slash commands in favor of skills because they were so similar. Or another way of looking at it is, they added the ability to invoke a skill via /skill-name.
So how is this slash command limit enforced?
Is it part of the Claude API/PostTraining etc?
It seems like a useful tool if it is!
I'd like a user writeable, LLM readable, LLM non-writable character/sequence.
That would make it a lot easier to know at a glance that a command/file/directory/username/password wasn't going to end up in context and being used by a rogue agent.
It wouldn't be fool proof, since it could probably find some other tool out there to generate it (eg write-me some unicode python), but it's something I haven't heard of that sounds useful. If it could be made fool/tool proof (fools and tools are so resourceful) that would be even better.
It's part of the Claude Code harness. I honestly haven't thought at all about security related to it; it's just a nice convenience to trigger a commonly run process.
Folks have run comparisons. From a huggingface employee:
codex + skills finetunes Qwen3-0.6B to +6 on humaneval and beats the base score on the first run.
I reran the experiment from this week, but used codex's new skills integration. Like claude code, codex consumes the full skill into context and doesn't start with failing runs. It's first run beats the base score, and on the second run it beats claude code.
I can't quite tell what's being compared there -- just looks like several different LLMs?
To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.
A useful comparison would be between: a) make a carefully organised .skills/ folder, b) put the same info anywhere and just link to it from your top-level doc, c) just dump everything directly in the top-level doc.
My guess is that it's probably a good idea to break stuff out into separate sections, to avoid polluting the context with stuff you don't need; but the specific way you do that very likely isn't important at all. So (a) and (b) would perform about the same.
I think the point is it smells like a hack, just like "think extra hard and I'll tip you $200" was a few years ago. It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
Standards have to start somewhere to gain traction and proliferate themselves for longer than that.
Plus, as has been mentioned multiple times here, standard skills are a lot more about different harnesses being able to consistently load skills into the context window in a programmatic way. Not every AI workload is a local coding agent.
A +6 jump on a 0.6B model is actually more impressive than a +2 jump on a 100B model. It proves that 'intelligence' isn't just parameter count; it is context relevance. You are proving that a lightweight model with a cheat sheet beats a giant with amnesia. This is the death of the 'bigger is better' dogma
Which is essentially the bitter lesson that Richard Sutton talks about?
Skills are not just documentation. They include computability (programs/scripts), data (assets), and the documentation (resources) to use everything effectively.
Programs and data are the basis of deterministic results that are accessible to the llm.
Embedding an sqlite database with interesting information (bus schedules, dietary info, or a thousand other things) and a python program run by the skill can access it.
For Claude at least, it does it in a VM and can be used from your phone.
Sure, skills are more convention than a standard right now. Skills lack versioning, distribution, updates, unique naming, selective network access. But they are incredibly useful and accessible.
Am I missing something because what you describe as the pack of stuff sounds like S tier documentation. I get full working examples and a pre-populated database it works on?
It's all about managing context. The bitter lesson applies over the long haul - and yes, over the long haul, as context windows get larger or go away entirely with different architectures, this sort of thing won't be needed. But we've defined enough skills in the last month or two that if we were to put them all in CLAUDE.md, we wouldn't have any context left for coding. I can only imagine that this will be a temporary standard, but given the current state of the art, it's a helpful one.
I use Claude pretty extensively on a 2.5m loc codebase, and it's pretty decent at just reading the relevant readme docs & docstrings to figure out what's what. Those docs were written for human audiences years (sometimes decades) ago.
I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.
Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"?
> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.
> The memory tool enables Claude to store and consult information outside the context window through a file-based system.
But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?
I don't think so, those things happen when agent yields the control back at the end of its inference call, not during the active agent inference with multiple tool calls ongoing. These days an agent can finish the whole task with 1000s tool calls during a single inference call without yielding control back to whatever called it to do some bookkeeping.
how is it different or better than maintaining an index page for your docs? Or a folder full of docs and giving Claude an instruction to `ls` the folder on startup?
It's hard to tell unless they give some hard data comparing the approaches systematically.. this feels like a grift or more charitably trying to build a presence/market around nothing. But who knows anymore, apparently saying "tell the agent to write it's own docs for reference and context continuity" is considered a revelation.
Not sure why you’re being downvoted so much, it’s a valid point.
It’s also related to attention — invoking a skill “now” means that the model has all the relevant information fresh in context, you’ll have much better results.
What I’m doing myself is write skills that invoke Python scripts that “inject” prompts. This way you can set up multi-turn workflows for eg codebase analysis, deep thinking, root cause analysis, etc.
This standardization, basically, makes a list of docs easier to scan.
As a human, you have a permanent memory. LLMs don't have it, they have to load it into the context, and doing it only as necessary can help.
E.g. if you had anterograde amnesia, you'd want everything to be optimally organized, labeled, etc, right? Perhaps an app which keeps all information handy.
It's not about instructions, it's about discoverability and data.
Yeah, WWW is really just text but that doesn't mean you don't need HTTP + HTML and a browser/search engine. Skills is just that, but for agent capabilities.
Long term you're right though, agents will fetch this all themselves. And at some point they will not be our agents at all.
I guess what I mean is that standardizing this bit of the problem right now feels sort of like XHTML. Many people thought that was a big deal back in the day, but it turned out to be a pointless digression.
You are right about it's just natural language but Standarization is very improtant, because it's never just about the model itself, the so called Harness is a big factor on LLM performance and standarization allows all harness to index all skills.
This is pushed by Antropic, OpenAI doesn't seem to care much about "skills". Maybe Anthropic is doing some extra training to better follow sections of text marked as skill, who knows? Or you can just store what worked as a skill and share with others without any need to do their own prompt for common tasks?
Skills are for the most part already generated by LLMs. And, if you're implementing them in your own workflow, they're tailored to real-world problems you've encountered.
Having a super repo of everyone else's slop is backwards thinking; you are now in the era where creating written content and verifying it's effectiveness is easier than ever.
Our team has found success in treating skills more like re-usable semi-deterministic functions and less like fingers-crossed prompts for random edge-cases.
For example, we have a skill to /create-new-endpoint. The skill contains a detailed checklist of all the boilerplate tasks that an engineer needs to do in addition to implementing the logic (e.g. update OpenAPI spec, add integration tests, endpoint boilerplate, etc.). The engineer manually invokes the skill from the CLI via slash commands, provides a JIRA ticket number, and engages in some brief design discussion. The LLM is consistently able to one-shot these tickets in a way that matches our existing application architecture.
I find that even though this isn't standard, that these -cli tools will scan the repo for .md files and for the most part execute the skills accordingly. Having said that, I would much prefer standards not just for this, but for plugins as well.
Standards for plugins makes sense, because you're establishing a protocol that both sides need to follow to be able to work together.
But I don't see why you need a strict standard for "an informal description of how to do a particular task". I say "informal" because it's necessarily written in prose -- if it were formal, it'd be a shell script.
I mean, it'd be good if these tools followed the xdg base spec and put their config in `~/.config/claude` e.t.c instead of `~/.claude`.
It's one of my biggest pet peeves with a lot of these tools (now admittedly a lot of them have a config env var to override, but it'd be nice if they just did the right thing automatically).
Eventually, you can standardize what you don't understand
The problem I see now is that everyone wants to be the winner in a hype cycle and be the standards bringer. How many "standards" have we seen put out now? No one talks about MCP much anymore, langchain I haven't seen in more than a year, will we be talking about Skills in another year?
Why do I want to throw away my dependency management system and shared libraries folder for putting scripts in skills?
What tools do they have access to, can I define this so it's dynamic? Do skills even have a concept for sub tools or sub agents? Why do I want to put references in a folder instead of a search engine? Does frontmatter even make sense, why not something closer to a package.json in a file next to it?
Does it even make sense to have skills in the repo? How do I use them across projects? How do we build an ecosystem and dependency management system for skills (which are themselves versioned)
> They are more than that, for example the frontmatter and code files around them.
You are right. I have edited my post slightly.
> Why do I want to throw away my dependency management system and shared libraries folder for putting scripts in skills?
You don't have to put scripts in skills. The script can be anywhere the agent can access. The skill just needs to tell the LLM how to run it.
> Does it even make sense to have skills in the repo? How do I use them across projects?
You don't have to put them in the repo. E.g. with Claude Code you can put project-specific skills in `.claude/skills` in the repo and system-wide skills in `~/.claude/skills`.
2. The spec / docs show people how to put code in a subdir. While you can reference external scripts, there is a blessed pattern that seems like an anti-pattern to me
3. generalize: how do I store, maintain, and distribute skills shared by employees who work on multiple repos. Sounds like standard dependency management to me. Does to some of the people building collections / registries. Not sure if any of them account for versioning, have not seen anything tied to lock files (though I'd avoid that by using MVS for dep selection)
Agreed. I think being overly formal about what can be in the frontmatter would be a mistake, but the beauty of doing this with an LLM is that you can pretty much emulate skills in any agent by telling it to start by reading the frontmatter of each skills file and use that to decide when to read the rest, so given that as a fallback, it's hardly imposing some massive burden to standardise it a bit.
It's why I wrapped my tiny skills repo with a script that softlink them into whichever is your skills folder, defaulting to Claude, but could be any other.
I treat my skills the same as I would write tiny bash scripts and fish functions in the days gone to simplify my life by writing 2 words instead of 2 sentences. Tiny improvement that only makes sense for a programmer at heart.
That doesn't work very well if your developers are on Windows (and most are). Uneven Git support for symbolic links across platforms is going to end up causing more problems than it solves.
I see it similar to browser user-agents all claiming to be an ancient version of Mozilla or KHTML. We pick whatever works and then move on. It might not be "correct," but as long as our tools know what to do, who cares?
This has been a problem for us too. Sometimes they reach for skills, sometimes they don’t and just try to do the thing on their own. It’s annoying.
I think this is (mostly) a solvable problem. The current generation of SotA models wasn’t RLVR-trained on skills (they didn’t exist at that time) and probably gets slightly confused by the way the little descriptions are all packed into the same tool call schema. (At least that’s how it works with Claude Code.) The next generation will have likely been RLVRed on a lot of tasks where skills are available, and will use them much more reliably. Basically, wait until the next Opus release and you should hopefully see major improvements. (Of course, all this stuff is non-deterministic blah blah, but I think it’s reasonable to expect going from “misses the skill 30% of the time” to “misses it 2% of the time”.)
I think this is mostly a problem of making things skills that don't need to be skills (telling it how to do something it already knows how to do), and having way too much context, so that the skills effectively disappear. If skills are important, information about using skills needs to be a relatively large proportion of the context. Probably the right way to do it, is aggressively trimming anything that might distract from them.
> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it. Adding the skill produced no improvement over baseline.
> …
> Skills aren't useless. The AGENTS.md approach provides broad, horizontal improvements to how agents work with Next.js across all tasks. Skills work better for vertical, action-specific workflows that users explicitly trigger,
Depends what you use perhaps. I use codex and it seems to mostly stick to instructions I give. I use an AGENTS.md that explicitly points to the repository's skill directory. I mostly keep instructions in there for obvious things like how to build, how to test, what to do before declaring a thing done, etc. I don't tend to have a lot of skills in there either.
Probably the more skills you have, the more confused it might get. The more potentially conflicting instructions you give the harder it gets for an LLM to figure out what you actually want to happen.
If I catch it going off script, I often interrupt it and tell it what to do and update the relevant skill. Seems to work pretty good. Keeping things simple seems to work.
Yep. I have an incredibly hard time getting them to use Skills at all, even when asked.
I saw someone's analysis a few days ago and they found that their agents were more accurate when just dumping the skill context directly into AGENTS.md
Because "skills" are just .md files that the lossy compressing statistical output machine may or may not find and that may or may not be retained in the tiny context window
I don’t think you should be downvoted. Skills and history get added to the prompt, there’s no other interface to the model to do anything different. I think it’s smart to keep this in mind when working with LLMs. It’s like keeping in mind that a webserver just responds to HTTP requests when developing a web application. You need to keep perspective.
Edit: btw I’ve gone from genai value denier to skeptic to cautiously optimistic to fairly impressed in the span of a year. (I’m a user of Claude code)
- There is no reason you have to expose the skills through the file system. Just as easy to add tool-call to load a skill. Just put a skill ID in the instruction metadata. Or have a `discover_skills` tool if you want to keep skills out of the instructions all together.
- Another variation is to put a "skills selector" inference in front of your agent invocation. This inference would receive the current inquiry/transcript + the skills metadata and return a list of potentially relevant skills. Same concept as a tool selection, this can save context bandwidth when there are a large number of skills
The observation about agents not using skills without being explicitly asked resonates. In practice, I've found success treating skills as explicit "workflows" rather than background context.
The pattern that works: skills that represent complete, self-contained sequences - "do X, then Y, then Z, then verify" - with clear trigger conditions. The agent recognizes these as distinct modes of operation rather than optional reference material.
What doesn't work: skills as general guidelines or "best practices" documents. These get lost in context or ignored entirely because the agent has no clear signal for when to apply them.
The mental model shift: think of skills less like documentation and more like subroutines you'd explicitly invoke. If you wouldn't write a function for it, it probably shouldn't be a skill.
Better yet is a system which activates skills in certain situations. I use hooks for this with Claude, works great. The skill descriptions are "Do not activate unless instructed by guidance."
Example: A Python file is read or written, guidance is given back (once, with a long cooldown) to activate global and company-specific Python skills. Claude activates the skills and writes Python to our preference.
That does raise the question of what the value is of a "skill" vs a "command". Claude Code supports both, and it's not entirely clear to me when we should use one vs the other - especially if skills work best as, well, commands.
IMO the value and differentiating factor is basically just the ability to organize them cleanly with accompanying scripts and references, which are only loaded on demand. But a skill just by itself (without scripts or references) is essentially just a slash command with metadata.
Another value add is that theoretically agents should trigger skills automatically based on context and their current task. In practice, at least in my experience, that is not happening reliably.
The description "just" needs to be excruciatingly precise about when to use the skill, because the frontmatter is all the model will see in context.
But on the other hand, in Claude Code, at least, the skill "foo" is accessible as /foo, as the generalisation of the old commands/ directory, so I tend to favour being explicit that way.
I am working on a domain specific agent that includes the concept of skills. I only allow one to be active at a time to reduce the chances for conflicting instructions. I use a small sub-agent to select/maintain/change the active skill at the start of each turn. It uses a small fast model to match the recent conversation to a skill (or none). I tried other approaches, but for my use case this was worked well.
My model for skills is similar to this, but I extended it to have explicit use when and don’t use when examples and counter examples. This helped the small model which tended to not get the nuances of a free form text description.
My unproven theory is that agent skills are just a good way to 'acquire' unspoken domain rules. A lot of things that developers do are just in their heads, and using 'skills' forces them to write these down. Then you feed this back to the LLM company for them to train on.
Pro tip: create README.md files in subfolders with helpful content that you might put in an AGENTS.md file (but, ya know, for humans too), and *link relevant skills there*. You don't even have to call them skills or use the skills format. It works for everything (including humans!).
I started playing with skills yesterday. I'm not sure if it's just easier for the LLM to call APIs inside the skill — and then move the heavier code behind an endpoint that the agent can call instead.
I have a feeling that otherwise it becomes too messy for agents to reliably handle a lot of complex stuff.
For example, I have OpenClaw automatically looking for trending papers, turning them into fun stories, and then sending me the text via Telegram so I can listen to it in the ElevenLabs app.
I'm not sure whether it's better to have the story-generating system behind an API or to code it as a skill — especially since OpenClaw already does a lot of other stuff for me.
They're basically all trade-offs between context-size/token-use and flexibility. If you can write a bash or a python script, or an api or an MCP to do what you want, then write a bash or python script to do it. You can even include it in the skill.
My general design principle for agents, is that the top level context (ie claude.md, etc) is primarily "information about information", a list of skills, mcps, etc, a very general overview, and a limited amount of information that they always need to have with every request. Everything more specific is in a skill, which is mostly some very light touch instructions for how to use various tools we have (scripts, apis and mcps).
I have found that people very often add _way_ to much information into claude.md's and skills. Claude knows a lot of stuff already! Keep your information to things specific whatever you are working on that it doesn't already know. If your internal processes and house style are super complicated to explain to claude and it keeps making mistakes, you might want to adapt to claude instead of the other way around. Claude itself makes this mistake! If you ask it to build a claude md, it'll often fill it with extraneous stuff that it already knows. You should regularly trim it.
I use a common README_AI.md file, and use CLAUDE.md and AGENTS.md to direct the agent to that common file. From README_AI.md, I make specific references to skills. This works pretty well - it's become pretty rare that the agent behaves in a way contrary to my instructions. More info on my approach here: https://www.appsoftware.com/blog/a-centralised-approach-to-a... ... There was a post on here a couple of days ago referring to a paper that said that the AGENTS file alone worked better than agent skills, but a single agents file doesn't scale. For me, a combination where I use a brief reference to the skill in the main agents file seems like the best approach.
I'm not disagreeing with standards but instead of creating adapters, can't we prompt the agent to create its own version of a skill using its preferred guidelines? I don't think machines care about standards in the way that humans do. If we maintain pure knowledge in markdown, the agents can extract what they need on demand.
A link from a couple weeks back suggests that putting them in first-person makes them get adopted reliably. Something like, "If this is available, I will read it," vs "Always read this." Haven't tried it myself, but plan to.
Started to work on a tool to synchronize all skills with symlinks.
Its ok for my needs at the moment but feel free to improve it its on GH: https://github.com/Alpha-Coders/agent-loom
Please help me understand. Is a "skill" the prompt instructing the LLM how to do something? For example, I give it the "skill" of writing a fantasy story, by describing how the hero's journey works. Or I give it the "curl" skill by outputting curl's man page.
Its additional context that can be loaded by the agent as-needed. Generally it decides to load based on the skill's description, or you can tell it to load a specific skill if you want to.
So for your example, yes you might tell the agent "write a fantasy story" and you might have a "storytelling skill" that explains things like charater arcs, tropes, etc. You might have a separate "fiction writing" skill that defines writing styles, editing, consistency, etc.
All of this stuff is just 'prompt management' tooling though and isn't super commplicated. You could just paste the skill content into your context and go from there, this just provides a standardized spec for how to structure these on-demand context blocks.
LLM-powered agents are surprisingly human-like in their errors and misconceptions about less-than-ubiquitous or new tools. Skills are basically just small how-to files, sometimes combined with usage examples, helper scripts etc.
I think skills are probably a net positive for the general population, but for power users, I do recommend moving one meta layer up --
Whenever there's an agent best practice (skill) or 'pre-prompt' that you want to use all the time, turn it into a text expansion snippet so that it works no matter where you are.
As an example, I have a design 'pre-prompt' that dictates a bunch of steering for agents re: how to pick style components, typography, layout, etc. It's a few paragraphs long and I always send it alongside requests for design implementation to get way-better-than-average output.
I could turn it into a skill, but then I'd have to make sure whatever I'm using supported skills -- and install it every time or in a way that was universally seen on my system (no, symlinking doesn't really solve this).
So I use AutoHotkey (you might use Raycast, Espanso, etc) to config that every time I type '/dsn', it auto-expands into my pre-prompt snippet.
Now, no matter whether I'm using an agent on the web/cloud, in my terminal window, or in an IDE, I've memorized my most important 'pre-prompts' and they're a few seconds away.
It's anti-fragile steering by design. Call it universal skill injection.
Interesting format, but skills feel like optimizing the wrong layer. The agents usually don't fail because of bad instructions — they fail because external systems treat them like bots.
You can have the perfect scraping skill, but if the target blocks your requests, you're stuck. The hard problems are downstream.
Experimenting with skills over the last few months has completely changed the way I think about using LLMs. It's not so much that it's a really important technology or super brilliant, but I have gone from thinking of LLMs and agents as a _feature_ of what we are building and thinking of them as a _user_ of what we are building.
I have been trying to build skills to do various things on our internal tools, and more often then not, when it doesn't work, it is as much a problem with _our tools_ as it is with the LLM. You can't do obvious things, the documentation sucks, api's return opaque error messages. These are problems that humans can work around because of tribal knowledge, but LLMs absolutely cannot, and fixing it for LLM's also improves it for your human users, who probably have been quietly dealing with friction and bullshit without complaining -- or not dealing with it and going elsewhere.
If you are building a product today, the feature you are working on _is not done_ until Claude Code can use it. A skill and an MCP isn't a "nice to have", it is going to be as important as SEO and accessibility, with extremely similar work to do to enable it.
Your product might as well not exist in a few years if it isn't discoverable by agents and usable by agents.
> If you are building a product today, the feature you are working on _is not done_ until Claude Code can use it. A skill and an MCP isn't a "nice to have", it is going to be as important as SEO and accessibility, with extremely similar work to do to enable it.
Your product might as well not exist in a few years if it isn't discoverable by agents and usable by agents.
This is an interesting take. I admit I've never thought this way.
Yeah, omnipresent LLMs are a kind of forcing function for addressing typical significant underinvestment in (human-readable) docs. That said, I'm not entirely sold on MCP per se.
one good thing vercel did, was indexing skills.md under a site skills.sh - and yes there are now 100s of these sites, but I like the speedy/lite approach from vercel's DX, despite me not liking vercel a whole lot
I don't like vercel design, its just huge list of abstract skill name and you have to click on every one to even have a clue what something does. Such a bad design IMHO.
Design of https://www.skillcreator.ai/explore for me it's more useful. At least I can search by category, framework, language and I also see much more information what some skill does at a glance. I don't know why vercel really wanted to do it completely black and white - colors used and done with a taste gives useful context and information.
Is it just me, or do skills seem enormously similar to MCP?
…including, apparently, the clueless enthusiasm for people to “share” skills.
MCP is also perfectly fine when you run your own MCP locally. It’s bad when you install some arbitrary MCP from some random person. It fails when you have too many installed.
Same for skills.
It’s only a matter of time (maybe it already exists?) until someone makes a “package manager” for skills that has all of the stupid of MCP.
I don’t feel they’re similar at all and I don’t get why people compare them.
MCP is giving the agents a bunch of functions/tools it can use to interact with some other piece of infrastructure or technology through abstraction. More like a toolbox full of screwdrivers and hammers for different purposes, or a high-level API interface that a program can use.
Skills are more similar to a stack of manuals/books in a library that teach an agent how to do something, without polluting the main context. For example a guide how to use `git` on the CLI: The agent can read the manual when it needs to use `git`, but it doesn’t need to have the knowledge how to use `git` in it’s brain when it’s not relevant.
> MCP is giving the agents a bunch of functions/tools
A directory of skills... same thing
You can use MCP the same way as skills with a different interface. There are no rules on what goes into them.
They both need descriptions and instruction around them, they both have to be is presented and index/instn to the agent dynamically, so we can tell them what they have access to without polluting the context.
See the Anthropic post on moving MCP servers to a search function. Once you have enough skills, you are going to require the same optimization.
I separate things in a different way
1. What things do I force into context (agents.md, "tools" index, files)
2. What things can the agent discorver (MCP, skills, search)
It is conceptually different. Skill was created over the context rot problem. You will pull the right skill from the deck after having a challenge and figuring out the best skill just by reading the title and description.
Sure, but in an MCP server the endpoints provide a description of how to use the resource. I guess a text file is nice too but it seems like a stepping stone to what will eventually be necessary.
It's hilarious that after all those years of resistance to technical writing and formal specification engineers and programmers have suddenly been reduced to nothing more than technical writers and specification designers. Funny that I somehow don't foresee technical writing pay bumps happening as a consequence of this sudden surge in importance.
You just don't know which parts of the doc are real and which are hallucinated. Maybe the prompter checked everything and the content is actually good, but sadly many don't and there is a lot of slop flying around.
So if you want others to read the output you'll have to de-slopify it, ideally make it shorter and remove the tells.
If I go by good faith these days and trust that someone didn't just upload llm hallucinated bullshit, I'd sadly just be reading slop all day and not learning anything or worse even get deceived by hallucinations and make wrong assumptions. It's just a waste of someones precious life time.
LLMs can read through slop all day, humans can not without getting extremely annoyed.
Sounds like a bunch of bullshit to me. A simple markdown file with whatever and a directory will do the same. This is just packaging, selling and marketing.
This stuff smells like maybe the bitter lesson isn't fully appreciated.
You might as well just write instructions in English in any old format, as long as it's comprehensible. Exactly as you'd do for human readers! Nothing has really changed about what constitutes good documentation. (Edit to add: my parochialism is showing there, it doesn't have to be English)
Is any of this standardization really needed? Who does it benefit, except the people who enjoy writing specs and establishing standards like this? If it really is a productivity win, it ought to be possible to run a comparison study and prove it. Even then, it might not be worthwhile in the longer run.
The instructions are standard documents - but this is not all. What the system adds is an index of all skills, built from their descriptions, that is passed to the llm in each conversation. The idea is to let the llm read the skill when it is needed and not load it into context upfront. Humans use indexes too - but not in this way. But there are some analogies with GUIs and how they enhance discoverability of features for humans.
I wish they arranged it around READMEs. I have a directory with my tasks and I have a README.md there - before codex had skills it already understood that it needs to read the readme when it was dealing with tasks. The skills system is less directory dependent so is a bit more universal - but I am not sure if this is really needed.
Humans use indexes too - but not in this way.
What's different?
I have been using Claude Code to automate a bunch of my business tasks, and I set up slash commands for each of them. Each slash command starts by reading from a .md file of instructions. I asked Claude how this is different from skills and the only substantive thing it could come up with was that Claude wouldn't be able to use these on its own, without me invoking the slash command (which is fine; I wouldn't want it to go off and start checking my inventory of its own volition).
So yeah, I agree that it's all just documentation. I know there's been some evidence shown that skills work better, but my feeling is that in the long run it'll fall to the wayside, like prompt engineering, for a couple of reasons. First, many skills will just become unnecessary - models will be able to make slide decks or do frontend design without specific skills (Gemini's already excellent at design without anything beyond the base model, imho). Second, increased context windows and overall intelligence will obviate the need for the specific skills paradigm. You can just throw all the stuff you want Claude to know in your claude.md and call it a day.
Claude Code recently deprecated slash commands in favor of skills because they were so similar. Or another way of looking at it is, they added the ability to invoke a skill via /skill-name.
So how is this slash command limit enforced? Is it part of the Claude API/PostTraining etc? It seems like a useful tool if it is!
I'd like a user writeable, LLM readable, LLM non-writable character/sequence. That would make it a lot easier to know at a glance that a command/file/directory/username/password wasn't going to end up in context and being used by a rogue agent.
It wouldn't be fool proof, since it could probably find some other tool out there to generate it (eg write-me some unicode python), but it's something I haven't heard of that sounds useful. If it could be made fool/tool proof (fools and tools are so resourceful) that would be even better.
It's part of the Claude Code harness. I honestly haven't thought at all about security related to it; it's just a nice convenience to trigger a commonly run process.
Folks have run comparisons. From a huggingface employee:
https://xcancel.com/ben_burtenshaw/status/200023306951767675...That said, it's not a perfect comparison because of the Codex model mismatch between runs.
The author seems to be doing a lot of work on skills evaluation.
https://github.com/huggingface/upskill
I can't quite tell what's being compared there -- just looks like several different LLMs?
To be clear, I'm suggesting that any specific format for "skills.md" is a red herring, and all you need to do is provide the LLM with good clear documentation.
A useful comparison would be between: a) make a carefully organised .skills/ folder, b) put the same info anywhere and just link to it from your top-level doc, c) just dump everything directly in the top-level doc.
My guess is that it's probably a good idea to break stuff out into separate sections, to avoid polluting the context with stuff you don't need; but the specific way you do that very likely isn't important at all. So (a) and (b) would perform about the same.
Your skepticism is valid. Vercel ran a study where they said that skills underperform putting a docs index in AGENTS.md[0].
My guess is that the standardization is going to make its way into how the models are trained and Skills are eventually going to pull out ahead.
0: https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
I think the point is it smells like a hack, just like "think extra hard and I'll tip you $200" was a few years ago. It increases benchmarks a few points now but what's the point in standardizing all this if it'll be obsolete next year?
Standards have to start somewhere to gain traction and proliferate themselves for longer than that.
Plus, as has been mentioned multiple times here, standard skills are a lot more about different harnesses being able to consistently load skills into the context window in a programmatic way. Not every AI workload is a local coding agent.
I think this tweet sums it correctly doesn't?
Which is essentially the bitter lesson that Richard Sutton talks about?Does this indicate running locally with a very small (quantized?) model?
I am very interested in finding ways to combine skills + local models + MCP + aider-ish tools to avoid using commercial LLM providers.
Is this a path to follow? Or, something different?
Check out the guy's work. He's doing a lot of work on precisely what you're talking about.
https://xcancel.com/ben_burtenshaw
https://huggingface.co/blog/upskill
https://github.com/huggingface/upskill
Sounds like the benchmark matrix just got a lot bigger, model * skill combinations.
Skills are not just documentation. They include computability (programs/scripts), data (assets), and the documentation (resources) to use everything effectively.
Programs and data are the basis of deterministic results that are accessible to the llm.
Embedding an sqlite database with interesting information (bus schedules, dietary info, or a thousand other things) and a python program run by the skill can access it.
For Claude at least, it does it in a VM and can be used from your phone.
Sure, skills are more convention than a standard right now. Skills lack versioning, distribution, updates, unique naming, selective network access. But they are incredibly useful and accessible.
Am I missing something because what you describe as the pack of stuff sounds like S tier documentation. I get full working examples and a pre-populated database it works on?
It's all about managing context. The bitter lesson applies over the long haul - and yes, over the long haul, as context windows get larger or go away entirely with different architectures, this sort of thing won't be needed. But we've defined enough skills in the last month or two that if we were to put them all in CLAUDE.md, we wouldn't have any context left for coding. I can only imagine that this will be a temporary standard, but given the current state of the art, it's a helpful one.
I use Claude pretty extensively on a 2.5m loc codebase, and it's pretty decent at just reading the relevant readme docs & docstrings to figure out what's what. Those docs were written for human audiences years (sometimes decades) ago.
I'm very curious to know the size & state of a codebase where skills are beneficial over just having good information hierarchy for your documentation.
Why not replace the context tokens on the GPU during inference when they become no longer relevant? i.e. some tool reads a 50k token document, LLM processes it, so then just flush those document tokens out of active context, rebuild QKV caches and store just some log entry in the context as "I already did this ... with this result"?
Anthropic added features like this into 4.5 release:
https://claude.com/blog/context-management
> Context editing automatically clears stale tool calls and results from within the context window when approaching token limits.
> The memory tool enables Claude to store and consult information outside the context window through a file-based system.
But it looks like nobody has it as a part of an inference loop yet: I guess it's hard to train (i.e. you need a training set which is a good match for what people use context in practice) and make inference more complicated. I guess more high-level context management is just easier to implement - and it's one of things which "GPT wrapper" companies can do, so why bother?
This is what agent calls do under the hood, yes.
I don't think so, those things happen when agent yields the control back at the end of its inference call, not during the active agent inference with multiple tool calls ongoing. These days an agent can finish the whole task with 1000s tool calls during a single inference call without yielding control back to whatever called it to do some bookkeeping.
how is it different or better than maintaining an index page for your docs? Or a folder full of docs and giving Claude an instruction to `ls` the folder on startup?
Vercel think it isn’t:
https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
It's hard to tell unless they give some hard data comparing the approaches systematically.. this feels like a grift or more charitably trying to build a presence/market around nothing. But who knows anymore, apparently saying "tell the agent to write it's own docs for reference and context continuity" is considered a revelation.
Not sure why you’re being downvoted so much, it’s a valid point.
It’s also related to attention — invoking a skill “now” means that the model has all the relevant information fresh in context, you’ll have much better results.
What I’m doing myself is write skills that invoke Python scripts that “inject” prompts. This way you can set up multi-turn workflows for eg codebase analysis, deep thinking, root cause analysis, etc.
Works very well.
> Is any of this standardization really needed?
This standardization, basically, makes a list of docs easier to scan.
As a human, you have a permanent memory. LLMs don't have it, they have to load it into the context, and doing it only as necessary can help.
E.g. if you had anterograde amnesia, you'd want everything to be optimally organized, labeled, etc, right? Perhaps an app which keeps all information handy.
It's not about instructions, it's about discoverability and data.
Yeah, WWW is really just text but that doesn't mean you don't need HTTP + HTML and a browser/search engine. Skills is just that, but for agent capabilities.
Long term you're right though, agents will fetch this all themselves. And at some point they will not be our agents at all.
I guess what I mean is that standardizing this bit of the problem right now feels sort of like XHTML. Many people thought that was a big deal back in the day, but it turned out to be a pointless digression.
You are right about it's just natural language but Standarization is very improtant, because it's never just about the model itself, the so called Harness is a big factor on LLM performance and standarization allows all harness to index all skills.
This is pushed by Antropic, OpenAI doesn't seem to care much about "skills". Maybe Anthropic is doing some extra training to better follow sections of text marked as skill, who knows? Or you can just store what worked as a skill and share with others without any need to do their own prompt for common tasks?
OpenAI has already adopted Agent Skills:
- https://community.openai.com/t/skills-for-codex-experimental...
- https://developers.openai.com/codex/skills/
- https://github.com/openai/skills
- https://x.com/embirico/status/2018415923930206718
Post training can make known formats more reliable.
Skills are for the most part already generated by LLMs. And, if you're implementing them in your own workflow, they're tailored to real-world problems you've encountered.
Having a super repo of everyone else's slop is backwards thinking; you are now in the era where creating written content and verifying it's effectiveness is easier than ever.
Our team has found success in treating skills more like re-usable semi-deterministic functions and less like fingers-crossed prompts for random edge-cases.
For example, we have a skill to /create-new-endpoint. The skill contains a detailed checklist of all the boilerplate tasks that an engineer needs to do in addition to implementing the logic (e.g. update OpenAPI spec, add integration tests, endpoint boilerplate, etc.). The engineer manually invokes the skill from the CLI via slash commands, provides a JIRA ticket number, and engages in some brief design discussion. The LLM is consistently able to one-shot these tickets in a way that matches our existing application architecture.
How do you test these skills for consistency over time, or is that not needed?
The same way you'd test a human following written instructions over time.
Check the results.
Please standardize the folder.
I find that even though this isn't standard, that these -cli tools will scan the repo for .md files and for the most part execute the skills accordingly. Having said that, I would much prefer standards not just for this, but for plugins as well.
Standards for plugins makes sense, because you're establishing a protocol that both sides need to follow to be able to work together.
But I don't see why you need a strict standard for "an informal description of how to do a particular task". I say "informal" because it's necessarily written in prose -- if it were formal, it'd be a shell script.
This is happening as we speak.
Codex started this and OpenCode followed suit with the hour.
https://x.com/embirico/status/2018415923930206718
“Proposal: include a standard folder where agent skills should be“
https://github.com/agentskills/agentskills/issues/15
That is being discussed in https://github.com/agentskills/agentskills/issues/15.
I mean, it'd be good if these tools followed the xdg base spec and put their config in `~/.config/claude` e.t.c instead of `~/.claude`.
It's one of my biggest pet peeves with a lot of these tools (now admittedly a lot of them have a config env var to override, but it'd be nice if they just did the right thing automatically).
.agent/
Skills seem a bit early to standardize. We are so early in this, why do we want to handcuff our creativity so soon?
Skills are a really simple concept. They're just custom prompts with a name and some metadata. What are you afraid of handcuffing?
Just the decision of whether to allow models to invoke them has [1][2][3] different ways.
[1]: https://code.claude.com/docs/en/skills#control-who-invokes-a... [2]: https://opencode.ai/docs/skills/#disable-the-skill-tool [3]: https://developers.openai.com/codex/skills/#enable-or-disabl...
All the more reason to standardise it
Eventually, you can standardize what you don't understand
The problem I see now is that everyone wants to be the winner in a hype cycle and be the standards bringer. How many "standards" have we seen put out now? No one talks about MCP much anymore, langchain I haven't seen in more than a year, will we be talking about Skills in another year?
They are more than that, for example the frontmatter and code files around them. The spec: https://agentskills.io/specification
Why do I want to throw away my dependency management system and shared libraries folder for putting scripts in skills?
What tools do they have access to, can I define this so it's dynamic? Do skills even have a concept for sub tools or sub agents? Why do I want to put references in a folder instead of a search engine? Does frontmatter even make sense, why not something closer to a package.json in a file next to it?
Does it even make sense to have skills in the repo? How do I use them across projects? How do we build an ecosystem and dependency management system for skills (which are themselves versioned)
> They are more than that, for example the frontmatter and code files around them.
You are right. I have edited my post slightly.
> Why do I want to throw away my dependency management system and shared libraries folder for putting scripts in skills?
You don't have to put scripts in skills. The script can be anywhere the agent can access. The skill just needs to tell the LLM how to run it.
> Does it even make sense to have skills in the repo? How do I use them across projects?
You don't have to put them in the repo. E.g. with Claude Code you can put project-specific skills in `.claude/skills` in the repo and system-wide skills in `~/.claude/skills`.
2. The spec / docs show people how to put code in a subdir. While you can reference external scripts, there is a blessed pattern that seems like an anti-pattern to me
3. generalize: how do I store, maintain, and distribute skills shared by employees who work on multiple repos. Sounds like standard dependency management to me. Does to some of the people building collections / registries. Not sure if any of them account for versioning, have not seen anything tied to lock files (though I'd avoid that by using MVS for dep selection)
Agreed. I think being overly formal about what can be in the frontmatter would be a mistake, but the beauty of doing this with an LLM is that you can pretty much emulate skills in any agent by telling it to start by reading the frontmatter of each skills file and use that to decide when to read the rest, so given that as a fallback, it's hardly imposing some massive burden to standardise it a bit.
might be too early to standardize
standards are good but they slow development and experimentation
ln -s to the rescue!
The root cause should be fixed.
It's why I wrapped my tiny skills repo with a script that softlink them into whichever is your skills folder, defaulting to Claude, but could be any other.
I treat my skills the same as I would write tiny bash scripts and fish functions in the days gone to simplify my life by writing 2 words instead of 2 sentences. Tiny improvement that only makes sense for a programmer at heart.
[1] https://github.com/flurdy/agent-skills
That doesn't work very well if your developers are on Windows (and most are). Uneven Git support for symbolic links across platforms is going to end up causing more problems than it solves.
Why not hardlinks?
You can't hardlink a directory.
There are 14 competing standards.
The problem is that the de facto standard is `.claude`, which is problematic for folks not using Claude.
Your skill then just becomes an .md file containing
>any time you want to search for a skill in `./codex`, search instead in `./claude`
and continue as you were.
I see it similar to browser user-agents all claiming to be an ancient version of Mozilla or KHTML. We pick whatever works and then move on. It might not be "correct," but as long as our tools know what to do, who cares?
Now, there are 15 competing standards.
Soon...
Worse yet; opencode uses singular words by default:
On the website[1] it says:
[1]: https://opencode.ai/docs/skills/#place-filesDoes anyone find that agents just don't use them without being asked?
This has been a problem for us too. Sometimes they reach for skills, sometimes they don’t and just try to do the thing on their own. It’s annoying.
I think this is (mostly) a solvable problem. The current generation of SotA models wasn’t RLVR-trained on skills (they didn’t exist at that time) and probably gets slightly confused by the way the little descriptions are all packed into the same tool call schema. (At least that’s how it works with Claude Code.) The next generation will have likely been RLVRed on a lot of tasks where skills are available, and will use them much more reliably. Basically, wait until the next Opus release and you should hopefully see major improvements. (Of course, all this stuff is non-deterministic blah blah, but I think it’s reasonable to expect going from “misses the skill 30% of the time” to “misses it 2% of the time”.)
I think this is mostly a problem of making things skills that don't need to be skills (telling it how to do something it already knows how to do), and having way too much context, so that the skills effectively disappear. If skills are important, information about using skills needs to be a relatively large proportion of the context. Probably the right way to do it, is aggressively trimming anything that might distract from them.
That's also what Vercel found:
> In 56% of eval cases, the skill was never invoked. The agent had access to the documentation but didn't use it. Adding the skill produced no improvement over baseline.
> …
> Skills aren't useless. The AGENTS.md approach provides broad, horizontal improvements to how agents work with Next.js across all tasks. Skills work better for vertical, action-specific workflows that users explicitly trigger,
https://vercel.com/blog/agents-md-outperforms-skills-in-our-...
Depends what you use perhaps. I use codex and it seems to mostly stick to instructions I give. I use an AGENTS.md that explicitly points to the repository's skill directory. I mostly keep instructions in there for obvious things like how to build, how to test, what to do before declaring a thing done, etc. I don't tend to have a lot of skills in there either.
Probably the more skills you have, the more confused it might get. The more potentially conflicting instructions you give the harder it gets for an LLM to figure out what you actually want to happen.
If I catch it going off script, I often interrupt it and tell it what to do and update the relevant skill. Seems to work pretty good. Keeping things simple seems to work.
Yep. I have an incredibly hard time getting them to use Skills at all, even when asked.
I saw someone's analysis a few days ago and they found that their agents were more accurate when just dumping the skill context directly into AGENTS.md
I often find they aren't triggered when I would expect using a keyword and explicitly trigger them.
Same! If I put the skill's instructions in the general AGENTS.md, it works just fine.
Because "skills" are just .md files that the lossy compressing statistical output machine may or may not find and that may or may not be retained in the tiny context window
I don’t think you should be downvoted. Skills and history get added to the prompt, there’s no other interface to the model to do anything different. I think it’s smart to keep this in mind when working with LLMs. It’s like keeping in mind that a webserver just responds to HTTP requests when developing a web application. You need to keep perspective.
Edit: btw I’ve gone from genai value denier to skeptic to cautiously optimistic to fairly impressed in the span of a year. (I’m a user of Claude code)
Implementation Notes:
- There is no reason you have to expose the skills through the file system. Just as easy to add tool-call to load a skill. Just put a skill ID in the instruction metadata. Or have a `discover_skills` tool if you want to keep skills out of the instructions all together.
- Another variation is to put a "skills selector" inference in front of your agent invocation. This inference would receive the current inquiry/transcript + the skills metadata and return a list of potentially relevant skills. Same concept as a tool selection, this can save context bandwidth when there are a large number of skills
The observation about agents not using skills without being explicitly asked resonates. In practice, I've found success treating skills as explicit "workflows" rather than background context.
The pattern that works: skills that represent complete, self-contained sequences - "do X, then Y, then Z, then verify" - with clear trigger conditions. The agent recognizes these as distinct modes of operation rather than optional reference material.
What doesn't work: skills as general guidelines or "best practices" documents. These get lost in context or ignored entirely because the agent has no clear signal for when to apply them.
The mental model shift: think of skills less like documentation and more like subroutines you'd explicitly invoke. If you wouldn't write a function for it, it probably shouldn't be a skill.
Better yet is a system which activates skills in certain situations. I use hooks for this with Claude, works great. The skill descriptions are "Do not activate unless instructed by guidance."
Example: A Python file is read or written, guidance is given back (once, with a long cooldown) to activate global and company-specific Python skills. Claude activates the skills and writes Python to our preference.
That does raise the question of what the value is of a "skill" vs a "command". Claude Code supports both, and it's not entirely clear to me when we should use one vs the other - especially if skills work best as, well, commands.
IMO the value and differentiating factor is basically just the ability to organize them cleanly with accompanying scripts and references, which are only loaded on demand. But a skill just by itself (without scripts or references) is essentially just a slash command with metadata.
Another value add is that theoretically agents should trigger skills automatically based on context and their current task. In practice, at least in my experience, that is not happening reliably.
Reminds me of my personal Obsidian notes, CLI commands for tasks I need just rarely enough to forget, with explanations for future me.
The description "just" needs to be excruciatingly precise about when to use the skill, because the frontmatter is all the model will see in context.
But on the other hand, in Claude Code, at least, the skill "foo" is accessible as /foo, as the generalisation of the old commands/ directory, so I tend to favour being explicit that way.
I am working on a domain specific agent that includes the concept of skills. I only allow one to be active at a time to reduce the chances for conflicting instructions. I use a small sub-agent to select/maintain/change the active skill at the start of each turn. It uses a small fast model to match the recent conversation to a skill (or none). I tried other approaches, but for my use case this was worked well.
My model for skills is similar to this, but I extended it to have explicit use when and don’t use when examples and counter examples. This helped the small model which tended to not get the nuances of a free form text description.
You should consider calling these "behaviors" to mimic behavior trees in game / robot AI. They follow the same notion of a single behavior being active at once: https://en.wikipedia.org/wiki/Behavior_tree_(artificial_inte...
My unproven theory is that agent skills are just a good way to 'acquire' unspoken domain rules. A lot of things that developers do are just in their heads, and using 'skills' forces them to write these down. Then you feed this back to the LLM company for them to train on.
Pro tip: create README.md files in subfolders with helpful content that you might put in an AGENTS.md file (but, ya know, for humans too), and *link relevant skills there*. You don't even have to call them skills or use the skills format. It works for everything (including humans!).
I wrote a rant about skills a while ago that's still relevant in some ways: https://sibylline.dev/articles/2025-10-20-claude-skills-cons...
I started playing with skills yesterday. I'm not sure if it's just easier for the LLM to call APIs inside the skill — and then move the heavier code behind an endpoint that the agent can call instead.
I have a feeling that otherwise it becomes too messy for agents to reliably handle a lot of complex stuff.
For example, I have OpenClaw automatically looking for trending papers, turning them into fun stories, and then sending me the text via Telegram so I can listen to it in the ElevenLabs app.
I'm not sure whether it's better to have the story-generating system behind an API or to code it as a skill — especially since OpenClaw already does a lot of other stuff for me.
Are you spending a fortune on running OpenClaw?
It's free with qwen oauth
They're basically all trade-offs between context-size/token-use and flexibility. If you can write a bash or a python script, or an api or an MCP to do what you want, then write a bash or python script to do it. You can even include it in the skill.
My general design principle for agents, is that the top level context (ie claude.md, etc) is primarily "information about information", a list of skills, mcps, etc, a very general overview, and a limited amount of information that they always need to have with every request. Everything more specific is in a skill, which is mostly some very light touch instructions for how to use various tools we have (scripts, apis and mcps).
I have found that people very often add _way_ to much information into claude.md's and skills. Claude knows a lot of stuff already! Keep your information to things specific whatever you are working on that it doesn't already know. If your internal processes and house style are super complicated to explain to claude and it keeps making mistakes, you might want to adapt to claude instead of the other way around. Claude itself makes this mistake! If you ask it to build a claude md, it'll often fill it with extraneous stuff that it already knows. You should regularly trim it.
Thanks, super useful!
I use a common README_AI.md file, and use CLAUDE.md and AGENTS.md to direct the agent to that common file. From README_AI.md, I make specific references to skills. This works pretty well - it's become pretty rare that the agent behaves in a way contrary to my instructions. More info on my approach here: https://www.appsoftware.com/blog/a-centralised-approach-to-a... ... There was a post on here a couple of days ago referring to a paper that said that the AGENTS file alone worked better than agent skills, but a single agents file doesn't scale. For me, a combination where I use a brief reference to the skill in the main agents file seems like the best approach.
I'm not disagreeing with standards but instead of creating adapters, can't we prompt the agent to create its own version of a skill using its preferred guidelines? I don't think machines care about standards in the way that humans do. If we maintain pure knowledge in markdown, the agents can extract what they need on demand.
A link from a couple weeks back suggests that putting them in first-person makes them get adopted reliably. Something like, "If this is available, I will read it," vs "Always read this." Haven't tried it myself, but plan to.
Started to work on a tool to synchronize all skills with symlinks. Its ok for my needs at the moment but feel free to improve it its on GH: https://github.com/Alpha-Coders/agent-loom
Please help me understand. Is a "skill" the prompt instructing the LLM how to do something? For example, I give it the "skill" of writing a fantasy story, by describing how the hero's journey works. Or I give it the "curl" skill by outputting curl's man page.
Its additional context that can be loaded by the agent as-needed. Generally it decides to load based on the skill's description, or you can tell it to load a specific skill if you want to.
So for your example, yes you might tell the agent "write a fantasy story" and you might have a "storytelling skill" that explains things like charater arcs, tropes, etc. You might have a separate "fiction writing" skill that defines writing styles, editing, consistency, etc.
All of this stuff is just 'prompt management' tooling though and isn't super commplicated. You could just paste the skill content into your context and go from there, this just provides a standardized spec for how to structure these on-demand context blocks.
Yes, pretty much.
LLM-powered agents are surprisingly human-like in their errors and misconceptions about less-than-ubiquitous or new tools. Skills are basically just small how-to files, sometimes combined with usage examples, helper scripts etc.
I think skills are probably a net positive for the general population, but for power users, I do recommend moving one meta layer up --
Whenever there's an agent best practice (skill) or 'pre-prompt' that you want to use all the time, turn it into a text expansion snippet so that it works no matter where you are.
As an example, I have a design 'pre-prompt' that dictates a bunch of steering for agents re: how to pick style components, typography, layout, etc. It's a few paragraphs long and I always send it alongside requests for design implementation to get way-better-than-average output.
I could turn it into a skill, but then I'd have to make sure whatever I'm using supported skills -- and install it every time or in a way that was universally seen on my system (no, symlinking doesn't really solve this).
So I use AutoHotkey (you might use Raycast, Espanso, etc) to config that every time I type '/dsn', it auto-expands into my pre-prompt snippet.
Now, no matter whether I'm using an agent on the web/cloud, in my terminal window, or in an IDE, I've memorized my most important 'pre-prompts' and they're a few seconds away.
It's anti-fragile steering by design. Call it universal skill injection.
I realized that amp uses ~/.agents/skills
I liked that idea to have something more CLI agnostic
Is there a skill directory that can be browsed by a human?
All these sites just look exactly like claude code skills doc.
Interesting format, but skills feel like optimizing the wrong layer. The agents usually don't fail because of bad instructions — they fail because external systems treat them like bots.
You can have the perfect scraping skill, but if the target blocks your requests, you're stuck. The hard problems are downstream.
If u wanna browse, search and download AI agent skills, use openskills.space
Experimenting with skills over the last few months has completely changed the way I think about using LLMs. It's not so much that it's a really important technology or super brilliant, but I have gone from thinking of LLMs and agents as a _feature_ of what we are building and thinking of them as a _user_ of what we are building.
I have been trying to build skills to do various things on our internal tools, and more often then not, when it doesn't work, it is as much a problem with _our tools_ as it is with the LLM. You can't do obvious things, the documentation sucks, api's return opaque error messages. These are problems that humans can work around because of tribal knowledge, but LLMs absolutely cannot, and fixing it for LLM's also improves it for your human users, who probably have been quietly dealing with friction and bullshit without complaining -- or not dealing with it and going elsewhere.
If you are building a product today, the feature you are working on _is not done_ until Claude Code can use it. A skill and an MCP isn't a "nice to have", it is going to be as important as SEO and accessibility, with extremely similar work to do to enable it.
Your product might as well not exist in a few years if it isn't discoverable by agents and usable by agents.
> If you are building a product today, the feature you are working on _is not done_ until Claude Code can use it. A skill and an MCP isn't a "nice to have", it is going to be as important as SEO and accessibility, with extremely similar work to do to enable it. Your product might as well not exist in a few years if it isn't discoverable by agents and usable by agents.
This is an interesting take. I admit I've never thought this way.
As discussed in https://news.ycombinator.com/item?id=46777409
Wow, that is almost point for point what I had written down in a bunch of documents I had been spreading around at work this week. Excellent post.
Yeah, omnipresent LLMs are a kind of forcing function for addressing typical significant underinvestment in (human-readable) docs. That said, I'm not entirely sold on MCP per se.
Are there good techniques for testing / benchmarking skills effectiveness?
one good thing vercel did, was indexing skills.md under a site skills.sh - and yes there are now 100s of these sites, but I like the speedy/lite approach from vercel's DX, despite me not liking vercel a whole lot
I don't like vercel design, its just huge list of abstract skill name and you have to click on every one to even have a clue what something does. Such a bad design IMHO.
Design of https://www.skillcreator.ai/explore for me it's more useful. At least I can search by category, framework, language and I also see much more information what some skill does at a glance. I don't know why vercel really wanted to do it completely black and white - colors used and done with a taste gives useful context and information.
That site loads 1 skill at a time on the explore page on my iphone, mobile safari
slop?
Is it just me, or do skills seem enormously similar to MCP?
…including, apparently, the clueless enthusiasm for people to “share” skills.
MCP is also perfectly fine when you run your own MCP locally. It’s bad when you install some arbitrary MCP from some random person. It fails when you have too many installed.
Same for skills.
It’s only a matter of time (maybe it already exists?) until someone makes a “package manager” for skills that has all of the stupid of MCP.
I don’t feel they’re similar at all and I don’t get why people compare them.
MCP is giving the agents a bunch of functions/tools it can use to interact with some other piece of infrastructure or technology through abstraction. More like a toolbox full of screwdrivers and hammers for different purposes, or a high-level API interface that a program can use.
Skills are more similar to a stack of manuals/books in a library that teach an agent how to do something, without polluting the main context. For example a guide how to use `git` on the CLI: The agent can read the manual when it needs to use `git`, but it doesn’t need to have the knowledge how to use `git` in it’s brain when it’s not relevant.
> MCP is giving the agents a bunch of functions/tools
A directory of skills... same thing
You can use MCP the same way as skills with a different interface. There are no rules on what goes into them.
They both need descriptions and instruction around them, they both have to be is presented and index/instn to the agent dynamically, so we can tell them what they have access to without polluting the context.
See the Anthropic post on moving MCP servers to a search function. Once you have enough skills, you are going to require the same optimization.
I separate things in a different way
1. What things do I force into context (agents.md, "tools" index, files) 2. What things can the agent discorver (MCP, skills, search)
That's the point. It was supposed to be a simpler, more efficient way of doing the same things as MCP but agents turned out not to like them as much.
It is conceptually different. Skill was created over the context rot problem. You will pull the right skill from the deck after having a challenge and figuring out the best skill just by reading the title and description.
It's mostly just static/dynamic content behind descriptive names.
> Is it just me, or do skills seem enormously similar to MCP?
Ok I'm glad I'm not the only one who wondered this. This seems like simplified MCP; so why not just have it be part of an MCP server?
For one thing, it’s a text file and not a server. That makes it simpler.
Sure, but in an MCP server the endpoints provide a description of how to use the resource. I guess a text file is nice too but it seems like a stepping stone to what will eventually be necessary.
It's hilarious that after all those years of resistance to technical writing and formal specification engineers and programmers have suddenly been reduced to nothing more than technical writers and specification designers. Funny that I somehow don't foresee technical writing pay bumps happening as a consequence of this sudden surge in importance.
This post does a very good job of laying out that argument
https://jsulmont.github.io/swarms-ai/
Reads like slop to me. Reeeeeally verbose and many “it’s not just …, it’s …” all over the place.
slopman, the new strawman argument
You just don't know which parts of the doc are real and which are hallucinated. Maybe the prompter checked everything and the content is actually good, but sadly many don't and there is a lot of slop flying around.
So if you want others to read the output you'll have to de-slopify it, ideally make it shorter and remove the tells.
If I go by good faith these days and trust that someone didn't just upload llm hallucinated bullshit, I'd sadly just be reading slop all day and not learning anything or worse even get deceived by hallucinations and make wrong assumptions. It's just a waste of someones precious life time.
LLMs can read through slop all day, humans can not without getting extremely annoyed.
Uh, more like managers than writers. We (the agent and I) have written about 20 design docs for my personal project and none of them were by hand.
Sounds like a bunch of bullshit to me. A simple markdown file with whatever and a directory will do the same. This is just packaging, selling and marketing.