Standards

llms.txt Is Metadata for Models

⏱ 5 min read · AI Standards

Robots.txt tells crawlers where they can't go. Sitemaps tell them where they should go. But neither file tells AI models what your site is actually about. That's what llms.txt does — it's metadata specifically designed for language models to read and understand.

It's not a standard from a committee. It's an emergent convention that developers started using because it solved a real problem: how do you tell an AI what your site is for?

What llms.txt Contains

An llms.txt file is markdown-formatted metadata about your site. It includes your site name, a brief description, key pages, content usage policy, and preferred citation format. All written in plain language that both humans and AI models can understand.

The format is deliberately simple. No complex schemas, no XML, no JSON. Just markdown with clear section headings. This makes it easy to write, easy to read, and easy for models to parse.

The content usage policy is the most important section. This is where you explicitly state whether AI models can use your content for training, for inference, or not at all. It's not legally binding, but it's a clear signal.

Why Models Check For It

When an AI model encounters your site, it needs context. What kind of site is this? Is it authoritative? What topics does it cover? Without llms.txt, the model has to infer this from the content itself, which is error-prone.

With llms.txt, the model gets explicit metadata. "This is a medical information site. Our key authoritative pages are X, Y, Z. We allow citation but not training." That context improves the model's ability to cite your content accurately.

Major AI labs are training models to recognize and respect llms.txt files. It's becoming part of the standard crawling process — check robots.txt for access rules, check llms.txt for metadata.

llms.txt is what robots.txt should have been: a way to communicate intent, not just restrictions.

The Citation Advantage

Sites with llms.txt files get cited more often in AI-generated content. This isn't guaranteed, but the data shows a correlation. When a model has clear metadata about a site, it's more confident citing that site as a source.

This matters in the age of AI search. When Perplexity or ChatGPT generates an answer, the sources it cites get visibility. If your site is consistently cited, that builds brand recognition even if users don't click through.

The preferred citation format section lets you control how you're cited. Do you want a link to the homepage or to specific articles? Do you want your brand name or your domain name? llms.txt lets you specify this.

The Training Policy Signal

The content usage policy in llms.txt is where you state your position on AI training. "This content may be used for inference but not for model training" is a common policy.

This isn't enforceable like robots.txt. A model that ignores robots.txt is breaking a widely-accepted norm. A model that ignores llms.txt is... also breaking a norm, but a newer, less established one.

Still, it's a signal. Responsible AI companies want to respect content creators' wishes. llms.txt gives them a clear way to know what those wishes are.

How to Write One

Start with your site's core purpose in one sentence. Then list 3-5 key pages that represent your best content. Add a content usage policy — be explicit about training vs. inference. Include a preferred citation format.

Keep it under 500 words. The file should be scannable by both humans and models. Avoid jargon. Write in plain language.

The file should live at yourdomain.com/llms.txt, just like robots.txt lives at yourdomain.com/robots.txt. Plain text or markdown format, not HTML.

The Adoption Curve

llms.txt started as a grassroots standard. Developers who wanted better AI citations began adding these files. AI researchers noticed and started training models to look for them. Now, major AI labs recommend llms.txt as a best practice.

It's not universal yet. Many sites don't have one. But the sites that do are seeing measurable benefits in citation rates and AI visibility.

As AI search becomes more prevalent, llms.txt will likely become as standard as robots.txt. It's the metadata layer that the AI-first web needs.

Beyond Basic Metadata

Some sites are experimenting with extended llms.txt formats. Structured data about authors, publication dates, content categories. API endpoints for real-time data. Contact information for licensing inquiries.

These extensions aren't standardized yet, but they show the potential. llms.txt could evolve from simple metadata to a comprehensive machine-readable profile of your site.

For now, the basic format is enough. Site name, description, key pages, usage policy, citation format. That's all you need to give AI models the context they need to understand and cite your content correctly.

Generate a custom llms.txt file for your site with LLM Utils llms.txt Generator — optimized for AI citations and model training policies.