How to block AI bots with llms.txt

January 16, 2024

10 min read

llms.txt Team

Some site owners prefer to limit how AI crawlers access and process their content. This guide explains what llms.txt can and cannot do, how to use it alongside robots.txt, and when to rely on server‑level or firewall controls. You will find copy‑paste examples plus a practical workflow to combine clarity and enforcement.

If you need a ready‑to‑publish file, you can generate a complete llms.txt in seconds with our tool. Try it here: Generate llms.txt.

What llms.txt can and cannot do

llms.txt is advisory guidance for AI crawlers. It explains your site, priorities, and suggested focus areas.
It does not enforce blocking. Non‑compliant bots can ignore it.
For authoritative directives (allow/deny), use robots.txt, meta robots, or server/firewall rules.
Best practice: publish llms.txt for clarity and pair it with robots.txt plus server‑side protections.

The recommended layered approach

Layer 1: llms.txt for AI guidance

Explain what is important and what to avoid in human‑readable form.
Reduce accidental indexing of low‑value or sensitive sections.
Document your intent and content structure for AI systems.

Layer 2: robots.txt for authoritative rules

Disallow specific paths for compliant crawlers and reference your sitemap.
List known AI user agents when providers support opt‑out directives.

Layer 3: Server or WAF enforcement

Block specific User‑Agents and IP ranges using your web server or CDN/WAF.
Require authentication or tokens for private content and APIs.
Apply rate limits to suspected automated traffic.

Example llms.txt with Disallow guidance

# Example Site
> Public marketing site with docs and a customer portal. The portal requires authentication.

## Contact
- Email: security@example.com
- Website: https://example.com

## Pages
### Home
URL: https://example.com/
Overview of services and customer stories.

### Docs
URL: https://example.com/docs
Public product documentation and API guides.

### Blog
URL: https://example.com/blog
Tutorials, announcements, and best practices.

## Crawling Rules
Disallow: /portal
Disallow: /admin
Disallow: /internal
Disallow: /search

Need a starter file? Generate llms.txt now and customize it to your policy.

robots.txt examples for stricter controls

These directives apply to many search/SEO crawlers and well‑behaved bots:

User-agent: *
Disallow: /admin
Disallow: /internal
Disallow: /portal
Allow: /
Sitemap: https://example.com/sitemap.xml

Block known AI user agents (example)

User-agent: GPTBot
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

Keep your list current. Some AI providers publish opt‑out user agents or IPs. Document your policy in llms.txt and robots.txt for clarity.

Server‑level blocking (enforcement)

Nginx example (block by User‑Agent)

map $http_user_agent $block_ai {
  default 0;
  ~*(GPTBot|CCBot|ClaudeBot|Google-Extended) 1;
}

server {
  listen 443 ssl;
  server_name example.com;

  if ($block_ai) { return 403; }

  location / {
    proxy_pass http://app_upstream;
  }
}

Apache example (.htaccess)

RewriteEngine On
SetEnvIfNoCase User-Agent "GPTBot|CCBot|ClaudeBot|Google-Extended" bad_bot
Order Allow,Deny
Allow from all
Deny from env=bad_bot

Cloudflare WAF concept

Field: User Agent
Operator: contains
Value: GPTBot (repeat for each target)
Action: Block or JS Challenge

Protecting authenticated and private content

Require authentication for private routes; do not rely on robots.txt alone.
Use signed URLs, token checks, or session‑based access for sensitive endpoints.
Avoid exposing staging/admin paths publicly; keep them restricted.

Handling AI crawlers for APIs and feeds

Add X‑Robots‑Tag: noindex to responses that should not be indexed.
Rate‑limit or require API keys for high‑value endpoints.
Publish an llms.txt statement clarifying allowed API endpoints for AI testing.

Monitoring and maintenance

Monitor server logs for AI bot user agents and IPs.
Update robots.txt and WAF rules regularly.
Test that public pages remain crawlable while private areas return 403/401.

Putting it all together (workflow)

Generate llms.txt guidance → /#generator
Update robots.txt with Disallow directives and sitemap references
Add server/WAF rules for specific AI bots
Verify with logs and external crawlers
Review monthly; adjust with new bot signatures