How to block AI bots with llms.txt
January 16, 2024
10 min read
llms.txt Team
Some site owners prefer to limit how AI crawlers access and process their content. This guide explains what llms.txt can and cannot do, how to use it alongside robots.txt, and when to rely on server‑level or firewall controls. You will find copy‑paste examples plus a practical workflow to combine clarity and enforcement.
If you need a ready‑to‑publish file, you can generate a complete llms.txt in seconds with our tool. Try it here: Generate llms.txt.
What llms.txt can and cannot do
- llms.txt is advisory guidance for AI crawlers. It explains your site, priorities, and suggested focus areas.
- It does not enforce blocking. Non‑compliant bots can ignore it.
- For authoritative directives (allow/deny), use robots.txt, meta robots, or server/firewall rules.
- Best practice: publish llms.txt for clarity and pair it with robots.txt plus server‑side protections.
The recommended layered approach
Layer 1: llms.txt for AI guidance
- Explain what is important and what to avoid in human‑readable form.
- Reduce accidental indexing of low‑value or sensitive sections.
- Document your intent and content structure for AI systems.
Layer 2: robots.txt for authoritative rules
- Disallow specific paths for compliant crawlers and reference your sitemap.
- List known AI user agents when providers support opt‑out directives.
Layer 3: Server or WAF enforcement
- Block specific User‑Agents and IP ranges using your web server or CDN/WAF.
- Require authentication or tokens for private content and APIs.
- Apply rate limits to suspected automated traffic.
Example llms.txt with Disallow guidance
# Example Site > Public marketing site with docs and a customer portal. The portal requires authentication. ## Contact - Email: security@example.com - Website: https://example.com ## Pages ### Home URL: https://example.com/ Overview of services and customer stories. ### Docs URL: https://example.com/docs Public product documentation and API guides. ### Blog URL: https://example.com/blog Tutorials, announcements, and best practices. ## Crawling Rules Disallow: /portal Disallow: /admin Disallow: /internal Disallow: /search
Need a starter file? Generate llms.txt now and customize it to your policy.
robots.txt examples for stricter controls
These directives apply to many search/SEO crawlers and well‑behaved bots:
User-agent: * Disallow: /admin Disallow: /internal Disallow: /portal Allow: / Sitemap: https://example.com/sitemap.xml
Block known AI user agents (example)
User-agent: GPTBot Disallow: / User-agent: CCBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: Google-Extended Disallow: /
Keep your list current. Some AI providers publish opt‑out user agents or IPs. Document your policy in llms.txt and robots.txt for clarity.
Server‑level blocking (enforcement)
Nginx example (block by User‑Agent)
map $http_user_agent $block_ai { default 0; ~*(GPTBot|CCBot|ClaudeBot|Google-Extended) 1; } server { listen 443 ssl; server_name example.com; if ($block_ai) { return 403; } location / { proxy_pass http://app_upstream; } }
Apache example (.htaccess)
RewriteEngine On SetEnvIfNoCase User-Agent "GPTBot|CCBot|ClaudeBot|Google-Extended" bad_bot Order Allow,Deny Allow from all Deny from env=bad_bot
Cloudflare WAF concept
- Field: User Agent
- Operator: contains
- Value: GPTBot (repeat for each target)
- Action: Block or JS Challenge
Protecting authenticated and private content
- Require authentication for private routes; do not rely on robots.txt alone.
- Use signed URLs, token checks, or session‑based access for sensitive endpoints.
- Avoid exposing staging/admin paths publicly; keep them restricted.
Handling AI crawlers for APIs and feeds
- Add X‑Robots‑Tag: noindex to responses that should not be indexed.
- Rate‑limit or require API keys for high‑value endpoints.
- Publish an llms.txt statement clarifying allowed API endpoints for AI testing.
Monitoring and maintenance
- Monitor server logs for AI bot user agents and IPs.
- Update robots.txt and WAF rules regularly.
- Test that public pages remain crawlable while private areas return 403/401.
Putting it all together (workflow)
- Generate llms.txt guidance → /#generator
- Update robots.txt with Disallow directives and sitemap references
- Add server/WAF rules for specific AI bots
- Verify with logs and external crawlers
- Review monthly; adjust with new bot signatures