October 2024

Hey!

Do you have a website? A personal one or perhaps something more serious?

Whatever the case, if you don’t want AI companies training on your website’s contents, add the following to your robots.txt file:

User-agent: *

Allow: /


User-agent: anthropic-ai

Disallow: /


User-agent: Claude-Web

Disallow: /


User-agent: CCbot

Disallow: /


User-agent: FacebookBot

Disallow: /


User-agent: Google-Extended

Disallow: /


User-agent: GPTBot

Disallow: /


User-agent: PiplBot

Disallow: /


User-agent: ByteSpider

Disallow: /


User-agent: PerplexityBot

Disallow: /


User-agent: cohere-ai

Disallow: /


User-agent: ChatGPT-User

Disallow: /


User-agent: Omgilibot

Disallow: /


User-agent: Omgili

Disallow: /


There are of course more and even if you added them they may not cooperate, but this should get the biggest AI companies to leave your site alone.

Important note: The first two lines declare that anything not on the list is allowed to access everything on the site. If you don’t want this, add “Disallow:” lines after them and write the relative paths of the stuff you don’t want any bots, including google search to access. For example:

User-agent: *

Allow: /

Disallow: /super-secret-pages/secret.html


If that was in the robots.txt of example.com, it would tell all bots to not access

https://example.com/super-secret-pages/secret.html

And I’m sure you already know what to do if you already have a robots txt, sitemap.xml/sitemap.txt etc.