Hey!
Do you have a website? A personal one or perhaps something more serious?
Whatever the case, if you don’t want AI companies training on your website’s contents, add the following to your robots.txt file:
User-agent: *
Allow: /
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: PiplBot
Disallow: /
User-agent: ByteSpider
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: cohere-ai
Disallow: /
User-agent: ChatGPT-User
Disallow: /
User-agent: Omgilibot
Disallow: /
User-agent: Omgili
Disallow: /
There are of course more and even if you added them they may not cooperate, but this should get the biggest AI companies to leave your site alone.
Important note: The first two lines declare that anything not on the list is allowed to access everything on the site. If you don’t want this, add “Disallow:” lines after them and write the relative paths of the stuff you don’t want any bots, including google search to access. For example:
User-agent: *
Allow: /
Disallow: /super-secret-pages/secret.html
If that was in the robots.txt of example.com, it would tell all bots to not access
https://example.com/super-secret-pages/secret.html
And I’m sure you already know what to do if you already have a robots txt, sitemap.xml/sitemap.txt etc.