For those who are content creators or publishers (blog content), some ways to protect your content from AI scraping (training data):
- Use the well known option to control bot site indexing, using robots.txt. There are reports that some AI bots don't respect the robots.txt, which is too bad.
- If you use CDN, you can interrogate the request header and block it at the edge before even reaching your site.
Option 1:
You can block AI crawlers by adding them to your site’s robots.txt file as disallowed user agents (according to each AI company’s instructions.)
User-agent: anthropic-ai
Disallow: /
User-agent: Claude-Web
Disallow: /
User-agent: CCbot
Disallow: /
User-agent: FacebookBot
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: GPTBot
Disallow: /
User-agent: PiplBot
Disallow: /
Option 2:
Popular CDN providers like Cloudflare make this easy to do, see https://lnkd.in/gVdU9y7U
You can also combine both 1 & 2 option.