Ai scraping is an effective DDoS on the entire interent
Ai scraping is an effective DDoS on the entire interent

Excerpt from a message I just posted in a #diaspora team internal f...

Ai scraping is an effective DDoS on the entire interent
Excerpt from a message I just posted in a #diaspora team internal f...
jwz gave the game away, so i'll reveal:
the One Weird Trick for this week is that the bots pretend to be an old version of Chrome. So you can block on useragent
so I blocked old Chrome from hitting the expensive mediawiki call on rationalwiki and took our load average from 35 (unusable) to 0.8 (schweeet)
caution! this also blocks the archive sites, which pretend to be old chrome. I refined it to only block the expensive query on mediawiki, vary as appropriate.
nginx code:
# block some bot UAs for complex requests # nginx doesn't do nested if, so we set a test variable # if $BOT is both Complex and Old, block as bot set $BOT ""; if ($uri ~* (/w/index.php)) { set $BOT "C"; } if ($http_user_agent ~* (Chrome/[2-9])) { set $BOT "${BOT}O";} if ($http_user_agent ~* (Chrome/1[012])) { set $BOT "${BOT}O";} if ($http_user_agent ~* (Firefox/3)) { set $BOT "${BOT}O";} if ($http_user_agent ~* (MSIE)) { set $BOT "${BOT}O";} if ($BOT = "CO") { return 503;}
you always return "503" not "403", because 403 says "fuck off" but the scrapers are used to seeing 503 from servers they've flattened.
I give this trick at least another week.
Count them as ad visits, to make big tech pay for better hardware or line?
That opens you up to getting accused of click fraud, as AdNauseam found out the hard way but its worth it if you can squeeze some cash out of them before that happens.
I mean, scraping bots would obviously obey robots.txt so those scraping - bots, i mean users can't be bots
Re the blocking of fake useragents, what people could try is see if there are things older useagents do (or do wrong) which these do not. I heard of some companies doing that. (Long ago I also heard of somebody using that to catch mmo bots in a specific game. There was a packet that if the server send it to a legit client, the client crashed, a bot did not). I'd assume the specifics are treated as secret just because you don't want the scrapers to find out.
It's a constant cat and mouse atm. Every week or so, we get another flood of scraping bots, which force us to triangulate which fucking DC IP range we need to start blocking now. If they ever start using residential proxies, we're fucked.
at least OpenAI and probably others do currently use commercial residential proxying services, though reputedly only if you make it obvious you’re blocking their scrapers, presumably as an attempt on their end to limit operating costs
In my experience with bots, a portion of them obey robots.txt, but it's tricky to find the user agent string that some bots react to.
So I recommend having a robots.txt that not only target specific bots, but also tell all bots to avoid specific paths/queries.
Example for dokuwiki
User-agent: * Noindex: /lib/ Disallow: /_export/ Disallow: /user/ Disallow: /*?do= Disallow: /*&do= Disallow: /*?rev= Disallow: /*&rev=
Would it be possible to detect the gptbot
(or similar) of their user agent, and server them different data?
Can they detect that?
yes, you can match on user agent, and then conditionally serve them other stuff (most webservers are fine with this). nepenthes and iocaine are the current preferred/recommended servers to serve them bot mazes
the thing is that the crawlers will also lie (openai definitely doesn't publish all its own source IPs, I've verified this myself), and will attempt a number of workarounds (like using residential proxies too)