Okay, fair enough, I thought you meant just the user agent. Trouble with having a bot make it look like an actual user is looking at the data, is that it’s slow and inefficient. Trouble with paying humans to scrape the data is that it’s slow and inefficient. These companies want to ingest data ridiculously fast because there’s so much of it. If all else fails, they’ll resort to paying the content creators. But only if it’s data they really do think gives their model a competitive edge in some metric and they can’t pirate it. E.g I can see them paying for scientific research they can’t get from libgen, but not some rando’s blog post or local news website.
Yeah this will have absolutely no impact to gathering training data.
I assumed it was to block ai agents crawling it during requests, which they’d be unlikely to bypass in the web ui.
But no company spending millions on training will hesitate to have an agent appear as a regular desktop user to scrape data.
Does cloudflare still look at the agent? I thought they have more reliable data points.
I meant an ai agent not the browser agent. All data points can be spoofed and if not they’ll pay a human to scrape before they pay for content.
Okay, fair enough, I thought you meant just the user agent. Trouble with having a bot make it look like an actual user is looking at the data, is that it’s slow and inefficient. Trouble with paying humans to scrape the data is that it’s slow and inefficient. These companies want to ingest data ridiculously fast because there’s so much of it. If all else fails, they’ll resort to paying the content creators. But only if it’s data they really do think gives their model a competitive edge in some metric and they can’t pirate it. E.g I can see them paying for scientific research they can’t get from libgen, but not some rando’s blog post or local news website.