Amazon discovered a 'high volume' of CSAM in its AI training data but isn't saying where it came from

other_cat@piefed.zip · 3 days ago

Amazon discovered a 'high volume' of CSAM in its AI training data but isn't saying where it came from

ZoteTheMighty@lemmy.zip · 3 days ago

But if they’re uniquely good at producing CSAM, odds are it’s due to a proprietary dataset.

ImgurRefugee114@reddthat.com · edit-2 3 days ago

This is why I use the word ‘proliferation,’ in the nuclear sense. Though contamination may be more apt… Since the days of SD1, these illegal capabilities have become more and more prevalent in the local image model space. The advent of model merging, mixing, and retraining/finetunes, have caused a significant increase in the proportion of model releases that have been contaminated.

What you’re saying is ultimately true, but it was more true in the early days. Animated, drawn, and CGI content has always been a problem, but photorealistic capability was very limited and rare, often coming from homebrewed proprietary finetunes published on shady forums. Since then, they’ve become much more prolific. It’s estimated that roughly between a fourth and a third of photorealistic SDXL-based NSFW models released on civit.ai during 2025 have some degree of capability. (Speaking purely in a boolean metric… I don’t think anyone has done a study on the perceptual quality of these capability for obvious reasons.)

Just as LLM benchmark test answers have contaminated open source models, illegal capabilities gained from illegal datasets have also contaminated image models; to the point where there are plenty of well-intentioned authors unknowingly contributing to the problem. There are some who go out of their way to poison models (usually with false association training on specific keywords) but few bother, or even known, to do so.