Login

**Maxmars** · 08-30-2024, 09:47 PM

After all the talk about "AI" overtaking the primacy of the human race, I really have to wonder.

As of now, everything they actively promote as "AI" must be "trained on existing data." (I will relax about how a) it's not AI yet, it hasn't been, and b) likely isn't going to be anytime "soon.")

AS it turns out, certain "AI" was being trained including databases that contain child sex imagery and text.

From ArsTechnica: Nonprofit scrubs illegal content from controversial AI training dataset
Subtitled: After backlash, LAION cleans child sex abuse materials from AI training data.

After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023.

Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it "is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM."

To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations' databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION's partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent.

In his study, Thiel warned that "the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content."

Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that "CSAM generated by AI is still CSAM.")

There are lots of pointed questions that this story could raise, most of them eclipsed by the criminal nature of the problem. More questions about the training material selection and vetting process need to be asked. I'll grant you many objections will veer into the morass of social sensitivities our culture manifests. It is an iceberg of a problem, this story is only it's tip.

Who taught the teachers what they teach?