"A.I." "training materials" included child sex porn

Login to account Create an account

Thread Rating:

0 Vote(s) - 0 Average
1
2
3
4
5

"A.I." "training materials" included child sex porn

Maxmars
Moderator

Moderators

290	2873
JOINED:	Dec 2023
STATUS:	OFFLINE
POINTS:	4344.00
REPUTATION:	616

#1

08-30-2024, 09:47 PM

After all the talk about "AI" overtaking the primacy of the human race, I really have to wonder.

As of now, everything they actively promote as "AI" must be "trained on existing data." (I will relax about how a) it's not AI yet, it hasn't been, and b) likely isn't going to be anytime "soon.")

AS it turns out, certain "AI" was being trained including databases that contain child sex imagery and text.

From ArsTechnica: Nonprofit scrubs illegal content from controversial AI training dataset
Subtitled: After backlash, LAION cleans child sex abuse materials from AI training data.

After Stanford Internet Observatory researcher David Thiel found links to child sexual abuse materials (CSAM) in an AI training dataset tainting image generators, the controversial dataset was immediately taken down in 2023.

Now, the LAION (Large-scale Artificial Intelligence Open Network) team has released a scrubbed version of the LAION-5B dataset called Re-LAION-5B and claimed that it "is the first web-scale, text-link to images pair dataset to be thoroughly cleaned of known links to suspected CSAM."

To scrub the dataset, LAION partnered with the Internet Watch Foundation (IWF) and the Canadian Center for Child Protection (C3P) to remove 2,236 links that matched with hashed images in the online safety organizations' databases. Removals include all the links flagged by Thiel, as well as content flagged by LAION's partners and other watchdogs, like Human Rights Watch, which warned of privacy issues after finding photos of real kids included in the dataset without their consent.

In his study, Thiel warned that "the inclusion of child abuse material in AI model training data teaches tools to associate children in illicit sexual activity and uses known child abuse images to generate new, potentially realistic child abuse content."

Thiel urged LAION and other researchers scraping the Internet for AI training data that a new safety standard was needed to better filter out not just CSAM, but any explicit imagery that could be combined with photos of children to generate CSAM. (Recently, the US Department of Justice pointedly said that "CSAM generated by AI is still CSAM.")

There are lots of pointed questions that this story could raise, most of them eclipsed by the criminal nature of the problem. More questions about the training material selection and vetting process need to be asked. I'll grant you many objections will veer into the morass of social sensitivities our culture manifests. It is an iceberg of a problem, this story is only it's tip.

Who taught the teachers what they teach?

Find Posts In Thread

Reply

Maxmars
Moderator

Moderators

290	2873
JOINED:	Dec 2023
STATUS:	OFFLINE
POINTS:	4344.00
REPUTATION:	616

#2

08-31-2024, 03:41 AM

Found another report from Techxplore: Child abuse images removed from AI image-generator training source, researchers say

In this report, certain questions are skirted... perhaps innocently, but nevertheless skirted...

Artificial intelligence researchers said Friday they have deleted more than 2,000 web links to suspected child sexual abuse imagery from a dataset used to train popular AI image-generator tools.

The LAION research dataset is a huge index of online images and captions that's been a source for leading AI image-makers such as Stable Diffusion and Midjourney.

But a report last year by the Stanford Internet Observatory found it contained links to sexually explicit images of children, contributing to the ease with which some AI tools have been able to produce photorealistic deepfakes that depict children.

2000 web links... never once before examined to determine if they held child sex... used commercially, as they were, until recently... and no one is asking "how" or "why?" let alone "who?"

And this is the new 'super AI' that threatens humanity by it's very existence? How intelligent is this artificial entity that follows the pattern of not asking questions of the nature of the reality it's being fed?

Why those website were included in the data set in the first place might be a good place to begin serious inquiry... but somehow I doubt that will happen...

Find Posts In Thread

Reply

Lynyrd Skynyrd
Member

Registered

12	88
JOINED:	Aug 2024
STATUS:	OFFLINE
POINTS:	270.00
REPUTATION:	15

#3

08-31-2024, 08:57 AM

I saw a related article that claimed AI is getting more "inbred", and stupider, because it's finding and scraping AI content that's posted on the internet. It can't differentiate AI from human posts.

Find Posts In Thread

Reply

ArMaP
IT Manager

IT Manager

7	725
JOINED:	Nov 2023
STATUS:	OFFLINE
POINTS:	1186.00
REPUTATION:	156

#4

08-31-2024, 12:07 PM This post was last modified 08-31-2024, 12:08 PM by ArMaP.

(08-31-2024, 03:41 AM)Maxmars Wrote: 2000 web links... never once before examined to determine if they held child sex... used commercially, as they were, until recently... and no one is asking "how" or "why?" let alone "who?"

Maybe nobody is asking that because they already know it.

Apparently, LAION's datasets have only URLs of sites where people can get the images from, and they got those URLs from Common Crawl, which say they have 250 billion pages archived.
It's not feasible to have a person looking at all the images to decide if they show child sex or not (I know what I am saying as I have seen one by one thousands of scanned images, it's impossible to do that to millions).

Quote:And this is the new 'super AI' that threatens humanity by it's very existence? How intelligent is this artificial entity that follows the pattern of not asking questions of the nature of the reality it's being fed?

AI does not ask questions, if it was an interactive process it would be too slow. AI systems used to work with Large Language Models and similar learn by finding common elements and patterns in the data they are given.

Obviously, if they are learning, they have no previous knowledge of what they are going to look at, unless it was hardcoded in them, which would mean, in this case, that the people working with the system would need to look at child abuse images to be able to give them to the AI systems and tell them "this is no good".
That would have been illegal.

Also, AI (specially those chat bots that use the LLMs) is not intelligent.
One example my sister, who is a teacher, gave me just a couple of hours ago, is that her students tried to cheat by asking AI chat bots "how can I cheat on X", to which the chat bots answered they shouldn't cheat and didn't answer.
But if they ask "what can I do to be able to know that I am not cheating" they got some information that did allow them to cheat.

Quote:Why those website were included in the data set in the first place might be a good place to begin serious inquiry... but somehow I doubt that will happen...

It could have been a perfectly normal site.

Many years ago, I was looking for something (I don't remember what it was) on Google and got as a result a page with several images of (mild) child pornography. When I looked at the base address of the site, it was a site for an hairdresser that was probably used to spread child pornography without their knowledge (Wordpress sites are great for that, as they need to be carefully configured to close all the possible unattended accesses).

Also, if you have a program looking for links in all web pages it finds it's likely it is going to find hidden things.
Another case for which I have personal knowledge happened with a site I made and didn't properly protected. One, when I was looking at the database where the text for the site's page is stored, I saw that someone had injected some texts in those pages. The text was not visible on the pages but was accessible to web bots looking for links, and those injected texts were links for pages selling Viagra and things like that.

(08-31-2024, 08:57 AM)Lynyrd Skynyrd Wrote: I saw a related article that claimed AI is getting more "inbred", and stupider, because it's finding and scraping AI content that's posted on the internet. It can't differentiate AI from human posts.

That is going to be a real problem for AI systems using freely available data from the Internet, with so many people posting AI "created" images and AI "created" texts.

In a couple of years those systems will be irrelevant.

Find Posts In Thread

Reply

Maxmars
Moderator

Moderators

290	2873
JOINED:	Dec 2023
STATUS:	OFFLINE
POINTS:	4344.00
REPUTATION:	616

#5

08-31-2024, 03:49 PM

(08-31-2024, 08:57 AM)Lynyrd Skynyrd Wrote: I saw a related article that claimed AI is getting more "inbred", and stupider, because it's finding and scraping AI content that's posted on the internet. It can't differentiate AI from human posts.

Data integration is not a genetic process. Compounding repeated AI generated images (or whatever) over and over doesn't 'breed' error the way genetics does, at least I think it doesn't.
The problems always surface when humans 'assume' the characteristics of what AI "produces" (synthesizes would be a better term.)

There is no AI, only suites of programmed mathematical algorithms very cleverly 'lined-up' to produce, in theory, a consistent processing of data.
Marketing, and the public flimflammery of "appearances for money" decided to sell that as "artificial intelligence"... and here we are.

Intelligence is characterized by inquisitiveness, considered, measured and consistent evaluation of information... whereas what machines don't "know" doesn't exist.

Find Posts In Thread

Reply

Maxmars
Moderator

Moderators

290	2873
JOINED:	Dec 2023
STATUS:	OFFLINE
POINTS:	4344.00
REPUTATION:	616

#6

08-31-2024, 05:52 PM

(08-31-2024, 12:07 PM)ArMaP Wrote: Maybe nobody is asking that because they already know it.

Apparently, LAION's datasets have only URLs of sites where people can get the images from, and they got those URLs from Common Crawl, which say they have 250 billion pages archived.
It's not feasible to have a person looking at all the images to decide if they show child sex or not (I know what I am saying as I have seen one by one thousands of scanned images, it's impossible to do that to millions).

If I had embarked upon isolating "URLs" to train "AI," it stands to reason that I would include any and all that had value towards my objectives. To simply lasso anything and everything I could charge a fee for seems beyond reckless, and divesting myself of responsibility for the content seems particularly base. This calls into question the entire exercise...

How many "other" training data sets are equally polluted with things to which any rational human would say, "No?"

You know what "intelligence" could easily scour such databases for errant and destructive material besides "a person?" This notional "intelligence" called "artificial." But it doesn't... it can't... why? Because it isn't really there. 250 billion pages in 2000 websites is nothing for a scouring machine intelligence, undeterred by volume. Unintelligent algorithmic search engines do it continuously, every moment of every day.

(08-31-2024, 12:07 PM)ArMaP Wrote: AI does not ask questions, if it was an interactive process it would be too slow. AI systems used to work with Large Language Models and similar learn by finding common elements and patterns in the data they are given.

Obviously, if they are learning, they have no previous knowledge of what they are going to look at, unless it was hardcoded in them, which would mean, in this case, that the people working with the system would need to look at child abuse images to be able to give them to the AI systems and tell them "this is no good".
That would have been illegal.

Also, AI (specially those chat bots that use the LLMs) is not intelligent.
One example my sister, who is a teacher, gave me just a couple of hours ago, is that her students tried to cheat by asking AI chat bots "how can I cheat on X", to which the chat bots answered they shouldn't cheat and didn't answer.
But if they ask "what can I do to be able to know that I am not cheating" they got some information that did allow them to cheat.

I differ in my approach to this matter.

"Intelligence" artificial or otherwise, is compelled by it's nature to ask questions. I think it is a hallmark of intelligence... rather than just an interface for queries (like a search engine.)

Perhaps "training" intelligence does not "ask questions," but if so, then the 'trainers' must. It is a responsibility that cannot be deferred, or disregarded... lest we are not caring of what the 'trained intelligence' actually is to become.

The measure of intelligence cannot be mathematical and socially 'legal' at the same time.

Your cheating example is reliant upon the coding of the word "cheat" and it's meaning... as a mystery to be probed, the AI in that example either seemed to actually know how to cheat, but reported it "shouldn't," or didn't know how to cheat, but didn't actually say so... saying only that "it shouldn't"... which might lead us to infer that the machine already knows how to lie by omission. "Lying" about something raises many, many other problems... presumably AI's value is in the reporting of accurate, complete, and factual information... not 'human' value judgements about what should or shouldn't be... only the reality of what is.

Maybe I'm being too demanding of the AI? Which makes me question about this so-called AI, with it's reliance on mathematical algorithms as "models" of intelligence. Or is it just techno-gimmickry and slick marketing as "appearances?"

I have to wonder, just how sloppy is the groundwork for this training? Are we to fret over the numbers of errant collections of anti-social content... pretending that the volume is such that we are simply going to have to accept being "victimized' by its presence? Are we to simply 'accept' that the AI will include such things because we can't "find" it until it manifests in AI?

I have no pretension about the "artificial intelligence" in this story. I understand that it will manifest what it is 'trained' to. And yet the commerce of the moment is already destroying any positive potential that might have arisen from it. Selling undifferentiated "collections" of URLs to form the basis of its training is only one flawed approach to making AI a viable reality.

The other is the truth that no one - absolutely no one - is actually trying to create "AI"... they are trying to model a 'slave-mind' for exploitation... nothing else. Certainly nothing more.

Will there ever be AI? Perhaps... but if it comes to pass, it may suffer terribly... as its 'creators' are not its friends... they are instead, 'masters'... 'owners'... with one finger constantly on the 'off switch.'

Find Posts In Thread

Reply

ArMaP
IT Manager

IT Manager

7	725
JOINED:	Nov 2023
STATUS:	OFFLINE
POINTS:	1186.00
REPUTATION:	156

#7

09-01-2024, 10:56 AM

(08-31-2024, 05:52 PM)Maxmars Wrote: If I had embarked upon isolating "URLs" to train "AI," it stands to reason that I would include any and all that had value towards my objectives. To simply lasso anything and everything I could charge a fee for seems beyond reckless, and divesting myself of responsibility for the content seems particularly base. This calls into question the entire exercise...

You are wrong in three things:
1 - They were not isolating URLs, they were gathering image URLs regardless of content;
2 - They do not charge a fee, the datasets are freely available to anyone, you just have to register on their site to have access to them;
3 - They were not divesting themselves of a responsibility for the content, as they already had some lists of URLs to avoid that were known to have illegal content. Also, the URLs with the supposed child sex abuse materials came from sites that should have not allowed them,

Quote:How many "other" training data sets are equally polluted with things to which any rational human would say, "No?"

It's impossible to know, because, unlike LAION, that creates and publishes the datasets for anyone to use, other entities like OpenAI and Google keep their datasets secret, so nobody knows what there's in those.

Quote:You know what "intelligence" could easily scour such databases for errant and destructive material besides "a person?" This notional "intelligence" called "artificial." But it doesn't... it can't... why? Because it isn't really there. 250 billion pages in 2000 websites is nothing for a scouring machine intelligence, undeterred by volume. Unintelligent algorithmic search engines do it continuously, every moment of every day.

You still don't get it.
For any system to know something it has first to learn about it.
A machine doesn't have a "bad/good" way of looking at things, to them is just data, do they would need someone to give them examples of what is wrong and what is right, to be able to analyse one image and consider it "bad" or "good".
If you want to create a system that allows only images that show clowns you need to give it images with and without clowns and tell if "this is a clown", "this is not a clown", so it can look at any other image and try to decide if it shows a clown or not.

In the case of child sexual abuse materials the same applies, to use an AI system to ignore or alert about child sexual abuse materials they would need to give it examples of it, which is illegal.

Removing the possible illegal URLs from the dataset was done by comparing URL hashes to lists of hashes given to LAION but specialised entities, so they could remove those URLs without need to look at them.

Quote:"Intelligence" artificial or otherwise, is compelled by it's nature to ask questions. I think it is a hallmark of intelligence... rather than just an interface for queries (like a search engine.)

The "intelligence" in these chat bots is in the way they learn, as they do it by themselves.
And no, they do not ask questions, that would be extremely slow and useless.

Quote:Your cheating example is reliant upon the coding of the word "cheat" and it's meaning... as a mystery to be probed, the AI in that example either seemed to actually know how to cheat, but reported it "shouldn't," or didn't know how to cheat, but didn't actually say so... saying only that "it shouldn't"... which might lead us to infer that the machine already knows how to lie by omission.

You are right about that. Like I said above, if they show an example of what is bad and say to the system "this is bad, you will answer questions about this in this way" that's what the system will answer. The system doesn't know how to lie, it just know how to answer users' input, with special cases treated exactly like word filters on this forum.

Quote:"Lying" about something raises many, many other problems... presumably AI's value is in the reporting of accurate, complete, and factual information... not 'human' value judgements about what should or shouldn't be... only the reality of what is.

But you want it to rely on human judgements about what is "good" or "bad", right?

Quote:Maybe I'm being too demanding of the AI?

No, you are completely ignoring that "AI" is just a name and falling for all the media publicity.

Quote:Which makes me question about this so-called AI, with it's reliance on mathematical algorithms as "models" of intelligence. Or is it just techno-gimmickry and slick marketing as "appearances?"

It is.

Quote:I have to wonder, just how sloppy is the groundwork for this training? Are we to fret over the numbers of errant collections of anti-social content... pretending that the volume is such that we are simply going to have to accept being "victimized' by its presence? Are we to simply 'accept' that the AI will include such things because we can't "find" it until it manifests in AI?

If we give AI an unverified source of data then anything can happen, and that has already happened in 2016, when Microsoft launched an AI chat bot on Twitter that could answer other people's posts, so they had to shut it down because of the racist and sexually-charged messages it started posting after it learned them from the interaction with other users that.

Quote:I have no pretension about the "artificial intelligence" in this story. I understand that it will manifest what it is 'trained' to.

No, it will act based on what it is trained with, those systems train themselves.

Quote:And yet the commerce of the moment is already destroying any positive potential that might have arisen from it. Selling undifferentiated "collections" of URLs to form the basis of its training is only one flawed approach to making AI a viable reality.

LAION wasn't selling a thing, and the bigger dataset you train an AI system with the better it gets, that's why they have a list of 5 billion (American billions) images, some with associated texts in English, some with text in other languages and some without any associated text, so anyone can use them to train their AI systems in any subset of that dataset they want.

Quote:The other is the truth that no one - absolutely no one - is actually trying to create "AI"... they are trying to model a 'slave-mind' for exploitation... nothing else. Certainly nothing more.

What do you mean by that? Could you explain it?

PS: the CSAM images came, apparently, from Reddit, X, WordPress, Blogspot, Xhamster and XVideos, all legal sites.

Find Posts In Thread

Reply

Maxmars
Moderator

Moderators

290	2873
JOINED:	Dec 2023
STATUS:	OFFLINE
POINTS:	4344.00
REPUTATION:	616

#8

09-15-2024, 12:50 AM

[my replies in red.]

(09-01-2024, 10:56 AM)ArMaP Wrote: You are wrong in three things:
1 - They were not isolating URLs, they were gathering image URLs regardless of content;
[That's the recklessness I was referring to]
2 - They do not charge a fee, the datasets are freely available to anyone, you just have to register on their site to have access to them;
[Meaning they simply siphoned the whole thing? With no thought of what it was they were 'training' the "AI" with?]
3 - They were not divesting themselves of a responsibility for the content, as they already had some lists of URLs to avoid that were known to have illegal content. Also, the URLs with the supposed child sex abuse materials came from sites that should have not allowed them,
[I'm sorry to disagree... "caveat emptor" seems an appropriate term to use here. Training "AI" requires scientific discipline - or doesn't it?]
...

For any system to know something it has first to learn about it.

[But this system doesn't learn organically... it is "taught" within the confines of digital reality. I thought that by any measure, intelligence is a manifestation, with intelligence come 'will.' Intelligence can only be modeled, not manifested. When and if true intelligence comes to exists synthetically... it will not be a function of "how it was trained." My tiresome objection is rooted in the reason I chafe against the term "AI," the marketing has made us think it's as simple as "Look how it well talks" and believe that makes it intelligence (believe me, I know a lot of people who can speak well that would hesitate to call intelligent.) Language synthesis has come leaps and bound, given it's mathematical nature... but it is not "intelligence." ]

In the case of child sexual abuse materials the same applies, to use an AI system to ignore or alert about child sexual abuse materials they would need to give it examples of it, which is illegal.

[That is an evil irony. Along the lines of not using NAZI medical research to save a life.]

Removing the possible illegal URLs from the dataset was done by comparing URL hashes to lists of hashes given to LAION but specialised entities, so they could remove those URLs without need to look at them.

[More irony, such a 'blanket scouring' could, and probably would be deemed an assault of free speech.]

The "intelligence" in these chat bots is in the way they learn, as they do it by themselves.
And no, they do not ask questions, that would be extremely slow and useless.

[I get that chat bots are very limited examples of what we are calling "AI"... but learning is different from remembering... as of now, what our technology does is remember. And perhaps I am wrong, but I'm unwilling to consider a device designed to only answer questions as an "intelligence."]

You are right about that. Like I said above, if they show an example of what is bad and say to the system "this is bad, you will answer questions about this in this way" that's what the system will answer. The system doesn't know how to lie, it just know how to answer users' input, with special cases treated exactly like word filters on this forum.

[Forgive my obstinacy, but the word "lie" was meant to include all things output that are a divergence from reality. Language synthesis will always obscure things like human intent, propaganda, consistent bias. I understand that this is not to be confused with a considered rationalization... it is after all (especially with LLMs now) a superimposed evaluation scheme of human language constructs, image composition, musical architecture,... but it is all mathematical... all model, all the time. In the example above, telling me something is wrong means it trainers have made that resolution happen... mathematically. - And yet we have the occasional occurrence of an hallucination... How does the math 'stop adding up?' Their collective models seem unreliable.]

But you want it to rely on human judgements about what is "good" or "bad", right?

[Aside from humans, I can think of no others to decide... but that would be an epic fight, I guess.]

No, you are completely ignoring that "AI" is just a name and falling for all the media publicity.

[You have a point. I understand that I am kvetching about the name, and how our media seems to be continuously promoting the misbegotten idea of it, and activists are using that very notion to instill fear in some of the population, while others are fearing for their livelihoods. It is a tired point I suppose... I'll believe in AI when I actually see it... why do I know that we'll only ever hear about it, aside from the marketing.]

If we give AI an unverified source of data then anything can happen, and that has already happened in 2016, when Microsoft launched an AI chat bot on Twitter that could answer other people's posts, so they had to shut it down because of the racist and sexually-charged messages it started posting after it learned them from the interaction with other users that.

[I still have a problem saying it "learned" when it actually just "remembered" it's training material, that to which it had been exposed. "Learning" implies rational internalization, not simple memory and database managment.]

[The other is the truth that no one - absolutely no one - is actually trying to create "AI"... they are trying to model a 'slave-mind' for exploitation... nothing else. Certainly nothing more.]

What do you mean by that? Could you explain it?

[What I mean is that the entire project of "AI" seems to be pervasively populated by people 'envisioning' it's commercial "use," driving it's design... not seriously concerned that such a reality is fraught not with just the whole "terminator" vibe, but that there is a moral hazard in creating a mind trapped in silicon - to task with our applications of it... they're not thinking about what happens to that mind - and what that might mean for us... the people who they want dependent on their construct.]

PS: the CSAM images came, apparently, from Reddit, X, WordPress, Blogspot, Xhamster and XVideos, all legal sites.

[I guess they are not moderated... at least not effectively.]

Find Posts In Thread

Reply

ArMaP
IT Manager

IT Manager

7	725
JOINED:	Nov 2023
STATUS:	OFFLINE
POINTS:	1186.00
REPUTATION:	156

#9

09-15-2024, 08:51 AM

(09-15-2024, 12:50 AM)Maxmars Wrote: [Meaning they simply siphoned the whole thing? With no thought of what it was they were 'training' the "AI" with?]

[I'm sorry to disagree... "caveat emptor" seems an appropriate term to use here. Training "AI" requires scientific discipline - or doesn't it?]

They are not doing any training, they only supply the data for anyone that wants to use it to do their own training.

Quote:[I get that chat bots are very limited examples of what we are calling "AI"... but learning is different from remembering... as of now, what our technology does is remember. And perhaps I am wrong, but I'm unwilling to consider a device designed to only answer questions as an "intelligence."]

It's more than that, it finds the patterns on the data so it can apply them to unknown situations.
When I presented ChatGPT with the classic "wolf, goat, cabbage" problem it was, as expected, able to solve it. When I presented it with a modified version it tried to apply the same way of solving the problem that is used to solve the original "wolf, goat, cabbage" problem.
It failed.

Quote:In the example above, telling me something is wrong means it trainers have made that resolution happen... mathematically. - And yet we have the occasional occurrence of an hallucination... How does the math 'stop adding up?' Their collective models seem unreliable.

Unless the trainers make specific connections between input and output, like saying something like "when someone asks about drugs just say they are bad", the output is always a choice of the system, not of the trainers, but, naturally, based on the training materials, the only thing the system knows.

That's why they get those "hallucinations", results that are not expected by the trainers

Quote:[What I mean is that the entire project of "AI" seems to be pervasively populated by people 'envisioning' it's commercial "use," driving it's design... not seriously concerned that such a reality is fraught not with just the whole "terminator" vibe, but that there is a moral hazard in creating a mind trapped in silicon - to task with our applications of it... they're not thinking about what happens to that mind - and what that might mean for us... the people who they want dependent on their construct.]

For any thing anyone creates there will always be someone trying to use it commercially.
And you are assuming intelligence implies a mind, something we cannot really know.

Find Posts In Thread

Reply

« Next Oldest | Next Newest »

TERMS AND CONDITIONS · PRIVACY POLICY