Unwelcome AI data scraping ... we do what we want.

Unwelcome AI data scraping ... we do what we want. - Printable Version

+- Deny Ignorance (https://denyignorance.com)
+-- Forum: Science, Mystery, & Paranormal (https://denyignorance.com/Section-Science-Mystery-Paranormal)
+--- Forum: Science & Technology (https://denyignorance.com/Section-Science-Technology)
+--- Thread: Unwelcome AI data scraping ... we do what we want. (/Thread-Unwelcome-AI-data-scraping-we-do-what-we-want)

Unwelcome AI data scraping ... we do what we want. - Maxmars - 06-24-2024

First came the marketing "full court press" about how we now have to worry about "AI."

Never mind that AI - as far as we have seen - does not yet exist.

But that didn't stop the Big Tech effort to sell algorithmic language modelling as AI... just for the big bucks. And nearly every media outlet went on the march, with a litany of productions and employment opportunities for those who spew words about topics they know little about... again, for the big bucks. Companies began their "investment' games, and marketers began their speculative prospects, while many amateurs played "monetized videos' for fun and profit.

But you see, the reality is that LLMs and algorithmic processes are not AI (anymore than eyeballs, and ear lobes are "humans.")

LLMs kept failing, and people ran with 'the products' to test their limits and flexibility, always encountering the same oddness and irregularities in the machine output... hinging mostly on the limitation that these models all share... they lack "training." By training I mean exposure to words strung together to convey meaning. So the solution for the business is to expose the models to as much human-created content as possible...

Hence, making the machine "crawl" through websites similar to this, extracting phrases, identifying patterns, and rendering the creations of people into a mathematical representations that "work" linguistically (but never mind context.)

Such a process take bandwidth, expends resources, and time; as well as creating a new form of data consumption which the hosting framework was never designed to accommodate.

From Engaget: AI companies are reportedly still scraping websites despite protocols meant to block them
Subtitled: Multiple AI companies are bypassing robots.txt instructions, according to Reuters.

Perplexity, a company that describes its product as "a free AI search engine," has been under fire over the past few days. Shortly after Forbes accused it of stealing its story and republishing it across multiple platforms, Wired reported that Perplexity has been ignoring the Robots Exclusion Protocol, or robots.txt, and has been scraping its website and other Condé Nast publications. Technology website The Shortcut also accused the company of scraping its articles. Now, Reuters has reported that Perplexity isn't the only AI company that's bypassing robots.txt files and scraping websites to get content that's then used to train their technologies.

Reuters said it saw a letter addressed to publishers from TollBit, a startup that pairs them up with AI firms so they can reach licensing deals, warning them that "AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites." The robots.txt file contains instructions for web crawlers on which pages they can and can't access. Web developers have been using the protocol since 1994, but compliance is completely voluntary.

So-called AI is actually just a set of algorithmically arranged programs; each on their own not dissimilar from "bots" which used to be aggressively used by some companies to scrape data from websites en masse... mostly search engines. It was ordinarily considered a sort of abusive strain on databases and the equipment that housed them, so as a restraint someone developed a robots.txt file as an instruction set. This robots.txt file is a set of instructions to search engine crawlers as to which URLs they can access on your site. Most creators and website designers welcome search engines that can direct traffic to them, but there have to be limits, or scouring their sites could bring their end-users experience to a crawling nightmare (pun intended.)

These "AI training" forays are NOT search engines... but what they are trying to do is just as bad. And a bit more complicated than simply acquiring active links...

In an interview with Fast Company, Perplexity CEO Aravind Srinivas told the publication that his company "is not ignoring the Robot Exclusions Protocol and then lying about it." That doesn't mean, however, that it isn't benefiting from crawlers that do ignore the protocol. Srinivas explained that the company uses third-party web crawlers on top of its own, and that the crawler Wired identified was one of them. When Fast Company asked if Perplexity told the crawler provider to stop scraping Wired's website, he only replied that "it's complicated."

Srinivas defended his company's practices, telling the publication that the Robots Exclusion Protocol is "not a legal framework" and suggesting that publishers and companies like his may have to establish a new kind of relationship.

So the idea is... "Your complaints are irrelevant, your robots.txt file is not a law." (Pssh - whatever, I do what I want.)

RE: Unwelcome AI data scraping ... we do what we want. - Swada - 06-30-2024

"Never mind that AI - as far as we have seen - does not yet exist."

I disagree. Imagine our world is a chess board. The human will use their intuition to find the best possible move. Whereas the computer, through brute force, will test all possible moves, to determine which provides the best possible outcome. Computers bet the world champion of chess in 1997. Today's more impressive computers could test the intelligence of humans in all fields.

But unlike Descartes, I don't believe sentience is a result of intelligence. Yet films like AI Machina suggest they may well be capable of mimicking sentience.

RE: Unwelcome AI data scraping ... we do what we want. - Maxmars - 06-30-2024

Perhaps it's best to define our terms.

Artificial Intelligence is now a well-marketed concept. One definition which seems practically generic reads: "... artificial intelligence is the ability of a machine to perform tasks that are commonly associated with intelligent beings, including reasoning, learning, generalization, and intelligence." (Irony of ironies, this is an AI-provided definition.) For the sake of ease, I will accept that as the legitimately intended definition.

Yet all we have been treated to (at least publicly) are machines that 'sort of approximate' what reasoning humans 'speak or communicate', and they inevitably fail in substantive form over time. Large language models (LLMs) have now been reported to "hallucinate" text meanings and context, even to the point of purposefully fabricating nonexistent 'supporting links' for their outputs.

This is an indication that we are not dealing with intelligence, and that it is far from sentience... which is the ultimate boogeyman used for fear-fodder, (and far from a reliable communicator of factual reality, which is the desired model.)

No, unless you can provide other 'non-algorithmic' methods for modeling human sentience/intelligence, you will never have a similarly sentient computer. LLMs are output filters, not 'thinking' processes.

(This is all of course, my opinion.)

RE: Unwelcome AI data scraping ... we do what we want. - broccoli - 06-30-2024

(06-30-2024, 12:49 AM)Maxmars Wrote: Perhaps it's best to define our terms.

Artificial Intelligence is now a well-marketed concept. One definition which seems practically generic reads: "... artificial intelligence is the ability of a machine to perform tasks that are commonly associated with intelligent beings, including reasoning, learning, generalization, and intelligence." (Irony of ironies, this is an AI-provided definition.) For the sake of ease, I will accept that as the legitimately intended definition.

Yet all we have been treated to (at least publicly) are machines that 'sort of approximate' what reasoning humans 'speak or communicate', and they inevitably fail in substantive form over time. Large language models (LLMs) have now been reported to "hallucinate" text meanings and context, even to the point of purposefully fabricating nonexistent 'supporting links' for their outputs.

This is an indication that we are not dealing with intelligence, and that it is far from sentience... which is the ultimate boogeyman used for fear-fodder, (and far from a reliable communicator of factual reality, which is the desired model.)

No, unless you can provide other 'non-algorithmic' methods for modeling human sentience/intelligence, you will never have a similarly sentient computer. LLMs are output filters, not 'thinking' processes.

(This is all of course, my opinion.)

Took the words right out of my mouth with this and your OP, well done max. I agree 100%
Pretty soon, we will have to start using "A.I." in quotations because it's not what most people believe it to be.
However I will add this small story because it kind of fits here if you define intelligence.
When I was younger I once asked my uncle how does he know god is real.
He told me that if intelligence is the ability to recognize patterns, anticipate patterns, or even create patterns, then the lack there of is random. We are of intelligent design, a system, every cell, every time, everywhere, in nature, all of it. There is no random, only the illusion of random from beings that can't perceive the pattern yet.

Not trying to have a religious debate lol I am sure my beliefs are unlike anyone elses here most likely and maybe we will have that conversation in another post one day but to me intelligence is the ability to recognize patterns, anticipate patterns, or even create patterns as stated above. For something artificial to do that, we've had that for a long time. In very many basic forms no less. From LLM to Security that learns from being attacked. The boogeyman only exists in the minds of those that don't understand it. That's ok though, ;) max will moderate that boogeyman away

Oh I forgot to comment on the robots.txt lol mostly robots.txt files are used so search engines do not data scrape directories you would like to keep sensitive. It in no way banishes them from doing so if they have other orders. It's like a courtesy thing. Like if I had a .htaccess file or a config file I didn't want anyone to know the location of without seriously digging. Google Dorking would be another example of something it would want to prevent, probably even a better example.

RE: Unwelcome AI data scraping ... we do what we want. - putnam6 - 06-30-2024

My first thought about AI takeover is this is being promoted by the same type of azzhats that promoted the Y2K potential meltdown. Yes, the marketing of it also bugs TF out of me too. Add we are already seeing incredible AI pics and videos and it's not that there is a tell, in them all, but for enough people, it's a quick glance, and boom it's accepted as real tangible fact, clicked on, saved, and reposted.

Because of the layered marketing Im already tired of it, all the generic collections of bullet points most open and free AI have are useful for giving you more ideas for research.

Until its open-sourced and vetted it is still a technology thats ripe for abuse, such as slanted or biased presentation, manipulation, and or brainwashing indoctrination among others let me ask AI the dangers of AI.

If it's like RoboCop it might just burn out its motherboard...well shit that didn't work, but at least it's self-aware enough and semi-honest about itself to admit its faults. That's a trait our leadership might ought to learn from AI, BTW.

What is going to happen will there be hearings and legislation to regulate AI. The companies that lobby the hardest and make campaign contributions will be the ones rolled out and get business contracts and be influenced, and manipulated by the politicians and Department and Agency heads and officials whose hands they grease.

Quote:The use of artificial intelligence (AI) presents several inherent dangers that must be addressed to ensure the technology benefits society without leading to catastrophic consequences. Here are some of the key dangers associated with AI:

Lack of Transparency and Explainability:

Job Displacement:

Social Manipulation and Bias:

Privacy and Legal Issues

Algorithmic Bias and Bad Data

Market Volatility and Economic Impact

Weapons Automation

Uncontrolled Self-Awareness and Behavior

To mitigate these dangers, it is crucial to implement ethical guidelines, ensure AI systems are designed with transparency and explainability in mind, and establish clear legal frameworks that protect consumers and workers alike. Furthermore, investing in research and development to improve the fairness and reliability of AI algorithms should be a priority for stakeholders in the technology industry.

FWIW thats sounds like big Tech will require extra funding for this potential problem bigT researched developed and promoted. Then there is this nugget about Y2K...minimizing sumbeech

In conclusion, while the world did overreact to Y2K, the reaction was deemed necessary and appropriate at the time due to the gravity of the potential consequences. The overreaction served as a testament to the global collaboration and ingenuity required to solve a complex technological issue

Its potential use for graphic design and videos is phenomenal, however much like Photoshop, I expect companies to install access through a monthly fee bullshit.

If today's AI were autonomous, omniscient, and all-seeing and knowing would it need monthly fees for its platinum or professional levels to finance itself, money-grubbing little bastages

Lastly, it would seem the lower one's IQ is the more susceptible they will become to saturation bombing ie like the COVID overreaction and mandated vaccine push.

Is there any doubt an early AI model probably pushed that...

Here's why AI has a long way to go...

My question....

Quote:Wasn't mandatory COVID vaccination, masking and distance a massive overreaction?

and Mister Spock and HAL 9000's inspired creation by KOMO AI's smirking condensending answer which is just pulled from the usual suspects with the most #hashtags

The query suggests a skepticism regarding the necessity of mandatory COVID-19 vaccination, masking, and social distancing measures. However, the information provided from various sources indicates that these actions are crucial components of the global response to the COVID-19 pandemic. Here's a detailed explanation supported by the sources:

Mandatory COVID-19 Vaccination:
The primary goal of the COVID-19 vaccines is to prevent severe illness, hospitalization, and death Vaccination is shown to significantly reduce the risk of symptomatic infection .

It's important to note that achieving herd immunity through vaccination is essential to control the spread of the virus . This requires a substantial proportion of the population being vaccinated, which can lower the overall amount of virus circulating .

Masking:
Face masks are proven to reduce the spread of the virus, particularly when worn over the nose and mouth . They help to prevent the spread of respiratory droplets .

The Centers for Disease Control and Prevention (CDC) initially recommended that everyone aged 2 and older wear a mask in public settings. Although the guidelines have relaxed, the CDC still advises wearing masks in certain situations, such as in crowded, poorly ventilated spaces .

Physical Distancing:
Maintaining physical distance from others, especially in crowded or indoor settings, is a key strategy to prevent the spread of the virus .

The CDC has emphasized that avoiding medium and large gatherings and close contact with individuals outside one's household is important, especially in areas with high levels of community transmission .

To address the concern that these measures might have been a "massive over reaction," it's necessary to consider the following points:

The rapid spread of the Delta variant and the potential for future variants underscores the ongoing threat posed by COVID-19

The vaccines are highly effective, offering 94% to 95% protection against severe disease However, the effectiveness may vary depending on the type of vaccine and the timing relative to the second dose

The CDC's guidance on masking has evolved, reflecting the changing nature of the pandemic and the availability of vaccines

Local mask mandates have been associated with reductions in COVID-19 case rates, suggesting that masking can be effective in mitigating the spread of the virus

In conclusion, while the situation with COVID-19 has improved due to vaccinations, it's premature to declare the end of the pandemic. The virus remains a significant health threat, and the actions taken to combat it, such as vaccination, masking, and social distancing, are prudent and science-based.

Quote:Coronavirus Updates: Covid-19: C.D.C. Guidance Prompts ...