Login to account Create an account  


Thread Rating:
  • 3 Vote(s) - 5 Average
  • 1
  • 2
  • 3
  • 4
  • 5
Unwelcome AI data scraping ... we do what we want.
#1
First came the marketing "full court press" about how we now have to worry about "AI."

Never mind that AI - as far as we have seen - does not yet exist.

But that didn't stop the Big Tech effort to sell algorithmic language modelling as AI... just for the big bucks.  And nearly every media outlet went on the march, with a litany of productions and employment opportunities for those who spew words about topics they know little about... again, for the big bucks.  Companies began their "investment' games, and marketers began their speculative prospects, while many amateurs played "monetized videos' for fun and profit.

But you see, the reality is that LLMs and algorithmic processes are not AI (anymore than eyeballs, and ear lobes are "humans.")

LLMs kept failing, and people ran with 'the products' to test their limits and flexibility, always encountering the same oddness and irregularities in the machine output... hinging mostly on the limitation that these models all share... they lack "training."  By training I mean exposure to words strung together to convey meaning.  So the solution for the business is to expose the models to as much human-created content as possible...

Hence, making the machine "crawl" through websites similar to this, extracting phrases, identifying patterns, and rendering the creations of people into a mathematical representations that "work" linguistically (but never mind context.)

Such a process take bandwidth, expends resources, and time; as well as creating a new form of data consumption which the hosting framework was never designed to accommodate.

From Engaget: AI companies are reportedly still scraping websites despite protocols meant to block them
Subtitled: Multiple AI companies are bypassing robots.txt instructions, according to Reuters.
 

Perplexity, a company that describes its product as "a free AI search engine," has been under fire over the past few days. Shortly after Forbes accused it of stealing its story and republishing it across multiple platforms, Wired reported that Perplexity has been ignoring the Robots Exclusion Protocol, or robots.txt, and has been scraping its website and other Condé Nast publications. Technology website The Shortcut also accused the company of scraping its articles. Now, Reuters has reported that Perplexity isn't the only AI company that's bypassing robots.txt files and scraping websites to get content that's then used to train their technologies.

Reuters said it saw a letter addressed to publishers from TollBit, a startup that pairs them up with AI firms so they can reach licensing deals, warning them that "AI agents from multiple sources (not just one company) are opting to bypass the robots.txt protocol to retrieve content from sites." The robots.txt file contains instructions for web crawlers on which pages they can and can't access. Web developers have been using the protocol since 1994, but compliance is completely voluntary.



So-called AI is actually just a set of algorithmically arranged programs; each on their own not dissimilar from "bots" which used to be aggressively used by some companies to scrape data from websites en masse... mostly search engines.  It was ordinarily considered a sort of abusive strain on databases and the equipment that housed them, so as a restraint someone developed a robots.txt file as an instruction set.  This robots.txt file is a set of instructions to search engine crawlers as to which URLs they can access on your site.  Most creators and website designers welcome search engines that can direct traffic to them, but there have to be limits, or scouring their sites could bring their end-users experience to a crawling nightmare (pun intended.)

These "AI training" forays are NOT search engines... but what they are trying to do is just as bad.  And a bit more complicated than simply acquiring active links...
 

In an interview with Fast Company, Perplexity CEO Aravind Srinivas told the publication that his company "is not ignoring the Robot Exclusions Protocol and then lying about it." That doesn't mean, however, that it isn't benefiting from crawlers that do ignore the protocol. Srinivas explained that the company uses third-party web crawlers on top of its own, and that the crawler Wired identified was one of them. When Fast Company asked if Perplexity told the crawler provider to stop scraping Wired's website, he only replied that "it's complicated."

Srinivas defended his company's practices, telling the publication that the Robots Exclusion Protocol is "not a legal framework" and suggesting that publishers and companies like his may have to establish a new kind of relationship.



So the idea is... "Your complaints are irrelevant, your robots.txt file is not a law."  (Pssh - whatever, I do what I want.)
Reply



Messages In This Thread
Unwelcome AI data scraping ... we do what we want. - by Maxmars - 06-24-2024, 01:44 AM

Possibly Related Threads…
Thread Author Replies Views Last Post
  The rapaciousness of data harvesting Maxmars 0 67 08-09-2024, 05:38 PM
Last Post: Maxmars
  Wireless Data From Every Lightbulb Nerb 2 215 04-04-2024, 04:46 PM
Last Post: Nerb


TERMS AND CONDITIONS · PRIVACY POLICY