This framing is so gross. To see (human!) generated (ahem: English) text to be a "vital resource" you have to be deeply committed to the project of building AI models and in this particular way.
Link to original tweet:
https://twitter.com/emollick/status/1605756428941246466
Link to paper:
https://arxiv.org/pdf/2211.04325.pdf
@emilymbender There is a demand for low-background steel, steel produced before the nuclear tests mid century, for use in Geiger counters. They produce it from scavenging ships sunk during world war one, as it's the only way they can be sure there is no radiation.
The same is going to happen for internet data, only archives pre-2022 will be usable for sociology research and the like as the rest will be contaminated by AI nonsense. Absolute travesty.
@divclassbutton
Furthermore, training on the AI-generated text will produce feedback loops. We move from hallucinating models to ongoing fever dreams.
@emilymbender
And some of those feedbacks will be extraordinary. There is a guy, Matt Loughrey, who used AI models at the start of 2021 to recolour B+W photos of the victims of the Khmer Rouge. The colouring was pretty good... But it also changed the photos so the victims are smiling.
If those photos are part of current AI models thatll represent a total rewrite of history, in an absolutely frightening way.
@divclassbutton
Oh, wow. That's a really dramatic and compelling example.
@divclassbutton @OwenK @emilymbender do you have a link for this horrifying story? It's a useful parable for thinking about the bias baked into the large models; it would be nice to have a pointer to the specific examples
@trochee
Yeah, I agree. Here's the best thing I found:
https://www.irishtimes.com/culture/art-and-design/visual-art/the-khmer-rouge-controversy-why-colourising-old-photos-is-always-a-falsification-of-history-1.4536637
It focuses on how transforming part of the historical record is problematic, but it also has a decent description of what happened in this particular case. If you find a more informative source, I'd like to know about it.
@divclassbutton @emilymbender
@divclassbutton @OwenK @emilymbender I wish @marinamaral2@twitter.com could see these. Grotesque.
Wir müssen gar nicht warten, bis eine #KI so richtig Mist baut. Das schaffen Menschen schon noch gut selbst. Wie aus einer Kolorierung mit KI-Tools eine groteske Verfälschung der Geschichte wurde: https://www.irishtimes.com/culture/art-and-design/visual-art/the-khmer-rouge-controversy-why-colourising-old-photos-is-always-a-falsification-of-history-1.4536637
Gutes Beispiel fürs #FediLZ
Hat tip to @divclassbutton and @OwenK for sharing this!
@divclassbutton @emilymbender @cstross
We see this already. Having a standing Google Search for my name, I'll see an original human-written article quoting me, and then a series of increasing word-salad automatic re-edits by ripoff sites, and its been going on for a year+ now. I suspect that, unless human curated, no dataset with data after 2020 is valid for ML training and a bunch of other purposes.
@divclassbutton @emilymbender @cstross I believe that @janellecshane has christened this uncontaminated data “low-botground data”.
@divclassbutton @emilymbender I am sure someone will be touting an AI that will go through archives to remove all text written by an AI.
@divclassbutton @emilymbender As the earliest viable brain scan, MMAcevedo is one of a very small number of brain scans to have been recorded before widespread understanding of the hazards of uploading and emulation.
I view data contamination as a probable necessity given how few scruples businesses and govs have in using our data.
@divclassbutton
The problematic isotopes had short half-lives and have decayed to a point where low background steel is no longer a thing. But the internet archives will glow in the dark for some time yet, I suspect.
@emilymbender
@divclassbutton @emilymbender for years, there was a vast "mothball" fleet" of WWI warships chained together just north of the Benicia Bridge on the Sacramento River near San Francisco. I believe they were kept there in case they needed recomissioning. I've noticed that most of them are gone now. It was a sight to see dozens of WWI combat ships chained together floating on the river.