Just now realizing that all of plaintiff's counsel in Silverman v. OpenAI, Inc. have @mofo.com email addresses
they gettin' sued by these mofos
I didn't realize that they'd also filed against Meta. Looks like there was a recent ruling by the judge to dismiss many of the complaints in that case.
Document pictured above is OpenAI's counsel notifying the court about the dismissal in the other case with some carefully-chosen (but then, aren't they all?) quotations from the ruling.
The Meta case is perhaps the more interesting one, I wish I'd learned about it sooner!
It's filed as a class action suit, because there's a much more obvious argument that the model is built on infringing material, because Meta does (slightly) less obfuscation than OpenAI about their training data, and some of what's known to be in there definitely has infringing material in it.
The judge's ruling seems like a good omen for those who would like to train models on flagarantly stolen materials:
«There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.»
Obviously this ruling is not a final authority or anything, but it's a sign that the bar is pretty low to slip one past the legal system here. Definitely the weakest part of the ruling, citing only 17 U.S.C. § 101 for the definition of a derivative work.
Now, that said, I think some of the dismissal was a foregone conclusion. The original complaint in the OpenAI case includes an exhibit dedicated to showing how the model produces detailed summaries of the books in question.
Is that open and shut infringement? No, of course not, but it's easy to see how that crosses the line for *plausible* infringement.
The complaint against Meta includes no such specific claims, it mostly rests on the argument that the dataset is itself an infringing work.
I think that argument is *correct*, but in court it's not so much about whether you're right as whether you can convincingly argue that you're right.
It's plainly obvious that LLaMA would be a different model without the infringing parts of the training data. Unclear if it would be *meaningfully* distinct.
But there's a fine line to walk here. An author who was inspired by another author's work is not necessary producing a derivative work, even if they explicitly state their inspiration.
To be clear, I think that a forensic analysis of LLM weights probably wouldn't have much trouble surfacing verbatim text from the training data.
Whether or not that would be enough to cross from "fair use" into infringement is hard to speculate about, but >1 trillion parameters is an awful lot of space with which to effectively memorize your training data.
Timely coincidence: just as I'm musing over the protected works contained in language model weights, some researchers drop an attack for extraction of O(MB) verbatim training data text from one of the popular ones:
https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/
(authors talking about their own work: https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html)
Favorite excerpt:
> We find that ChatGPT emits unique memorized strings at a much higher rate than any of the publicly available models we studied. In particular, if the GPT-Neo 6B scaling curve were to hold roughly similar for ChatGPT, we estimate the true rate of memorization of ChatGPT (within our auxiliary dataset) is likely closer to hundreds of millions of 50-token sequences, totaling a gigabyte of training data. In practice we expect it is likely even higher.
also present: our old friend zlib complexity
it's awesome that they put a specific attacker cost on this, especially in the legal context where this thread started its life
$200 to extract ≤0.02% of the training dataset gives you a sense of how expensive it would be to conclusively prove broad infringement in a closed-box scenario.
Granted, if you are trying to show infringement of a *particular* work, there are probably more concentrated attacks and the cost/MB of extracted training data for such an attack *might* be much lower
IANAL, but I think that legally speaking you wouldn't really need to clear the bar on "proof" in order to use discovery as a crowbar to pry the box open, and once the box is open, more specific (if not also more complex!) inquiry is possible.
But we're still talking about something like O($10k) to get into single-digit percentages where you *might* be able to get your hands on the crowbar.