Hachyderm @hachyderm

Recent searches

Search options

Only available when logged in.

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

Just now realizing that all of plaintiff's counsel in Silverman v. OpenAI, Inc. have @mofo.com email addresses

Screenshot of a court document header for Case 3:23-cv-03416-AMO Document 59

Attorneys from Morrison & Foerster LLP are listed, along with their emails (along with SBN and other contact information)

Joseph C. Gratz
jgratz@mofo.com

Tiffany Cheung
tcheung@mofo.com

Allyson R. Bennett
abennett@mofo.com

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

they gettin' sued by these mofos

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

I didn't realize that they'd also filed against Meta. Looks like there was a recent ruling by the judge to dismiss many of the complaints in that case.

Document pictured above is OpenAI's counsel notifying the court about the dismissal in the other case with some carefully-chosen (but then, aren't they all?) quotations from the ruling.

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

The Meta case is perhaps the more interesting one, I wish I'd learned about it sooner!

It's filed as a class action suit, because there's a much more obvious argument that the model is built on infringing material, because Meta does (slightly) less obfuscation than OpenAI about their training data, and some of what's known to be in there definitely has infringing material in it.

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

The judge's ruling seems like a good omen for those who would like to train models on flagarantly stolen materials:

«There is no way to understand the LLaMA models themselves as a recasting or adaptation of any of the plaintiffs’ books.»

Obviously this ruling is not a final authority or anything, but it's a sign that the bar is pretty low to slip one past the legal system here. Definitely the weakest part of the ruling, citing only 17 U.S.C. § 101 for the definition of a derivative work.

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

Now, that said, I think some of the dismissal was a foregone conclusion. The original complaint in the OpenAI case includes an exhibit dedicated to showing how the model produces detailed summaries of the books in question.

Is that open and shut infringement? No, of course not, but it's easy to see how that crosses the line for *plausible* infringement.

The complaint against Meta includes no such specific claims, it mostly rests on the argument that the dataset is itself an infringing work.

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

I think that argument is *correct*, but in court it's not so much about whether you're right as whether you can convincingly argue that you're right.

It's plainly obvious that LLaMA would be a different model without the infringing parts of the training data. Unclear if it would be *meaningfully* distinct.

But there's a fine line to walk here. An author who was inspired by another author's work is not necessary producing a derivative work, even if they explicitly state their inspiration.

**SnoopJ** @SnoopJ · Nov 29, 2023 *

Nov 29, 2023 *

SnoopJ @SnoopJ

To be clear, I think that a forensic analysis of LLM weights probably wouldn't have much trouble surfacing verbatim text from the training data.

Whether or not that would be enough to cross from "fair use" into infringement is hard to speculate about, but >1 trillion parameters is an awful lot of space with which to effectively memorize your training data.

SnoopJ @SnoopJ@hachyderm.io

Timely coincidence: just as I'm musing over the protected works contained in language model weights, some researchers drop an attack for extraction of O(MB) verbatim training data text from one of the popular ones:

https://www.404media.co/google-researchers-attack-convinces-chatgpt-to-reveal-its-training-data/

(authors talking about their own work: https://not-just-memorization.github.io/extracting-training-data-from-chatgpt.html)

404 Media · Nov 29, 2023Google Researchers’ Attack Prompts ChatGPT to Reveal Its Training DataChatGPT is full of sensitive private information and spits out verbatim text from CNN, Goodreads, WordPress blogs, fandom wikis, Terms of Service agreements, Stack Overflow source code, Wikipedia pages, news blogs, random internet comments, and much more.

Nov 29, 2023, 05:19 PM··Web

5boosts·6favorites

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

Favorite excerpt:

> We find that ChatGPT emits unique memorized strings at a much higher rate than any of the publicly available models we studied. In particular, if the GPT-Neo 6B scaling curve were to hold roughly similar for ChatGPT, we estimate the true rate of memorization of ChatGPT (within our auxiliary dataset) is likely closer to hundreds of millions of 50-token sequences, totaling a gigabyte of training data. In practice we expect it is likely even higher.

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

also present: our old friend zlib complexity

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

it's awesome that they put a specific attacker cost on this, especially in the legal context where this thread started its life

$200 to extract ≤0.02% of the training dataset gives you a sense of how expensive it would be to conclusively prove broad infringement in a closed-box scenario.

Granted, if you are trying to show infringement of a *particular* work, there are probably more concentrated attacks and the cost/MB of extracted training data for such an attack *might* be much lower

**SnoopJ** @SnoopJ · Nov 29, 2023

Nov 29, 2023

SnoopJ @SnoopJ

IANAL, but I think that legally speaking you wouldn't really need to clear the bar on "proof" in order to use discovery as a crowbar to pry the box open, and once the box is open, more specific (if not also more complex!) inquiry is possible.

But we're still talking about something like O($10k) to get into single-digit percentages where you *might* be able to get your hands on the crowbar.

Drag & drop to upload

Recent searches

Search options

Administered by:

Server stats:

Recent searches

Search options

Administered by:

Server stats:

Back