I haven't been very worried about AI, even though I'm a writer.
Why?
Because it takes a while for the law teams employed by the titans of old media to rumble to action, but it was always clear they were coming. These are the teams that don't sue other companies unless they're certain of winning.
And today, the New York Times sued OpenAI for several billion dollars.
https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html
BTW: this isn't even the lawsuit OpenAI is *really* scared of.
The House of Mouse has yet to tee up to the plate against them or Midjourney or the like.
Modern copyright laws are terrifying, y'all, and the courts have found definitively and repeatedly that AI products are derivative material, and cannot be copyrighted. Unless copyright law is completely rewritten (BTW, it needs to be) the *only* thing AI can be used for is to build better search.
Just another tech hustle, like crypto & NFTs.
@Impossible_PhD Heh. Was wondering when that'd happen.
I guess, regarding the whole thing, I've been of a mindset to give it time to see where it ends up after the FUD/hype shake out
@Impossible_PhD what I'm curious about is: how much effort is it to re-train such a generative model?
They're gonna get sued again and again, and each time it's gonna end with "remove our stuff from your model and pay us damages, or keep it in and pay us damages and licensing fees". And as far as I understand, the way these models work, it's impossible to "remove" anything because the training data isn't stored inside as discrete units, it all contributes to the biasing of these artificial neurons.
So will they have to re-train their models after each lawsuit with one source fewer? Or how is that gonna work.
@amberage @Impossible_PhD Pretty much. It takes the same effort and cost as it took to train them in the first place, each time you need to remove something from the training set. There are shortcuts, but they might not stand in the legal sense.
@theartlav @amberage Correct. Post-training, the way these things are made, material *cannot* be removed. You have to retrain from scratch.
@Impossible_PhD @theartlav @amberage And and and...you have to have a method for enumerating what it's been trained on, which right now no one seems to care about. Can't wait for pipeline engineers to become all the rage.
@Impossible_PhD@hachyderm.io @theartlav@hachyderm.io @amberage@eldritch.cafe I tried an experiment - I have a coding one called Tabnine that is only trained on code that is licensed under something like MIT or Apache - i.e. you can use it in free or non-free software. It is helpful for looking things up for one.
I asked it "Who is Karl Marx?" and it started to answer before it then refreshed and told me off for breaking terms of service regarding politics, religions etc. It is possible to imperfectly add guardrails after the fact, but as demonstrated there was enough in the dataset for it to attempt an answer so it did and the flicker of an answer I got before it cut out appeared correct, not nonsense.
@Impossible_PhD@hachyderm.io @theartlav@hachyderm.io @amberage@eldritch.cafe I expect they may also be using a generic base model that has been trained on more than just code, though they don't seem to disclose what it is.
@Impossible_PhD@hachyderm.io @theartlav@hachyderm.io @amberage@eldritch.cafe I think for the package I'm using they may be able to train on my code as well, but that's not a problem in this case because I am willing to consent to that.
@amberage@eldritch.cafe @Impossible_PhD@hachyderm.io
@Impossible_PhD@hachyderm.io @simontoth@hachyderm.io
they could train models on only copyright-free/copyleft/their own sources but 1) the quality would be lower and 2) they still couldn't copyright the results so they couldn't use it for certain things, just like, lazy stock art generators
Edit: copyleft/other licenses still won't be enough for them to use. The main point was that they'd be limited significantly by being held to legal uses
@rachel @amberage @simontoth The problem with that is the 92-year threshold for copyright expiry. Anything trained on such old material would spit out prose that's *hopelessly* and unrecognizably weird.
@Impossible_PhD@hachyderm.io @amberage@eldritch.cafe @simontoth@hachyderm.io true. There is art/writing/etc that people produce today that they publish with explicit copyright-free/copyleft licences. Some realms have more of it than others and it is absolutely in lower volume that what is otherwise obtainable for training.
Overall my biggest fears is how it'll be used to accelerate the crumbling of the web as we know it, and the torrent of shit news articles.
@Impossible_PhD@hachyderm.io @amberage@eldritch.cafe @simontoth@hachyderm.io mostly thinking of things like open source code, social media posts used for training by those social media companies and such and that is still full of legal landmines
@rachel @amberage @simontoth Courts have already found that tweets are copyrightable material, and property of their writers. SO yeah, they can't be used.
@Impossible_PhD@hachyderm.io @amberage@eldritch.cafe @simontoth@hachyderm.io it would please me so very much to see musky get sued by his own site's users for his attempts at making an anti-woke fancy autocomplete
They probably added something into the unenforceable EULA saying they can do it
@rachel @Impossible_PhD @amberage @simontoth
If the business case gets made well enough, I’ve figured for a while we’d end up in an uncomfortably dystopian scenario where content farming by AI companies becomes normal.
IE, people are hired to just produce stuff for which they’ve signed away the rights.
Historically high rates of education/skill plus capitalism hell scape leads pretty neatly to such a state IMO.
These “content cattle” will be doing what they’re passionate about anyway.
@rachel @Impossible_PhD @amberage @simontoth
Like, “are you a philosopher but need to pay rent? Come, debate ideas and publish for pay!”
“Music career not panning out and sick of the YouTube algorithm? … just jam everyday with us!”
Seems to tie too well into the education industrial complex that’s developed around the middle class and their “passion coddling”.
@rachel @Impossible_PhD @amberage @simontoth stuff I publish (code) under open source or copyleft licenses still isn't suitable training data: there are license obligations of varying intensity, and nobody training these massive models is obeying then. I adamantly oppose treating publicly shared content as obligation free, even if all the tech bros wish it were for their bottom line. I have a bigger rant in the footer of my web site but I won't dox my alt
@rachel @Impossible_PhD @amberage @simontoth tldr: use my code and follow the associated license. Don't train your plagiarism machine on it.
@Specialist_Being_677@hachyderm.io @Impossible_PhD@hachyderm.io @amberage@eldritch.cafe @simontoth@hachyderm.io oh yeah I definitely should not have included copyleft there just like it is certainly a violation of GPL software in GitHub copilot
@rachel @Impossible_PhD @amberage @simontoth
There's a lot of newer results showing that a smaller model trained on a smaller but well-curated data set will be as good as these huge models trained on the garbage that is the entire internet.
My guess is, the legal issues will accelerate the move towards this approach. It'll let you use traceable and properly licensed data, and costs less overall to both train and use.
@Impossible_PhD @rachel @amberage @simontoth
My favourite bit of meaningless trivia is that both James Joyce's *Ulysses* and Timothy Dexter's *a pickle for the knowing ones* are in the public domain, and I would kill to see a plagiaristic, stochastic parrot poisoned by them and so, so many others.
@rachel @Impossible_PhD @amberage @simontoth No, copyleft is even worse for them. It means they have to make the model and all derivative works free under the same license, if they can. If they can't (because of other conflicting legal obligations) they can't distribute tbe derivative works at all.
@Impossible_PhD Unless I missed something, the only cases around AI were simply about "not made by a human and therefore cannot be copyrighted".
@simontoth Mmm hmmm. Thing is, that means it's derivative *by definition*.
Originality is the determinant of what is and is not copyrightable. Everything that is not original is either derivative or out-and-out theft. In either case, massive infringement. Proving originality is the only defense against a robust copyright infringement lawsuit.
@Impossible_PhD Well, no. The "Monkey selfie" was original, but it was rejected for the same reason.
@Impossible_PhD I will make a personal speculation: the Mouse had two events to wait for: a) the twin strikes which centred on generative AI b) what is happening to the actual mouse on 1 Jan when it goes out of copyright.
those outcomes no doubt affect some of the phrasing in the filings.
not that my speculations matter in the slightest, it's just fun to muse on it
@Impossible_PhD This article https://web.law.duke.edu/cspd/mickey/ makes me think that the house of mouse might not be anxious to enter this fray, given their reliance on both sides of copyright. I also think that OpenAI MSFT will use a very different tactic in court, relying on the fact that AI training is not an expressive use of the material. EU regulation and US law can be read in a way that would support such an argument. Only the courts can decide if that would apply.
@Impossible_PhD but i sure do wish we’d rewrite copyright law
@Impossible_PhD
Hmm. That's not quite enough to stop me worrying, but it's nice to see.
@Impossible_PhD Download PrivateGPT while you can…
@knutson_brain @Impossible_PhD
Seriously, i'd recommend GGML. No dependencies, works with all sorts of models, all local.
And i do recommend to play with it, it shows just how bad these things can be on average, without having to deal with the sites and limits.
@Impossible_PhD hahaha awesome! I'm a writer too and so far am not worried about AI either (art is a different story). I played with Chat GPT and the output is awful. It could not do my job. I hope the NYT wins!!
@Jennifer Nah. Disney gets to eat the Midjournies of the world, and Disney has no mercy whatsoever. When The House of Mouse eventually files its lawsuit, it'll make this look like amateur hour.
@Impossible_PhD I'm not familiar with Midjourney. ? I'm not a fan of Disney, but if Open AI is infringing on any of their copyrights, that will be one hell of a lawsuit!
@Jennifer Midjourney is ChatGPT, expect with pictures. ChatGPT incorporating image generation into GPT4 was a duuuumb move, because now Disney gets to come for them too.
@Impossible_PhD ooooh ok! LOL. My husband is an artist so AI for art makes me kind of mad.
@Impossible_PhD This is gopd news. Waiting to see if they succeed.
@anne_twain I'd bet just about everything I own that they will. OpenAI has no actual legal defense here, given prior findings that OAI product is non-copyrightable. Their only way out is to settle, and I'm 99% sure that the Times won't because their actual objective here is to shut down OpenAI.
@Impossible_PhD @anne_twain
Whether or not the output of generative AI can be copyrighted isn't really a question here though. It's unrelated to the question of copyright infringement.
I'm still of the opinion that these various copyright infringement cases are going to fail, that OpenAI and others have a relatively strong fair use defense. But we'll see how things are ruled in the end.
@hybridhavoc @Impossible_PhD Well spotted. It's not the question we're discussing.
The question we're asking is whether generative AI is infringing the copyright of the newspaper. That's what the lawsuit is about.
@Impossible_PhD@hachyderm.io Except we know OpenAI and Microsoft will have planned for this, and Microsoft has very much been through and won on anti-trust suits, so getting an unethical and potentially illegal business model past legal challenges isn't new ground for them. My guess would be that like then they intend to win through with most but not all of it.
@alastair This isn't a regulatory case, though. Not even close. The law is completely different, and copyright has almost no loopholes--decades of lobbying from Disney and other old media titans have closed almost every single one, and the remainders don't stand a chance of protecting OAI here.
@Impossible_PhD@hachyderm.io I'm inclined to assume that broadly speaking anything Disney etc have considered, OpenAI/Microsoft will have. So to my mind that means they have planned for losing at this, which makes me wonder what the next gambit would be. They could perhaps pay the media companies for training data and I'm fairly skeptical that that would lead to much if anything extra for a lot of writers in many cases.
@alastair They've already said publicly that even the lowest possible training licensure fees would make AI impossible--like, at tenths of a cent per trained item, we're talking here. The volume they need to feed these things is *mind-shatteringly* large.
@Impossible_PhD @mattly the NYT is something you consider aligned with good?
“We have IP extortion and protection rackets on our side” doesn’t make me feel better
The NYT is going to sue the AI makers, win some large amount of money, and then make a deal with them that totally fails to protect the rest of us.
The real reason AI will fail is that it can't actually do any of the things that it's supposed to do.
@Impossible_PhD@hachyderm.io OpenAI is great for creating first drafts of things, especially for people who aren't hugely comfortable writing. Then, once that first draft has been generated, a human will need to take over and confirm information, verify sources, and begin the editing/rewriting process.
@thelaughingmuse Yeah, fun fact: if you start with copyright infringement, no matter what you do with it, the product is a derivative work.
It's still copyright infringement.
And just as a note, I'm a professor of writing. I know very very well the complexities of the writing process. There are far better ways which will yield superior product, regardless of comfort or proficiency.
@Impossible_PhD The Times is suing over verbatim reproduction of their articles.
It remains to be decided if training on copyrighted material is s problem or not.
Assuming chatgpt really will spit out articles verbatim in response to appropriate prompts, I think the Times has s good case.
I'm personally much more skeptical about the training copyright violation assertions.
@colo_lee I mean, courts have already determined that AI writing is not copyrightable because it's inherently derivative work.
I'm not skeptical. It's barely a hard breath from current findings to that.
@Impossible_PhD Yeah, we'll see.
Human work is clearly copyrightable, and AI output is not. That seems right.
But it doesn't mean the use of copyrighted works to train AI is a violation.
It'll be interesting to see how this plays out legally
Thank you @Impossible_PhD for the hopeful take.
For those who want the article without the NYT site silliness: https://web.archive.org/web/20231227131618/https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html