Suing Writers Seethe at OpenAI's Excuses in Court
Suing Writers Seethe at OpenAI's Excuses in Court
![](https://lemmy.ml/pictrs/image/54908e65-66fb-4e1d-ae47-f683798f72dd.jpeg?format=webp&thumbnail=128)
Writers who are suing OpenAI for training ChatGPT on their work without permission aren't having any of the AI startup's nonsense.
![Suing Writers Seethe at OpenAI's Excuses in Court](https://lemmy.ml/pictrs/image/54908e65-66fb-4e1d-ae47-f683798f72dd.jpeg?format=webp)
Suing Writers Seethe at OpenAI's Excuses in Court
Writers who are suing OpenAI for training ChatGPT on their work without permission aren't having any of the AI startup's nonsense.
Did anyone expect them to go "oh, okay, that makes sense after all"?
At the crux of the author's lawsuit is the argument that OpenAI is ruthlessly mining their material to create "derivative works" that will "replace the very writings it copied."
The authors shoot down OpenAI's excuse that "substantial similarity is a mandatory feature of all copyright-infringement claims," calling it "flat wrong."
Goodbye Star Wars, Avatar, Tarantino’s entire filmography, every slasher film since 1974…
Uh, yeah, a massive corporation sucking up all intellectual property to milk it is not the own you think it is.
AI training isn’t only for mega-corporations. We can already train our own open source models, so we shouldn't applaud someone trying to erode our rights and let people put up barriers that will keep out all but the ultra-wealthy. We need to be careful not weaken fair use and hand corporations a monopoly of a public technology by making it prohibitively expensive to for regular people to keep developing our own models. Mega corporations already have their own datasets, and the money to buy more. They can also make users sign predatory ToS allowing them exclusive access to user data, effectively selling our own data back to us. Regular people, who could have had access to a corporate-independent tool for creativity, education, entertainment, and social mobility, would instead be left worse off with fewer rights than where they started.
Is actually reminds me of a Sci-Fi I read where in the future, they use an ai to scan any new work in order to see what intellectual property is the big Corporation Zone that may have been used as an influence in order to Halt the production of any new media not tied to a pre-existing IP including 100% of independent and fan-made works.
Which is one of the contributing factors towards the apocalypse. So 500 years later after the apocalypse has been reversed and human colonies are enjoying post scarcity, one of the biggest fads is rediscovering the 20th century, now that all the copyrights expired in people can datamine the ruins of Earth to find all the media that couldn't be properly preserved heading into Armageddon thanks to copyright trolling.
It's referred to in universe as "Twencen"
The series is called FreeRIDErs if anyone is curious, unfortunately the series may never have a conclusion, (untimely death of co creator) most of its story arcs were finished so there's still a good chunk of meat to chew through and I highly recommend it.
OpenAI is trying to argue that the whole work has to be similar to infringe, but that's never been true. You can write a novel and infringe on page 302 and that's a copyright infringement. OpenAI is trying to change the meaning of copyright otherwise, the output of their model is oozing with various infringements.
I can quote work that's already been published, that's allowable and I don't have to get to the author's consent to do that. I don't have to get consent to do that because I'm not passing the work off my own, I am quoting it with reference.
So if I ask the AI to produce something in the style of Stephen King no copyright is violated because it's all original work.
If I ask the AI to quote Stephen King (and it actually does it) then it's a quote and it's not claiming the work is its own.
Under the current interpretation of copyright law (and current law is broken beyond belief, but that's a completely different issue) a copyright breach has not occurred in either scenario.
The only arguement I can see working is that if the AI actually can quote Stephen King that will prove that it has the works of Stephen King in its data set, but that doesn't really prove anything other than the works of Stephen King are in its data set. It doesn't definitively prove openAI didn't pay for the works.
Speaking of slasher films, does anybody know of any movies that have terrible everything except a really good plot?
The Godfather Part III
I don't care what works a neural network gets trained on. How else are we supposed to make one?
Should I care more about modern eternal copyright bullshit? I'd feel more nuance if everything a few decades old was public-domain, like it's fucking supposed to be. Then there'd be plenty of slightly-outdated content to shovel into these statistical analysis engines. But there's not. So fuck it: show the model absolutely everything, and the impact of each work becomes vanishingly small.
Models don't get bigger as you add more stuff. Training only twiddles the numbers in each layer. There are two-gigabyte networks that have been trained on hundreds of millions of images. If you tried to store those image, verbatim, they would each weigh barely a dozen bytes. And the network gets better as that number goes down.
The entire point is to force the distillation of high-level concepts from raw data. We've tried doing it the smart way and we suck at it. "AI winter" and "good old-fashioned AI" were half a century of fumbling toward the acceptance that we don't understand how intelligence works. This brute-force approach isn't chosen for cost or ease or simplicity. This is the only approach that works.
Models don’t get bigger as you add more stuff.
They will get less coherent and/or "forget" the earlier data if you don't increase the parameters with the training set.
There are two-gigabyte networks that have been trained on hundreds of millions of images
You can take a huge tiff of an image, put it through JPEG with the quality cranked all the way down and get a tiny file out the other side, which is still a recognizable derivative of the original. LLMs are extremely lossy compression of their training set.
which is still a recognizable derivative of the original
Not in twelve bytes.
Deep models are a statistical distillation of a metric shitload of data. Smaller models with more training on more data don't get worse, they get more abstract - and in adversarial uses they often kick big networks' asses.
Copyright is already just a band-aid for what is really an issue of resource allocation.
If writers and artists weren't at risk of loosing their means of living, we wouldn't need to concern ourselves with the threat of an advanced tool supplanting them. Nevermind how the tool is created, it is clearly very valuable (otherwise it would not represent such a large threat to writers) and should be made as broadly available (and jointly-owned and controlled) as possible. By expanding copyright like this, all we're doing is gatekeeping the creation of AI models to the largest of tech companies, and making them prohibitively expensive to train for smaller applications.
If LLM's are truly the start of a "fourth industrial revolution" as some have claimed, then we need to consider the possibility that our economic arrangement is ill-suited for the kind of productivity it is said AI will bring. Private ownership (over creative works, and over AI models, and over data) is getting in the way of what could be a beautiful technological advancement that benefits everyone.
Instead, we're left squabbling over who gets to own what and how.
fourth industrial revolution" as some have claimed
The people claiming this are often the shareholders themselves.
prohibitively expensive to train for smaller applications.
There is so much work out there for free, with no copyright. The biggest cost in training is most likely the hardware, and I see no added value in having AI train on Stephen King ☠️
Copyright is already just a band-aid for what is really an issue of resource allocation.
God damn right but I want our government to put a band aid on capitalists just stealing whatever the fuck they want "move fast and break things". It's yet another test for my confidence in the state. Every issue, a litmus test for how our society deals with the problems that arise.
There is so much work out there for free, with no copyright
There's actually a lot less than you'd think (since copyright lasts for so long), but even less now that any online and digitized sources are being locked down and charged for by the domain owners. But even if it were abundant, it would likely not satisfy the true concern here. If there was enough data to produce an LLM of similar quality without using copyrighted data, it would still threaten the security of those writers. What is to say a user couldn't provide a sample of Stephen King's writing to the LLM and have it still produce derivative work without having trained it on copyrighted data? If the user had paid for that work, are they allowed to use the LLM in the same way? If they aren't who is really at fault, the user or the owner of the LLM?
The law can't address the complaints of these writers because interpreting the law to that standard is simply too restrictive and sets an impossible standard. The best way to address the complaint is to simply reform copyright law (or regulate LLM's through some other mechanism). Frankly, I do not buy that the LLM's are a competing product to the copyrighted works.
The biggest cost in training is most likely the hardware
That's right for large models like the ones owned by OpenAI and Google, but with the amount of data needed to effectively train and fine-tune these models, if that data suddenly became scarce and expensive it could easily overtake hardware cost. To say nothing for small consumer models that are run on consumer hardware.
capitalists just stealing whatever the fuck they want “move fast and break things”
I understand this sentiment, but keep in mind that copyright ownership is just another form of capital.
seethe
Very concerning word use from you.
The issue art faces isn't that there's not enough throughput, but rather there's not enough time, both to make them and enjoy them.
That's always been the case, though, imo. People had to make time for art. They had to go to galleries, see plays and listen to music. To me it's about the fair promotion of art, and the ability for the art enjoyer to find art that they themselves enjoy rather than what some business model requires of them, and the ability for art creators to find a niche and to be able to work on their art as much as they would want to.
Headline is stupid.
Millenails journalism is fucking got to stop with these clown word choices...
I think the place we haven't quite gotten to yet is that copyright is probably the wrong law for this. What the AI is doing is reverse engineering the authors magic formula for creating new works, which would likely be patent law.
In the past this hasn't really been possible for a person to do reliably, and it isn't really quantifiable as far as filling a patent for your process, yet the AI does it anyway, leaving us in a weird spot.
US patent professional here
Ya, saying it isn't possible to do under patent law is no understatement. Even making the patent applications possible to allow would require changes to 35 U. S. C. 112 (A, and probably also B), 35 U. S. C. 101. This all assumes that all authors would have the time and money and energy to file a patent, which even with a good attorney is analogous to is many many hours of work and filing pro se would be like writing a whole new book. After the patent is allowed the costs of continuation applications to account for changes in the process as the author learns and grows would be a hellish burden. After this comes the 20 year lifespan of a patent (assuming all maintenance fees are paid, which is quite the assumption, those are not cheap) at which point the patent protections are dead and the author needs to invent a new process to be protected. Don't even get me started on enforcing a patent.
Patent law is fundamentally flawed to be sure but even if every author gets infinite money and time to file patents with then the changes needed to patent law to let them do so would leave patent law utterly broken for other purposes.
Using patent law for this is a good idea to bring up but for the above reasons I don't think it is viable at all. It would be better and more realistic to have congress change copyright law than to change patent law I think. Sadly, I don't think that is particularly likely either. :(
And patent law is even more broken than copyright law.
I don't know if I would say more broken, at least patents have limits on how long they can exist for, putting an upper bound on how much damage they can cause. The again, limiting the production of vaccines during a pandemic is a lot more urgent than letting people do micky mouse cartoons so the standard for what broken is has to be a lot more stringent. It is more important for patent law to not be broken than it is for copyright law so the same amount of brokenness feels worse with patents.
What the AI is doing is reverse engineering the authors magic formula for creating new works
Great but the humans involved knowingly let it scrub pirated works.
This is the best summary I could come up with:
ChatGPT creator OpenAI has been on the receiving end of two high profile lawsuits by authors who are absolutely livid that the AI startup used their writing to train its large language models, which they say amounts to flaunting copyright laws without any form of compensation.
One of the lawsuits, led by comedian and memoirist Sarah Silverman, is playing out in a California federal court, where the plaintiffs recently delivered a scolding on ChatGPT's underlying technology.
At the crux of the author's lawsuit is the argument that OpenAI is ruthlessly mining their material to create "derivative works" that will "replace the very writings it copied."
The authors shoot down OpenAI's excuse that "substantial similarity is a mandatory feature of all copyright-infringement claims," calling it "flat wrong."
It can brag that it's a leader in a booming AI industry, but in doing so it's also painted a bigger target on its back, making enemies of practically every creative pursuit.
High profile literary luminaries behind that suit include George R. R. Martin, Jonathan Franzen, David Baldacci, and legal thriller maestro John Grisham.
The original article contains 369 words, the summary contains 180 words. Saved 51%. I'm a bot and I'm open source!
Here’s current guidance from US Congress regarding AI copyright infringement.
Page 3 includes guidance on fair use.
"substantial similarity is a mandatory feature of all copyright-infringement claims"
Is that not a requirement? Time for me to start suing people!
I take it we don't use the phrase "good writers borrow, great writers steal" in this day and age...
Wait till they find out photographers spend their whole careers trying to emulate the style of previous generations. Or that Adobe has been implementing AI-driven content creation into Photoshop and Lightroom for years now, and we've been pretending we don't notice because it makes our jobs easier.
Wah. Waaaah. Cry more rich people.
Writers are rich because they've made artwork and sold it. I personally hold that to a higher value than CEOs.
And while these ones may not be badly off, most writers are far from rich.
Amazing how every new generation of technology has a generation of users of the previous technology who do whatever they can do stop its advancement. This technology takes human creativity and output to a whole new level, it will advance medicine and science in ways that are difficult to even imagine, it will provide personalized educational tutoring to every student regardless of income, and these people are worried about the technicality of what the AI is trained on and often don't even understand enough about AI to even make an argument about it. If people like this win, whatever country's legal system they win in will not see the benefits that AI can bring. That society is shooting themselves in the foot.
Your favorite musician listened to music that inspired them when they made their songs. Listening to other people's music taught them how to make music. They paid for the music (or somebody did via licensing fees or it was freely available for some other reason) when they listened to it in the first place. When they sold records, they didn't have to pay the artist of every song they ever listened to. That would be ludicrous. An AI shouldn't have to pay you because it read your book and millions like it to learn how to read and write.
You’re humanizing the software too much. Comparing software to human behavior is just plain wrong. GPT can’t even reason properly yet. I can’t see this as anything other than a more advanced collage process.
Open used intellectual property without consent of the owners. Major fucked.
If ‘anybody’ does anything similar to tracing, copy&pasting or even sampling a fraction of another person’s imagery or written work, that anybody is violating copyright.
If ‘anybody’ does anything similar to tracing, copy&pasting or even sampling a fraction of another person’s imagery or written work, that anybody is violating copyright.
Ok, but tracing is literally a part of the human learning process. If you trace a work and sell it as your own that's bad. If you trace a work to learn about the style and let that influence your future works that is what every artist already does.
The artistic process isn't copyrighted, only the final result. The exact same standards can apply to AI generated work as already do to anything human generated.
sampling a fraction of another person's imagery or written work.
So citing is a copyright violation? A scientific discussion on a specific text is a copyright violation? This makes no sense. It would mean your work couldn't build on anything else, and that's plain stupid.
Also to your first point about reasoning and advanced collage process: you are right and wrong. Yes an LLM doesn't have the ability to use all the information a human has or be as precise, therefore it can't reason the same way a human can. BUT, and that is a huge caveat, the inherit goal of AI and in its simplest form neural networks was to replicate human thinking. If you look at the brain and then at AIs, you will see how close the process is. It's usually giving the AI an input, the AI tries to give the desired output, them the AI gets told what it should have looked like, and then it backpropagates to reinforce it's process. This already pretty advanced and human-like (even look at how the brain is made up and then how AI models are made up, it's basically the same concept).
Now you would be right to say "well in it's simplest form LLMs like GPT are just predicting which character or word comes next" and you would be partially right. But in that process it incorporates all of the "knowledge" it got from it's training sessions and a few valuable tricks to improve. The truth is, differences between a human brain and an AI are marginal, and it mostly boils down to efficiency and training time.
And to say that LLMs are just "an advanced collage process" is like saying "a car is just an advanced horse". You're not technically wrong but the description is really misleading if you look into the details.
And for details sake, this is what the paper for Llama2 looks like; the latest big LLM from Facebook that is said to be the current standard for LLM development:
No that's not how it works. It stores learned information like "word x is more likely to follow word y than word a" or "people from country x are more likely to consume food a than b". That is what is distributed when the AI model is shared. To learn that, it just reads books zillions of times and updates its table of likelihoods. Just like an artist might listen to a Lil Wayne album hundreds of times and each time they learn a little bit more about his rhyme style or how beats work or whatever. It's more complicated than that, but that's a layperson's explanation of how it works. The book isn't stored in there somewhere. The book's contents aren't transferred to other parties.
Its less about copying the work, its more like looking at patterns that appear in a work.
To bring a very rudimentary example, if I wanted a word and the first letter was Q, what would the second letter be.
Of course, statistically, the next letter is u, and its not common for words starting with Q to have a different letter after that. ML/AI is like taking these small situations, but having a ridiculous amount of parameters to come up with something based on several internal models. These paramters of course generally have some context.
Its like if you were told to read a book thoroughly, and then after was told to reproduce the same book. You probably cannot make it 1:1, but could probably get the general gist of a story. The difference between you and the machine is the machine read a lot of books, and contextually knows patterns so that it can generate something similar faster and more accurate, but not exactly the original one for one thing.
I don't think that Sarah Silverman and the others are saying that the tech shouldn't exist. They're saying that the input to train them needs to be negotiated as a society. And the businesses also care about the input to train them because it affects the performance of the LLMs. If we do allow licensing, watermarking, data cleanup, synthetic data, etc. in a way that is transparent, I think it's good for the industry and it's good for the people.
its a bit more than that if the ai is told to make something in the style of.
Amazing how every generation of technology has an asshole billionaire or two stealing shit to be the first in line to try and monopolize society's progress.
This technology takes human creativity and output to a whole new level,
No, it doesn't. There's nothing "human" or "creative" about the output of AI.