World News @lemmy.world sabbah @lemmy.world 1 yr. ago

Two authors are suing OpenAI for training ChatGPT with their books. Could they win?

theconversation.com Two authors are suing OpenAI for training ChatGPT with their books. Could they win?

Mona Awad and Paul Tremblay’s lawsuit claims their books were used without their consent. But copyright protection doesn’t apply to ideas – they’ll need to demonstrate the likelihood of economic loss.

41 comments

I think there's an argument that using someone's art or writing to train an AI is like charging for a screening of a movie in your garage. You're using their work and labor for something that will make a profit without their permission. It's not like Fair Use for educational purpose, the AI isn't a human being who can make a choice as to what they do with their education, it's a mathematical prediction engine that is going to be use for industry purposes.

I can read someone else's book. I can read someone else's book to a child. I can't post someone else's book on my website and charge 5 bucks to read it. I can't reprint someone's book on my website with ads. So why can someone use someone else's book to develop an LLM chatboot that will be placed on a website that gains ad revenue? Or that will be sold to software companies to write technical instructions or code?

With that in mind, that the lawsuit here is based on COPYING the book to an internal database to train on, based on scanning it, they are arguing that the book was reproduced to gain a profit, basically the same thing as pirating a movie and selling tickets to a private screening.
- I can't post someone else's book on my website and charge 5 bucks to read it.
  
  No, but you can read someone else's book and then later write a book inspired by theirs and sell that.
  
  Which is what ai does, as far as I know.
  
  I'm not trying to argue with the rest of your comment, but that middle part looks like false equivalency to me. "I can do this but not that, so why would ai developers be allowed to do this completely different thing" just has no logic to it.
  
  The AI isn't redistributing copies of even sections of the book, it just learnt from it. It's like when you read books and gain an understanding of how they are structured and such and then you write your own book based on what you've learnt from reading books.
  
  Also, screw it. I'll say it. If the LLM chatbot producing text from having scanned other books is the same as a person being inspired by reading books, then the LLM should get PAID.
  
  If not, then it's just a tool. And it's a tool they built using uncompensated labor.
  
  An LLM is mathematically calculating the probability of the words being used. That is not inspiration.
  
  I said right in the comment, it's not like using the book to educate a child. A child will grow up and make their own decisions. The LLM has no ability to choose a different life path. The LLM is not getting IDEAS from the book. The LLM is a mathematical engine that will produce what has been asked for, and it will do that by calculating the most likely words to be used based on what has been fed to it.
  
  The LLM is a machine used to make profit for its programmer, it is not an independent person creating out of inspiration.
  
  Don't believe the hype. They have NOT produced actual Artificial Intelligence.
- Are the AIs reprinting? Seems like they are quoting, and when there’s not verbatim content whatever’s coming out is a derivative work transformed by combination with the rest of the training set and the prompt.
  
  Like, have we seen a chatbot post a passage of a story or textbook without it being in a context like “hey quote me some of that story or textbook”?
It would be cool to see some kind of legal or practical protection creators can place on their work that would prevent AIs from being able to use them for training.
- It exists. It is copyright. We just haven't seen the ends of the current batch of lawsuits just yet.
Yeah this is a weird one. I don't really know how the line gets drawn between training an AI and plagiarism. My gut feeling is that this feels like suing somebody for being inspired by your work or learning a new word from it.
- Yeah, I'm not sure how I feel about it... But I somehow instinctively feel that a human being "inspired" by other works is different to a neural network being trained on a novel. I don't know that I can articulate specifically why one feels okay and the other doesn't... But that's how it feels to me.
  
  Part of the problem is that AI research likes to use terminology that sounds like what people do, when that's not what the AI actually does.
  
  Large language models are not intelligent in any sense. They are autocomplete on steroids. This is a computer program that was fed a book someone wrote, then mathematically tweaked to be able to guess the next word in a sentence in a way that resembles that book. That's all it does. It does not think or learn in any sense we'd apply to a human.
  
  To me, LLMs sound like a massive plagiarism engine, and I think they should need to get a license from the authors whose works they used to make the LLM under whatever terms that author wants to give, just like a publisher needs to get permission to print a copy of the work. But copyright law has no easy "bright line" for what counts and what doesn't. So the courts will have to decide whether what the AI "creates" is similar enough to the original works to count as a violation, or if the AI and its results are transformative enough to count as something new.
  
  I agree with you but, since I can't come up with a reasonable explanation for it, my brain wants to err on the side of them being largely the same for whatever reason
  
  In part it feels that way because you, along with pretty much every other human being online today, have been propagandized for decades now with SciFi inspired from dystopian futurist predictions around AI which are almost universally clearly obsolete and misinformed by now, but still persist due to anchoring bias.
  
  AI trained to predict collective human thought ends up replicating quite a lot more than most people thought would be possible in our lifetimes.
  
  And yet when it exhibits emotional intelligence it's called creepy, when it exhibits above average reasoning capabilities it's called scary, and when it displays a potential for automating large swaths of busywork for most humans it's called a threat.
  
  Next to no one I see discussing the topic is considering the opportunity costs here, as the media influence on perceiving AI as 'other' is so pervasive that most humans fall into treating it like a monkey from another forest competing for bananas rather than treating it like a much better stick.
- There are already laws regarding producing works too similar to copyrighted material.
  
  Production is infringement, not training.
  
  If I feed all of Stephen King into a LLM such that it learns what well written horror narratives looks like, and it produces a story with original and different plot elements distinct from copyrighted works, that's fine.
  
  If it starts writing about killer clowns thwarted by child orgies in the sewers then you might have an infringement problem.
  
  And ironically, the best tool for protecting copyrighted material from infringement is going to be...LLMs (acting in a discriminator role comparing indexed copy to protected works).
  
  If 'training' ends up successfully labeled as infringement we're going to end up with much worse long term outcomes in jurisdictions that honor that ruling than we otherwise would.
  
  This is the longer tail masses adopting MPAA math in trying to tally potential losses and in the efforts to protect the status quo are shooting themselves in the foot on laying claim to the future of the industry, inevitably leading to being left out of the next round of growth.
  
  Also, from an 'infringenent' standpoint it just means we'll see less open models and more closed ones which ends up using other jurisdictional models to launder copyrighted materials for synthetic training data.
  
  This is beyond dumb.
On a related note, I would be very curious to see how something like ChatGPT trained exclusively on works in the public domain would turn out. It would likely have a very different diction and style based on the older source material, but I wonder what other differences there would be.
- What do they mean train? If by reading then how can that be wrong. But if copying the text and using it as it's own works that would be wrong.
  
  After reading the article the authors are fucking stupid. Makes me not want to support their books. If you get mad because AI read you book then they could sue if someone asked me about the authors books and I wrote a description of what I read.
  
  The problem I have with this view is that AI "reading" a book is not the same as you or I reading. It doesn't actually learn it's just predicting the most likely sequence of words to be a response to whatever prompt it receives. In that sense, the words are just data, not actual words. Given how valuable data is in this day and age, I think it makes perfect sense for OpenAI to have to either: only use public domain/authorized works, or pay the creators for their work.
  
  Here, these videos are a fairly good explanation of how AI is created and "trained":
  
  https://youtu.be/R9OHn5ZF4Uo
  
  https://youtu.be/wvWpdrfoEv0
I feel like things created by AI are transformative enough that it's hard to argue that the resultant works inherently infringe on any copyrights by the very nature of how they were created
- I really need you to read this: https://softwarecrisis.dev/letters/llmentalist/
I really think artists/authors/etc. are going about this the wrong way. ChatGPT and other trained models aren't really the issue here. How the data is available and collected by other software and groups is.

What we should be really talking about is data privacy. Who can and how easily access one's data they put on the internet.
- Well of course, putting it on the open internet is very intentionally making it available for everyone to see. If you don't want everyone to see it, don't put it on the open internet. The issue is what people do with it, not whether they can access it. Copyright forbids distributing copyrighted data. The entire point of that it is so that you can make it available to be seen but protected from people copying it. However, there is no distribution or storage of copyrighted material with an LLM - there is no copy. I think OpenAI will be OK, but these things are never certain when the big lawyers are let loose.
  
  Distributing the training dataset, though, that could well be a problem.

You've viewed 41 comments.