Skip Navigation

OpenAI says it’s “impossible” to create useful AI models without copyrighted material

Apparently, stealing other people's work to create product for money is now "fair use" as according to OpenAI because they are "innovating" (stealing). Yeah. Move fast and break things, huh?

"Because copyright today covers virtually every sort of human expression—including blogposts, photographs, forum posts, scraps of software code, and government documents—it would be impossible to train today’s leading AI models without using copyrighted materials," wrote OpenAI in the House of Lords submission.

OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."

244 comments
  • Some relevant comments from Ars:

    leighno5

    The absolute hubris required for OpenAI here to come right out and say, 'Yeah, we have no choice but to build our product off the exploitation of the work others have already performed' is stunning. It's about as perfect a representation of the tech bro mindset that there can ever be. They didn't even try to approach content creators in order to do this, they just took what they needed because they wanted to. I really don't think it's hyperbolic to compare this to modern day colonization, or worker exploitation. 'You've been working pretty hard for a very long time to create and host content, pay for the development of that content, and build your business off of that, but we need it to make money for this thing we're building, so we're just going to fucking take it and do what we need to do.'

    The entitlement is just...it's incredible.

    4qu4rius

    20 years ago, high school kids were sued for millions & years in jail for downloading a single Metalica album (if I remember correctly minimum damage in the US was something like 500k$ per song).

    All of a sudden, just because they are the dominant ones doing the infringment, they should be allowed to scrap the entire (digital) human knowledge ? Funny (or not) how the law always benefits the rich.

  • Any reasonable person can reach the conclusion that something is wrong here.

    What I'm not seeing a lot of acknowledgement of is who really gets hurt by copyright infringement under the current U.S. scheme. (The quote is obviously directed toward the UK, but I'm reasonably certain a similar situation exists there.)

    Hint: It's rarely the creators, who usually get paid once while their work continues to make money for others.

    Let's say the New York Times wins its lawsuit. Do you really think the reporters who wrote the infringed-upon material will be getting royalty checks to be made whole?

    This is not OpenAI vs creatives. OK, on a basic level it is, but expecting no one to scrape blogs and forum posts rather goes against the idea of the open internet in the first place. We've all learned by now that what goes on the internet stays there, with attribution totally optional unless you have a legal department. What's novel here is the scale of scraping, but I see some merit to the "transformational" fair-use defense given that the ingested content is not being reposted verbatim.

    This is corporations vs corporations. Framing it as millions of people missing out on what they'd have otherwise rightfully gotten is disingenuous.

  • ...so stop doing it!

    This explains what Valve was until recently not so cavalier about AI: They didn't want to hold the bag on copyright matters outside of their domain.

  • As with many things, the golden rule applies. They who have the gold, make the rules.

  • I think viral outrage aside, there is a very open question about what constitutes fair use in this application. And I think the viral outrage misunderstands the consequences of enforcing the notion that you can't use openly scrapable online data to build ML models.

    Effectively what the copyright argument does here is make it so that ML models are only legally allowed to make by Meta, Google, Microsoft and maybe a couple of other companies. OpenAI can say whatever, I'm not concerned about them, but I am concerned about open source alternatives getting priced out of that market. I am also concerned about what it does to previously available APIs, as we've seen with Twitter and Reddit.

    I get that it's fashionable to hate on these things, and it's fashionable to repeat the bit of misinformation about models being a copy or a collage of training data, but there are ramifications here people aren't talking about and I fear we're going to the worst possible future on this, where AI models are effectively ubiquitous but legally limited to major data brokers who added clauses to own AI training rights from their billions of users.

    • People hate them not because it is fashionable, but because they can see what is coming.

      Tech companies want to create tools that would replace million of jobs without compensating the very people that created these works in the first place.

      • That's not "coming", it's an ongoing process that has been going on for a couple hundred years, and it absolutely does not require ChatGPT.

        People genuinely underestimate how many of these things have been an ongoing concern. A lot like crypto isn't that different to what you can do with a server, "AI" isn't a magic key that unlocks automation. I don't even know how this mental model works. Is the idea that companies who are currently hiring millions of copywriters will just rely on automated tools? I get that yeah, a bunch of call center people may get removed (again, a process that has been ongoing for decades), but how is compensating Facebook for scrubbing their social media posts for text data going to make that happen less?

        Again, I think people don't understand the parameters of the problem, which is different from saying that there is no problem here. If anything the conversation is a net positive in that we should have been having it in 2010 when Amazon and Facebook and Google were all-in on this process already through both ML tools and other forms of data analysis.

      • Tech companies will create those tools no matter what. Then they will charge everyone through the nose for using them.

        The question is whether:

        • ONLY tech companies capable of paying scraps during 70 years after the author's death are allowed to create those tools
        • EVERYONE is allowed to train their own tool, without having to raise a few billion in seed capital

        In this case, OpenAI is acting as "the devil's advocate"... and it's working to fool people into supporting the opposite position.

    • It is an open question. As others have pointed out, a human taking inspiration from the work of others is totally fine. My issue is that AI are not human.

      A human's production of work is limited. A human can only produce so fast for so long. An AI could theoretically be scaled infinitely and produce indefinitely. I don't want to live in a world where FAANGCORP's OmniAI is responsible for 90% of all art, media, and music because humans can't keep pace with it.

      • A lot of this can be traced back to the invention of photography, which is a fun point of reference, if one goes to dig up the debate at the time.

        In any case, the idea that humans can only produce so fast for so long and somehow that cleans the channel just doesn't track. We are flooded by low quality content enabled by social media. There's seven billion of us two or three billion of those are on social platforms and a whole bunch of the content being shared in channels is created by using corporate tools to make stuff by pointing phones at it. I guarantee that people will still go to museums to look at art regardless of how much cookie cutter AI stuff gets shared.

        However, I absolutely wouldn't want a handful of corporations to have the ability to empower their employed artists with tools to run 10x faster than freelance artists. That is a horrifying proposition. Art is art. The difficulty isn't in making the thing technically (say hello, Marcel Duchamp, I bet you thought you had already litgated this). Artists are gonna art, but it's important that nobody has a monopoly on the tools to make art.

      • "It's too fast" is a really really dumb argument against AI

      • Mass produced garbage is still mass produced garbage. As you point out AIs aren't human and while that removes the limitations of the flesh (including limitations that we might want there - no human ever says oops, I made a child porn), it imposes limitations of the machine. AI output isn't that good at anything practical. It writes garbage code that even if you manage to get it working, the business manager or whoever isn't capable of seeing the flaws in it. The art is devoid of any sort of soul and almost always has glaring flaws that require actual humans to identify and fix.

        We are about to be inundated with AI produced garbage, sure, but that only proves the lie that shady internet sites and social media have always been a cesspool of shitty, unreliable content, and connecting with hundreds of thousands of faceless strangers was never a good idea. Hopefully we'll come up with (or go back to) solutions that don't treat the problem as simply one of volume.

  • Or, or, or, hear me out:

    Maybe their particular approach to making an AI is flawed.

    Its like people do not know that there are many different kinds of ways that attempt to do AI.

    Many of them do not rely on basically a training set that is the cumulative sum of all human generated content of every imaginable kind.

  • All the AI race has done is surface the long standing issue of how broken copyright is for the online internet era. Artists should be compensated but trying to do that using the traditional model which was originally designed with physical, non infinitely copyable goods in mind is just asinine.

    One such model could be to make the copyright owner automatically assigned by first upload on any platform that supports the API. An API provided and enforced by the US copyright office. A percentage of the end use case can be paid back as royalties. I haven't really thought out this model much further than this.

    Machine learning is here to say and is a useful tool that can be used for good and evil things alike.

    • Nah. Copyright is broken, but it's broken because it lasts too long, and it can be held by constructs. People should still reserve the right to not have the things they've made incorporated into projects or products they don't want to be associated with.

      The right to refusal is important. Consent is important. The default permission should not be shifted to "yes" in anybody's mind.

      The fact that a not insignificant number of people seem to think the only issue here is money points to some pretty fucking entitled views among the would-be-billionaires.

      • My major issue with copyright is how published works can have major cultural significance. How it can shift ideas and shape minds. But your not allowed to have some fun with on a personal level. How can it be the norm that the most important scientific knowledge and other culturally significant material is locked behind such restrictive measures. Essentially ensuring that middle class and especially poor people are locked out.

        If you publish something, even if it's paid, you don't deserve such restrictive rights. You deserve to be compensated for your work but you don't deserve to make it into a extortion racket.

        My view on your second point is if you have posted it publicly with no paywall, maybe you should still get some percentage revenue but you don't have a say in what it can be used. To place restrictions on what it can be used for when posting it publicly is academic as it's basically unenforceable.

        We live in a society which revolves around the discovery and sharing of ideas. We are all entitled to a certain amount of the sharing of that information. That's the whole point. To have some business man who was in the right place at the right time create an extortion racket out of something culturally significant they almost certainly didn't create is wrong.

        Sorry if this is all over the place. I'm writing this while tired.

  • Could they be legally required to open source the llm? I believe them, but that doesn’t make it right

  • OpenAI now needs to go to court and argue fair use forever. That's the burden of our system. Private ownership is valued higher than anything else so ... Good luck we're all counting on you (unfortunately).

  • 🤖 I'm a bot that provides automatic summaries for articles: ::: spoiler Click here to see the summary Further, OpenAI writes that limiting training data to public domain books and drawings "created more than a century ago" would not provide AI systems that "meet the needs of today's citizens."

    OpenAI responded to the lawsuit on its website on Monday, claiming that the suit lacks merit and affirming its support for journalism and partnerships with news organizations.

    OpenAI's defense largely rests on the legal principle of fair use, which permits limited use of copyrighted content without the owner's permission under specific circumstances.

    "Training AI models using publicly available internet materials is fair use, as supported by long-standing and widely accepted precedents," OpenAI wrote in its Monday blog post.

    In August, we reported on a similar situation in which OpenAI defended its use of publicly available materials as fair use in response to a copyright lawsuit involving comedian Sarah Silverman.

    OpenAI claimed that the authors in that lawsuit "misconceive[d] the scope of copyright, failing to take into account the limitations and exceptions (including fair use) that properly leave room for innovations like the large language models now at the forefront of artificial intelligence."


    Saved 58% of original text. :::

244 comments