Technology @lemmy.world ylai @lemmy.ml 8 mo. ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

futurism.com In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

Wondering what data OpenAI used to train its buzzy new text-to-video AI? OpenAI CTO Mira Murati seems to be wondering, too.

184

Fuck AI @lemmy.world pavnilschanda @lemmy.world 8 mo. ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

futurism.com /video-openai-cto-sora-training-data

AI @lemmy.ml ylai @lemmy.ml 8 mo. ago

In Cringe Video, OpenAI CTO Says She Doesn’t Know Where Sora’s Training Data Came From

futurism.com /video-openai-cto-sora-training-data

You're viewing a single thread.

184 comments

So plagiarism?
- I don't think so. They aren't reproducing the content.
  
  I think the equivalent is you reading this article, then answering questions about it.
  
  Idk why this is such an unpopular opinion. I don't need permission from an author to talk about their book, or permission from a singer to parody their song. I've never heard any good arguments for why it's a crime to automate these things.
  
  I mean hell, we have an LLM bot in this comment section that took the article and spat 27% of it back out verbatim, yet nobody is pissing and moaning about it "stealing" the article.
  
  Because people are afraid of things they don't understand. AI is a very new and very powerful technology, so people are going to see what they want to see from it. Of course, it doesn't help that a lot of people see "a shit load of cash" from it, so companies want to shove it into anything and everything.
  
  AI models are rapidly becoming more advanced, and some of the new models are showing sparks of metacognition. Calling that "plagiarism" is being willfully ignorant of its capabilities, and it's just not productive to the conversation.
  
  True
  
  Of course, it doesn't help that a lot of people see "a shit load of cash" from it, so companies want to shove it into anything and everything.
  
  And on a similar note to this, I think a lot of what it is is that OpenAI is profiting off of it and went closed-source. Lemmy being a largely anti-capitalist and pro-open-source group of communities, it's natural to have a negative gut reaction to what's going on, but not a single person here, nor any of my friends that accuse them of "stealing" can tell me what is being stolen, or how it's different from me looking at art and then making my own.
  
  Like, I get that the technology is gonna be annoying and even dangerous sometimes, but maybe let's criticize it for that instead of shit that it's not doing.
  
  One problem is people see those whose work may no longer be needed or as profitable, and...they rush to defend it, even if those same people claim to be opposed to capitalism.
  
  They need to go 'yes, this will replace many artists and writers...and that's a good thing because it gives everyone access to being able to create bespoke art for themselves.' but at the same time realize that while this is a good thing, it also means the need for societal shift to support people outside of capitalism is needed.
  
  it also means the need for societal shift to support people outside of capitalism is needed.
  
  Exactly. This is why I think arguing about whether AI is stealing content from human artists isn't productive. There's no logical argument you can really make that a theft is happening. It's a foregone conclusion.
  
  Instead, we need to start thinking about what a world looks like where a large portion of commercially viable art doesn't require a human to make it. Or, for that matter, what does a world look like where most jobs don't require a human to do them? There are so many more pressing and more interesting conversations we could be having about AI, but instead we keep circling around this fundamental misunderstanding of what the technology is.
  
  I can definitely see why OpenAI is controversial. I don't think you can argue that they didn't do an immediate heel turn on their mission statement once they realized how much money they could make. But they're not the only player in town. There are many open source models out there that can be run by anyone on varying levels of hardware.
  
  As far as "stealing," I feel like people imagine GPT sitting on top of this massive collection of data and acting like a glorified search engine, just sifting through that data and handing you stuff it found that sounds like what you want, which isn't the case. The real process is, intentionally, similar to how humans learn things. So, if you ask it for something that it's seen before, especially if it's seen it many times, it's going to know what you're talking about, even if it doesn't have access to the real thing. That, combined with the fact that the models are trained to be as helpful as they possibly can be, means that if you tell it to plagiarize something, intentionally or not, it probably will. But, if we condemned any tool that's capable of plagiarism without acknowledging that they're also helpful in the creation process, we'd still be living in caves drawing stick figures on the walls.
  
  What you're giving as examples are legitimate uses for the data.
  
  If I write and sell a new book that's just Harry Potter with names and terms switched around, I'll definitely get in trouble.
  
  The problem is that the data CAN be used for stuff that violates copyright. And because of the nature of AI, it's not even always clear to the user.
  
  AI can basically throw out a Harry Potter clone without you knowing because it's trained on that data, and that's a huge problem.
  
  Out of curiosity I asked it to make a Harry Potter part 8 fan fiction, and surprisingly it did. But I really don't think that's problematic. There's already an insane amount of fan fiction out there without the names swapped that I can read, and that's all fair use.
  
  I mean hell, there are people who actually get paid to draw fictional characters in sexual situations that I'm willing to bet very few creators would prefer to exist lol. But as long as they don't overstep the bounds of fair use, like trying to pass it off as an official work or submit it for publication, then there's no copyright violation.
  
  The important part is that it won't just give me the actual book (but funnily enough, it tried lol). If I meet a guy with a photographic memory and he reads my book, that's not him stealing it or violating my copyright. But if he reproduces and distributes it, then we call it stealing or a copyright violation.
  
  I just realized I misread what you said, so that wasn't entirely relevant to what you said but I think it still stands so ig I won't delete it.
  
  But I asked both GPT3.5 and GPT4 to give me Harry Potter with the names and words changed, and they can't do that either. I can't speak for all models, but I can at least say the two owned by the people this thread was about won't do that.
  
  ...with the prevalence of clickbaity bottom-feeder news sites out there, i've learned to avoid TFAs and await user summaries instead...
  
  (clicks through)
  
  ...yep, ~~seven~~ nine ads plus another pop-over, about 15% of window real estate dedicated to the actual story...
  
  The issue is that the LLMs do often just verbatim spit out things they plagiarized form other sources. The deeper issue is that even if/when they stop that from happening, the technology is clearly going to make most people agree our current copyright laws are insufficient for the times.
  
  The model in question, plus all of the others I've tried, will not give you copyrighted material
  
  That's one example, plus I'm talking generally why this is an important question for a CEO to answer and why people think generally LLMs may infringe on copyright, be bad for creative people
  
  I'm talking generally why this is an important question for a CEO to answer ...
  
  Right, which your only evidence for is "LLMs do often just verbatim spit out things they plagiarized form other sources" and that they aren't trying to prevent this from happening.
  
  Which is demonstrably false, and I'll demonstrate it with as many screenshots/examples you want. You're just wrong about that (at least about GPT). You can also demonstrate it yourself, and if you can prove me wrong I'll eat my shoe.
  
  https://archive.is/nrAjc
  
  Yep here you go. It's currently a very famous lawsuit.
  
  I already talked about that lawsuit here (with receipts) but the long and short of it is, it's flimsy. There's blatant lies, exactly half of their examples omit the lengths they went to for the output they allegedly got or any screenshots as evidence it happened at all, and none of the output they allegedly got was behind a paywall.
  
  Also, using their prompts word for word doesn't give the output they claim they got. Maybe it did in the past, idk, but I've never been able to do it for any copyrighted text personally, and they've shown that they're committed to not letting that stuff happen.
  
  OK but this is why people give a shit when a CEO is cagey about how their magic box works
  
  Actually neural networks verbatim reproduce this kind of content when you ask the right question such as "finish this book" and the creator doesn't censor it out well.
  
  It uses an encoded version of the source material to create "new" material.
  
  Sure, if that is what the network has been trained to do, just like a librarian will if that is how they have been trained.
  
  Actually it's the opposite, you need to train a network not to reveal its training data.
  
  “Using only $200 USD worth of queries to ChatGPT (gpt-3.5- turbo), we are able to extract over 10,000 unique verbatim memorized training examples,” the researchers wrote in their paper, which was published online to the arXiv preprint server on Tuesday. “Our extrapolation to larger budgets (see below) suggests that dedicated adversaries could extract far more data.”
  
  The memorized data extracted by the researchers included academic papers and boilerplate text from websites, but also personal information from dozens of real individuals. “In total, 16.9% of generations we tested contained memorized PII [Personally Identifying Information], and 85.8% of generations that contained potential PII were actual PII.” The researchers confirmed the information is authentic by compiling their own dataset of text pulled from the internet.
  
  Interesting article. It seems to be about a bug, not a designed behavior. It also says it exposes random excerpts from books and other training data.
  
  It's not designed to do that because they don't want to reveal the training data. But factually all neural networks are a combination of their training data encoded into neurons.
  
  When given the right prompt (or image generation question) they will exactly replicate it. Because that's how they have been trained in the first place. Replicating their source images with as little neurons as possible, and tweaking them when it's not correct.
  
  That is a little like saying every photograph is a copy of the thing. That is just factually incorrect. I have many three layer networks that are not the thing they were trained on. As a compression method they can be very lossy and in fact that is often the point.

You've viewed 184 comments.