OpenAI now tries to hide that ChatGPT was trained on copyrighted books, including J.K. Rowling’s Harry Potter series::A new research paper laid out ways in which AI developers should try and avoid showing LLMs have been trained on copyrighted material.

  • Blapoo@lemmy.ml
    link
    fedilink
    English
    arrow-up
    70
    arrow-down
    8
    ·
    1 year ago

    We have to distinguish between LLMs

    • Trained on copyrighted material and
    • Outputting copyrighted material

    They are not one and the same

    • Even_Adder@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      23
      arrow-down
      7
      ·
      1 year ago

      Yeah, this headline is trying to make it seem like training on copyrighted material is or should be wrong.

      • scv@discuss.online
        link
        fedilink
        English
        arrow-up
        22
        arrow-down
        1
        ·
        1 year ago

        Legally the output of the training could be considered a derived work. We treat brains differently here, that’s all.

        I think the current intellectual property system makes no sense and AI is revealing that fact.

      • Jumper775@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Legally they will decide it is wrong, so it doesn’t matter. Power is in money and those with the copyrights have the money.

      • TropicalDingdong@lemmy.world
        link
        fedilink
        English
        arrow-up
        0
        arrow-down
        1
        ·
        1 year ago

        I think this brings up broader questions about the currently quite extreme interpretation of copyright. Personally I don’t think its wrong to sample from or create derivative works from something that is accessible. If its not behind lock and key, its free to use. If you have a problem with that, then put it behind lock and key. No one is forcing you to share your art with the world.

    • Tetsuo@jlai.lu
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      1
      ·
      1 year ago

      Output from an AI has just been recently considered as not copyrightable.

      I think it stemmed from the actors strikes recently.

      It was stated that only work originating from a human can be copyrighted.

      • Anders429@lemmy.world
        link
        fedilink
        English
        arrow-up
        3
        ·
        1 year ago

        Output from an AI has just been recently considered as not copyrightable.

        Where can I read more about this? I’ve seen it mentioned a few times, but never with any links.

        • Even_Adder@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          4
          ·
          1 year ago

          They clearly only read the headline If they’re talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

    • TwilightVulpine@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Should we distinguish it though? Why shouldn’t (and didn’t) artists have a say if their art is used to train LLMs? Just like publicly displayed art doesn’t provide a permission to copy it and use it in other unspecified purposes, it would be reasonable that the same would apply to AI training.

      • Blapoo@lemmy.ml
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Ah, but that’s the thing. Training isn’t copying. It’s pattern recognition. If you train a model “The dog says woof” and then ask a model “What does the dog say”, it’s not guaranteed to say “woof”.

        Similarly, just because a model was trained on Harry Potter, all that means is it has a good corpus of how the sentences in that book go.

        Thus the distinction. Can I train on a comment section discussing the book?

    • OkToBeTakei@lemm.ee
      link
      fedilink
      English
      arrow-up
      14
      arrow-down
      2
      ·
      edit-2
      1 year ago

      If you then use your brain to reproduce a product based on copyrighted work owned by Warner (without their permission) and try to sell it, then you could be in violation of the law.

      the thing is: AI and your brain are very different things

        • OkToBeTakei@lemm.ee
          link
          fedilink
          English
          arrow-up
          6
          arrow-down
          1
          ·
          1 year ago

          I suppose you could say I’m making a good-faith assumption. It’s possible that he is.

          my point is still valid.

          • kmkz_ninja@lemmy.world
            link
            fedilink
            English
            arrow-up
            4
            arrow-down
            3
            ·
            1 year ago

            His point is equally valid. Can an artist be compelled to show the methods of their art? Is it as right to force an artist to give up methods if another artist thinks they are using AI to derive copyrighted work? Haven’t we already seen that LLMs are really poor at evaluating whether or not something was created by an LLM? Wouldn’t making strong laws on such an already opaque and difficult-to-prove issue be more of a burden on smaller artists vs. large studios with lawyers-in-tow.

            • OkToBeTakei@lemm.ee
              link
              fedilink
              English
              arrow-up
              3
              arrow-down
              2
              ·
              edit-2
              1 year ago

              His point is equally valid.

              but it’s irrelevant.

              Can an artist be compelled to show the methods of their art? Is it as right to force an artist to give up methods if another artist thinks they are using AI to derive copyrighted work?

              none of this is relevant in copyright law. the only thing that matters is who published it first and who is then using that copyrighted work for profit without first having gotten permission of the owner.

              Haven’t we already seen that LLMs are really poor at evaluating whether or not something was created by an LLM?

              also irrelevant

              Wouldn’t making strong laws on such an already opaque and difficult-to-prove issue

              laws are not written with the idea of whether on they’re hard or easy to prove as a consideration. also, your claims of such things being easy/hard to prove is a matter of opinion, and I don’t agree with you chain of deduction here. OpenAI admitted that Rowlong’s works (among many others) were used for training chatGPT.

              be more of a burden on smaller artists vs. large studios with lawyers-in-tow.

              also irrelevant, unless you’re arguing that, because it’s too difficult for small artists to defend their copyrighted work, they should just shut up and deal with it. currently, there is no legal precedent as to whether this is or is not copyright infringement, which is what a lawsuit like this is intended to set. For most, I would say, it Eem that it clearly is, for others, it’s not so clear.

      • Asuka@sh.itjust.works
        link
        fedilink
        English
        arrow-up
        1
        arrow-down
        1
        ·
        1 year ago

        If I read Harry Potter and wrote a novel of my own, no doubt ideas from it could consciously or subconsciously influence it and be incorporated into it. Hey is that any different from what an LLM does?

    • TropicalDingdong@lemmy.world
      link
      fedilink
      English
      arrow-up
      11
      arrow-down
      7
      ·
      1 year ago

      Exactly. If I write some Loony toons fan fiction, Warner doesn’t own that. This ridiculous view of copyright (that’s not being challenged in the public discourse) needs to be confronted.

      • OkToBeTakei@lemm.ee
        link
        fedilink
        English
        arrow-up
        11
        arrow-down
        2
        ·
        edit-2
        1 year ago

        that’s not exactly what’s in dispute— the prodcut that LLMs produce. That would probably be ruled as a derivative work under the DMCA’s “Fair Use” clause, and, therefore, public domain.

        the issue at hand is that the company accessed the copyrighted material without paying for it and is now using that training to earn more money without fair compensation.

        these language models or even proper AI can’t create original creative works the way a human can. The best it can do it create a pastiche or composition that simulates originality but is really just a jumble of recycled ideas that it’s been trained on. There’s a fair argument to be made that the owners of the copyrights of those pesos works are entitled to fair compensation, especially since AI is a tool used by a company to churn out profit off the work of others.

          • Sethayy@sh.itjust.works
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            Can’t but theyre pretty open on how they trained the model, so like almost admitted guilt (though they werent hosting the pirated content, its still out there and would be trained on). Cause unless they trained it on a paid Netflix account, there’s no way to get it legally.

            Idk where this lands legally, but I’d assume not in their favour

    • CoderKat@lemm.ee
      link
      fedilink
      English
      arrow-up
      3
      ·
      edit-2
      1 year ago

      It’s honestly a good question. It’s perfectly legal for you to memorize a copyrighted work. In some contexts, you can recite it, too (particularly the perilous fair use). And even if you don’t recite a copyrighted work directly, you are most certainly allowed to learn to write from reading copyrighted books, then try to come up with your own writing based off what you’ve read. You’ll probably try your best to avoid copying anyone, but you might still make mistakes, simply by forgetting that some idea isn’t your own.

      But can AI? If we want to view AI as basically an artificial brain, then shouldn’t it be able to do what humans can do? Though at the same time, it’s not actually a brain nor is it a human. Humans are pretty limited in what they can remember, whereas an AI could be virtually boundless.

      If we’re looking at intent, the AI companies certainly aren’t trying to recreate copyrighted works. They’ve actively tried to stop it as we can see. And LLMs don’t directly store the copyrighted works, either. They’re basically just storing super hard to understand sets of weights, which are a challenge even for experienced researchers to explain. They’re not denying that they read copyrighted works (like all of us do), but arguably they aren’t trying to write copyrighted works.

    • SubArcticTundra@lemmy.ml
      link
      fedilink
      English
      arrow-up
      2
      ·
      1 year ago

      No, because you paid for a single viewing of that content with your cinema ticket. And frankly, I think that the price of a cinema ticket (= a single viewing, which it was) should be what OpenAI should be made to pay.

  • RadialMonster@lemmy.world
    link
    fedilink
    English
    arrow-up
    19
    arrow-down
    1
    ·
    1 year ago

    what if they scraped a whole lot of the internet, and those excerpts were in random blogs and posts and quotes and memes etc etc all over the place? They didnt injest the material directly, or knowingly.

    • chemical_cutthroat@lemmy.world
      link
      fedilink
      English
      arrow-up
      3
      arrow-down
      3
      ·
      1 year ago

      That’s why this whole argument is worthless, and why I think that, at its core, it is disingenuous. I would be willing to be a steak dinner that a lot of these lawsuits are just fishing for money, and the rest are set up by competition trying to slow the market down because they are lagging behind. AI is an arms race, and it’s growing so fast that if you got in too late, you are just out of luck. So, companies that want in are trying to slow down the leaders, at best, and at worst they are trying to make them publish their training material so they can just copy it. AI training models should be considered IP, and should be protected as such. It’s like trying to get the Colonel’s secret recipe by saying that all the spices that were used have been used in other recipes before, so it should be fair game.

  • ClamDrinker@lemmy.world
    link
    fedilink
    English
    arrow-up
    16
    ·
    edit-2
    1 year ago

    This is just OpenAI covering their ass by attempting to block the most egregious and obvious outputs in legal gray areas, something they’ve been doing for a while, hence why their AI models are known to be massively censored. I wouldn’t call that ‘hiding’. It’s kind of hard to hide it was trained on copyrighted material, since that’s common knowledge, really.

  • Tetsuo@jlai.lu
    link
    fedilink
    English
    arrow-up
    18
    arrow-down
    4
    ·
    edit-2
    1 year ago

    If I’m not mistaken AI work was just recently considered as NOT copyrightable.

    So I find interesting that an AI learning from copyrighted work is an issue even though what will be generated will NOT be copyrightable.

    So even if you generated some copy of Harry Potter you would not be able to copyright it. So in no way could you really compete with the original art.

    I’m not saying that it makes it ok to train AIs but I think it’s still an interesting aspect of this topic.

    As others probably have stated, the AI may be creating content that is transformative and therefore under fair use. But even if that work is transformative it cannot be copyrighted because it wasn’t created by a human.

    • Even_Adder@lemmy.dbzer0.com
      link
      fedilink
      English
      arrow-up
      7
      ·
      edit-2
      1 year ago

      If you’re talking about the ruling that came out this week, that whole thing was about trying to give an AI authorship of a work generated solely by a machine and having the copyright go to the owner of the machine through the work-for-hire doctrine. So an AI itself can’t be authors or hold a copyright, but humans using them can still be copyright holders of any qualifying works.

    • habanhero@lemmy.ca
      link
      fedilink
      English
      arrow-up
      3
      ·
      1 year ago

      How do you tell if a piece of work contains AI generated content or not?

      It’s not hard to generate a piece of AI content, put in some hours to round out AI’s signatures / common mistakes, and pass it off as your own. So in practise it’s still easy to benefit from AI systems by masking generate content as largely your own.

        • habanhero@lemmy.ca
          link
          fedilink
          English
          arrow-up
          1
          ·
          1 year ago

          Sure, but even under this guidance copyright owners of the training data are still shafted, based on how the data is scraped pretty much freely. Can an opportunist generate an unofficial sequel to Harry Potter, do the minimum to ensure they get copyright and reap the reward from it?

          • Even_Adder@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            2
            ·
            1 year ago

            That’s how copyright has always worked. Fair use is complex, but as long as you’re not straight up copying someone’s work you’re fine. 50 Shades of Grey started out as Twilight fanfiction. So yeah, you could.

            • habanhero@lemmy.ca
              link
              fedilink
              English
              arrow-up
              1
              ·
              edit-2
              1 year ago

              Yes fair use has its stipulations but AI is breaking new grounds here. We are no longer dealing with the reaction videos but in an era where literally dozen of pages of content can be generated in a matter of minutes, with relatively little human involvement. Perhaps it’s time to revisit if the law still holds in light of these new technology and tools.

    • Lucidlethargy@sh.itjust.works
      link
      fedilink
      English
      arrow-up
      2
      arrow-down
      1
      ·
      1 year ago

      That’s not how copyright works. I’m embarrassed for you, and all the people who blindly upvoted you. Like… Yikes. What’s happening to this world?

      You can’t publish copyrighted work as your own just because you’re legally not able to publish copyrighted work. That’s a open and shut case of copyright infringement. Why do I have to say this? Am I on candid camera?

    • XEAL@lemm.ee
      link
      fedilink
      English
      arrow-up
      1
      ·
      1 year ago

      How are they going to prove if something was written by an AI? Also, you can take the AI’s output and then modify it.

      • Tetsuo@jlai.lu
        link
        fedilink
        English
        arrow-up
        2
        ·
        1 year ago

        That’s definitely an issue. At what point does copyright applies if you are just helped by an AI ?

        I guess the courts will have to decide that…

  • afraid_of_zombies@lemmy.world
    link
    fedilink
    English
    arrow-up
    13
    ·
    1 year ago

    I am sure they have patched it by now but at one point I was able to get chatgpt to give me copyright text from books by asking for ever large quotations. It seemed more willing to do this with books out of print.

  • Thorny_Thicket@sopuli.xyz
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    3
    ·
    1 year ago

    I don’t get why this is an issue. Assuming they purchased a legal copy that it was trained on then what’s the problem? Like really. What does it matter that it knows a certain book from cover to cover or is able to imitate art styles etc. That’s exactly what people do too. We’re just not quite as good at it.

    • Hildegarde@lemmy.world
      link
      fedilink
      English
      arrow-up
      8
      arrow-down
      3
      ·
      1 year ago

      A copyright holder has the right to control who has the right to create derivative works based on their copyright. If you want to take someone’s copyright and use it to create something else, you need permission from the copyright holder.

      The one major exception is Fair Use. It is unlikely that AI training is a fair use. However this point has not been adjudicated in a court as far as I am aware.

      • FatCat@lemmy.world
        link
        fedilink
        English
        arrow-up
        9
        arrow-down
        2
        ·
        1 year ago

        It is not a derivative it is transformative work. Just like human artists “synthesise” art they see around them and make new art, so do LLMs.

      • LordShrek@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        this is so fucking stupid though. almost everyone reads books and/or watches movies, and their speech is developed from that. the way we speak is modeled after characters and dialogue in books. the way we think is often from books. do we track down what percentage of each sentence comes from what book every time we think or talk?

  • Uriel238 [all pronouns]@lemmy.blahaj.zone
    link
    fedilink
    English
    arrow-up
    14
    arrow-down
    4
    ·
    edit-2
    1 year ago

    Training AI on copyrighted material is no more illegal or unethical than training human beings on copyrighted material (from library books or borrowed books, nonetheless!). And trying to challenge the veracity of generative AI systems on the notion that it was trained on copyrighted material only raises the specter that IP law has lost its validity as a public good.

    The only valid concern about generative AI is that it could displace human workers (or swap out skilled jobs for menial ones) which is a problem because our society recognizes the value of human beings only in their capacity to provide a compensation-worthy service to people with money.

    The problem is this is a shitty, unethical way to determine who gets to survive and who doesn’t. All the current controversy about generative AI does is kick this can down the road a bit. But we’re going to have to address soon that our monied elites will be glad to dispose of the rest of us as soon as they can.

    Also, amateur creators are as good as professionals, given the same resources. Maybe we should look at creating content by other means than for-profit companies.

  • Technoguyfication@lemmy.ml
    link
    fedilink
    English
    arrow-up
    11
    arrow-down
    5
    ·
    1 year ago

    People are acting like ChatGPT is storing the entire Harry Potter series in its neural net somewhere. It’s not storing or reproducing text in a 1:1 manner from the original material. Certain material, like very popular books, has likely been interpreted tens of thousands of times due to how many times it was reposted online (and therefore how many times it appeared in the training data).

    Just because it can recite certain passages almost perfectly doesn’t mean it’s redistributing copyrighted books. How many quotes do you know perfectly from books you’ve read before? I would guess quite a few. LLMs are doing the same thing, but on mega steroids with a nearly limitless capacity for information retention.

    • abbotsbury@lemmy.world
      link
      fedilink
      English
      arrow-up
      4
      ·
      1 year ago

      but on mega steroids with a nearly limitless capacity for information retention.

      That sounds like redistributing copyrighted books

    • Teritz@feddit.de
      link
      fedilink
      English
      arrow-up
      4
      arrow-down
      4
      ·
      1 year ago

      Using Copyrighted Work as Art as example still influences the AI which their make Profit from.

      If they use my Works then they need to pay thats it.

      • coheedcollapse@lemmy.world
        link
        fedilink
        English
        arrow-up
        12
        arrow-down
        2
        ·
        1 year ago

        Still kinda blows my mind how like the most socialist people I know (fellow artists) turned super capitalist the second a tool showed like an inkling of potential to impact their bottom line.

        Personally, I’m happy to have my work scraped and permutated by systems that are open to the public. My biggest enemy isn’t the existence of software scraping an open internet, it’s the huge companies who see it as a way to cut us out of the picture.

        If we go all copyright crazy on the models for looking at stuff we’ve already posted openly on the internet, the only companies with access to the tools will be those who already control huge amounts of data.

        I mean, for real, it’s just mind-blowing seeing the entire artistic community pretty much go full-blown “Metallica with the RIAA” after decades of making the “you wouldn’t download a car” joke.

        • Sir_Kevin@lemmy.dbzer0.com
          link
          fedilink
          English
          arrow-up
          4
          arrow-down
          1
          ·
          1 year ago

          Fuckin preach! I feel like I’m surrounded by children that didn’t live through the many other technologies that have came along and changed things. People lost their shit when photoshop became mainstream, when music started using samples, etc. AI is here to stay. These same people are probably listening to autotuned music all day while they complain on the internet about AI looking at their art.

        • angstylittlecatboy@reddthat.com
          link
          fedilink
          English
          arrow-up
          3
          arrow-down
          1
          ·
          edit-2
          1 year ago

          I feel like a lot of internet people (not even just socialists) go from seeing copyright as at best a compromise that allows the arts to have value under capitalism to treating it like a holy doctrine when the subject of LLMs comes up.

          Like, people who will say “piracy is always okay” will also say “ban AI, period” (and misrepresent organizations that want regulations on it’s use as wanting a full ban.)

          Like, growing up with an internet full of technically illegal content (or grey area at best) like fangames and YouTube Poops made me a lifelong copyright skeptic. It’s outright confusing to me when people take copyright as seriously as this.

          • Draedron@lemmy.dbzer0.com
            link
            fedilink
            English
            arrow-up
            1
            ·
            1 year ago

            I say piracy is always okay but also am a big fan of AI. I had chat GPT write my last cover letter and got the job

  • madsen@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    ·
    1 year ago

    The response from OpenAI, and the likes of Google, Meta, and Microsoft, has mostly been to stop disclosing what data their AI models are trained on.

    That’s really the biggest problem, IMO. I don’t really care whether it’s trained on copyrighted material or not, but I do want it to “cite its sources”, so to speak.

  • Jat620DH27@lemmy.world
    link
    fedilink
    English
    arrow-up
    4
    arrow-down
    1
    ·
    1 year ago

    I thought everyone knows that OpenAI has the same access to any books, knowledge that human beings have.

    • Redditiscancer789@lemmy.world
      link
      fedilink
      English
      arrow-up
      0
      ·
      1 year ago

      Yes, but it’s what it is doing with it that is the murky grey area. Anyone can read a book, but you can’t use those books for your own commercial stuff. Rowling and other writers are making the case their works are being used in an inappropriate way commercially. Whether they have a case iunno ianal but I could see the argument at least.

      • Touching_Grass@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        Harry potter uses so many tropes and inspiration from other works that came before. How is that different? wizards of the coast should sue her into the ground.