Core copyright violation moves ahead in The Intercept's lawsuit against OpenAI

174 points | by giuliomagnifico 10 hours ago

165 comments

efitz 5 hours ago
I would trust AI a lot more if it gave answers more like:
“Source A on date 1 said XYX”
“Source B …”
“Synthesizing these, it seems that the majority opinion is X but Y is also a commonly held opinion.”
Instead of what it does now, which is make extremely confident, unsourced statements.
It looks like the copyright lawsuits are rent-seeking as much as anything else; another reason I hate copyright in its current form.
[-]
- akira2501 an hour ago
  > which is make extremely confident,
  One of the results the LLM has available to itself is a confidence value. It should, at the very least, provide this along with it's answer. Perhaps if it did people would stop calling it 'AI'.'
  [-]
  - pavon 41 minutes ago
    My understanding is that this confidence value is not a measure of how likely something is correct/true, but more along the lines of how likely that sentence would be. Including it could be more misleading than helpful, for example if it is repeating commonly misunderstood information.
  - ethernot 35 minutes ago
    I'm not sure that it's possible to produce anything reasonable in that space. It would need to know how far it is away from correct to provide a usable confidence value otherwise it'd just be hallucinating a number in the same way as the result.
    An analogy. Take a former commuter friend of mine, Mr Skol (named after his favourite breakfast drink). Seen on a minibus I had to get to work years ago, we shared many interesting conversations. Now he was a confident expert on everything. If asked to rate his confidence in a subject it would be a good 95% at least. However he spoke absolute garbage because his brain was rotten away from drinking Skol for breakfast, and the odd crack chaser. I suspect his model was still better than GPT-4o. But an average person could determine the veracity of his arguments.
    Thus confidence should be externally rated as an entity with knowledge cannot necessarily rate itself for it has bias. Which then brings in the question of how do you do that. Well you'd have to do the research you were going to do anyway and compare. So now you've used the AI and done the research which you would have had to do if the AI didn't exist. So the AI at this point becomes a cost over benefit if you need something with any level of confidence and accuracy.
    Thus the value is zero unless you need crap information, which is at least here, never, unless I'm generating a picture of a goat driving a train or something. And I'm not sure that has any commercial value. But it's fun at least.
- CaptainFever 5 hours ago
  ChatGPT Search provides this, by the way, though it relies a lot on the quality of Bing search results. Consensus.app does this but for research papers, and has been very useful to me.
  [-]
  - maronato 5 hours ago
    More often than not in my experience, clicking these sources takes me to pages that either don’t exist, don’t have the information ChatGPT is quoting, or ChatGPT completely misinterpreted the content.
quarterdime 6 hours ago
Interesting. Two key quotes:
> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?
[-]
- Xelynega 5 hours ago
  If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)
  [-]
  - jsheard 5 hours ago
    It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.
    [-]
    - Xelynega 5 hours ago
      I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext.
      [-]
      - CuriouslyC 4 hours ago
        They can't stop doing things some people don't like (people who also won't stop doing things other people don't like). The legality of the claims is questionable which is why most are getting thrown out, but we'll see if this narrow approach works out.
        I'm sure there are also a number of easy technical ways to "include" the metadata while mostly ignoring it during training that would skirt the letter of the law if needed.
        [-]
        Xelynega 2 hours ago
        If we really want to be technical, in common law systems anything is legal as long as the highest court to challenge it decides it's legal.
        I guess I should have used the phrase "common sense stealing in any other context" to be more precise?
    - asdff 4 hours ago
      I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.
      Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.
      In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.
    - zozbot234 5 hours ago
      They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.
      [-]
      - CaptainFever 5 hours ago
        The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)
        But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)
        But who knows. Maybe it can be done for more fact-like stuff.
        [-]
        TeMPOraL 4 hours ago
        > Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.
        All of that and more, all at the same time.
        Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.
        RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.
    - TeMPOraL 4 hours ago
      Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>
      [-]
      - jprete 2 hours ago
        Being part of a public court record doesn't seem like something that would invalidate copyright.
  - ashoeafoot 5 hours ago
    What about bombing? You could always smuggle dmca content in training sets hoping for a payout?
    [-]
    - Xelynega 5 hours ago
      The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.
      "well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...
  - sandworm101 4 hours ago
    But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.
    The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.
    [-]
    - Xelynega 2 hours ago
      Yes, that's how every other industry that redistributes content works.
      You have to license content you want to use, you cant just use it for free because it's on the internet.
      Netflix doesn't just start hosting shows and hope they don't get a copyright suit...
  - A4ET8a8uTh0 5 hours ago
    Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.
3pt14159 5 hours ago
Is there a way to figure out if OpenAI ingested my blog? If the settlements are $2500 per article then I'll take a free used cars worth of payments if its available.
[-]
- jazzyjackson 5 hours ago
  I suppose the cost of legal representation would cancel it out. I can just imagine a class action where anyone who posted on blogger.com between 2002 and 2012 eventually gets a check for 28 dollars.
  If I were more optimistic I could imagine a UBI funded by lawsuits against AGI, some combination of lost wages and intellectual property infringement. Can't figure out exactly how much more important an article on The Intercept had on shifting weights than your hacker news comments, might as well just pay everyone equally since we're all equally screwed
  [-]
  - SahAssar an hour ago
    If you posted on blogger.com (or any platform with enough money to hire lawyers) you probably gave them a license that is irrevocable, non-exclusive and able to be sublicensed.
    There are reasons for that (they need a license to show it on the platform) but usually these agreements are overly broad because everyone except the user is covering their ass too much.
    Those licenses will now be used to sell that content/data for purposes that nobody thought about when you started your account.
  - dwattttt 3 hours ago
    Wouldn't the point of the class action to be to dilute the cost of representation? If the damages per article are high and there's plenty of class members, I imagine the limit would be how much OpenAI has to pay out.
hydrolox 6 hours ago
I understand that regulations exist and how there can be copyright violations, but shouldn't we be concerned that other.. more lenient governments (mainly China) who are opposed to the US will use this to get ahead? If OpenAI is significantly set back.
[-]
- fny 6 hours ago
  No. OpenAI is suspected to be worth over $150B. They can absolutely afford to pay people for data.
  Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.
  I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.
  [-]
  - jsheard 6 hours ago
    The OpenAI that is assumed to keep being able to harvest every form of IP without compensation is valued at $150B, an OpenAI that has to pay for data would be worth significantly less. They're currently not even expecting to turn a profit until 2029, and that's without paying for data.
    https://finance.yahoo.com/news/report-reveals-openais-44-bil...
  - suby 6 hours ago
    OpenAI is not profitable, and to achieve what they have achieved they had to scrape basically the entire internet. I don't have a hard time believing that OpenAI could not exist if they had to respect copyright.
    https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
    [-]
    - jpalawaga 4 hours ago
      technically open ai has respected copyright, except in the (few) instances they produce non-fair-use amounts of copyrighted material.
      dmca does not cover scraping.
  - mrweasel 6 hours ago
    That's not real money tough. You need actual cash on hand to pay for stuff, OpenAI only have the money they've been given by investors. I suspect that many of the investors wouldn't have been so keen if they knew that OpenAI would need an additional couple of billions a year to pay for data.
  - nickpsecurity 6 hours ago
    That doesn’t mean they have $150B to hand over. What you can cite is the $10 billion they got from Microsoft.
    I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.
- bogwog 5 hours ago
  This type of argument is ignorant, cowardly, shortsighted, and regressive. Both technology and society will progress when we find a formula that is sustainable and incentivizes everyone involved to maximize their contributions without it all blowing up in our faces someday. Copyright law is far from perfect, but it protects artists who want to try and make a living from their work, and it incentivizes creativity that places without such protections usually end up just imitating.
  When we find that sustainable framework for AI, China or <insert-boogeyman-here> will just end up imitating it. Idk what harms you're imagining might come from that ("get ahead" is too vague to mean anything), but I just want to point out that that isn't how you become a leader in anything. Even worse, if they are the ones who find that formula first while we take shortcuts to "get ahead", then we will be the ones doing the imitation in the end.
  [-]
  - gaganyaan 4 hours ago
    Copyright is a dead man walking and that's a good thing. Let's applaud the end of a temporary unnatural state of affairs.
- worble 6 hours ago
  Should we also be concerned that other governments use slave labor (among other human rights violations) and will use that to get ahead?
  [-]
  - logicchains 6 hours ago
    It's hysterical to compare training an ML model with slave labour. It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
    [-]
    - Kbelicius 6 hours ago
      > It's hysterical to compare training an ML model with slave labour.
      Nobody did that.
      > It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
      It makes sense. There is always scale to consider in these things.
- mu53 6 hours ago
  Isn't it a greater risk that creators lose their income and nobody is creating the content anymore?
  Take for instance what has happened with news because of the internet. Not exactly the same, but similar forces at work. It turned into a race to the bottom with everyone trying to generate content as cheaply as possible to get maximum engagement with tech companies siphoning revenue. Expensive, investigative pieces from educated journalists disappeared in favor of stuff that looks like spam. Pre-Internet news was higher quality
  Imagine that same effect happening for all content? Art, writing, academic pieces. Its a real risk that openai has peaked in quality
  [-]
  - CuriouslyC 4 hours ago
    Lots of people create without getting paid to do it. A lot of music and art is unprofitable. In fact, you could argue that when the mainstream media companies got completely captured by suits with no interest in the things their companies invested in, that was when creativity died and we got consigned to genre-box superhero pop hell.
  - eastbound 2 hours ago
    I don’t know. When I look at news from before, there never was investigative journalism. It was all opinion swaying editos, until alternate voices voiced their counternarratives. It’s just not in newspapers because they are too politically biased to produce the two sides of stories that we’ve always asked them to do. It’s on other media.
    But investigative journalism has not disappeared. If anything, it has grown.
- devsda 6 hours ago
  Get ahead in terms of what? Do you believe that the material in public domain or legally available content that doesn't violate copyrights is not enough to research AI/LLMs or is the concern about purely commercial interests?
  China also supposedly has abusive labor practices. So, should other countries start relaxing their labor laws to avoid falling behind ?
- immibis 5 hours ago
  Absolutely: if copyright is slowing down innovation, we should abolish copyright.
  Not just turn a blind eye when it's the right people doing it. They don't even have a legal exemption passed by Congress - they're just straight-up breaking the law and getting away with it. Which is how America works, I suppose.
  [-]
  - JoshTriplett 4 hours ago
    Exactly. They rushed to violate copyright on a massive scale quickly, and now are making the argument that it shouldn't apply to them and they couldn't possibly operate in compliance with it. As long as humans don't get to ignore copyright, AI shouldn't either.
    [-]
    - treyd 7 minutes ago
      ChatGPT doesn't violate copyright, it's a software application. "Open"AI does, it's a company run by humans (for now).
    - Filligree 3 hours ago
      Humans do get to ignore copyright, when they do the same thing OpenAI has been doing.
      [-]
      - slyall an hour ago
        Exactly.
        Should I be paying a proportion of my salary to all the copyright holders of the books, song, TV shows and movies I consumed during my life?
        If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?
      - __loam an hour ago
        Yeah it turns out humans have more rights than computer programs and tech startups.
- dmead 6 hours ago
  I'm more concerned that someone people in the tech world are conflating Sam Altman's interest with the national interest.
  [-]
  - jMyles 5 hours ago
    Am I jazzed about Sam Altman making billions? No.
    Am I even more concerned about the state having control over the future corpus of knowledge via this doomed-in-any-case vector of "intellectual property"? Yes.
    I think it will be easier to overcome the influence of billionaires when we drop the pretext that the state is a more primal force than the internet.
    [-]
    - dmead 3 hours ago
      100% disagree. "It'll be fine bro" is not a substitute for having a vote over policy decisions made by the government. What you're talking about has a name. It starts with F and was very popular in Italy in the early to mid 20th century.
      [-]
      - jMyles 3 hours ago
        Rapidity of Godwin's law notwithstanding, I'm not disputing the importance of equity in decision-making. But this matter is more complex than that: it's obvious that the internet doesn't tolerate censorship even if it is dressed as intellectual property. I prefer an open and democratic internet to one policied by childish legacy states, the presence of which serves only (and only sometimes) to drive content into open secrecy.
        It seems particularly unfair to equate any questioning of the wisdom of copyright laws (even when applied in situations where we might not care for the defendant, as with this case) with fascism.
        [-]
        dmead an hour ago
        It's not Godwin's law when it's correct. Just because it's cool and on the Internet doesn't mean you get to throw out people's stake in how their lives are run.
        [-]
        jMyles an hour ago
        > throw out people's stake in how their lives are run
        FWIW, you're talking to a professional musician. Ostensibly, the IP complex is designed to protect me. I cannot fathom how you can regard it as the "people's stake in how their lives are run". Eliminating copyright will almost certainly give people more control over their digital lives, not less.
        > It's not Godwin's law when it's correct.
        Just to be clear, you are doubling down on the claim that sunsetting copyright laws is tantamount to nazism?
0xcde4c3db 7 hours ago
The claim that's being allowed to proceed is under 17 USC 1202, which is about stripping metadata like the title and author. Not exactly "core copyright violation". Am I missing something?
[-]
- anamexis 7 hours ago
  I read the headline as the copyright violation claim being core to the lawsuit.
  [-]
  - H8crilA 6 hours ago
    The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.
    I.e., it's some legalese trick, but "everyone knows" what's really at stake.
- Kon-Peki 6 hours ago
  Violations of 17 USC 1202 can be punished pretty severely. It's not about just money, either.
  If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.
  Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.
  [-]
  - zozbot234 6 hours ago
    > he can order all of OpenAIs computer equipment be impounded.
    Arrrrr matey, this is going to be fun.
    [-]
    - Kon-Peki 6 hours ago
      People have been complaining about the DMCA for 2+ decades now. I guess it's great if you are on the winning side. But boy does it suck to be on the losing side.
      [-]
      - immibis 5 hours ago
        And normal people can't get on the winning side. I'm trying to get Github to DMCA my own repositories, since it blocked my account and therefore I decided it no longer has the right to host them. Same with Stack Exchange.
        GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)
        [-]
        ralph84 4 hours ago
        When you uploaded your code to GitHub you granted them a license to host it. You can’t use DMCA against someone who’s operating within the parameters of the license you granted them.
        [-]
        tremon 3 hours ago
        Their stance is that GitHub revoked that license by blocking their account.
    - immibis 5 hours ago
      It won't happen. Judges only order that punishment for the little guys.
  - nickpsecurity 6 hours ago
    “ If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.”
    That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.
    [-]
    - jsheard 6 hours ago
      That's what Adobe and Getty Images are doing with their image generation models, both are exclusively using their own licensed stock image libraries so they (and their users) are on pretty safe ground.
      [-]
      - nickpsecurity 4 hours ago
        That’s good. I hope more do. This list has those doing it under the Fairly Trained banner:
        https://www.fairlytrained.org/certified-models
    - 3pt14159 6 hours ago
      The problem is that you don't get the same quality of data if you go about it that way. I love ChatGPT and I understand that we're figuring out this new media landscape but I really hope it doesn't turn out to neuter the models. The models are really well done.
      [-]
      - nickpsecurity 4 hours ago
        If I steal money, I can get way more done than I do now by earning it legally. Yet, you won’t see me regularly dismissing legitimate jobs by posting comparisons to what my numbers would look like if stealing I.P..
        We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.
      - tremon 3 hours ago
        That is OpenAI's problem, not their victims'.
- CaptainFever 6 hours ago
  Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?
  It seems to me that it shouldn't really affect model quality all that much, is it?
  Also, in the amended complaint:
  > not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights
  Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?
  In the decision:
  > I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.
  [-]
  - freejazz 6 hours ago
    >Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?
    Have you read 1202? It's all about hiding your infringement.
ashoeafoot 5 hours ago
Will we see human washing, where Ai art or works get a "Made by man" final touch in some third world mechanical turk den? Would that add another financial detracting layer to the ai winter?
[-]
- righthand 4 hours ago
  That will probably happen to some extent if not already. However I think people will just stop publishing online if malicious corps like OpenAI are just going to harvest works for their own gain. People publish for personal gain, not to enrich the public or enrich private entities.
  [-]
  - Filligree 3 hours ago
    However, I get my personal gain regardless of whether or not the text is also ingested into ChatGPT.
    In fact, since I use ChatGPT a lot, I get more gain if it is.
- CuriouslyC 4 hours ago
  There's no point in having third world mechanical turk dens do finishing passes on AI output unless you're trying to make it worse.
  Artists are already using AI to photobash images, and writers are using AI to outline and create rough drafts. The point of having a human in the loop is to tell the AI what is worth creating, then recognize where the AI output can be improved. If we have algorithms telling the AI what to make and content mill hacks smearing shit on the output to make it look more human, that would be the worst of both worlds.
- Retric 5 hours ago
  The law generally takes a dim view of such attempts to get around things like that. AI biggest defense is claiming they are so beneficial to society that what they are doing is fine.
  [-]
  - gmueckl 4 hours ago
    That argument stands on the mother of all slippery slopes! Just find a way to make your product mpressive or ubiquitous and all of a sudden it doesn't matter how much you break the law along the way? That's so insane I don't even know where to start.
    [-]
    - rcxdude 2 hours ago
      Why not, considering copyright law specifically has fair use outlined for that kind of thing? It's not some overriding consequence of law, it's that copyright is a granting of a privilege to individuals and that that privilege is not absolute.
    - ashoeafoot 4 hours ago
      Worked for purdue
    - Retric 3 hours ago
      YouTube, AirBnB, Uber, and many many others have all done stuff that’s blatant against the law but gotten away with it due to utility.
  - gaganyaan 4 hours ago
    That is not in any way the biggest defense
    [-]
    - Retric 3 hours ago
      It’s worked for many startups and court cases in the past. Copyright even has many explicit examples of the utility loophole look at say: https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....
logicchains 6 hours ago
Eventually we're going to have embodied models capable of live learning and it'll be extremely apparent how absurd the ideas of the copyright extremists are. Because in their world, it'd be illegal for an intelligent robot to watch TV, read a book or browse the internet like a human can, because it could remember what it saw and potentially regurgitate it in future.
[-]
- CuriouslyC 4 hours ago
  You have to understand, the media companies don't give a shit about the logic, in fact I'm sure a lot of the people pushing the litigation probably see the absurdity of it. This is a business turf war, the stated litigation is whatever excuse they can find to try and go on the offensive against someone they see as a potential threat. The pro copyright group (big media) sees the writing on the wall, that they're about to get dunked on by big tech, and they're thrashing and screaming because $$$.
- luqtas 6 hours ago
  problem is when a human company profits over their scrape... this isn't a non-profit running out of volunteers & a total distant reality from autonomous robots learning it way by itself
  we are discussing an emergent cause that has social & ecological consequences. servers are power hungry stuff that may or not run on a sustainable grid (that also has a bazinga of problems like leaking heavy chemicals on solar panels production, hydro-electric plants destroying their surroundings etc.) & the current state of producing hardware, be a sweatshop or conflict minerals. lets forget creators copyright violation that is written in the law code of almost every existing country and no artist is making billions out of the abuse of their creation right (often they are pretty chill on getting their stuff mentioned, remixed and whatever)
- Karliss 4 hours ago
  If humanity ever gets to the point where intelligent robots are capable of watching TV like human can, having to adjust copyright laws seems like the least of problems. How about having to adjust almost every law related to basic "human" rights, ownership, being to establish a contract, being responsible for crimes and endless other things.
  But for now your washing machine cannot own other things, and you owning a washing machine isn't considered slavery.
- tokioyoyo 3 hours ago
  The problem is, we can't come up with a solution where both parties are happy, because in the end, consumers choose one (getting information from news agencies) or the other (getting information from chatgpt). So, both are fighting for life.
- JoshTriplett 4 hours ago
  > copyright extremists
  It's not copyright "extremism" to expect a level playing field. As long as humans have to adhere to copyright, so should AI companies. If you want to abolish copyright, by all means do, but don't give AI a special exemption.
  [-]
  - CuriouslyC 4 hours ago
    It's actually the opposite of what you're saying. I can 100% legally do all the things that they're suing OpenAI for. Their whole argument is that the rules should be different when a machine does it than a human.
    [-]
    - JoshTriplett 2 hours ago
      Only because it would be unconscionable to apply copyright to actual human brains, so we don't. But, for instance, you absolutely can commit copyright violation by reading something and then writing something very similar, which is one reason why reverse engineering commonly uses clean-room techniques. AI training is in no way a clean room.
  - IAmGraydon 4 hours ago
    Except LLMs are in no way violating copyright in the true sense of the word. They aren’t spitting out a copy of what they ingested.
    [-]
    - JoshTriplett 2 hours ago
      Go make a movie using the same plot as a Disney movie, that doesn't copy any of the text or images of the original, and see how far "not spitting out a copy" gets you in court.
      AI's approach to copyright is very much "rules for thee but not for me".
      [-]
      - rcxdude 2 hours ago
        That might get you pretty far in court, actually. You'd have to be pretty close in terms of the sequence of events, character names, etc. Especially considering how many Disney movies are based on pre-existing stories, if you were, to, say, make a movie featuring talking animals that more or less followed the plot of Hamlet, you would have a decent chance of prevailing in court, given the resources to fight their army of lawyers.
      - bdangubic 2 hours ago
        100% agree. but now a million$ question - how would you deal with AI when it comes to copyright? what rules could we possibly put in place?
        [-]
        JoshTriplett 2 hours ago
        The same rules we already have: follow the license of whatever you use. If something doesn't have a license, don't use it. And if someone says "but we can't build AI that way!", too bad, go fix it for everyone first.
        [-]
        slyall an hour ago
        You have a lot of opinions on AI for somebody who has only read stuff in the public domain
- openrisk 5 hours ago
  Leaving aside the hypothetical "live learning AGI" of the future (given that money is made or lost now), would a human regurgitating content that is not theirs - but presented as if it is - be acceptable to you?
  [-]
  - CuriouslyC 4 hours ago
    I don't know about you but my friends don't tell me that Joe Schmoe of Reuters published a report that said XYZ copyright XXXX. They say "XYZ happened."
- IAmGraydon 4 hours ago
  Exactly. Also core to the copyright extremists’ delusional train of thought is the fact that they don’t seem to understand (or admit) that ingesting, creating a model, and then outputting based on that model is exactly what people do when they observe others’ works and are inspired to create.
bastloing an hour ago
Who would be forever grateful if openai removed all of The Intercept's content permanently and refused to crawl it in the future?
bastloing 5 hours ago
Isn't this the same thing Google has been doing for years with their search engine? Only difference is Google keeps the data internal, whereas openai spits it out to you. But it's still scraped and stored in both cases.
[-]
- jazzyjackson 5 hours ago
  A component of fair use is to what degree the derivative work displaces the original. Google's argument has always been that they direct traffic to the original, whereas AI summaries (which Google of course is just as guilty of as openai) completely obsoletes the original publication. The argument now is that the derivative work (LLM model) is transformative, ie, different enough that it doesn't economically compete with the original. I think it's a losing argument but we'll see what the courts arrive at.
  [-]
  - CaptainFever 5 hours ago
    Is this specific to AI or specific to summaries in general? Do summaries, like the ones found in Wikipedia or Cliffs Notes, not have the same effect of making it such that people no longer have to view the original work as much?
    Note: do you mean the model is transformative, or the summaries are transformative? I think your comment holds up either way but I think it's better to be clear which one you mean.
- LinuxBender 5 hours ago
  In my opinion not a lawyer, Google at least references where they obtained the data and did not regurgitate it as if they were the creators that created something new. obfuscated plagiarism via LLM. Some claim derivative works but I have always seen that as quite a stretch. People here expect me to cite references yet LLM's somehow escape this level of scrutiny.
james_sulivan 7 hours ago
Meanwhile China is using everything available to train their AI models
[-]
- goatlover 4 hours ago
  We don't want to be like China.
  [-]
  - tokioyoyo 3 hours ago
    Fair. But I made a comment somewhere else that, if their models become better than ours, they'll be incorporated into products. Then we're back to being depended on China for LLM model development as well, on top of manufacturing. Realistically that'll be banned because of National Security laws or something, but companies tend to choose the path of "best and cheapest" no matter what.
ada1981 3 hours ago
I’m still of the opinion that we should be allowed to train on any data a human can read online.
cynicalsecurity 2 hours ago
Yeah, let's stop the progress because a few magazines no one cares about are unhappy.
[-]
- a57721 an hour ago
  Maybe just don't use data from the unhappy magazines you don't care about in the first place?
zb3 7 hours ago
Forecast: OpenAI and The Intercept will settle and OpenAI users will pay for it.
[-]
- jsheard 7 hours ago
  Yep, the game plan is to keep settling out of court so that (they hope) no legal precedent is set that would effectively make their entire business model illegal. That works until they run out of money I guess, but they probably can't keep it up forever.
  [-]
  - echoangle 6 hours ago
    Wouldn’t the better method to throw all your money at one suit you can make an example of and try to win that one? You can’t effectively settle every single suit if you have no realistic chance of winning, otherwise every single publisher on the internet will come and try to get their money.
    [-]
    - gr3ml1n 6 hours ago
      That's a good strategy, but you have to have the right case. One where OpenAI feels confident they can win and establish favorable precedent. If the facts of the case aren't advantageous, it's probably not worth the risk.
    - lokar 6 hours ago
      Too high risk. Every year you can delay you keep lining your pockets.
- tokioyoyo 3 hours ago
  Side question, why doesn't other companies get the same attention? Anthropic, xAI and others have deep pockets, and scraped the same data, I'm assuming? It could be a gold mine for all these news agencies to keep settling out of court to make some buck.
ysofunny 7 hours ago
the very idea of "this digital asset is exclusively mine" cannot die soon enough
let real physically tangible assets keep the exclusivity problem
let's not undo the advantages unlocked by the digital internet; let us prevent a few from locking down this grand boon of digital abundance such that the problem becomes saturation of data
let us say no to digital scarcity
[-]
- CaptainFever 6 hours ago
  This is, in fact, the core value of the hacker ethos. HackerNews.
  > The belief that information-sharing is a powerful positive good, and that it is an ethical duty of hackers to share their expertise by writing open-source code and facilitating access to information and to computing resources wherever possible.
  > Most hackers subscribe to the hacker ethic in sense 1, and many act on it by writing and giving away open-source software. A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
  http://www.catb.org/jargon/html/H/hacker-ethic.html
  Perhaps if the Internet didn't kill copyright, AI will. (Hyperbole)
  (Personally my belief is more nuanced than this; I'm fine with very limited copyright, but my belief is closer to yours than the current system we have.)
  [-]
  - Xelynega 5 hours ago
    I don't understand what the "hacker ethos" could have to do with defending openai's blatant stealing of people's content for their own profit.
    Openai is not sharing their data(they're keeping it private to profit off of), so how could it be anywhere near the "hacker ethos" to believe that everyone else needs to hand over their data to openai for free?
    [-]
    - CaptainFever 5 hours ago
      Following the "GNU-flavour hacker ethos" as described, one concludes that it is right for OpenAI to copy data without restriction, it is wrong for NYT to restrict others from using their data, and it is also wrong for OpenAI to restrict the sharing of their model weights or outputs for training.
      Luckily, most people seem to ignore OpenAI's hypocritical TOS against sharing their output weights for training. I would go one step further and say that they should share the weights completely, but I understand there's practical issues with that.
      Luckily, we can kind of "exfiltrate" the weights by training on their output. Or wait for someone to leak it, like NovelAI did.
  - AlienRobot 6 hours ago
    I think an ethical hacker is someone who uses their expertise to help those without.
    How could an ethical hacker side with OpenAI, when OpenAI is using its technological expertise to exploit creators without?
    [-]
    - CaptainFever 6 hours ago
      I won't necessarily argue against that moral view, but in this case it is two large corporations fighting. One has the power of tech, the other has the power of the state (copyright). So I don't think that applies in this case specifically.
      [-]
      - Xelynega 5 hours ago
        Aren't you ignoring that common law is built on precedent? If they win this case, that makes it a lot easier for people who's copyright is being infringed on an individual level to get justice.
        [-]
        CaptainFever 4 hours ago
        You're correct, but I think many don't realize how many small model trainers and fine-tuners there are currently. For example, PonyXL, or the many models and fine-tunes on CivitAI made by hobbyists.
        So basically the reasoning is this:
        - NYT vs OpenAI, neither is disenfranchied - OpenAI vs individual creators, creators are disenfranchised - NYT vs individual model trainers, model trainers are disenfranchised - Individual model trainers vs individual creators, neither are disenfranchised
        And if only one can win, and since the view is that information should be free, it biases the argument towards the model trainers.
        [-]
        AlienRobot an hour ago
        What "information" are you talking about? It's a text and image generator.
        Your argument is that it's okay to scrape content when you are an individual. It doesn't change the fact those individuals are people with technical expertise using it to exploit people without.
        If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
        You need to draw the line somewhere.
  - onetokeoverthe 6 hours ago
    Creators freely sharing with attribution requested is different than creations being ruthlessly harvested and repurposed without permission.
    https://creativecommons.org/share-your-work/
    [-]
    - a57721 an hour ago
      > freely sharing with attribution requested
      If I share my texts/sounds/images for free, harvesting and regurgitating them omits the requested attribution. Even the most permissive CC license (excluding CC0 public domain) still requires an attribution.
    - CaptainFever 6 hours ago
      > A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
      In this view, the ideal world is one where copyright is abolished (but not moral rights). So piracy is good, and datasets are also good.
      Asking creators to license their work freely is simply a compromise due to copyright unfortunately still existing. (Note that even if creators don't license their work freely, this view still permits you to pirate or mod it against their wishes.)
      (My view is not this extreme, but my point is that this view was, and hopefully is, still common amongst hackers.)
      I will ignore the moralizing words (eg "ruthless", "harvested" to mean "copied"). It's not productive to the conversation.
      [-]
      - onetokeoverthe 5 hours ago
        If not respected, some Creators will strike, lay flat, not post, go underground.
        Ignoring moral rights of creators is the issue.
        [-]
        CaptainFever 4 hours ago
        Moral rights involve the attribution of works where reasonable and practical. Clearly doing so during inference is not reasonable or practical (you'll have to attribute all of humanity!) but attributing individual sources is possible and is already being done in cases like ChatGPT Search.
        So I don't think you actually mean moral rights, since it's not being ignored here.
        But the first sentence of your comment still stands regardless of what you meant by moral rights. To that, well... we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here.
        And yes, funding is a thing, which I agree needs copyright for the most part unfortunately. But does training AI on, for example, a book really reduce the need to buy the book, if it is not reproduced?
        Remember, training is not just about facts, but about learning how humans talk, how languages work, how books work, etc. Learning that won't reduce the book's economical value.
        And yes, summaries may reduce the value. But summaries already exist. Wikipedia, Cliff's Notes. I think the main defense is that you can't copyright facts.
        [-]
        onetokeoverthe 2 hours ago
        we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here
        ?!?! Comparing and equating commenting to creative works. ?!?!
        These comments are NOT equivalent to the 17 full time months it took me to write a nonfiction book.
        Or an 8 year art project.
        When I give away my work I decide to whom and how.
  - ysofunny 6 hours ago
    oh please, then, riddle me why does my comment has -1 votes on "hacker" news
    which has indeed turned into "i-am-rich-cuz-i-own-tech-stock"news
    [-]
    - alwa 5 hours ago
      I did not contribute a vote either way to your comment above, but I would point out that you get more of what you reward. Maybe the reward is monetary, like an author paid for spending their life writing books. Maybe it’s smaller, more reputational or social—like people who generate thoughtful commentary here, or Wikipedia’s editors, or hobbyists’ forums.
      When you strip people’s names from their words, as the specific count here charges; and you strip out any reason or even way for people to reward good work when they appreciate it; and you put the disembodied words in the mouth of a monolithic, anthropomorphized statistical model tuned to mimic a conversation partner… what type of thought is it that becomes abundant in this world you propose, of “data abundance”?
      In that world, the only people who still have incentive to create are the ones whose content has negative value, who make things people otherwise wouldn’t want to see: advertisers, spammers, propagandists, trolls… where’s the upside of a world saturated with that?
    - CaptainFever 6 hours ago
      Yes, I have no idea either. I find it disappointing.
      I think people simply like it when data is liberated from corporations, but hate it when data is liberated from them. (Though this case is a corporation too so idk. Maybe just "AI bad"?)
- cess11 6 hours ago
  I think you'll find that most people aren't comfortable with this in practice. They'd like e.g. the state to be able to keep secrets, such as personal information regarding citizens and the stuff foreign spies would like to copy.
  [-]
  - jMyles 5 hours ago
    Obviously we're all impacted in these perceptions by our bubbles, but it would surprise me at this particular moment in the history of US politics to find that most people favor the existence of the state at all, let alone its ability to keep secret personal information regarding citizens.
    [-]
    - goatlover 4 hours ago
      Most people aren't anarchists, and think the state is necessary for complex societies to function.
      [-]
      - jMyles an hour ago
        My sense is that the constituency of people who prefer deprecation of the US state is much larger than just anarchists.
    - cess11 4 hours ago
      Really? Are Food Not Bombs and the IWW that popular we're you live?
philipwhiuk 8 hours ago
It's extremely lousy that you have to pre-register copyright.
That would make the USCO a defacto clearinghouse for news.
[-]
- throw646577 7 hours ago
  You don't have to pre-register copyright in any Berne Convention countries. Your copyright exists from the moment you create something.
  (ETA: This paragraph below is diametrically wrong. Sorry.)
  AFAIK in the USA, registered copyright is necessary if you want to bring a lawsuit and get more than statutory damages, which are capped low enough that corporations do pre-register work.
  Not the case in all Berne countries; you don't need this in the UK for example, but then the payouts are typically a lot lower in the UK. Statutory copyright payouts in the USA can be enough to make a difference to an individual author/artist.
  As I understand it, OpenAI could still be on the hook for up to $150K per article if it can be demonstrated it is wilful copyright violation. It's hard to see how they can argue with a straight face that it is accidental. But then OpenAI is, like several other tech unicorns, a bad faith manufacturing device.
  [-]
  - Loughla 7 hours ago
    You seem to know more about this than me. I have a family member who "invented" some electronics things. He hasn't done anything with the inventions (I'm pretty sure they're quackery).
    But to ensure his patent, he mailed himself a sealed copy of the plans. He claims the postage date stamp will hold up in court if he ever needs it.
    Is that a thing? Or is it just more tinfoil business? It's hard to tell with him.
    [-]
    - WillAdams 7 hours ago
      It won't hold up in court, and given that the post-office will mail/deliver unsealed letters (which may then be sealed after the fact), will be viewed rather dimly.
      Buy your family member a copy of:
      https://www.goodreads.com/book/show/58734571-patent-it-yours...
      [-]
      - Y_Y 7 hours ago
        Surely the NSA will retain a copy which can be checked
        [-]
        Tuna-Fish 7 hours ago
        Even if they did, it in fact cannot be checked. There is precedent that you cannot subpoena NSA for their intercepts, because exactly what has been intercepted and stored is privileged information.
        [-]
        hiatus 7 hours ago
        > There is precedent that you cannot subpoena NSA for their intercepts
        I know it's tangential to this thread but could you link to further reading?
        ysofunny 7 hours ago
        but only in a real democracy
    - Isamu 7 hours ago
      Mailing yourself using registered mail is a very old tactic to establish a date for your documents using an official government entity, so this can be meaningful in court. However this may not provide the protection he needs. Copyright law differs from patent law and he should seek legal advice
    - dataflow 7 hours ago
      Even if the date is verifiable, what would it even prove? If it's not public then I don't believe it can count as prior art to begin with.
    - cma 7 hours ago
      The USmoved to first to file years ago. Whoever files first gets it, except if he publishes it publicly there is a 1-year inventor's grace period (it would not apply to a self mail or private mail to other people).
      This is patent, not copyright.
    - throw646577 7 hours ago
      Honestly I don't know whether that actually is a meaningful thing to do anymore; it may be with patents.
      It certainly used to be a legal device people used.
      Essentially it is low-budget notarisation. If your family member believes they have something which is timely and valuable, it might be better to seek out proper legal notarisation, though -- you'd consult a Notary Public:
      https://en.wikipedia.org/wiki/Notary_public
    - blibble 7 hours ago
      presumably the intention is to prove the existence of the specific plans at a specific time?
      I guess the modern version would be to sha256 the plans and shove it into a bitcoin transaction
      good luck explaining that to a judge
  - Isamu 7 hours ago
    Right, you can register before you bring a lawsuit. Pre-registration makes your claim stronger, as does notice of copyright.
  - pera 7 hours ago
    > It's hard to see how they can argue with a straight face that it is accidental
    It's another instance of "move fast, break things" (i.e. "keep your eyes shut while breaking the law at scale")
    [-]
    - renewiltord 5 hours ago
      Yes, because all progress depends upon the unreasonable man.
  - dataflow 7 hours ago
    That's what I thought too, but why does the article say:
    > Infringement suits require that relevant works were first registered with the U.S. Copyright Office (USCO).
    [-]
    - throw646577 7 hours ago
      OK so it turns out I am wrong here! Cool.
      I have it upside down/diametrically wrong, however you see fit. Right that structures exist, exactly wrong on how they apply.
      It is registration that guarantees access to statutory damages:
      https://www.justia.com/intellectual-property/copyright/infri...
      Without registration you still have your natural copyright, but you would have to try to recover the profits made by the infringer.
      Which does sound like more of an uphill struggle for The Intercept, because OpenAI could maybe just say "anything we earn from this is de minimis considering how much errr similar material is errrr in the training set"
      Oh man it's going to take a long time for me to get my brain to accept this truth over what I'd always understood.
whywhywhywhy 7 hours ago
It's so weird to me seeing journalists complaining about copyright and people taking something they did.
The whole of journalism is taking the acts of others and repeating them, why does a journalist claim they have the rights to someone else's actions when someone simply looks at something they did and repeat it.
If no one else ever did anything, the journalist would have nothing to report, it's inherently about replicating the work and acts of others.
[-]
- echoangle 6 hours ago
  That’s a pretty narrow view of journalism. If you look into newspapers, it’s not just a list of events but also opinion pieces, original research, reports etc. The main infringement isn’t with the basic reporting of facts but with the original part that’s done by the writer.
- PittleyDunkin 6 hours ago
  > The whole of journalism is taking the acts of others and repeating them
  Hilarious (and depressing) that this is what people think journalists do.
  [-]
  - SoftTalker 6 hours ago
    What is a "journalist?" It sounds old-fashioned.
    They are "content creators" now.
- razakel 6 hours ago
  Or you could just not do illegal and/or immoral things that are worthy of reporting.
- barapa 6 hours ago
  This is terribly unpersuasive