One of the results the LLM has available to itself is a confidence value. It should, at the very least, provide this along with it's answer. Perhaps if it did people would stop calling it 'AI'.'
My understanding is that this confidence value is not a measure of how likely something is correct/true, but more along the lines of how likely that sentence would be. Including it could be more misleading than helpful, for example if it is repeating commonly misunderstood information.
I'm not sure that it's possible to produce anything reasonable in that space. It would need to know how far it is away from correct to provide a usable confidence value otherwise it'd just be hallucinating a number in the same way as the result.
An analogy. Take a former commuter friend of mine, Mr Skol (named after his favourite breakfast drink). Seen on a minibus I had to get to work years ago, we shared many interesting conversations. Now he was a confident expert on everything. If asked to rate his confidence in a subject it would be a good 95% at least. However he spoke absolute garbage because his brain was rotten away from drinking Skol for breakfast, and the odd crack chaser. I suspect his model was still better than GPT-4o. But an average person could determine the veracity of his arguments.
Thus confidence should be externally rated as an entity with knowledge cannot necessarily rate itself for it has bias. Which then brings in the question of how do you do that. Well you'd have to do the research you were going to do anyway and compare. So now you've used the AI and done the research which you would have had to do if the AI didn't exist. So the AI at this point becomes a cost over benefit if you need something with any level of confidence and accuracy.
Thus the value is zero unless you need crap information, which is at least here, never, unless I'm generating a picture of a goat driving a train or something. And I'm not sure that has any commercial value. But it's fun at least.
ChatGPT Search provides this, by the way, though it relies a lot on the quality of Bing search results. Consensus.app does this but for research papers, and has been very useful to me.
More often than not in my experience, clicking these sources takes me to pages that either don’t exist, don’t have the information ChatGPT is quoting, or ChatGPT completely misinterpreted the content.
> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?
If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)
It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.
They can't stop doing things some people don't like (people who also won't stop doing things other people don't like). The legality of the claims is questionable which is why most are getting thrown out, but we'll see if this narrow approach works out.
I'm sure there are also a number of easy technical ways to "include" the metadata while mostly ignoring it during training that would skirt the letter of the law if needed.
I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.
Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.
In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.
They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.
The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)
But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)
But who knows. Maybe it can be done for more fact-like stuff.
> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.
All of that and more, all at the same time.
Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.
RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.
Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>
The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.
"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...
But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.
The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.
Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.
Is there a way to figure out if OpenAI ingested my blog? If the settlements are $2500 per article then I'll take a free used cars worth of payments if its available.
I suppose the cost of legal representation would cancel it out. I can just imagine a class action where anyone who posted on blogger.com between 2002 and 2012 eventually gets a check for 28 dollars.
If I were more optimistic I could imagine a UBI funded by lawsuits against AGI, some combination of lost wages and intellectual property infringement. Can't figure out exactly how much more important an article on The Intercept had on shifting weights than your hacker news comments, might as well just pay everyone equally since we're all equally screwed
If you posted on blogger.com (or any platform with enough money to hire lawyers) you probably gave them a license that is irrevocable, non-exclusive and able to be sublicensed.
There are reasons for that (they need a license to show it on the platform) but usually these agreements are overly broad because everyone except the user is covering their ass too much.
Those licenses will now be used to sell that content/data for purposes that nobody thought about when you started your account.
Wouldn't the point of the class action to be to dilute the cost of representation? If the damages per article are high and there's plenty of class members, I imagine the limit would be how much OpenAI has to pay out.
I understand that regulations exist and how there can be copyright violations, but shouldn't we be concerned that other.. more lenient governments (mainly China) who are opposed to the US will use this to get ahead? If OpenAI is significantly set back.
No. OpenAI is suspected to be worth over $150B. They can absolutely afford to pay people for data.
Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.
I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.
The OpenAI that is assumed to keep being able to harvest every form of IP without compensation is valued at $150B, an OpenAI that has to pay for data would be worth significantly less. They're currently not even expecting to turn a profit until 2029, and that's without paying for data.
OpenAI is not profitable, and to achieve what they have achieved they had to scrape basically the entire internet. I don't have a hard time believing that OpenAI could not exist if they had to respect copyright.
That's not real money tough. You need actual cash on hand to pay for stuff, OpenAI only have the money they've been given by investors. I suspect that many of the investors wouldn't have been so keen if they knew that OpenAI would need an additional couple of billions a year to pay for data.
That doesn’t mean they have $150B to hand over. What you can cite is the $10 billion they got from Microsoft.
I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.
This type of argument is ignorant, cowardly, shortsighted, and regressive. Both technology and society will progress when we find a formula that is sustainable and incentivizes everyone involved to maximize their contributions without it all blowing up in our faces someday. Copyright law is far from perfect, but it protects artists who want to try and make a living from their work, and it incentivizes creativity that places without such protections usually end up just imitating.
When we find that sustainable framework for AI, China or <insert-boogeyman-here> will just end up imitating it. Idk what harms you're imagining might come from that ("get ahead" is too vague to mean anything), but I just want to point out that that isn't how you become a leader in anything. Even worse, if they are the ones who find that formula first while we take shortcuts to "get ahead", then we will be the ones doing the imitation in the end.
It's hysterical to compare training an ML model with slave labour. It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
> It's hysterical to compare training an ML model with slave labour.
Nobody did that.
> It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
It makes sense. There is always scale to consider in these things.
Isn't it a greater risk that creators lose their income and nobody is creating the content anymore?
Take for instance what has happened with news because of the internet. Not exactly the same, but similar forces at work. It turned into a race to the bottom with everyone trying to generate content as cheaply as possible to get maximum engagement with tech companies siphoning revenue. Expensive, investigative pieces from educated journalists disappeared in favor of stuff that looks like spam. Pre-Internet news was higher quality
Imagine that same effect happening for all content? Art, writing, academic pieces. Its a real risk that openai has peaked in quality
Lots of people create without getting paid to do it. A lot of music and art is unprofitable. In fact, you could argue that when the mainstream media companies got completely captured by suits with no interest in the things their companies invested in, that was when creativity died and we got consigned to genre-box superhero pop hell.
I don’t know. When I look at news from before, there never was investigative journalism. It was all opinion swaying editos, until alternate voices voiced their counternarratives. It’s just not in newspapers because they are too politically biased to produce the two sides of stories that we’ve always asked them to do. It’s on other media.
But investigative journalism has not disappeared. If anything, it has grown.
Get ahead in terms of what? Do you believe that the material in public domain or legally available content that doesn't violate copyrights is not enough to research AI/LLMs or is the concern about purely commercial interests?
China also supposedly has abusive labor practices. So, should other countries start relaxing their labor laws to avoid falling behind ?
Absolutely: if copyright is slowing down innovation, we should abolish copyright.
Not just turn a blind eye when it's the right people doing it. They don't even have a legal exemption passed by Congress - they're just straight-up breaking the law and getting away with it. Which is how America works, I suppose.
Exactly. They rushed to violate copyright on a massive scale quickly, and now are making the argument that it shouldn't apply to them and they couldn't possibly operate in compliance with it. As long as humans don't get to ignore copyright, AI shouldn't either.
Am I even more concerned about the state having control over the future corpus of knowledge via this doomed-in-any-case vector of "intellectual property"? Yes.
I think it will be easier to overcome the influence of billionaires when we drop the pretext that the state is a more primal force than the internet.
100% disagree. "It'll be fine bro" is not a substitute for having a vote over policy decisions made by the government. What you're talking about has a name. It starts with F and was very popular in Italy in the early to mid 20th century.
Rapidity of Godwin's law notwithstanding, I'm not disputing the importance of equity in decision-making. But this matter is more complex than that: it's obvious that the internet doesn't tolerate censorship even if it is dressed as intellectual property. I prefer an open and democratic internet to one policied by childish legacy states, the presence of which serves only (and only sometimes) to drive content into open secrecy.
It seems particularly unfair to equate any questioning of the wisdom of copyright laws (even when applied in situations where we might not care for the defendant, as with this case) with fascism.
It's not Godwin's law when it's correct. Just because it's cool and on the Internet doesn't mean you get to throw out people's stake in how their lives are run.
> throw out people's stake in how their lives are run
FWIW, you're talking to a professional musician. Ostensibly, the IP complex is designed to protect me. I cannot fathom how you can regard it as the "people's stake in how their lives are run". Eliminating copyright will almost certainly give people more control over their digital lives, not less.
> It's not Godwin's law when it's correct.
Just to be clear, you are doubling down on the claim that sunsetting copyright laws is tantamount to nazism?
The claim that's being allowed to proceed is under 17 USC 1202, which is about stripping metadata like the title and author. Not exactly "core copyright violation". Am I missing something?
The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.
I.e., it's some legalese trick, but "everyone knows" what's really at stake.
Violations of 17 USC 1202 can be punished pretty severely. It's not about just money, either.
If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.
Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.
People have been complaining about the DMCA for 2+ decades now. I guess it's great if you are on the winning side. But boy does it suck to be on the losing side.
And normal people can't get on the winning side. I'm trying to get Github to DMCA my own repositories, since it blocked my account and therefore I decided it no longer has the right to host them. Same with Stack Exchange.
GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)
When you uploaded your code to GitHub you granted them a license to host it. You can’t use DMCA against someone who’s operating within the parameters of the license you granted them.
“ If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.”
That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.
That's what Adobe and Getty Images are doing with their image generation models, both are exclusively using their own licensed stock image libraries so they (and their users) are on pretty safe ground.
The problem is that you don't get the same quality of data if you go about it that way. I love ChatGPT and I understand that we're figuring out this new media landscape but I really hope it doesn't turn out to neuter the models. The models are really well done.
If I steal money, I can get way more done than I do now by earning it legally. Yet, you won’t see me regularly dismissing legitimate jobs by posting comparisons to what my numbers would look like if stealing I.P..
We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.
Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?
It seems to me that it shouldn't really affect model quality all that much, is it?
Also, in the amended complaint:
> not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights
Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?
In the decision:
> I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the
information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.
Will we see human washing, where Ai art or works get a "Made by man" final touch in some third world mechanical turk den? Would that add another financial detracting layer to the ai winter?
That will probably happen to some extent if not already. However I think people will just stop publishing online if malicious corps like OpenAI are just going to harvest works for their own gain. People publish for personal gain, not to enrich the public or enrich private entities.
There's no point in having third world mechanical turk dens do finishing passes on AI output unless you're trying to make it worse.
Artists are already using AI to photobash images, and writers are using AI to outline and create rough drafts. The point of having a human in the loop is to tell the AI what is worth creating, then recognize where the AI output can be improved. If we have algorithms telling the AI what to make and content mill hacks smearing shit on the output to make it look more human, that would be the worst of both worlds.
The law generally takes a dim view of such attempts to get around things like that. AI biggest defense is claiming they are so beneficial to society that what they are doing is fine.
That argument stands on the mother of all slippery slopes! Just find a way to make your product mpressive or ubiquitous and all of a sudden it doesn't matter how much you break the law along the way? That's so insane I don't even know where to start.
Why not, considering copyright law specifically has fair use outlined for that kind of thing? It's not some overriding consequence of law, it's that copyright is a granting of a privilege to individuals and that that privilege is not absolute.
Eventually we're going to have embodied models capable of live learning and it'll be extremely apparent how absurd the ideas of the copyright extremists are. Because in their world, it'd be illegal for an intelligent robot to watch TV, read a book or browse the internet like a human can, because it could remember what it saw and potentially regurgitate it in future.
You have to understand, the media companies don't give a shit about the logic, in fact I'm sure a lot of the people pushing the litigation probably see the absurdity of it. This is a business turf war, the stated litigation is whatever excuse they can find to try and go on the offensive against someone they see as a potential threat. The pro copyright group (big media) sees the writing on the wall, that they're about to get dunked on by big tech, and they're thrashing and screaming because $$$.
problem is when a human company profits over their scrape... this isn't a non-profit running out of volunteers & a total distant reality from autonomous robots learning it way by itself
we are discussing an emergent cause that has social & ecological consequences. servers are power hungry stuff that may or not run on a sustainable grid (that also has a bazinga of problems like leaking heavy chemicals on solar panels production, hydro-electric plants destroying their surroundings etc.) & the current state of producing hardware, be a sweatshop or conflict minerals.
lets forget creators copyright violation that is written in the law code of almost every existing country and no artist is making billions out of the abuse of their creation right (often they are pretty chill on getting their stuff mentioned, remixed and whatever)
If humanity ever gets to the point where intelligent robots are capable of watching TV like human can, having to adjust copyright laws seems like the least of problems. How about having to adjust almost every law related to basic "human" rights, ownership, being to establish a contract, being responsible for crimes and endless other things.
But for now your washing machine cannot own other things, and you owning a washing machine isn't considered slavery.
The problem is, we can't come up with a solution where both parties are happy, because in the end, consumers choose one (getting information from news agencies) or the other (getting information from chatgpt). So, both are fighting for life.
It's not copyright "extremism" to expect a level playing field. As long as humans have to adhere to copyright, so should AI companies. If you want to abolish copyright, by all means do, but don't give AI a special exemption.
It's actually the opposite of what you're saying. I can 100% legally do all the things that they're suing OpenAI for. Their whole argument is that the rules should be different when a machine does it than a human.
Only because it would be unconscionable to apply copyright to actual human brains, so we don't. But, for instance, you absolutely can commit copyright violation by reading something and then writing something very similar, which is one reason why reverse engineering commonly uses clean-room techniques. AI training is in no way a clean room.
Go make a movie using the same plot as a Disney movie, that doesn't copy any of the text or images of the original, and see how far "not spitting out a copy" gets you in court.
AI's approach to copyright is very much "rules for thee but not for me".
That might get you pretty far in court, actually. You'd have to be pretty close in terms of the sequence of events, character names, etc. Especially considering how many Disney movies are based on pre-existing stories, if you were, to, say, make a movie featuring talking animals that more or less followed the plot of Hamlet, you would have a decent chance of prevailing in court, given the resources to fight their army of lawyers.
The same rules we already have: follow the license of whatever you use. If something doesn't have a license, don't use it. And if someone says "but we can't build AI that way!", too bad, go fix it for everyone first.
Leaving aside the hypothetical "live learning AGI" of the future (given that money is made or lost now), would a human regurgitating content that is not theirs - but presented as if it is - be acceptable to you?
I don't know about you but my friends don't tell me that Joe Schmoe of Reuters published a report that said XYZ copyright XXXX. They say "XYZ happened."
Exactly. Also core to the copyright extremists’ delusional train of thought is the fact that they don’t seem to understand (or admit) that ingesting, creating a model, and then outputting based on that model is exactly what people do when they observe others’ works and are inspired to create.
Isn't this the same thing Google has been doing for years with their search engine? Only difference is Google keeps the data internal, whereas openai spits it out to you. But it's still scraped and stored in both cases.
A component of fair use is to what degree the derivative work displaces the original. Google's argument has always been that they direct traffic to the original, whereas AI summaries (which Google of course is just as guilty of as openai) completely obsoletes the original publication. The argument now is that the derivative work (LLM model) is transformative, ie, different enough that it doesn't economically compete with the original. I think it's a losing argument but we'll see what the courts arrive at.
Is this specific to AI or specific to summaries in general? Do summaries, like the ones found in Wikipedia or Cliffs Notes, not have the same effect of making it such that people no longer have to view the original work as much?
Note: do you mean the model is transformative, or the summaries are transformative? I think your comment holds up either way but I think it's better to be clear which one you mean.
In my opinion not a lawyer, Google at least references where they obtained the data and did not regurgitate it as if they were the creators that created something new. obfuscated plagiarism via LLM. Some claim derivative works but I have always seen that as quite a stretch. People here expect me to cite references yet LLM's somehow escape this level of scrutiny.
Fair. But I made a comment somewhere else that, if their models become better than ours, they'll be incorporated into products. Then we're back to being depended on China for LLM model development as well, on top of manufacturing. Realistically that'll be banned because of National Security laws or something, but companies tend to choose the path of "best and cheapest" no matter what.
Yep, the game plan is to keep settling out of court so that (they hope) no legal precedent is set that would effectively make their entire business model illegal. That works until they run out of money I guess, but they probably can't keep it up forever.
Wouldn’t the better method to throw all your money at one suit you can make an example of and try to win that one? You can’t effectively settle every single suit if you have no realistic chance of winning, otherwise every single publisher on the internet will come and try to get their money.
That's a good strategy, but you have to have the right case. One where OpenAI feels confident they can win and establish favorable precedent. If the facts of the case aren't advantageous, it's probably not worth the risk.
Side question, why doesn't other companies get the same attention? Anthropic, xAI and others have deep pockets, and scraped the same data, I'm assuming? It could be a gold mine for all these news agencies to keep settling out of court to make some buck.
the very idea of "this digital asset is exclusively mine" cannot die soon enough
let real physically tangible assets keep the exclusivity problem
let's not undo the advantages unlocked by the digital internet; let us prevent a few from locking down this grand boon of digital abundance such that the problem becomes saturation of data
This is, in fact, the core value of the hacker ethos. HackerNews.
> The belief that information-sharing is a powerful positive good, and that it is an ethical duty of hackers to share their expertise by writing open-source code and facilitating access to information and to computing resources wherever possible.
> Most hackers subscribe to the hacker ethic in sense 1, and many act on it by writing and giving away open-source software. A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
Perhaps if the Internet didn't kill copyright, AI will. (Hyperbole)
(Personally my belief is more nuanced than this; I'm fine with very limited copyright, but my belief is closer to yours than the current system we have.)
I don't understand what the "hacker ethos" could have to do with defending openai's blatant stealing of people's content for their own profit.
Openai is not sharing their data(they're keeping it private to profit off of), so how could it be anywhere near the "hacker ethos" to believe that everyone else needs to hand over their data to openai for free?
Following the "GNU-flavour hacker ethos" as described, one concludes that it is right for OpenAI to copy data without restriction, it is wrong for NYT to restrict others from using their data, and it is also wrong for OpenAI to restrict the sharing of their model weights or outputs for training.
Luckily, most people seem to ignore OpenAI's hypocritical TOS against sharing their output weights for training. I would go one step further and say that they should share the weights completely, but I understand there's practical issues with that.
Luckily, we can kind of "exfiltrate" the weights by training on their output. Or wait for someone to leak it, like NovelAI did.
I won't necessarily argue against that moral view, but in this case it is two large corporations fighting. One has the power of tech, the other has the power of the state (copyright). So I don't think that applies in this case specifically.
Aren't you ignoring that common law is built on precedent? If they win this case, that makes it a lot easier for people who's copyright is being infringed on an individual level to get justice.
You're correct, but I think many don't realize how many small model trainers and fine-tuners there are currently. For example, PonyXL, or the many models and fine-tunes on CivitAI made by hobbyists.
So basically the reasoning is this:
- NYT vs OpenAI, neither is disenfranchied
- OpenAI vs individual creators, creators are disenfranchised
- NYT vs individual model trainers, model trainers are disenfranchised
- Individual model trainers vs individual creators, neither are disenfranchised
And if only one can win, and since the view is that information should be free, it biases the argument towards the model trainers.
What "information" are you talking about? It's a text and image generator.
Your argument is that it's okay to scrape content when you are an individual. It doesn't change the fact those individuals are people with technical expertise using it to exploit people without.
If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
If I share my texts/sounds/images for free, harvesting and regurgitating them omits the requested attribution. Even the most permissive CC license (excluding CC0 public domain) still requires an attribution.
> A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
In this view, the ideal world is one where copyright is abolished (but not moral rights). So piracy is good, and datasets are also good.
Asking creators to license their work freely is simply a compromise due to copyright unfortunately still existing. (Note that even if creators don't license their work freely, this view still permits you to pirate or mod it against their wishes.)
(My view is not this extreme, but my point is that this view was, and hopefully is, still common amongst hackers.)
I will ignore the moralizing words (eg "ruthless", "harvested" to mean "copied"). It's not productive to the conversation.
Moral rights involve the attribution of works where reasonable and practical. Clearly doing so during inference is not reasonable or practical (you'll have to attribute all of humanity!) but attributing individual sources is possible and is already being done in cases like ChatGPT Search.
So I don't think you actually mean moral rights, since it's not being ignored here.
But the first sentence of your comment still stands regardless of what you meant by moral rights. To that, well... we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here.
And yes, funding is a thing, which I agree needs copyright for the most part unfortunately. But does training AI on, for example, a book really reduce the need to buy the book, if it is not reproduced?
Remember, training is not just about facts, but about learning how humans talk, how languages work, how books work, etc. Learning that won't reduce the book's economical value.
And yes, summaries may reduce the value. But summaries already exist. Wikipedia, Cliff's Notes. I think the main defense is that you can't copyright facts.
I did not contribute a vote either way to your comment above, but I would point out that you get more of what you reward. Maybe the reward is monetary, like an author paid for spending their life writing books. Maybe it’s smaller, more reputational or social—like people who generate thoughtful commentary here, or Wikipedia’s editors, or hobbyists’ forums.
When you strip people’s names from their words, as the specific count here charges; and you strip out any reason or even way for people to reward good work when they appreciate it; and you put the disembodied words in the mouth of a monolithic, anthropomorphized statistical model tuned to mimic a conversation partner… what type of thought is it that becomes abundant in this world you propose, of “data abundance”?
In that world, the only people who still have incentive to create are the ones whose content has negative value, who make things people otherwise wouldn’t want to see: advertisers, spammers, propagandists, trolls… where’s the upside of a world saturated with that?
Yes, I have no idea either. I find it disappointing.
I think people simply like it when data is liberated from corporations, but hate it when data is liberated from them. (Though this case is a corporation too so idk. Maybe just "AI bad"?)
I think you'll find that most people aren't comfortable with this in practice. They'd like e.g. the state to be able to keep secrets, such as personal information regarding citizens and the stuff foreign spies would like to copy.
Obviously we're all impacted in these perceptions by our bubbles, but it would surprise me at this particular moment in the history of US politics to find that most people favor the existence of the state at all, let alone its ability to keep secret personal information regarding citizens.
You don't have to pre-register copyright in any Berne Convention countries. Your copyright exists from the moment you create something.
(ETA: This paragraph below is diametrically wrong. Sorry.)
AFAIK in the USA, registered copyright is necessary if you want to bring a lawsuit and get more than statutory damages, which are capped low enough that corporations do pre-register work.
Not the case in all Berne countries; you don't need this in the UK for example, but then the payouts are typically a lot lower in the UK. Statutory copyright payouts in the USA can be enough to make a difference to an individual author/artist.
As I understand it, OpenAI could still be on the hook for up to $150K per article if it can be demonstrated it is wilful copyright violation. It's hard to see how they can argue with a straight face that it is accidental. But then OpenAI is, like several other tech unicorns, a bad faith manufacturing device.
You seem to know more about this than me. I have a family member who "invented" some electronics things. He hasn't done anything with the inventions (I'm pretty sure they're quackery).
But to ensure his patent, he mailed himself a sealed copy of the plans. He claims the postage date stamp will hold up in court if he ever needs it.
Is that a thing? Or is it just more tinfoil business? It's hard to tell with him.
It won't hold up in court, and given that the post-office will mail/deliver unsealed letters (which may then be sealed after the fact), will be viewed rather dimly.
Even if they did, it in fact cannot be checked. There is precedent that you cannot subpoena NSA for their intercepts, because exactly what has been intercepted and stored is privileged information.
Mailing yourself using registered mail is a very old tactic to establish a date for your documents using an official government entity, so this can be meaningful in court. However this may not provide the protection he needs. Copyright law differs from patent law and he should seek legal advice
The USmoved to first to file years ago. Whoever files first gets it, except if he publishes it publicly there is a 1-year inventor's grace period (it would not apply to a self mail or private mail to other people).
Honestly I don't know whether that actually is a meaningful thing to do anymore; it may be with patents.
It certainly used to be a legal device people used.
Essentially it is low-budget notarisation. If your family member believes they have something which is timely and valuable, it might be better to seek out proper legal notarisation, though -- you'd consult a Notary Public:
Without registration you still have your natural copyright, but you would have to try to recover the profits made by the infringer.
Which does sound like more of an uphill struggle for The Intercept, because OpenAI could maybe just say "anything we earn from this is de minimis considering how much errr similar material is errrr in the training set"
Oh man it's going to take a long time for me to get my brain to accept this truth over what I'd always understood.
It's so weird to me seeing journalists complaining about copyright and people taking something they did.
The whole of journalism is taking the acts of others and repeating them, why does a journalist claim they have the rights to someone else's actions when someone simply looks at something they did and repeat it.
If no one else ever did anything, the journalist would have nothing to report, it's inherently about replicating the work and acts of others.
That’s a pretty narrow view of journalism. If you look into newspapers, it’s not just a list of events but also opinion pieces, original research, reports etc. The main infringement isn’t with the basic reporting of facts but with the original part that’s done by the writer.
I would trust AI a lot more if it gave answers more like:
“Source A on date 1 said XYX”
“Source B …”
“Synthesizing these, it seems that the majority opinion is X but Y is also a commonly held opinion.”
Instead of what it does now, which is make extremely confident, unsourced statements.
It looks like the copyright lawsuits are rent-seeking as much as anything else; another reason I hate copyright in its current form.
> which is make extremely confident,
One of the results the LLM has available to itself is a confidence value. It should, at the very least, provide this along with it's answer. Perhaps if it did people would stop calling it 'AI'.'
My understanding is that this confidence value is not a measure of how likely something is correct/true, but more along the lines of how likely that sentence would be. Including it could be more misleading than helpful, for example if it is repeating commonly misunderstood information.
I'm not sure that it's possible to produce anything reasonable in that space. It would need to know how far it is away from correct to provide a usable confidence value otherwise it'd just be hallucinating a number in the same way as the result.
An analogy. Take a former commuter friend of mine, Mr Skol (named after his favourite breakfast drink). Seen on a minibus I had to get to work years ago, we shared many interesting conversations. Now he was a confident expert on everything. If asked to rate his confidence in a subject it would be a good 95% at least. However he spoke absolute garbage because his brain was rotten away from drinking Skol for breakfast, and the odd crack chaser. I suspect his model was still better than GPT-4o. But an average person could determine the veracity of his arguments.
Thus confidence should be externally rated as an entity with knowledge cannot necessarily rate itself for it has bias. Which then brings in the question of how do you do that. Well you'd have to do the research you were going to do anyway and compare. So now you've used the AI and done the research which you would have had to do if the AI didn't exist. So the AI at this point becomes a cost over benefit if you need something with any level of confidence and accuracy.
Thus the value is zero unless you need crap information, which is at least here, never, unless I'm generating a picture of a goat driving a train or something. And I'm not sure that has any commercial value. But it's fun at least.
ChatGPT Search provides this, by the way, though it relies a lot on the quality of Bing search results. Consensus.app does this but for research papers, and has been very useful to me.
More often than not in my experience, clicking these sources takes me to pages that either don’t exist, don’t have the information ChatGPT is quoting, or ChatGPT completely misinterpreted the content.
Interesting. Two key quotes:
> It is unclear if the Intercept ruling will embolden other publications to consider DMCA litigation; few publications have followed in their footsteps so far. As time goes on, there is concern that new suits against OpenAI would be vulnerable to statute of limitations restrictions, particularly if news publishers want to cite the training data sets underlying ChatGPT. But the ruling is one signal that Loevy & Loevy is narrowing in on a specific DMCA claim that can actually stand up in court.
> Like The Intercept, Raw Story and AlterNet are asking for $2,500 in damages for each instance that OpenAI allegedly removed DMCA-protected information in its training data sets. If damages are calculated based on each individual article allegedly used to train ChatGPT, it could quickly balloon to tens of thousands of violations.
Tens of thousands of violations at $2500 each would amount to tens of millions of dollars in damages. I am not familiar with this field, does anyone have a sense of whether the total cost of retraining (without these alleged DMCA violations) might compare to these damages?
If you're going to retrain your model because of this ruling, wouldn't it make sense to remove all DMCA protected content from your training data instead of just the one you were most recently sued for(especially if it sets precedent)
It would make sense from a legal standpoint, but I don't think they could do that without massively regressing their models performance to the point that it would jeopardize their viability as a company.
I agree, just want to make sure "they can't stop doing illegal things or they wouldn't be a success" is said out loud instead of left to subtext.
They can't stop doing things some people don't like (people who also won't stop doing things other people don't like). The legality of the claims is questionable which is why most are getting thrown out, but we'll see if this narrow approach works out.
I'm sure there are also a number of easy technical ways to "include" the metadata while mostly ignoring it during training that would skirt the letter of the law if needed.
If we really want to be technical, in common law systems anything is legal as long as the highest court to challenge it decides it's legal.
I guess I should have used the phrase "common sense stealing in any other context" to be more precise?
I wonder if they can say something like “we aren’t scraping your protected content, we are merely scraping this old model we don’t maintain anymore and it happened to have protected content in it from before the ruling” then you’ve essentially won all of humanities output, as you can already scrape the new primary information (scientific articles and other datasets designed for researchers to freely access) and whatever junk outputted by the content mills is just going to be a poor summarizations of that primary information.
Other factors that help this effort of an old model + new public facing data being complete, are the idea that other forms of media like storytelling and music have already converged onto certain prevailing patters. For stories we expect a certain style of plot development and complain when its missing or not as we expect. For music most anything being listened to is lyrics no one is deeply reading into put over the same old chord progressions we’ve always had. For art there are just too few of us who are actually going out of our way to get familiar with novel art vs the vast bulk of the worlds present day artistic effort which goes towards product advertisement, which once again follows certain patterns people have been publishing in psychological journals for decades now.
In a sense we’ve already put out enough data and made enough of our world formulaic to the point where I believe we’ve set up for a perfect singularity already in terms of what can be generated for the average person who looks at a screen today. And because of that I think even a lack of any new training on such content wouldn’t hurt openai at all.
They might make it work by (1) having lots of public domain content, for the purpose of training their models on basic language use, and (2) preserving source/attribution metadata about what copyrighted content they do use, so that the models can surface this attribution to the user during inference. Even if the latter is not 100% foolproof, it might still be useful in most cases and show good faith intent.
The latter one is possible with RAG solutions like ChatGPT Search, which do already provide sources! :)
But for inference in general, I'm not sure it makes too much sense. Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc. Which is kind of too fundamental to be attributed to, IMO. (Attribution: Humanity)
But who knows. Maybe it can be done for more fact-like stuff.
> Training data is not just about learning facts, but also (mainly?) about how language works, how people talk, etc.
All of that and more, all at the same time.
Attribution at inference level is bound to work more-less the same way as humans attribute things during conversations: "As ${attribution} said, ${some quote}", or "I remember reading about it in ${attribution-1} - ${some statements}; ... or maybe it was in ${attribution-2}?...". Such attributions are often wrong, as people hallucinate^Wmisremember where they saw or heard something.
RAG obviously can work for this, as well as other solutions involving retrieving, finding or confirming sources. That's just like when a human actually looks up the source when citing something - and has similar caveats and costs.
Only half-serious, but: I wonder if they can dance with the publishers around this issue long enough for most of the contested text to become part of public court records, and then claim they're now training off that. <trollface>
Being part of a public court record doesn't seem like something that would invalidate copyright.
What about bombing? You could always smuggle dmca content in training sets hoping for a payout?
The onus is on the person collecting massive amounts of data and circumventing DMCA protections to ensure they're not doing anything illegal.
"well someone snuck in some DMCA content" when sharing family photos and doesn't suddenly make it legal to share that DMCA protected content with your photos...
But all content is DMCA protected. Avoiding copyrighted content means not having content as all material is automatically copyrighted. One would be limited to licensed content, which is another minefield.
The apparant loophole is between copyrighted work and copyrighted work that is also registered. But registration can occur at any time, meaning there is little practical difference. Unless you have perfect licenses for all your training data, which nobody does, you have to accept the risk of copyright suits.
Yes, that's how every other industry that redistributes content works.
You have to license content you want to use, you cant just use it for free because it's on the internet.
Netflix doesn't just start hosting shows and hope they don't get a copyright suit...
Re-training can be done, but, and it is not a small but, models already do exist and can be used locally suggesting that the milk has been spilled for too long at this point. Separately, neutering them effectively lowers their value as opposed to their non-neutered counterparts.
Is there a way to figure out if OpenAI ingested my blog? If the settlements are $2500 per article then I'll take a free used cars worth of payments if its available.
I suppose the cost of legal representation would cancel it out. I can just imagine a class action where anyone who posted on blogger.com between 2002 and 2012 eventually gets a check for 28 dollars.
If I were more optimistic I could imagine a UBI funded by lawsuits against AGI, some combination of lost wages and intellectual property infringement. Can't figure out exactly how much more important an article on The Intercept had on shifting weights than your hacker news comments, might as well just pay everyone equally since we're all equally screwed
If you posted on blogger.com (or any platform with enough money to hire lawyers) you probably gave them a license that is irrevocable, non-exclusive and able to be sublicensed.
There are reasons for that (they need a license to show it on the platform) but usually these agreements are overly broad because everyone except the user is covering their ass too much.
Those licenses will now be used to sell that content/data for purposes that nobody thought about when you started your account.
Wouldn't the point of the class action to be to dilute the cost of representation? If the damages per article are high and there's plenty of class members, I imagine the limit would be how much OpenAI has to pay out.
I understand that regulations exist and how there can be copyright violations, but shouldn't we be concerned that other.. more lenient governments (mainly China) who are opposed to the US will use this to get ahead? If OpenAI is significantly set back.
No. OpenAI is suspected to be worth over $150B. They can absolutely afford to pay people for data.
Edit: People commenting need to understand that $150B is the discounted value of future revenues. So... yes they can pay out... yes they will be worth less... and yes that's fair to the people who created the information.
I can't believe there are so many apologists on HN for what amounts to vacuuming up peoples data for financial gain.
The OpenAI that is assumed to keep being able to harvest every form of IP without compensation is valued at $150B, an OpenAI that has to pay for data would be worth significantly less. They're currently not even expecting to turn a profit until 2029, and that's without paying for data.
https://finance.yahoo.com/news/report-reveals-openais-44-bil...
OpenAI is not profitable, and to achieve what they have achieved they had to scrape basically the entire internet. I don't have a hard time believing that OpenAI could not exist if they had to respect copyright.
https://www.cnbc.com/2024/09/27/openai-sees-5-billion-loss-t...
technically open ai has respected copyright, except in the (few) instances they produce non-fair-use amounts of copyrighted material.
dmca does not cover scraping.
That's not real money tough. You need actual cash on hand to pay for stuff, OpenAI only have the money they've been given by investors. I suspect that many of the investors wouldn't have been so keen if they knew that OpenAI would need an additional couple of billions a year to pay for data.
That doesn’t mean they have $150B to hand over. What you can cite is the $10 billion they got from Microsoft.
I’m sure they could use a chunk of that to buy competitive I.P. for both companies to use for training. They can also pay experts to create it. They could even sell that to others for use in smaller models to finance creating or buying even more I.P. for their models.
This type of argument is ignorant, cowardly, shortsighted, and regressive. Both technology and society will progress when we find a formula that is sustainable and incentivizes everyone involved to maximize their contributions without it all blowing up in our faces someday. Copyright law is far from perfect, but it protects artists who want to try and make a living from their work, and it incentivizes creativity that places without such protections usually end up just imitating.
When we find that sustainable framework for AI, China or <insert-boogeyman-here> will just end up imitating it. Idk what harms you're imagining might come from that ("get ahead" is too vague to mean anything), but I just want to point out that that isn't how you become a leader in anything. Even worse, if they are the ones who find that formula first while we take shortcuts to "get ahead", then we will be the ones doing the imitation in the end.
Copyright is a dead man walking and that's a good thing. Let's applaud the end of a temporary unnatural state of affairs.
Should we also be concerned that other governments use slave labor (among other human rights violations) and will use that to get ahead?
It's hysterical to compare training an ML model with slave labour. It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
> It's hysterical to compare training an ML model with slave labour.
Nobody did that.
> It's perfectly fine and accepted for a human to read and learn from content online without paying anything to the author when that content has been made available online for free, it's absurd to assert that it somehow becomes a human rights violation when the learning is done by a non-biological brain instead.
It makes sense. There is always scale to consider in these things.
Isn't it a greater risk that creators lose their income and nobody is creating the content anymore?
Take for instance what has happened with news because of the internet. Not exactly the same, but similar forces at work. It turned into a race to the bottom with everyone trying to generate content as cheaply as possible to get maximum engagement with tech companies siphoning revenue. Expensive, investigative pieces from educated journalists disappeared in favor of stuff that looks like spam. Pre-Internet news was higher quality
Imagine that same effect happening for all content? Art, writing, academic pieces. Its a real risk that openai has peaked in quality
Lots of people create without getting paid to do it. A lot of music and art is unprofitable. In fact, you could argue that when the mainstream media companies got completely captured by suits with no interest in the things their companies invested in, that was when creativity died and we got consigned to genre-box superhero pop hell.
I don’t know. When I look at news from before, there never was investigative journalism. It was all opinion swaying editos, until alternate voices voiced their counternarratives. It’s just not in newspapers because they are too politically biased to produce the two sides of stories that we’ve always asked them to do. It’s on other media.
But investigative journalism has not disappeared. If anything, it has grown.
Get ahead in terms of what? Do you believe that the material in public domain or legally available content that doesn't violate copyrights is not enough to research AI/LLMs or is the concern about purely commercial interests?
China also supposedly has abusive labor practices. So, should other countries start relaxing their labor laws to avoid falling behind ?
Absolutely: if copyright is slowing down innovation, we should abolish copyright.
Not just turn a blind eye when it's the right people doing it. They don't even have a legal exemption passed by Congress - they're just straight-up breaking the law and getting away with it. Which is how America works, I suppose.
Exactly. They rushed to violate copyright on a massive scale quickly, and now are making the argument that it shouldn't apply to them and they couldn't possibly operate in compliance with it. As long as humans don't get to ignore copyright, AI shouldn't either.
ChatGPT doesn't violate copyright, it's a software application. "Open"AI does, it's a company run by humans (for now).
Humans do get to ignore copyright, when they do the same thing OpenAI has been doing.
Exactly.
Should I be paying a proportion of my salary to all the copyright holders of the books, song, TV shows and movies I consumed during my life?
If a Hollywood writer says she "learnt a lot about writing by watching the Simpsons" will Fox have an additional claim on her earnings?
Yeah it turns out humans have more rights than computer programs and tech startups.
I'm more concerned that someone people in the tech world are conflating Sam Altman's interest with the national interest.
Am I jazzed about Sam Altman making billions? No.
Am I even more concerned about the state having control over the future corpus of knowledge via this doomed-in-any-case vector of "intellectual property"? Yes.
I think it will be easier to overcome the influence of billionaires when we drop the pretext that the state is a more primal force than the internet.
100% disagree. "It'll be fine bro" is not a substitute for having a vote over policy decisions made by the government. What you're talking about has a name. It starts with F and was very popular in Italy in the early to mid 20th century.
Rapidity of Godwin's law notwithstanding, I'm not disputing the importance of equity in decision-making. But this matter is more complex than that: it's obvious that the internet doesn't tolerate censorship even if it is dressed as intellectual property. I prefer an open and democratic internet to one policied by childish legacy states, the presence of which serves only (and only sometimes) to drive content into open secrecy.
It seems particularly unfair to equate any questioning of the wisdom of copyright laws (even when applied in situations where we might not care for the defendant, as with this case) with fascism.
It's not Godwin's law when it's correct. Just because it's cool and on the Internet doesn't mean you get to throw out people's stake in how their lives are run.
> throw out people's stake in how their lives are run
FWIW, you're talking to a professional musician. Ostensibly, the IP complex is designed to protect me. I cannot fathom how you can regard it as the "people's stake in how their lives are run". Eliminating copyright will almost certainly give people more control over their digital lives, not less.
> It's not Godwin's law when it's correct.
Just to be clear, you are doubling down on the claim that sunsetting copyright laws is tantamount to nazism?
The claim that's being allowed to proceed is under 17 USC 1202, which is about stripping metadata like the title and author. Not exactly "core copyright violation". Am I missing something?
I read the headline as the copyright violation claim being core to the lawsuit.
The plaintiffs focused on exactly this part - removal of metadata - probably because it's the most likely to hold in courts. One judge remarked on it pretty explicitly, saying that it's just a proxy topic for the real issue of the usage of copyrighted material in model training.
I.e., it's some legalese trick, but "everyone knows" what's really at stake.
Violations of 17 USC 1202 can be punished pretty severely. It's not about just money, either.
If, during the trial, the judge thinks that OpenAI is going to be found to be in violation, he can order all of OpenAIs computer equipment be impounded. If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.
Whether you call that "core" or not, OpenAI cannot afford to lose these parts that are left of this lawsuit.
> he can order all of OpenAIs computer equipment be impounded.
Arrrrr matey, this is going to be fun.
People have been complaining about the DMCA for 2+ decades now. I guess it's great if you are on the winning side. But boy does it suck to be on the losing side.
And normal people can't get on the winning side. I'm trying to get Github to DMCA my own repositories, since it blocked my account and therefore I decided it no longer has the right to host them. Same with Stack Exchange.
GitHub's ignored me so far, and Stack Exchange explicitly said no (then I sent them an even broader legal request under GDPR)
When you uploaded your code to GitHub you granted them a license to host it. You can’t use DMCA against someone who’s operating within the parameters of the license you granted them.
Their stance is that GitHub revoked that license by blocking their account.
It won't happen. Judges only order that punishment for the little guys.
“ If OpenAI is found to be in violation, he can then order permanent destruction of the models and OpenAI would have to start over from scratch in a manner that doesn't violate the law.”
That is exactly why I suggested companies train some models on public domain and licensed data. That risk disappears or is very minimal. They could also be used for code and synthetic data generation without legal issues on the outputs.
That's what Adobe and Getty Images are doing with their image generation models, both are exclusively using their own licensed stock image libraries so they (and their users) are on pretty safe ground.
That’s good. I hope more do. This list has those doing it under the Fairly Trained banner:
https://www.fairlytrained.org/certified-models
The problem is that you don't get the same quality of data if you go about it that way. I love ChatGPT and I understand that we're figuring out this new media landscape but I really hope it doesn't turn out to neuter the models. The models are really well done.
If I steal money, I can get way more done than I do now by earning it legally. Yet, you won’t see me regularly dismissing legitimate jobs by posting comparisons to what my numbers would look like if stealing I.P..
We must start with moral and legal behavior. Within that, we look at what opportunities we have. Then, we pick the best ones. Those we can’t have are a side effect of the tradeoffs we’ve made (or tolerated) in our system.
That is OpenAI's problem, not their victims'.
Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?
It seems to me that it shouldn't really affect model quality all that much, is it?
Also, in the amended complaint:
> not to notify ChatGPT users when the responses they received were protected by journalists’ copyrights
Wasn't it already quite clear that as long as the articles weren't replicated, it wasn't protected? Or is that still being fought in this case?
In the decision:
> I agree with Defendants. Plai ntiffs allege that ChatGPT has been trained on "a scrape of most of the internet, " Compl. , 29, which includes massive amounts of information from innumerable sources on almost any given subject. Plaintiffs have nowhere alleged that the information in their articles is copyrighted, nor could they do so . When a user inputs a question into ChatGPT, ChatGPT synthesizes the relevant information in its repository into an answer. Given the quantity of information contained in the repository, the likelihood that ChatGPT would output plagiarized content from one of Plaintiffs' articles seems remote. And while Plaintiffs provide third-party statistics indicating that an earlier version of ChatGPT generated responses containing signifi cant amounts of pl agiarized content, Compl. ~ 5, Plaintiffs have not plausibly alleged that there is a " substantial risk" that the current version of ChatGPT will generate a response plagiarizing one of Plaintiffs' articles.
>Also, is there really any benefit to stripping author metadata? Was it basically a preprocessing step?
Have you read 1202? It's all about hiding your infringement.
Will we see human washing, where Ai art or works get a "Made by man" final touch in some third world mechanical turk den? Would that add another financial detracting layer to the ai winter?
That will probably happen to some extent if not already. However I think people will just stop publishing online if malicious corps like OpenAI are just going to harvest works for their own gain. People publish for personal gain, not to enrich the public or enrich private entities.
However, I get my personal gain regardless of whether or not the text is also ingested into ChatGPT.
In fact, since I use ChatGPT a lot, I get more gain if it is.
There's no point in having third world mechanical turk dens do finishing passes on AI output unless you're trying to make it worse.
Artists are already using AI to photobash images, and writers are using AI to outline and create rough drafts. The point of having a human in the loop is to tell the AI what is worth creating, then recognize where the AI output can be improved. If we have algorithms telling the AI what to make and content mill hacks smearing shit on the output to make it look more human, that would be the worst of both worlds.
The law generally takes a dim view of such attempts to get around things like that. AI biggest defense is claiming they are so beneficial to society that what they are doing is fine.
That argument stands on the mother of all slippery slopes! Just find a way to make your product mpressive or ubiquitous and all of a sudden it doesn't matter how much you break the law along the way? That's so insane I don't even know where to start.
Why not, considering copyright law specifically has fair use outlined for that kind of thing? It's not some overriding consequence of law, it's that copyright is a granting of a privilege to individuals and that that privilege is not absolute.
Worked for purdue
YouTube, AirBnB, Uber, and many many others have all done stuff that’s blatant against the law but gotten away with it due to utility.
That is not in any way the biggest defense
It’s worked for many startups and court cases in the past. Copyright even has many explicit examples of the utility loophole look at say: https://en.wikipedia.org/wiki/Sony_Corp._of_America_v._Unive....
Eventually we're going to have embodied models capable of live learning and it'll be extremely apparent how absurd the ideas of the copyright extremists are. Because in their world, it'd be illegal for an intelligent robot to watch TV, read a book or browse the internet like a human can, because it could remember what it saw and potentially regurgitate it in future.
You have to understand, the media companies don't give a shit about the logic, in fact I'm sure a lot of the people pushing the litigation probably see the absurdity of it. This is a business turf war, the stated litigation is whatever excuse they can find to try and go on the offensive against someone they see as a potential threat. The pro copyright group (big media) sees the writing on the wall, that they're about to get dunked on by big tech, and they're thrashing and screaming because $$$.
problem is when a human company profits over their scrape... this isn't a non-profit running out of volunteers & a total distant reality from autonomous robots learning it way by itself
we are discussing an emergent cause that has social & ecological consequences. servers are power hungry stuff that may or not run on a sustainable grid (that also has a bazinga of problems like leaking heavy chemicals on solar panels production, hydro-electric plants destroying their surroundings etc.) & the current state of producing hardware, be a sweatshop or conflict minerals. lets forget creators copyright violation that is written in the law code of almost every existing country and no artist is making billions out of the abuse of their creation right (often they are pretty chill on getting their stuff mentioned, remixed and whatever)
If humanity ever gets to the point where intelligent robots are capable of watching TV like human can, having to adjust copyright laws seems like the least of problems. How about having to adjust almost every law related to basic "human" rights, ownership, being to establish a contract, being responsible for crimes and endless other things.
But for now your washing machine cannot own other things, and you owning a washing machine isn't considered slavery.
The problem is, we can't come up with a solution where both parties are happy, because in the end, consumers choose one (getting information from news agencies) or the other (getting information from chatgpt). So, both are fighting for life.
> copyright extremists
It's not copyright "extremism" to expect a level playing field. As long as humans have to adhere to copyright, so should AI companies. If you want to abolish copyright, by all means do, but don't give AI a special exemption.
It's actually the opposite of what you're saying. I can 100% legally do all the things that they're suing OpenAI for. Their whole argument is that the rules should be different when a machine does it than a human.
Only because it would be unconscionable to apply copyright to actual human brains, so we don't. But, for instance, you absolutely can commit copyright violation by reading something and then writing something very similar, which is one reason why reverse engineering commonly uses clean-room techniques. AI training is in no way a clean room.
Except LLMs are in no way violating copyright in the true sense of the word. They aren’t spitting out a copy of what they ingested.
Go make a movie using the same plot as a Disney movie, that doesn't copy any of the text or images of the original, and see how far "not spitting out a copy" gets you in court.
AI's approach to copyright is very much "rules for thee but not for me".
That might get you pretty far in court, actually. You'd have to be pretty close in terms of the sequence of events, character names, etc. Especially considering how many Disney movies are based on pre-existing stories, if you were, to, say, make a movie featuring talking animals that more or less followed the plot of Hamlet, you would have a decent chance of prevailing in court, given the resources to fight their army of lawyers.
100% agree. but now a million$ question - how would you deal with AI when it comes to copyright? what rules could we possibly put in place?
The same rules we already have: follow the license of whatever you use. If something doesn't have a license, don't use it. And if someone says "but we can't build AI that way!", too bad, go fix it for everyone first.
You have a lot of opinions on AI for somebody who has only read stuff in the public domain
Leaving aside the hypothetical "live learning AGI" of the future (given that money is made or lost now), would a human regurgitating content that is not theirs - but presented as if it is - be acceptable to you?
I don't know about you but my friends don't tell me that Joe Schmoe of Reuters published a report that said XYZ copyright XXXX. They say "XYZ happened."
Exactly. Also core to the copyright extremists’ delusional train of thought is the fact that they don’t seem to understand (or admit) that ingesting, creating a model, and then outputting based on that model is exactly what people do when they observe others’ works and are inspired to create.
Who would be forever grateful if openai removed all of The Intercept's content permanently and refused to crawl it in the future?
Isn't this the same thing Google has been doing for years with their search engine? Only difference is Google keeps the data internal, whereas openai spits it out to you. But it's still scraped and stored in both cases.
A component of fair use is to what degree the derivative work displaces the original. Google's argument has always been that they direct traffic to the original, whereas AI summaries (which Google of course is just as guilty of as openai) completely obsoletes the original publication. The argument now is that the derivative work (LLM model) is transformative, ie, different enough that it doesn't economically compete with the original. I think it's a losing argument but we'll see what the courts arrive at.
Is this specific to AI or specific to summaries in general? Do summaries, like the ones found in Wikipedia or Cliffs Notes, not have the same effect of making it such that people no longer have to view the original work as much?
Note: do you mean the model is transformative, or the summaries are transformative? I think your comment holds up either way but I think it's better to be clear which one you mean.
In my opinion not a lawyer, Google at least references where they obtained the data and did not regurgitate it as if they were the creators that created something new. obfuscated plagiarism via LLM. Some claim derivative works but I have always seen that as quite a stretch. People here expect me to cite references yet LLM's somehow escape this level of scrutiny.
Meanwhile China is using everything available to train their AI models
We don't want to be like China.
Fair. But I made a comment somewhere else that, if their models become better than ours, they'll be incorporated into products. Then we're back to being depended on China for LLM model development as well, on top of manufacturing. Realistically that'll be banned because of National Security laws or something, but companies tend to choose the path of "best and cheapest" no matter what.
I’m still of the opinion that we should be allowed to train on any data a human can read online.
Yeah, let's stop the progress because a few magazines no one cares about are unhappy.
Maybe just don't use data from the unhappy magazines you don't care about in the first place?
Forecast: OpenAI and The Intercept will settle and OpenAI users will pay for it.
Yep, the game plan is to keep settling out of court so that (they hope) no legal precedent is set that would effectively make their entire business model illegal. That works until they run out of money I guess, but they probably can't keep it up forever.
Wouldn’t the better method to throw all your money at one suit you can make an example of and try to win that one? You can’t effectively settle every single suit if you have no realistic chance of winning, otherwise every single publisher on the internet will come and try to get their money.
That's a good strategy, but you have to have the right case. One where OpenAI feels confident they can win and establish favorable precedent. If the facts of the case aren't advantageous, it's probably not worth the risk.
Too high risk. Every year you can delay you keep lining your pockets.
Side question, why doesn't other companies get the same attention? Anthropic, xAI and others have deep pockets, and scraped the same data, I'm assuming? It could be a gold mine for all these news agencies to keep settling out of court to make some buck.
the very idea of "this digital asset is exclusively mine" cannot die soon enough
let real physically tangible assets keep the exclusivity problem
let's not undo the advantages unlocked by the digital internet; let us prevent a few from locking down this grand boon of digital abundance such that the problem becomes saturation of data
let us say no to digital scarcity
This is, in fact, the core value of the hacker ethos. HackerNews.
> The belief that information-sharing is a powerful positive good, and that it is an ethical duty of hackers to share their expertise by writing open-source code and facilitating access to information and to computing resources wherever possible.
> Most hackers subscribe to the hacker ethic in sense 1, and many act on it by writing and giving away open-source software. A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
http://www.catb.org/jargon/html/H/hacker-ethic.html
Perhaps if the Internet didn't kill copyright, AI will. (Hyperbole)
(Personally my belief is more nuanced than this; I'm fine with very limited copyright, but my belief is closer to yours than the current system we have.)
I don't understand what the "hacker ethos" could have to do with defending openai's blatant stealing of people's content for their own profit.
Openai is not sharing their data(they're keeping it private to profit off of), so how could it be anywhere near the "hacker ethos" to believe that everyone else needs to hand over their data to openai for free?
Following the "GNU-flavour hacker ethos" as described, one concludes that it is right for OpenAI to copy data without restriction, it is wrong for NYT to restrict others from using their data, and it is also wrong for OpenAI to restrict the sharing of their model weights or outputs for training.
Luckily, most people seem to ignore OpenAI's hypocritical TOS against sharing their output weights for training. I would go one step further and say that they should share the weights completely, but I understand there's practical issues with that.
Luckily, we can kind of "exfiltrate" the weights by training on their output. Or wait for someone to leak it, like NovelAI did.
I think an ethical hacker is someone who uses their expertise to help those without.
How could an ethical hacker side with OpenAI, when OpenAI is using its technological expertise to exploit creators without?
I won't necessarily argue against that moral view, but in this case it is two large corporations fighting. One has the power of tech, the other has the power of the state (copyright). So I don't think that applies in this case specifically.
Aren't you ignoring that common law is built on precedent? If they win this case, that makes it a lot easier for people who's copyright is being infringed on an individual level to get justice.
You're correct, but I think many don't realize how many small model trainers and fine-tuners there are currently. For example, PonyXL, or the many models and fine-tunes on CivitAI made by hobbyists.
So basically the reasoning is this:
- NYT vs OpenAI, neither is disenfranchied - OpenAI vs individual creators, creators are disenfranchised - NYT vs individual model trainers, model trainers are disenfranchised - Individual model trainers vs individual creators, neither are disenfranchised
And if only one can win, and since the view is that information should be free, it biases the argument towards the model trainers.
What "information" are you talking about? It's a text and image generator.
Your argument is that it's okay to scrape content when you are an individual. It doesn't change the fact those individuals are people with technical expertise using it to exploit people without.
If they wrote a bot to annoy people but published how many people got angry about it, would you say it's okay because that is information?
You need to draw the line somewhere.
Creators freely sharing with attribution requested is different than creations being ruthlessly harvested and repurposed without permission.
https://creativecommons.org/share-your-work/
> freely sharing with attribution requested
If I share my texts/sounds/images for free, harvesting and regurgitating them omits the requested attribution. Even the most permissive CC license (excluding CC0 public domain) still requires an attribution.
> A few go further and assert that all information should be free and any proprietary control of it is bad; this is the philosophy behind the GNU project.
In this view, the ideal world is one where copyright is abolished (but not moral rights). So piracy is good, and datasets are also good.
Asking creators to license their work freely is simply a compromise due to copyright unfortunately still existing. (Note that even if creators don't license their work freely, this view still permits you to pirate or mod it against their wishes.)
(My view is not this extreme, but my point is that this view was, and hopefully is, still common amongst hackers.)
I will ignore the moralizing words (eg "ruthless", "harvested" to mean "copied"). It's not productive to the conversation.
If not respected, some Creators will strike, lay flat, not post, go underground.
Ignoring moral rights of creators is the issue.
Moral rights involve the attribution of works where reasonable and practical. Clearly doing so during inference is not reasonable or practical (you'll have to attribute all of humanity!) but attributing individual sources is possible and is already being done in cases like ChatGPT Search.
So I don't think you actually mean moral rights, since it's not being ignored here.
But the first sentence of your comment still stands regardless of what you meant by moral rights. To that, well... we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here.
And yes, funding is a thing, which I agree needs copyright for the most part unfortunately. But does training AI on, for example, a book really reduce the need to buy the book, if it is not reproduced?
Remember, training is not just about facts, but about learning how humans talk, how languages work, how books work, etc. Learning that won't reduce the book's economical value.
And yes, summaries may reduce the value. But summaries already exist. Wikipedia, Cliff's Notes. I think the main defense is that you can't copyright facts.
we're still commenting here, are we not? Despite it with almost 100% certainty being used to train AI. We're still here
?!?! Comparing and equating commenting to creative works. ?!?!
These comments are NOT equivalent to the 17 full time months it took me to write a nonfiction book.
Or an 8 year art project.
When I give away my work I decide to whom and how.
oh please, then, riddle me why does my comment has -1 votes on "hacker" news
which has indeed turned into "i-am-rich-cuz-i-own-tech-stock"news
I did not contribute a vote either way to your comment above, but I would point out that you get more of what you reward. Maybe the reward is monetary, like an author paid for spending their life writing books. Maybe it’s smaller, more reputational or social—like people who generate thoughtful commentary here, or Wikipedia’s editors, or hobbyists’ forums.
When you strip people’s names from their words, as the specific count here charges; and you strip out any reason or even way for people to reward good work when they appreciate it; and you put the disembodied words in the mouth of a monolithic, anthropomorphized statistical model tuned to mimic a conversation partner… what type of thought is it that becomes abundant in this world you propose, of “data abundance”?
In that world, the only people who still have incentive to create are the ones whose content has negative value, who make things people otherwise wouldn’t want to see: advertisers, spammers, propagandists, trolls… where’s the upside of a world saturated with that?
Yes, I have no idea either. I find it disappointing.
I think people simply like it when data is liberated from corporations, but hate it when data is liberated from them. (Though this case is a corporation too so idk. Maybe just "AI bad"?)
I think you'll find that most people aren't comfortable with this in practice. They'd like e.g. the state to be able to keep secrets, such as personal information regarding citizens and the stuff foreign spies would like to copy.
Obviously we're all impacted in these perceptions by our bubbles, but it would surprise me at this particular moment in the history of US politics to find that most people favor the existence of the state at all, let alone its ability to keep secret personal information regarding citizens.
Most people aren't anarchists, and think the state is necessary for complex societies to function.
My sense is that the constituency of people who prefer deprecation of the US state is much larger than just anarchists.
Really? Are Food Not Bombs and the IWW that popular we're you live?
It's extremely lousy that you have to pre-register copyright.
That would make the USCO a defacto clearinghouse for news.
You don't have to pre-register copyright in any Berne Convention countries. Your copyright exists from the moment you create something.
(ETA: This paragraph below is diametrically wrong. Sorry.)
AFAIK in the USA, registered copyright is necessary if you want to bring a lawsuit and get more than statutory damages, which are capped low enough that corporations do pre-register work.
Not the case in all Berne countries; you don't need this in the UK for example, but then the payouts are typically a lot lower in the UK. Statutory copyright payouts in the USA can be enough to make a difference to an individual author/artist.
As I understand it, OpenAI could still be on the hook for up to $150K per article if it can be demonstrated it is wilful copyright violation. It's hard to see how they can argue with a straight face that it is accidental. But then OpenAI is, like several other tech unicorns, a bad faith manufacturing device.
You seem to know more about this than me. I have a family member who "invented" some electronics things. He hasn't done anything with the inventions (I'm pretty sure they're quackery).
But to ensure his patent, he mailed himself a sealed copy of the plans. He claims the postage date stamp will hold up in court if he ever needs it.
Is that a thing? Or is it just more tinfoil business? It's hard to tell with him.
It won't hold up in court, and given that the post-office will mail/deliver unsealed letters (which may then be sealed after the fact), will be viewed rather dimly.
Buy your family member a copy of:
https://www.goodreads.com/book/show/58734571-patent-it-yours...
Surely the NSA will retain a copy which can be checked
Even if they did, it in fact cannot be checked. There is precedent that you cannot subpoena NSA for their intercepts, because exactly what has been intercepted and stored is privileged information.
> There is precedent that you cannot subpoena NSA for their intercepts
I know it's tangential to this thread but could you link to further reading?
but only in a real democracy
Mailing yourself using registered mail is a very old tactic to establish a date for your documents using an official government entity, so this can be meaningful in court. However this may not provide the protection he needs. Copyright law differs from patent law and he should seek legal advice
Even if the date is verifiable, what would it even prove? If it's not public then I don't believe it can count as prior art to begin with.
The USmoved to first to file years ago. Whoever files first gets it, except if he publishes it publicly there is a 1-year inventor's grace period (it would not apply to a self mail or private mail to other people).
This is patent, not copyright.
Honestly I don't know whether that actually is a meaningful thing to do anymore; it may be with patents.
It certainly used to be a legal device people used.
Essentially it is low-budget notarisation. If your family member believes they have something which is timely and valuable, it might be better to seek out proper legal notarisation, though -- you'd consult a Notary Public:
https://en.wikipedia.org/wiki/Notary_public
presumably the intention is to prove the existence of the specific plans at a specific time?
I guess the modern version would be to sha256 the plans and shove it into a bitcoin transaction
good luck explaining that to a judge
Right, you can register before you bring a lawsuit. Pre-registration makes your claim stronger, as does notice of copyright.
> It's hard to see how they can argue with a straight face that it is accidental
It's another instance of "move fast, break things" (i.e. "keep your eyes shut while breaking the law at scale")
Yes, because all progress depends upon the unreasonable man.
That's what I thought too, but why does the article say:
> Infringement suits require that relevant works were first registered with the U.S. Copyright Office (USCO).
OK so it turns out I am wrong here! Cool.
I have it upside down/diametrically wrong, however you see fit. Right that structures exist, exactly wrong on how they apply.
It is registration that guarantees access to statutory damages:
https://www.justia.com/intellectual-property/copyright/infri...
Without registration you still have your natural copyright, but you would have to try to recover the profits made by the infringer.
Which does sound like more of an uphill struggle for The Intercept, because OpenAI could maybe just say "anything we earn from this is de minimis considering how much errr similar material is errrr in the training set"
Oh man it's going to take a long time for me to get my brain to accept this truth over what I'd always understood.
It's so weird to me seeing journalists complaining about copyright and people taking something they did.
The whole of journalism is taking the acts of others and repeating them, why does a journalist claim they have the rights to someone else's actions when someone simply looks at something they did and repeat it.
If no one else ever did anything, the journalist would have nothing to report, it's inherently about replicating the work and acts of others.
That’s a pretty narrow view of journalism. If you look into newspapers, it’s not just a list of events but also opinion pieces, original research, reports etc. The main infringement isn’t with the basic reporting of facts but with the original part that’s done by the writer.
> The whole of journalism is taking the acts of others and repeating them
Hilarious (and depressing) that this is what people think journalists do.
What is a "journalist?" It sounds old-fashioned.
They are "content creators" now.
Or you could just not do illegal and/or immoral things that are worthy of reporting.
This is terribly unpersuasive