One interesting tidbit from this article that I haven't seen mentioned yet is that you can use glitch tokens to figure out what model someone is using behind the scenes. Put a glitch token in a prompt, and see if it reacts normally or response with this kind of glitchy behavior.
You can imagine LLM fingerprinting to be part of future pentest workflows where they identify the model and know it's weaknesses and vulnerabilities etc...
I wonder if it will mean more or less revealing of the models that are running agentic flows (we currently abstract with Fast/Smart)
It is also possible that the first model calls other models, and you could reverse engineer the tool call structure by seeing when glitches occur based on different branches of the tool calling
Isn't the only reason we can do that because we have access to the tokenizer? Do we have the Claude and Gemini tokens? I mean if they didn't publish that, would it defeat this attack?
I don't think we do, but maybe we could reverse engineer them by using the API to count the tokens in a bunch of strings. I think you can do that for free through both APIs.
This article says that "GPT-5 was trained on phrases from adult websites". However, this is misleading as the only thing that was shown is that GPT-5 was trained on phrases that also occur on adult websites, with some speculation of the source of the training data container such adult phrases being GitHub.
Chinese adult site ads are everywhere in repackaged free and pirate content, which are distributed thru sites including but not limited to github, shadow libraries and youtube.
for same reason, whisper some blank audio will output those ads.
Specifically, because some pirates will put advertisements for other illicit services at the beginning or end of movies and tv shows in its subtitle data where there's a suitable gap. Usually those gaps have silence.
Companies incorporating subtitle data as transcription source of truth training data will thus train their models to output facsimiles of these messages whenever they're encountering prolonged stretches of silence.
> There are about 936 tokens with very low L2 norm, centered at about 2. This likely means that they did not occur in the training process of GPT-oss and were thus depressed by some form of weight decay.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
Could it instead be the case that these tokens were initialized at some mean value across the dataset (plus a little noise), and then never changed because they were never seen in training? Not sure if that is state of the art anymore but e.g. in Karpathy's videos he uses a trick like this to avoid the "sharp hockey stick" drop in loss in the early gradient descent steps, which can result in undesirably big weight updates.
Unfortunately the article glances over some of practices of uncovering such patterns in the training data. It goes very straitghfully to the point, no lube needed. It didn't land well for me.
Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?
> Do biases go away completely or they just get suppressed down deep in the model's "mind"?
Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.
Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.
LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.
"only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.
For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.
> GPT-5 was trained on phrases from adult websites
Does it really imply they were trained on phrases FROM adult websites, or that those phrases FOR adult sites were common in the training data?
Blogspam, link-farms, affiliate marketing, etc, are extremely common for adult (and gambling) sites and likely result in a lot of data tainted with those phrases.
There's an interesting set of options for the weird "xadder" token: misspellings of "xpadder" (a game pad helper), xadder (the name of at least two or three tools), xadder (a parameter in an XLib call), XAdder (the Xilinx full adder implementation for the Vivado FPGA platform), and more than a few usernames on various forums.
Many of the crude translations of those Chinese phrases are way off to the point that it fails to understand the meaning, which makes me think the data in those matrices is inaccurate as well. The author really needs to ask a native Chinese speaker with experience in ... searching explicit content to proofread the article and examine the results.
Given that the token space is large enough to waste on such "low quality" tokens, has there been work done to use a smaller token space in order for quantized models to perform better?
Just a silly thought that crossed my mind when I saw those "ad tokens".
I tried many of the examples in this article in Gemini 2.5 pro and it seems to handle most quite flawlessly. Is it possibly that Google's model is just susceptible to different glitch tokens? I admit most of the technical discussion in the article went a little over my head.
Glitch tokens should be tokenizer-specific. Gemini uses a different tokenizer from the OpenAI models.
The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
Maybe I'm misinterpreting, but the article seems (?) to be implying there's something scandalous about OpenAI training an adult websites.
I find that odd. Would anyone be surprised to know that Google indexes adult websites, and ranks them in its search algorithm? If not, what is the difference for an LLM?
Theyre saying if you find references to a very specific set of phrases that were probably included accidentally on github then github is likely part of the training data.
I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
They are trained on public data at our expense so We The People should *own* them.
Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).
There's nothing new about being able to copyright something that's a transformation of another work. And they definitely aren't exclusively trained on public data.
> There's nothing new about being able to copyright something that's a transformation of another work
There is something novel here.
Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
Just because I download a bunch of copyrighted files and run `tar c | gzip` over them does not mean I have new copyright.
Just because I download an image and convert it from png to jpg at 50% quality, throwing away about half the data, does not mean I have created new copyright.
AI models are giant lossy compression algorithms. They take text, tokenize it, and turn it into weights, and then inference is a weird form of decompression. See https://bellard.org/ts_zip/ for a logical extension to this.
I think this is the reason that the claim of LLM models being unencumbered by copyright is novel. Until now, a human had to do some creative transformation to transform a work, it could not simply be a computer algorithm that changed the format or compressed the input.
Google Books is not transformative. It shows you all the same data for the same purpose as they were published for.
A better example is Google Image Search. Thumbnails are transformative because they have a different purpose and aren't the same data. An LLM is much more transformative than a thumbnail.
It's more lossy than even lossy compression because of the regularization term; I'm pretty sure you can train one that's guaranteed to not retain any of the pretraining text. Of course then it can't answer things like "what's the second line of The Star Spangled Banner".
The fact that compression is incredibly lossy does not change the fact that it's copyright infringement.
I have a lossy compression algorithm with simply outputs '0' or '1' depending on the parity of bits of the input.
If I run that against a camcording of a disney film, the result is a 0 copyrighted by disney, and in fact posting that 0 in this comment would make this comment also illegal so I must disclaim that I did not actually produce that from a camcorded disney film.
If I run it against the book 'dracula' the result is a 0 under the public domain.
The law does not understand bits, it does not understand compression or lossiness, it understands "humans can creatively transform things, algorithms cannot unless a human imbues creativity into it". It does not matter if your compressed output does not contain the original.
> Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
No. It's a decided case. It's transformative and fair use. My understanding why it's transformative is that Google Books mainly offers a search interface for books and it also have measures to make sure only snippets of books are shown.
Unfortunately very unlikely in our forseeable future with the U.S. having a "U.S. against the world" mentality to the AI race. Would love to see this but this would get shot down immediately.
> I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
> They are trained on public data at our expense so We The People should own them.
The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.
> AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Also, this entire blog post only exists because they're curious about a specific free open-weights model.
The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).
in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?
The takings clause of the fifth amendment allows seizure of private property for public use so long as it provides just compensation. So the necessary amendment already exists if they're willing to pay for it. Otherwise they'd need an amendment to circumvent the fifth amendment, to the extent the document is honored.
If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.
Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?
It doesn't make sense to me that whether the result of intellectual effort is property or not depends on the legal status of its output, whether its production involved automation, or if it involved statistical computation. These look like vague justifications to take something made by someone else because it has value to you, without compensation.
I'm looking at this through the lens of US copyright, where the Copyright Office determined that AI output isn't protected by copyright, and thus isn't private IP, as it isn't the creative output of a human being.
If the results of inference and generation can't be protected under copyright, as they aren't the creative output of a human being, why wouldn't the results of back-propagation and gradient descent follow the same logic?
This isn't about how we feel about it, it's a legal question.
What you are describing is more-or-less a planned economy, the polar opposite of America's market economy. The government has the power to appropriate things for the common good because it's perceived that private enterprise isn't a necessary force. Sometimes it works, sometimes it doesn't; only certain countries can "moneyball" their way through economics like that, though. America has long since passed the point of even trying.
Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.
Interesting. Small typo by the way. It's SolidGoldMagikarp with a k. Easy mistake to make with that tokenizer though har har
It strikes me less that they're from adult websites and more that they're from compromised sites. I've had that happen before and it's mostly porn and stuff like that when that happens.
Fascinating article. I am giving everything AI a wide birth for now, however, I do enjoy learning about how AI works. The question I have, is what does a LLM do when it encounters a new token? Can it actually learn from context, etymology and usage?
As I child I had no idea what many of the words meant in the newspaper and in literature but I could just pretend I knew what those words meant or get by without knowing what those words meant in full. In time I would gain familiarity with these words, able to make sense of them in context but not necessarily able to pronounce said words or be able to use them in my own writing. I certainly didn't stop what I was reading to get the dictionary out every time I encountered a new word, and this is how I think most people learn to read, with gradual changes with new words going from no idea to some familiarity to confidently able to use.
We aren't tokenising like the LLMs do and our languages are the product of many hundreds of thousands of years of development. So, how does an LLM learn words that have not already been tokenised? Or is this baked in?
The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.
>Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
I think it could infer the meaning of words composed out of tokens it has already seen before, same way that you might be able to infer the meaning of an unknown word based on its prefix/suffix, country of origin, context, etc.
For an entire token that it hasn't seen before, it would have to rely only on context. Presumably it could do this, since that is after all the case in the early phases of training.
The LLM training process doesn't operate at that conceptual level. What it's doing is closer to examining a large number of possible meanings, seeing which fit the most, and moving its "understanding" in that direction. Repeat enough times, and it develops an association between the new word and the context in which it's used.
New words will usually be combinations of existing tokens, but at the beginning of training a new model, it doesn't "know" what any of the tokens mean. And there's no reason you can't treat every UTF-8 byte as a separate token, but that would require a larger model before you got results that look to a layperson like intelligence, understanding, or knowledge. Tokenisation lets you use a system like word2vec to assign each token a semantic embedding in a vector space, giving the model a bit of a leg up.
> During learning, the model […] starts by grouping tokens together
You probably could design a ML system that works like this, and it'd probably be more efficient to train than a hundred-billion parameter GPT model, but that's not how GPT model training works. Instead, it attempts all of those things in parallel (although I would expect the solutions to the earlier, easier parts to settle down before the solutions to the later parts do), and the same process is responsible for all of the behaviour in a straightforward fashion.
We do understand the "magic": it's just that it produces a really complicated system that we can't characterise the iterative behaviour of. (For comparison, the iterative function f_c(z) = z² + c, iterated starting at 0, produces the Mandelbrot set.) To use an analogy: imagine the training data is a landscape, and the behaviour of the GPT model trained on it is a weather system. (The parameter count is the amount of atmosphere, or something.) There's nothing magical going on in the weather, but it's just too complicated to predict ahead of time, and tiny gaps in our understanding can magnify into extremely inaccurate long-term predictions. We can, despite this, make some blanket statements about the possible capabilities of a GPT model, of the form "a GPT model will never be able to do X unless you cheat".
The RL magic is, I believe, well understood, but I don't personally understand it. (I know what it does, since RL always does the same thing, but I don't know what it's doing to the model to achieve that.)
> "as a child" starts being internally represented by the same neurons as "als ich ein Kind war"
Yes and no. For a few reasons, including that this kind of association can occur without the same "neurons" getting involved until past the point where that representation exists, it's better to say that they're embedded in nearby regions of a vector space. The actual nodes of the neural network are an implementation detail.
One interesting tidbit from this article that I haven't seen mentioned yet is that you can use glitch tokens to figure out what model someone is using behind the scenes. Put a glitch token in a prompt, and see if it reacts normally or response with this kind of glitchy behavior.
You can imagine LLM fingerprinting to be part of future pentest workflows where they identify the model and know it's weaknesses and vulnerabilities etc...
Yes I thought so too.
I wonder if it will mean more or less revealing of the models that are running agentic flows (we currently abstract with Fast/Smart)
It is also possible that the first model calls other models, and you could reverse engineer the tool call structure by seeing when glitches occur based on different branches of the tool calling
Isn't the only reason we can do that because we have access to the tokenizer? Do we have the Claude and Gemini tokens? I mean if they didn't publish that, would it defeat this attack?
I don't think we do, but maybe we could reverse engineer them by using the API to count the tokens in a bunch of strings. I think you can do that for free through both APIs.
This article says that "GPT-5 was trained on phrases from adult websites". However, this is misleading as the only thing that was shown is that GPT-5 was trained on phrases that also occur on adult websites, with some speculation of the source of the training data container such adult phrases being GitHub.
Chinese adult site ads are everywhere in repackaged free and pirate content, which are distributed thru sites including but not limited to github, shadow libraries and youtube.
for same reason, whisper some blank audio will output those ads.
Specifically, because some pirates will put advertisements for other illicit services at the beginning or end of movies and tv shows in its subtitle data where there's a suitable gap. Usually those gaps have silence.
Companies incorporating subtitle data as transcription source of truth training data will thus train their models to output facsimiles of these messages whenever they're encountering prolonged stretches of silence.
This is addressed at the end of the blogpost
It is not
It is - in the link to the MIT Technology Review article
> There are about 936 tokens with very low L2 norm, centered at about 2. This likely means that they did not occur in the training process of GPT-oss and were thus depressed by some form of weight decay.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
E.g., they exclude them in minGPT: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...
Could it instead be the case that these tokens were initialized at some mean value across the dataset (plus a little noise), and then never changed because they were never seen in training? Not sure if that is state of the art anymore but e.g. in Karpathy's videos he uses a trick like this to avoid the "sharp hockey stick" drop in loss in the early gradient descent steps, which can result in undesirably big weight updates.
Unfortunately the article glances over some of practices of uncovering such patterns in the training data. It goes very straitghfully to the point, no lube needed. It didn't land well for me.
Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?
Yes.
https://arxiv.org/abs/2403.06634
https://arxiv.org/abs/2311.17035
(I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)
Thanks for these, I'll have a look!
> Do biases go away completely or they just get suppressed down deep in the model's "mind"?
Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.
Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.
LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.
"only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.
For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.
> GPT-5 was trained on phrases from adult websites
Does it really imply they were trained on phrases FROM adult websites, or that those phrases FOR adult sites were common in the training data?
Blogspam, link-farms, affiliate marketing, etc, are extremely common for adult (and gambling) sites and likely result in a lot of data tainted with those phrases.
This guy adults.
There's an interesting set of options for the weird "xadder" token: misspellings of "xpadder" (a game pad helper), xadder (the name of at least two or three tools), xadder (a parameter in an XLib call), XAdder (the Xilinx full adder implementation for the Vivado FPGA platform), and more than a few usernames on various forums.
Many of the crude translations of those Chinese phrases are way off to the point that it fails to understand the meaning, which makes me think the data in those matrices is inaccurate as well. The author really needs to ask a native Chinese speaker with experience in ... searching explicit content to proofread the article and examine the results.
Hi, thanks! If someone posts better translations I will update them.
For a start, you could replace all occurrences of "No Code" (无码) with "Uncensored."
Done, thank you!
Given that the token space is large enough to waste on such "low quality" tokens, has there been work done to use a smaller token space in order for quantized models to perform better?
Just a silly thought that crossed my mind when I saw those "ad tokens".
Isn't that exactly what some of these models that have 30b params but only activate 3b at a time
That's mixture of experts pattern.
Humans also only use X% of their brains (the one needed for a specific task)
Does that mean I'm a mixture of experts.
I tried many of the examples in this article in Gemini 2.5 pro and it seems to handle most quite flawlessly. Is it possibly that Google's model is just susceptible to different glitch tokens? I admit most of the technical discussion in the article went a little over my head.
Glitch tokens should be tokenizer-specific. Gemini uses a different tokenizer from the OpenAI models.
The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
Maybe I'm misinterpreting, but the article seems (?) to be implying there's something scandalous about OpenAI training an adult websites.
I find that odd. Would anyone be surprised to know that Google indexes adult websites, and ranks them in its search algorithm? If not, what is the difference for an LLM?
And it's nothing new.
https://github.com/jiangyy/gpt-tokens
People found these adult-site-related Chinese phrases in Gpt-4o. The OP is more than one year late.
Theyre saying if you find references to a very specific set of phrases that were probably included accidentally on github then github is likely part of the training data.
GitHub is obviously part of the training data, you don't need to find obscure tokens to tell.
FWIW, I didn't get that sense.
Wouldn't it be best for them to strip that out of the training data for moderation reasons?
I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
They are trained on public data at our expense so We The People should *own* them.
Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).
Also, long timer lurker, first time posting!
I just had to get this off my mind! Cheers.
There's nothing new about being able to copyright something that's a transformation of another work. And they definitely aren't exclusively trained on public data.
> There's nothing new about being able to copyright something that's a transformation of another work
There is something novel here.
Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
Just because I download a bunch of copyrighted files and run `tar c | gzip` over them does not mean I have new copyright.
Just because I download an image and convert it from png to jpg at 50% quality, throwing away about half the data, does not mean I have created new copyright.
AI models are giant lossy compression algorithms. They take text, tokenize it, and turn it into weights, and then inference is a weird form of decompression. See https://bellard.org/ts_zip/ for a logical extension to this.
I think this is the reason that the claim of LLM models being unencumbered by copyright is novel. Until now, a human had to do some creative transformation to transform a work, it could not simply be a computer algorithm that changed the format or compressed the input.
Google Books is not transformative. It shows you all the same data for the same purpose as they were published for.
A better example is Google Image Search. Thumbnails are transformative because they have a different purpose and aren't the same data. An LLM is much more transformative than a thumbnail.
It's more lossy than even lossy compression because of the regularization term; I'm pretty sure you can train one that's guaranteed to not retain any of the pretraining text. Of course then it can't answer things like "what's the second line of The Star Spangled Banner".
Thumbnails are not transformative, they are fair use. They would be copyright infringement, except that a court case ruled them as fair use: https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com... .
The fact that compression is incredibly lossy does not change the fact that it's copyright infringement.
I have a lossy compression algorithm with simply outputs '0' or '1' depending on the parity of bits of the input.
If I run that against a camcording of a disney film, the result is a 0 copyrighted by disney, and in fact posting that 0 in this comment would make this comment also illegal so I must disclaim that I did not actually produce that from a camcorded disney film.
If I run it against the book 'dracula' the result is a 0 under the public domain.
The law does not understand bits, it does not understand compression or lossiness, it understands "humans can creatively transform things, algorithms cannot unless a human imbues creativity into it". It does not matter if your compressed output does not contain the original.
Google Books is transformative. It's a decided case. And it's the same as Google Image, i.e. for search.
https://news.ycombinator.com/item?id=45489807
> Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
No. It's a decided case. It's transformative and fair use. My understanding why it's transformative is that Google Books mainly offers a search interface for books and it also have measures to make sure only snippets of books are shown.
Unfortunately very unlikely in our forseeable future with the U.S. having a "U.S. against the world" mentality to the AI race. Would love to see this but this would get shot down immediately.
> I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
> They are trained on public data at our expense so We The People should own them.
The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.
> AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Also, this entire blog post only exists because they're curious about a specific free open-weights model.
The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).
* as in: not just incidentally
> They are trained on public data
this is questionable, but okay...
> at our expense
?
> so We The People should own them.
in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?
I'd settle with them being held in a public trust for public benefit
Why would it require a constitutional amendment?
The takings clause of the fifth amendment allows seizure of private property for public use so long as it provides just compensation. So the necessary amendment already exists if they're willing to pay for it. Otherwise they'd need an amendment to circumvent the fifth amendment, to the extent the document is honored.
Are models necessarily IP?
If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.
Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?
It doesn't make sense to me that whether the result of intellectual effort is property or not depends on the legal status of its output, whether its production involved automation, or if it involved statistical computation. These look like vague justifications to take something made by someone else because it has value to you, without compensation.
I'm looking at this through the lens of US copyright, where the Copyright Office determined that AI output isn't protected by copyright, and thus isn't private IP, as it isn't the creative output of a human being.
If the results of inference and generation can't be protected under copyright, as they aren't the creative output of a human being, why wouldn't the results of back-propagation and gradient descent follow the same logic?
This isn't about how we feel about it, it's a legal question.
What you are describing is more-or-less a planned economy, the polar opposite of America's market economy. The government has the power to appropriate things for the common good because it's perceived that private enterprise isn't a necessary force. Sometimes it works, sometimes it doesn't; only certain countries can "moneyball" their way through economics like that, though. America has long since passed the point of even trying.
Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.
Wouldn’t the same argument then be applied to all scraped data?
Interesting. Small typo by the way. It's SolidGoldMagikarp with a k. Easy mistake to make with that tokenizer though har har
It strikes me less that they're from adult websites and more that they're from compromised sites. I've had that happen before and it's mostly porn and stuff like that when that happens.
Fascinating article. I am giving everything AI a wide birth for now, however, I do enjoy learning about how AI works. The question I have, is what does a LLM do when it encounters a new token? Can it actually learn from context, etymology and usage?
As I child I had no idea what many of the words meant in the newspaper and in literature but I could just pretend I knew what those words meant or get by without knowing what those words meant in full. In time I would gain familiarity with these words, able to make sense of them in context but not necessarily able to pronounce said words or be able to use them in my own writing. I certainly didn't stop what I was reading to get the dictionary out every time I encountered a new word, and this is how I think most people learn to read, with gradual changes with new words going from no idea to some familiarity to confidently able to use.
We aren't tokenising like the LLMs do and our languages are the product of many hundreds of thousands of years of development. So, how does an LLM learn words that have not already been tokenised? Or is this baked in?
Informed layman warning.
The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.
Not an expert, but I don't think that this bit:
> At some point in this process it naturally abstracts concepts from languages: "as a child"
Is true. I don't know of any way for the model to represent concepts.
https://www.anthropic.com/research/tracing-thoughts-language...
>Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
> The tokenizer covers the entire dataset.
Well, this is only trivially true. You can feed binary data to the LLM and it probably has tokens that only cover single bytes of that.
I think it could infer the meaning of words composed out of tokens it has already seen before, same way that you might be able to infer the meaning of an unknown word based on its prefix/suffix, country of origin, context, etc.
For an entire token that it hasn't seen before, it would have to rely only on context. Presumably it could do this, since that is after all the case in the early phases of training.
s/birth/berth :)
That's rather presumptuous, don't you think? There are some people here with very unusual jobs.
The LLM training process doesn't operate at that conceptual level. What it's doing is closer to examining a large number of possible meanings, seeing which fit the most, and moving its "understanding" in that direction. Repeat enough times, and it develops an association between the new word and the context in which it's used.
New words will usually be combinations of existing tokens, but at the beginning of training a new model, it doesn't "know" what any of the tokens mean. And there's no reason you can't treat every UTF-8 byte as a separate token, but that would require a larger model before you got results that look to a layperson like intelligence, understanding, or knowledge. Tokenisation lets you use a system like word2vec to assign each token a semantic embedding in a vector space, giving the model a bit of a leg up.
---
Response to the sibling comment https://news.ycombinator.com/item?id=45485439, since I've hit the rate limit:
> During learning, the model […] starts by grouping tokens together
You probably could design a ML system that works like this, and it'd probably be more efficient to train than a hundred-billion parameter GPT model, but that's not how GPT model training works. Instead, it attempts all of those things in parallel (although I would expect the solutions to the earlier, easier parts to settle down before the solutions to the later parts do), and the same process is responsible for all of the behaviour in a straightforward fashion.
We do understand the "magic": it's just that it produces a really complicated system that we can't characterise the iterative behaviour of. (For comparison, the iterative function f_c(z) = z² + c, iterated starting at 0, produces the Mandelbrot set.) To use an analogy: imagine the training data is a landscape, and the behaviour of the GPT model trained on it is a weather system. (The parameter count is the amount of atmosphere, or something.) There's nothing magical going on in the weather, but it's just too complicated to predict ahead of time, and tiny gaps in our understanding can magnify into extremely inaccurate long-term predictions. We can, despite this, make some blanket statements about the possible capabilities of a GPT model, of the form "a GPT model will never be able to do X unless you cheat".
The RL magic is, I believe, well understood, but I don't personally understand it. (I know what it does, since RL always does the same thing, but I don't know what it's doing to the model to achieve that.)
> "as a child" starts being internally represented by the same neurons as "als ich ein Kind war"
Yes and no. For a few reasons, including that this kind of association can occur without the same "neurons" getting involved until past the point where that representation exists, it's better to say that they're embedded in nearby regions of a vector space. The actual nodes of the neural network are an implementation detail.