What GPT-OSS leaks about OpenAI's training data

287 points | by fi-le 19 hours ago

67 comments

smj-edison 8 hours ago
One interesting tidbit from this article that I haven't seen mentioned yet is that you can use glitch tokens to figure out what model someone is using behind the scenes. Put a glitch token in a prompt, and see if it reacts normally or response with this kind of glitchy behavior.
[-]
- willvarfar 6 hours ago
  You can imagine LLM fingerprinting to be part of future pentest workflows where they identify the model and know it's weaknesses and vulnerabilities etc...
- sails 7 hours ago
  Yes I thought so too.
  I wonder if it will mean more or less revealing of the models that are running agentic flows (we currently abstract with Fast/Smart)
  It is also possible that the first model calls other models, and you could reverse engineer the tool call structure by seeing when glitches occur based on different branches of the tool calling
- lyu07282 5 hours ago
  Isn't the only reason we can do that because we have access to the tokenizer? Do we have the Claude and Gemini tokens? I mean if they didn't publish that, would it defeat this attack?
  [-]
  - sebzim4500 2 hours ago
    I don't think we do, but maybe we could reverse engineer them by using the API to count the tokens in a bunch of strings. I think you can do that for free through both APIs.
NoahZuniga 16 hours ago
This article says that "GPT-5 was trained on phrases from adult websites". However, this is misleading as the only thing that was shown is that GPT-5 was trained on phrases that also occur on adult websites, with some speculation of the source of the training data container such adult phrases being GitHub.
[-]
- jimmydoe 10 hours ago
  Chinese adult site ads are everywhere in repackaged free and pirate content, which are distributed thru sites including but not limited to github, shadow libraries and youtube.
  for same reason, whisper some blank audio will output those ads.
  [-]
  - breakingcups 4 hours ago
    Specifically, because some pirates will put advertisements for other illicit services at the beginning or end of movies and tv shows in its subtitle data where there's a suitable gap. Usually those gaps have silence.
    Companies incorporating subtitle data as transcription source of truth training data will thus train their models to output facsimiles of these messages whenever they're encountering prolonged stretches of silence.
- tymscar 13 hours ago
  This is addressed at the end of the blogpost
  [-]
  - a_victorp 12 hours ago
    It is not
    [-]
    - rahulstein 8 hours ago
      It is - in the link to the MIT Technology Review article
zaptrem 18 hours ago
> There are about 936 tokens with very low L2 norm, centered at about 2. This likely means that they did not occur in the training process of GPT-oss and were thus depressed by some form of weight decay.
Afaik embedding and norm params are excluded from weight decay as standard practice. Is this no longer true?
E.g., they exclude them in minGPT: https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab...
[-]
- levocardia 14 hours ago
  Could it instead be the case that these tokens were initialized at some mean value across the dataset (plus a little noise), and then never changed because they were never seen in training? Not sure if that is state of the art anymore but e.g. in Karpathy's videos he uses a trick like this to avoid the "sharp hockey stick" drop in loss in the early gradient descent steps, which can result in undesirably big weight updates.
- 3abiton 17 hours ago
  Unfortunately the article glances over some of practices of uncovering such patterns in the training data. It goes very straitghfully to the point, no lube needed. It didn't land well for me.
behnamoh 17 hours ago
Is there any work on reverse engineering LLMs, especially the closed source API ones? For example, how can we learn about the data used in Claude Sonnet 4.5 training?
And more tricky but as important, is there any work on extrapolating the pretrained model AFTER it's RLHF'd? For example, what kinds of biases did exist in gpt-4o before it was unbiased?
Do biases go away completely or they just get suppressed down deep in the model's "mind"?
[-]
- tptacek 15 hours ago
  Yes.
  https://arxiv.org/abs/2403.06634
  https://arxiv.org/abs/2311.17035
  (I just have these ones off the top of my head because I'm a Nicholas Carlini fan and we interviewed him about these attacks.)
  [-]
  - behnamoh 14 hours ago
    Thanks for these, I'll have a look!
- zer00eyz 15 hours ago
  > Do biases go away completely or they just get suppressed down deep in the model's "mind"?
  Bias is a human term, and couching the conversation in that context does nothing to address the issue here, because it gets into the quagmire of social context.
  Let's say LLM's had taken off 15 years ago at the point system d launched. All the answers given are going to weight toward the old init system simply because there is a lack of information.
  LLM's are only repeating the data they are given, and it's cheaper to remove the data after the fact than it is to try to scrub it out of the training data.
  [-]
  - astrange 13 hours ago
    "only" and "repeating" aren't accurate here. There's a lot of steps between the pretraining tokens and the LLM. I mean, you can pretty much do whatever you want in the process of making one or running it.
    For instance you could use pretraining/SFT to steer something away from a document instead of towards it and that wouldn't be "only repeating" it. Though I don't know if that's actually possible, and afaik it is true RL reweights existing data instead of learning new things.
supermatt 4 hours ago
> GPT-5 was trained on phrases from adult websites
Does it really imply they were trained on phrases FROM adult websites, or that those phrases FOR adult sites were common in the training data?
Blogspam, link-farms, affiliate marketing, etc, are extremely common for adult (and gambling) sites and likely result in a lot of data tainted with those phrases.
[-]
- dbtablesorrows 3 hours ago
  This guy adults.
indrora 6 hours ago
There's an interesting set of options for the weird "xadder" token: misspellings of "xpadder" (a game pad helper), xadder (the name of at least two or three tools), xadder (a parameter in an XLib call), XAdder (the Xilinx full adder implementation for the Vivado FPGA platform), and more than a few usernames on various forums.
rs186 17 hours ago
Many of the crude translations of those Chinese phrases are way off to the point that it fails to understand the meaning, which makes me think the data in those matrices is inaccurate as well. The author really needs to ask a native Chinese speaker with experience in ... searching explicit content to proofread the article and examine the results.
[-]
- fi-le 17 hours ago
  Hi, thanks! If someone posts better translations I will update them.
  [-]
  - yorwba 17 hours ago
    For a start, you could replace all occurrences of "No Code" (无码) with "Uncensored."
    [-]
    - fi-le 17 hours ago
      Done, thank you!
magicalhippo 16 hours ago
Given that the token space is large enough to waste on such "low quality" tokens, has there been work done to use a smaller token space in order for quantized models to perform better?
Just a silly thought that crossed my mind when I saw those "ad tokens".
[-]
- typpilol 14 hours ago
  Isn't that exactly what some of these models that have 30b params but only activate 3b at a time
  [-]
  - clarionbell 5 hours ago
    That's mixture of experts pattern.
  - rvba 5 hours ago
    Humans also only use X% of their brains (the one needed for a specific task)
    [-]
    - koakuma-chan 4 hours ago
      Does that mean I'm a mixture of experts.
httpsoverdns 17 hours ago
I tried many of the examples in this article in Gemini 2.5 pro and it seems to handle most quite flawlessly. Is it possibly that Google's model is just susceptible to different glitch tokens? I admit most of the technical discussion in the article went a little over my head.
[-]
- simonw 16 hours ago
  Glitch tokens should be tokenizer-specific. Gemini uses a different tokenizer from the OpenAI models.
  The origins of the OpenAI glitch tokens are pretty interesting: the trained an early tokenizer on common strings in their early training data but it turns out popular subreddits caused some weird tokens to be common enough to get assigned an integer, like davidjl - a frequent poster in the https://reddit.com/r/counting subreddit. More on that here: https://simonwillison.net/2023/Jun/8/gpt-tokenizers/#glitch-...
Wowfunhappy 17 hours ago
Maybe I'm misinterpreting, but the article seems (?) to be implying there's something scandalous about OpenAI training an adult websites.
I find that odd. Would anyone be surprised to know that Google indexes adult websites, and ranks them in its search algorithm? If not, what is the difference for an LLM?
[-]
- raincole 16 hours ago
  And it's nothing new.
  https://github.com/jiangyy/gpt-tokens
  People found these adult-site-related Chinese phrases in Gpt-4o. The OP is more than one year late.
- pydry 15 hours ago
  Theyre saying if you find references to a very specific set of phrases that were probably included accidentally on github then github is likely part of the training data.
  [-]
  - relatedtitle 13 hours ago
    GitHub is obviously part of the training data, you don't need to find obscure tokens to tell.
- refulgentis 17 hours ago
  FWIW, I didn't get that sense.
- mudkipdev 7 hours ago
  Wouldn't it be best for them to strip that out of the training data for moderation reasons?
starkeeper 15 hours ago
I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
They are trained on public data at our expense so We The People should *own* them.
Someday probably sooner then we might even think.... We'll easily run mega huge sized models on our laptops, desktops, and phones. AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
Anyways, only tangentally related... (why worry about leaks like this and the hidden base prompts! - they *should all be 100% OSS* - it is the only way to ensure privacy and security).
Also, long timer lurker, first time posting!
I just had to get this off my mind! Cheers.
[-]
- astrange 13 hours ago
  There's nothing new about being able to copyright something that's a transformation of another work. And they definitely aren't exclusively trained on public data.
  [-]
  - TheDong 11 hours ago
    > There's nothing new about being able to copyright something that's a transformation of another work
    There is something novel here.
    Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
    Just because I download a bunch of copyrighted files and run `tar c | gzip` over them does not mean I have new copyright.
    Just because I download an image and convert it from png to jpg at 50% quality, throwing away about half the data, does not mean I have created new copyright.
    AI models are giant lossy compression algorithms. They take text, tokenize it, and turn it into weights, and then inference is a weird form of decompression. See https://bellard.org/ts_zip/ for a logical extension to this.
    I think this is the reason that the claim of LLM models being unencumbered by copyright is novel. Until now, a human had to do some creative transformation to transform a work, it could not simply be a computer algorithm that changed the format or compressed the input.
    [-]
    - astrange 10 hours ago
      Google Books is not transformative. It shows you all the same data for the same purpose as they were published for.
      A better example is Google Image Search. Thumbnails are transformative because they have a different purpose and aren't the same data. An LLM is much more transformative than a thumbnail.
      It's more lossy than even lossy compression because of the regularization term; I'm pretty sure you can train one that's guaranteed to not retain any of the pretraining text. Of course then it can't answer things like "what's the second line of The Star Spangled Banner".
      [-]
      - TheDong 7 hours ago
        Thumbnails are not transformative, they are fair use. They would be copyright infringement, except that a court case ruled them as fair use: https://en.wikipedia.org/wiki/Perfect_10,_Inc._v._Amazon.com... .
        The fact that compression is incredibly lossy does not change the fact that it's copyright infringement.
        I have a lossy compression algorithm with simply outputs '0' or '1' depending on the parity of bits of the input.
        If I run that against a camcording of a disney film, the result is a 0 copyrighted by disney, and in fact posting that 0 in this comment would make this comment also illegal so I must disclaim that I did not actually produce that from a camcorded disney film.
        If I run it against the book 'dracula' the result is a 0 under the public domain.
        The law does not understand bits, it does not understand compression or lossiness, it understands "humans can creatively transform things, algorithms cannot unless a human imbues creativity into it". It does not matter if your compressed output does not contain the original.
      - johanyc 3 hours ago
        Google Books is transformative. It's a decided case. And it's the same as Google Image, i.e. for search.
        https://news.ycombinator.com/item?id=45489807
    - johanyc 3 hours ago
      > Google Books created a huge online index of books, OCRing, compressing them, and transforming them. That was copyright infringement.
      No. It's a decided case. It's transformative and fair use. My understanding why it's transformative is that Google Books mainly offers a search interface for books and it also have measures to make sure only snippets of books are shown.
- halperter 15 hours ago
  Unfortunately very unlikely in our forseeable future with the U.S. having a "U.S. against the world" mentality to the AI race. Would love to see this but this would get shot down immediately.
- ben_w 14 hours ago
  > I wish we had a constitutional amendment that opensourced all AI commercial AI models and requires documentation and links to all training data and base prompts.
  > They are trained on public data at our expense so We The People should own them.
  The people who appear to have been trained off for the interesting parts of the blog post are mostly, like me, not American.
  > AI should be free. Overhyped and Overpriced. I would love this setup for privacy and security.
  Also, this entire blog post only exists because they're curious about a specific free open-weights model.
  The "source" being ~"the internet", which we've got as much access to as most of the model makers (i.e. where you don't, they've got explicit licensing rights anyway), and possibly also some explicitly* pirated content (I've not been keeping track of which model makers have or have not done that).
  * as in: not just incidentally
- timcobb 13 hours ago
  > They are trained on public data
  this is questionable, but okay...
  > at our expense
  ?
  > so We The People should own them.
  in addition to training data, it is my understanding that a model's architecture also largely determines its efficacy. Why should we own the architecture?
- heavyset_go 15 hours ago
  I'd settle with them being held in a public trust for public benefit
- rileymat2 15 hours ago
  Why would it require a constitutional amendment?
  [-]
  - delichon 15 hours ago
    The takings clause of the fifth amendment allows seizure of private property for public use so long as it provides just compensation. So the necessary amendment already exists if they're willing to pay for it. Otherwise they'd need an amendment to circumvent the fifth amendment, to the extent the document is honored.
    [-]
    - heavyset_go 14 hours ago
      Are models necessarily IP?
      If generative AI models' output can't be copyrighted and turned into private IP, who is to say the output of gradient descent and back-propagation similarly can't be copyrighted? Neither are the creative output of a human being, but both are the product of automated and computed statistical processes.
      Similarly, if AI companies want to come at dataset compilation and model training from a fair use angle, would it not be fair use to use the same models for similar purposes if models were obtained through eminent domain? Or through, like in Anthropic's training case, explicit piracy?
      [-]
      - delichon 2 hours ago
        It doesn't make sense to me that whether the result of intellectual effort is property or not depends on the legal status of its output, whether its production involved automation, or if it involved statistical computation. These look like vague justifications to take something made by someone else because it has value to you, without compensation.
        [-]
        heavyset_go an hour ago
        I'm looking at this through the lens of US copyright, where the Copyright Office determined that AI output isn't protected by copyright, and thus isn't private IP, as it isn't the creative output of a human being.
        If the results of inference and generation can't be protected under copyright, as they aren't the creative output of a human being, why wouldn't the results of back-propagation and gradient descent follow the same logic?
        This isn't about how we feel about it, it's a legal question.
- bigyabai 14 hours ago
  What you are describing is more-or-less a planned economy, the polar opposite of America's market economy. The government has the power to appropriate things for the common good because it's perceived that private enterprise isn't a necessary force. Sometimes it works, sometimes it doesn't; only certain countries can "moneyball" their way through economics like that, though. America has long since passed the point of even trying.
  Your heart is in the right place here (I agree about FOSS), but there is a snowball's chance in hell that any of this ever happens in the USA. We'll be lucky if AI doesn't resemble cable TV by 2030.
- canadiantim 15 hours ago
  Wouldn’t the same argument then be applied to all scraped data?
renewiltord 9 hours ago
Interesting. Small typo by the way. It's SolidGoldMagikarp with a k. Easy mistake to make with that tokenizer though har har
It strikes me less that they're from adult websites and more that they're from compromised sites. I've had that happen before and it's mostly porn and stuff like that when that happens.
Theodores 17 hours ago
Fascinating article. I am giving everything AI a wide birth for now, however, I do enjoy learning about how AI works. The question I have, is what does a LLM do when it encounters a new token? Can it actually learn from context, etymology and usage?
As I child I had no idea what many of the words meant in the newspaper and in literature but I could just pretend I knew what those words meant or get by without knowing what those words meant in full. In time I would gain familiarity with these words, able to make sense of them in context but not necessarily able to pronounce said words or be able to use them in my own writing. I certainly didn't stop what I was reading to get the dictionary out every time I encountered a new word, and this is how I think most people learn to read, with gradual changes with new words going from no idea to some familiarity to confidently able to use.
We aren't tokenising like the LLMs do and our languages are the product of many hundreds of thousands of years of development. So, how does an LLM learn words that have not already been tokenised? Or is this baked in?
[-]
- FeepingCreature 16 hours ago
  Informed layman warning.
  The tokenizer covers the entire dataset. It's basically just a fixed-size Huffman code, grouping together common fragments of letters- for instance, the 100 most common English words are probably all single tokens.
  During learning, the model proceeds in roughly the same way a child would: it starts by grouping tokens together, learning the deep regularities of language such as "news[paper]" being more likely than "news[q77.bfe]". Then it incrementally assembles these fragments into larger and larger chains. Similarly , it first learns thematic groupings, such as "word" being more likely somewhere after "dictionary" rather than "stop what I was reading to get the dictionary out every time I encountered a banana assault hungry". Then it starts to pick up "patterns": "as a [baby|child|kid] I had no [idea|concept|clue]". At some point in this process it naturally abstracts concepts from languages: "as a child" starts being internally represented by the same neurons as "als ich ein Kind war".
  Then some magic happens that we don't understand, and out pops a neural network that you can talk to and that can write programs and use tools. To be clear, this is the case before RL: probably these patterns are now widespread in the training data, so that the model already understands how to "complete the pattern" on its own. RL then does some magic on top of that to bring it from 20% benchmarks to 80% and presto, AI assistant.
  [-]
  - lelanthran 6 hours ago
    Not an expert, but I don't think that this bit:
    > At some point in this process it naturally abstracts concepts from languages: "as a child"
    Is true. I don't know of any way for the model to represent concepts.
    [-]
    - jfyi 3 hours ago
      https://www.anthropic.com/research/tracing-thoughts-language...
      >Claude sometimes thinks in a conceptual space that is shared between languages, suggesting it has a kind of universal “language of thought.” We show this by translating simple sentences into multiple languages and tracing the overlap in how Claude processes them.
  - astrange 13 hours ago
    > The tokenizer covers the entire dataset.
    Well, this is only trivially true. You can feed binary data to the LLM and it probably has tokens that only cover single bytes of that.
- krackers 15 hours ago
  I think it could infer the meaning of words composed out of tokens it has already seen before, same way that you might be able to infer the meaning of an unknown word based on its prefix/suffix, country of origin, context, etc.
  For an entire token that it hasn't seen before, it would have to rely only on context. Presumably it could do this, since that is after all the case in the early phases of training.
- refulgentis 17 hours ago
  s/birth/berth :)
  [-]
  - DrewADesign 16 hours ago
    That's rather presumptuous, don't you think? There are some people here with very unusual jobs.
- wizzwizz4 16 hours ago
  The LLM training process doesn't operate at that conceptual level. What it's doing is closer to examining a large number of possible meanings, seeing which fit the most, and moving its "understanding" in that direction. Repeat enough times, and it develops an association between the new word and the context in which it's used.
  New words will usually be combinations of existing tokens, but at the beginning of training a new model, it doesn't "know" what any of the tokens mean. And there's no reason you can't treat every UTF-8 byte as a separate token, but that would require a larger model before you got results that look to a layperson like intelligence, understanding, or knowledge. Tokenisation lets you use a system like word2vec to assign each token a semantic embedding in a vector space, giving the model a bit of a leg up.
  ---
  Response to the sibling comment https://news.ycombinator.com/item?id=45485439, since I've hit the rate limit:
  > During learning, the model […] starts by grouping tokens together
  You probably could design a ML system that works like this, and it'd probably be more efficient to train than a hundred-billion parameter GPT model, but that's not how GPT model training works. Instead, it attempts all of those things in parallel (although I would expect the solutions to the earlier, easier parts to settle down before the solutions to the later parts do), and the same process is responsible for all of the behaviour in a straightforward fashion.
  We do understand the "magic": it's just that it produces a really complicated system that we can't characterise the iterative behaviour of. (For comparison, the iterative function f_c(z) = z² + c, iterated starting at 0, produces the Mandelbrot set.) To use an analogy: imagine the training data is a landscape, and the behaviour of the GPT model trained on it is a weather system. (The parameter count is the amount of atmosphere, or something.) There's nothing magical going on in the weather, but it's just too complicated to predict ahead of time, and tiny gaps in our understanding can magnify into extremely inaccurate long-term predictions. We can, despite this, make some blanket statements about the possible capabilities of a GPT model, of the form "a GPT model will never be able to do X unless you cheat".
  The RL magic is, I believe, well understood, but I don't personally understand it. (I know what it does, since RL always does the same thing, but I don't know what it's doing to the model to achieve that.)
  > "as a child" starts being internally represented by the same neurons as "als ich ein Kind war"
  Yes and no. For a few reasons, including that this kind of association can occur without the same "neurons" getting involved until past the point where that representation exists, it's better to say that they're embedded in nearby regions of a vector space. The actual nodes of the neural network are an implementation detail.