I haven’t gotten around to adding Klein to my GenAI Showdown site yet, but if it’s anything like Z-Image Turbo, it should perform extremely well.
For reference, Z-Image Turbo scored 4 out of 15 points on GenAI Showdown. I’m aware that doesn’t sound like much, but given that one of the largest models, Flux.2 (32b), only managed to outscore ZiT (a 6b model) by a single point and is significantly heavier-weight, that’s still damn impressive.
I think it shows problems with your tests tbh. The bigger models are way more capable than you make them out to be. They are also better in training and understanding of CGI render outputs as reference like normal maps or id-masks. Your testing suite is the perfect example that structured data implies false confidence. Pure t2i is not a good benchmark anymore.
I am amazed, though not entirely surprised, that these models keep getting smaller while the quality and effectiveness increases. z image turbo is wild, I'm looking forward to trying this one out.
There are probably some more subtle tipping points that small models hit too. One of the challenges of a 100GB model is that there is non-trivial difficulty in downloading and running the thing that a 4GB model doesn't face. At 4GB I think it might be reasonable to assume that most devs can just try it and see what it does.
Is there a theoritical minimum for params for a given output? I saw news about GPT 3.5, then Deepseek training models at a fraction of that cost, then laptops running a model that beats 3.5. When does it stop?
Quality is increasing, but these small models have very little knowledge compared to their big brothers (Qwen Image/Full size Flux 2). As in characters, artists, specific items, etc.
Agreed - given what Tongyi-MAI Lab was able to accomplish with a 6b model - I would love to see what they could do with something larger. Somewhere in the range of 15-20b, between these smaller models (ZiT, Klein) and the significantly larger models (Flux.2 dev).
I was trying to get it to create an image of a tiger jumping on a pogo stick, which is way beyond its capabilities, but it cannot create an image of a pogo stick in isolation.
When given an image of an empty wine glass, it can't fill it to the brim with wine. The pogo stick drawers and wine glass fillers can enjoy their job security for months to come!
This is where smaller models are just going to be more constrained and will require additional prompting to coax out the physical description of a "pogo stick". I had similar issues when generating Alexander the Great leading a charge on a hippity-hop / space hopper.
You are right, just tried even with reference images it can't do it for me. Maybe with some good prompting.
Because in theory I would say that knowledge is something that does not have to be baked in the model but could be added using reference images if the model is capable enough to reason about them.
> FLUX.2 [klein] 4B The fastest variant in the Klein family. Built for interactive applications, real-time previews, and latency-critical production use cases.
I wonder what kind of use cases could be "latency-critical production use cases"?
If we think of GenAI models as a compression implementation. Generally, text compresses extremely well. Images and video do not. Yet state-of-the-art text-to-image and text-to-video models are often much smaller (in parameter count) than large language models like Llama-3. Maybe vision models are small because we’re not actually compressing very much of the visual world. The training data covers a narrow, human-biased manifold of common scenes, objects, and styles. The combinatorial space of visual reality remains largely unexplored. I am looking towards what else is out there outside of the human-biased manifold.
> Generally, text compresses extremely well. Images and video do not.
Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.
The comparison makes sense in what I am charitably assuming is the case the GP is referring to: we know how to build a tight embedding space from a text corpus, and get out outputs from it tolerably similar to the inputs for the purposes they're put to. That is lossy compression, just not in the sense anyone talking about conventional lossless text compression algorithms would use the words. I'm not sure we can say the same of image embeddings.
I find it likely that we are still missing a few major efficiency tricks with LLMs. But I would also not underestimate the amount of implicit knowledge and skill an LLM is expected to carry on a meta level.
Images and video compress vastly better than text. You're lucky to get 4:1 to 6:1 compression of text (1), while the best perceptual codecs for static images are typically visually lossless at 10:1 and still look great at 20:1 or higher. Video compression is much better still due to temporal coherence.
1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.
I appreciate that they released a smaller version that is actually open source. It creates a lot more opportunities when you do not need a massive budget just to run the software. The speed improvements look pretty significant as well.
Flux2 Klein isn’t some generation leap or anything. It’s good, but let’s be honest, this is an ad.
What will be really interesting to me is the release of Z-image, if that goes the way it’s looking, it’ll be natural language SDXL 2.0, which seems to be what people really want.
Releasing the Turbo/Distilled/Finetune months ago was a genius move really. It hurt Flux and Qwen releases on a possible future implication alone.
If this was intentional, I can’t think of the last time I saw such shrewd marketing.
The team behind Z-Image Turbo has told us multiple times in their paper that the output quality of the Turbo model is superior to the larger base model.
I think that information still did not get through to most users.
"Notably, the resulting distilled model not only matches the original multi-step teacher but even surpasses it in terms of photorealism and visual impact."
"It achieves 8-step inference that is not only indistinguishable from the 100-step teacher but frequently surpasses it in perceived quality and aesthetic appeal"
I’m a bit confused, both you and another commenter mention something called Z-Image, presumably another Flux model?
Your frame of it is speculative, i.e. it is forthcoming. Theirs is present tense. Could I trouble you to give us plebes some more context? :)
ex. Parsed as is, and avoiding the general confusion if you’re unfamiliar, it is unclear how one can observe “the way it is looking”, especially if turbo was released months ago and there is some other model that is unreleased. Chose to bother you because the others comment was less focused on lab on lab strategy.
Z-Image is another open-weight image-generation model by Alibaba [1]. Z-Image Turbo was released around the same time as (non-Klein) FLUX.2 and received generally warmer community response [2] since Z-image Turbo was faster, also high-quality, and reportedly better at generating NSFW material. The base (non-Turbo) version of Z-Image is not yet released.
Z-Image is roughly as censored as Flux 2, from my very limited testing. It got popular because Flux 2 is just really big and slow. It is, however, great at editing, has an amazing breadth of built in knowledge, and has great prompt adherence.
Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.
Ahh I see, and Klein is basically a response to Z-Image Turbo, i.e. another 4-8B sized model that fits comfortably on a consumer GPU.
It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.
Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.
They are underselling Z-Image Turbo somewhat. It's arguably the best overall model for local image generation for several reasons including prompt adherence, overall output quality and realism, and freedom from censorship, even though it's also one of the smallest at 6B parameters.
ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.
Hold on now. Z-Image Turbo has gotten a lot of hype but it's worse at all of those things other than perhaps looking like it was shot on a cell phone camera than Qwen Image and Flux 2 (the full sized version). Once you get away from photographic portraits of people it quickly shows just how little it can do.
Not in my experience. Flux 2 is much larger and heavily censored, and Qwen-Image is just plain not as good. You can fool me into thinking that Z-Image Turbo output isn't AI, while that's rarely the case with Qwen.
Look at the images I posted elsewhere in this section. They are crappy excuses for pogo sticks, but they absolutely do NOT look like they came from a cell phone.
Also see vunderba's page at https://genai-showdown.specr.net/ . Even when Z-Image Turbo fails a test, it still looks great most of the time.
Edit re: your other comment -- don't make the mistake of confusing censorship with lack of training data. Z-Image will try to render whatever you ask for, but at the end of the day it's a very small model that will fail once you start asking for things it simply wasn't trained on. They didn't train it with much NSFW material, so it has some rather... unorthodox anatomical ideas.
However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.
We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.
Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.
I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?
A few days ago people were replying to every image on Twitter saying "Grok, put him/her/it in a bikini" and Grok would just do it. It was minimum effort, maximum damage trolling and people loved it.
Ah. So, see, this is exactly why I need to check apparently.
Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.
But I’m applying a good faith argument where GP does not seem to have intended one.
Reducing it to some people put people in bikinis for a couple days for the lulz is...not quite what happened.
You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.
I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.
I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.
Nah it's been happening for months and involved kids, over and over, albeit for the same reasoning, lulz & totally based. I am a bit surprised that you thought this was just a PG-rated stunt on X for a couple days, it's been in the news for weeks, including on HN.
I haven’t gotten around to adding Klein to my GenAI Showdown site yet, but if it’s anything like Z-Image Turbo, it should perform extremely well.
For reference, Z-Image Turbo scored 4 out of 15 points on GenAI Showdown. I’m aware that doesn’t sound like much, but given that one of the largest models, Flux.2 (32b), only managed to outscore ZiT (a 6b model) by a single point and is significantly heavier-weight, that’s still damn impressive.
Local model comparisons only:
https://genai-showdown.specr.net/?models=fd,hd,kd,qi,f2d,zt
I think it shows problems with your tests tbh. The bigger models are way more capable than you make them out to be. They are also better in training and understanding of CGI render outputs as reference like normal maps or id-masks. Your testing suite is the perfect example that structured data implies false confidence. Pure t2i is not a good benchmark anymore.
Can you fix the information bubble on mobile please? When pressing one, it vanishes instantly...
Hey Bombthecat, sorry about that! I can't repro this issue on any of the devices I have (Android Pixel 7, an iPad, etc).
If you get a chance, could you list your mobile device specs? That way I can at least try it on Browserstack and see if I can figure out a fix.
Yeah works fine for me on a Pixel 9.
Samsung, brave browser
Update: Huh, now it's working
I am amazed, though not entirely surprised, that these models keep getting smaller while the quality and effectiveness increases. z image turbo is wild, I'm looking forward to trying this one out.
An older thread on this has a lot of comments: https://news.ycombinator.com/item?id=46046916
There are probably some more subtle tipping points that small models hit too. One of the challenges of a 100GB model is that there is non-trivial difficulty in downloading and running the thing that a 4GB model doesn't face. At 4GB I think it might be reasonable to assume that most devs can just try it and see what it does.
Is there a theoritical minimum for params for a given output? I saw news about GPT 3.5, then Deepseek training models at a fraction of that cost, then laptops running a model that beats 3.5. When does it stop?
Quality is increasing, but these small models have very little knowledge compared to their big brothers (Qwen Image/Full size Flux 2). As in characters, artists, specific items, etc.
Agreed - given what Tongyi-MAI Lab was able to accomplish with a 6b model - I would love to see what they could do with something larger. Somewhere in the range of 15-20b, between these smaller models (ZiT, Klein) and the significantly larger models (Flux.2 dev).
I smell the bias-variance tradeoff. By underfitting more, they get closer to the degenerate case of a model that only knows one perfect photo.
That's what LoRAs are for.
And small models are also much easier to fine tune than large ones.
It cannot create an image of a pogo stick.
I was trying to get it to create an image of a tiger jumping on a pogo stick, which is way beyond its capabilities, but it cannot create an image of a pogo stick in isolation.
When given an image of an empty wine glass, it can't fill it to the brim with wine. The pogo stick drawers and wine glass fillers can enjoy their job security for months to come!
You can still taste wine in the metaverse with the mouth adapter and can get a buzz by gently electrifying your neuralink (time travel required)
It's a tough test for local models - (gpt-image and NB had zero problems) - the only one that came reasonably close was Qwen-Image
Z-Image / Flux 2 / Hidream / Omnigen2 / Qwen Samples:
https://imgur.com/a/tB6YUSu
This is where smaller models are just going to be more constrained and will require additional prompting to coax out the physical description of a "pogo stick". I had similar issues when generating Alexander the Great leading a charge on a hippity-hop / space hopper.
You are right, just tried even with reference images it can't do it for me. Maybe with some good prompting.
Because in theory I would say that knowledge is something that does not have to be baked in the model but could be added using reference images if the model is capable enough to reason about them.
Those are both good benchmark prompts. Z-Image Turbo doesn't like them either:
Tiger on pogo stick: https://i.imgur.com/lnGfbjy.jpeg
Dunno what this is, but it's not a pogo stick: https://i.imgur.com/OmMiLzQ.jpeg
Nano Banana Pro FTW: https://i.imgur.com/6B7VBR9.jpeg
> FLUX.2 [klein] 4B The fastest variant in the Klein family. Built for interactive applications, real-time previews, and latency-critical production use cases.
I wonder what kind of use cases could be "latency-critical production use cases"?
Local models. I'm not gonna wait 10 min for one image on my computer like I did back in the Stable Diffusion days. And image editing in particular.
Maybe fast image editing, since it supports that.
If we think of GenAI models as a compression implementation. Generally, text compresses extremely well. Images and video do not. Yet state-of-the-art text-to-image and text-to-video models are often much smaller (in parameter count) than large language models like Llama-3. Maybe vision models are small because we’re not actually compressing very much of the visual world. The training data covers a narrow, human-biased manifold of common scenes, objects, and styles. The combinatorial space of visual reality remains largely unexplored. I am looking towards what else is out there outside of the human-biased manifold.
> Generally, text compresses extremely well. Images and video do not.
Is that actually true? I'm not sure it's fair to compare lossless compression ratios of text (abstract, noiseless) to images and video that innately have random sampling noise. If you look at humanly indistinguishable compression, I'd expect that you'd see far better compression ratios for lossy image and video compression than lossless text.
The comparison makes sense in what I am charitably assuming is the case the GP is referring to: we know how to build a tight embedding space from a text corpus, and get out outputs from it tolerably similar to the inputs for the purposes they're put to. That is lossy compression, just not in the sense anyone talking about conventional lossless text compression algorithms would use the words. I'm not sure we can say the same of image embeddings.
I find it likely that we are still missing a few major efficiency tricks with LLMs. But I would also not underestimate the amount of implicit knowledge and skill an LLM is expected to carry on a meta level.
Images and video compress vastly better than text. You're lucky to get 4:1 to 6:1 compression of text (1), while the best perceptual codecs for static images are typically visually lossless at 10:1 and still look great at 20:1 or higher. Video compression is much better still due to temporal coherence.
1: Although it looks like the current Hutter competition leader is closer to 9:1, which I didn't realize. Pretty awesome by historical standards.
How does this compare to GPT version in terms of interactive capabilities?
Neat, I really enjoyed flux 1. Currently use z image turbo for messing around.
I will wait for invoke to add flux2 klein.
I appreciate that they released a smaller version that is actually open source. It creates a lot more opportunities when you do not need a massive budget just to run the software. The speed improvements look pretty significant as well.
2026 will be the year of small/open models
Flux2 Klein isn’t some generation leap or anything. It’s good, but let’s be honest, this is an ad.
What will be really interesting to me is the release of Z-image, if that goes the way it’s looking, it’ll be natural language SDXL 2.0, which seems to be what people really want.
Releasing the Turbo/Distilled/Finetune months ago was a genius move really. It hurt Flux and Qwen releases on a possible future implication alone.
If this was intentional, I can’t think of the last time I saw such shrewd marketing.
The team behind Z-Image Turbo has told us multiple times in their paper that the output quality of the Turbo model is superior to the larger base model.
I think that information still did not get through to most users.
"Notably, the resulting distilled model not only matches the original multi-step teacher but even surpasses it in terms of photorealism and visual impact."
"It achieves 8-step inference that is not only indistinguishable from the 100-step teacher but frequently surpasses it in perceived quality and aesthetic appeal"
https://arxiv.org/abs/2511.22699
It's important for finetuning, Lora training and as a refiner...
I also heard so, that it would mainly be useful for training and applying the resulting Lora to the distilled Turbo model.
However, I wonder what has been the source of the delay with its release and if there were problems with that approach.
I’m a bit confused, both you and another commenter mention something called Z-Image, presumably another Flux model?
Your frame of it is speculative, i.e. it is forthcoming. Theirs is present tense. Could I trouble you to give us plebes some more context? :)
ex. Parsed as is, and avoiding the general confusion if you’re unfamiliar, it is unclear how one can observe “the way it is looking”, especially if turbo was released months ago and there is some other model that is unreleased. Chose to bother you because the others comment was less focused on lab on lab strategy.
Z-Image is another open-weight image-generation model by Alibaba [1]. Z-Image Turbo was released around the same time as (non-Klein) FLUX.2 and received generally warmer community response [2] since Z-image Turbo was faster, also high-quality, and reportedly better at generating NSFW material. The base (non-Turbo) version of Z-Image is not yet released.
[1] https://tongyi-mai.github.io/Z-Image-blog/
[2] https://www.reddit.com/r/StableDiffusion/comments/1p9uu69/no...
Z-Image is roughly as censored as Flux 2, from my very limited testing. It got popular because Flux 2 is just really big and slow. It is, however, great at editing, has an amazing breadth of built in knowledge, and has great prompt adherence.
Z Image got popular because the people stuck with 12GB video cards could still use it, and hell - probably train on it, at least once the base version comes out. I think most people disparaging Flux 2 never tried it as they wouldn't want to deal with how slowly it would work on their system, if they even realize that they could run it.
Ahh I see, and Klein is basically a response to Z-Image Turbo, i.e. another 4-8B sized model that fits comfortably on a consumer GPU.
It’ll be interesting to see how the NSFW catering plays out for the Chinese labs. I was joking a couple months ago to someone that Seedream 4’s talents at undressing was an attempt to sow discord and it was interesting it flew under the radar.
Post-Grok going full gooner pedo, I wonder if it Grok will take the heat alone moving forward.
They are underselling Z-Image Turbo somewhat. It's arguably the best overall model for local image generation for several reasons including prompt adherence, overall output quality and realism, and freedom from censorship, even though it's also one of the smallest at 6B parameters.
ZIT is not far short of revolutionary. It is kind of surreal to contemplate how much high-quality imagery can be extracted from a model that fits on a single DVD and runs extremely quickly on consumer-grade GPUs.
Hold on now. Z-Image Turbo has gotten a lot of hype but it's worse at all of those things other than perhaps looking like it was shot on a cell phone camera than Qwen Image and Flux 2 (the full sized version). Once you get away from photographic portraits of people it quickly shows just how little it can do.
It is, however, small and quick.
Not in my experience. Flux 2 is much larger and heavily censored, and Qwen-Image is just plain not as good. You can fool me into thinking that Z-Image Turbo output isn't AI, while that's rarely the case with Qwen.
Look at the images I posted elsewhere in this section. They are crappy excuses for pogo sticks, but they absolutely do NOT look like they came from a cell phone.
Also see vunderba's page at https://genai-showdown.specr.net/ . Even when Z-Image Turbo fails a test, it still looks great most of the time.
Edit re: your other comment -- don't make the mistake of confusing censorship with lack of training data. Z-Image will try to render whatever you ask for, but at the end of the day it's a very small model that will fail once you start asking for things it simply wasn't trained on. They didn't train it with much NSFW material, so it has some rather... unorthodox anatomical ideas.
Everything you said is exactly the truth.
However.. I’m already expecting the blowback when a Z-Image release doesn’t wow people like the Turbo finetune does. SDXL hasn’t been out two years yet, seems like a decade.
We’ll see. I’m hopeful that Z works as expected and sets the new watermark. I just am not sure it does it right out the gate.
>Post-Grok going full gooner pedo
Almost afraid to ask, but anytime grok or x or musk comes up I am never sure if there is some reality based thing, or some “I just need to hate this” thing. Sometimes they’re the same thing, other times they aren’t.
I can guess here that because Grok likely uses WAN that someone wrote some gross prompts and then pretended this is an issue unique to Grok for effect?
A few days ago people were replying to every image on Twitter saying "Grok, put him/her/it in a bikini" and Grok would just do it. It was minimum effort, maximum damage trolling and people loved it.
Ah. So, see, this is exactly why I need to check apparently.
Personally, I go between “I don’t care at all” and “well it’s not ideal” on AI generations. It’s already too late, but the barrier of entry is a lot lower than it was.
But I’m applying a good faith argument where GP does not seem to have intended one.
Reducing it to some people put people in bikinis for a couple days for the lulz is...not quite what happened.
You may note I am no shirking violet, nor do I lack perspective, as evidenced by my notes on Seedream. And fortuitiously, I only mentioned it before being dismissed as bad faith: I could not have foreseen needing to call out as credentials until now.
I don't think it's kind to accuse others of bad faith, as evidence by me not passing judgement on the person you are replying to's description.
I do admit it made my stomach churn a little bit to see how quickly people will other. Not on you, I'm sure I've done this too. It's stark when you're on the other side of it.
Nah it's been happening for months and involved kids, over and over, albeit for the same reasoning, lulz & totally based. I am a bit surprised that you thought this was just a PG-rated stunt on X for a couple days, it's been in the news for weeks, including on HN.
damn, they really counter attack after z-image release huh
good competition breed innovation