you can even run it on a 4gb raspberry pi Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf
https://lmstudio.ai/
Keep in mind if you run it at the full 262144 tokens of context youll need ~65gb of ram.
Anyway if you're on mac you can search for "qwen3 4b 2507 mlx 4bit" and run the mlx version which is often faster on m chips. Crazy impressive what you get from a 2gb file in my opinion.
It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.
The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.
KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.
With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:
2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
I think, anyway. It's hard to keep up with this stuff. :)
Is there a crowd-sourced sentiment score for models? I know all these scores are juiced like crazy. I stopped taking them at face value months ago. What I want to know is if other folks out there actually use them or if they are unreliable.
Besides the LM Arena Leaderboard mentioned by a sibling comment, if go to the r/LocalLlama/ subreddit, you can very unscientifically get a rough sentiment of the performance of the models by reading the comments (and maybe even check the upvotes). I think the crowd's knee-jerk reaction is unreliable though, but that's what you asked for.
Not anymore tho. It used to be the place to vibe-check a model ~1 year ago, but lately it's filled with toxic my team vs. your team, memes about CEOs (wtf) and general poor takes on a lot of things.
For a while it was china vs. world, but lately it's even more divided, with heavy camping on specific models. You can still get some signal, but you have to either ban a lot of accounts, or read new during different tzs so you can get some of that "i'm just here for the tech stack" vibe from posters.
I don't really go there much anymore but, when I was, there seemed to be an innordinate amount of Chinese nationalism from young accounts speaking odd English.
Since the ranking is based on token usage, wouldn't this ranking be skewed by the fact that small models' APIs are often used for consumer products, especially free ones? Meanwhile reasoning models skew it in the opposite direction, but to what extent I don't know.
It's an interesting proxy, but idk how reliable it'd be.
You can see not from clicking on their name, I wouldn’t assume every positive comment about China is a ‘shill’ - there are many people unhappy with our current neo-cold war.
I’m American. Just giving some background to the feeling. There’s some discontent with some western communities (localllama) that Chinese developers have been open weighting all of their models while most western models have been closed weights.
Meta seems to have now stepped out of the running despite being the local LLM catalyst. Anthropic has done nothing. IBM's Granite and Microsoft's Phi are both very far behind. AWS doesn't even attempt to compete. Grok failed to make good on their promise. OpenAI only even entered the game yesterday so it's hard to tell if they're actually serious since they released such an overly censored model that isn't really better than Qwen options and at its smallest still requires a decent computer. Google seems to be the only domestic contender on the level of what Chinese companies are doing and they're being very careful to not cannibalize Gemini with Gemma.
Right now though China is dropping huge improvements across the entire spectrum of model sizes with Qwen, Kimi, DeepSeek, GLM, and Yi. We've also got Mistral doing competitive self-hosted models too, but they're French. Local AI tooling is plainly _not_ being driven forwards by the United States.
China has a business model where you can lose money and it doesn't matter. The state's modus operandi is just fund things until the leader changes his mind about it.
This is why the Chinese labs are so open, they don't ever need to make a profit, they just need to make good AI.
Xi Jinping has set AI as a national priority. This means that lenders (Chinese party run banks) will "lend" money to AI orgs with no financial strings attached. No business evaluation needed. This is how China does growth, they just fund it in the direction they want without much attention to profitability or returns. It's how you get billion dollar high speed rail lines that transport 50 people a day along the route.
Despite the many capitalist facets of China, it's core operations are still planned communist economy.
That you do not follow is apparent from your comment. I don't think this will be a productive discussion - you're speaking in such broad generalities about provable facts about specific companies (Alibaba, Baidu, etc.) that it is difficult to respond. Take care.
Sure. The US isn't morally any better on this though. VC backed companies and the "get users, figure out profitability later" strategy that have made outside countries unable to compete have been enabled by the US government tweaking interest rates and the supply value of the dollar. And the US backs all sorts of unprofitable things through grants, contracts, and bail outs.
The US government just hasn't yet found a reason to directly pay for a domestic country to release open models. But it's not like we're above that at all.
There isn't really question of morality here, China simply operates different than the US. US companies will need to provide a return to investors and want a secret competitive edge. Chinese companies will need to provide the state with AI, money isn't a concern.
As for why the US gets all the VC love, it's because the US has an extremely friendly business environment. People leave their home countries to start businesses in the US because of it. Europe has done a pretty good job suffocating their tech industry in comparison.
The US government only plays a relatively minor role in this too, because unlike China the US is not a planned economy.
Let's say any country create the most powerful - and thus best - LLMs. They over time infiltrate it with their political will. Over 20-30 years, I'd imagine people asking those LLMs will have their minds' shifted.
But. That's just me, my pessimism-sci-fi scenario.
I think that just as how perplexity had actually created a deepseek(fine-tune?)[1], then there is more and more incentive towards making uncensored models though I am gonna be honest, Kimi K2 isn't that censored but I tried the gguf variant of this model on local pc and it definitely is censored / biased towards china (like taiwan is part of them and so on)
But still, the most recent version of american foss model gpt-oss is just so filled with censorship that its just not worth it in the name of "safety", so to me both are doing censorship but I'd much rather use chinese censorship since its only censored on chinese topics and I mean, I personally wouldn't be ever asking chinese models chinese questions but maybe that's just me.
And even if I would, I would probably ask it on some uncensored, in fact I was actually thinking of creating a fine tune like the perplexity, or just this idea to break chinese censorship.
I also think that some better idea needs to come up with multi modal approach so that censorship could be removed by mixing and matching american and chinese models, I do think it is far from reality but I read a recent comment about harmony which gpt-oss uses and it does look promising I am not sure.
The problem I've had with Chinese models is that when they get confused they revert to Chinese which is just gibberish to me of course. American models revert to English in those situations but that's fine for me.
And models can be abliterated to remove the censorship. I use llama3 that way.
According to the benchmarks, this one is improved in every one of them compared to the previous version, some better than 30B-A3B. Definitely worth a try, it’ll easily fit into memory and token generation speed will be pleasantly fast.
It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is it possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?
Is there like a leaderboard or power rankings sort of thing that tracks these small open models and assigns ratings or grades to them based on particular use cases?
Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?
Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.
My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.
But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
OpenRouter rankings conflate many factors like output quality, popularity, price and legal concerns. They can not tell us whether a model is popular because it is genuinely good, or because many people have heard about it, or because it is free, or because the lawyers trust the provider.
For example, Google's inexplicable design decisions around libraries and APIs means it's often worth the 5% premium to just use OpenRouter to access their models. In other cases it's about which models particular agents default to.
Sonnet 4 is extremely good for tool-usage agentic setups though - something I have found other models struggle to do over a long-context.
OpenAI has a lot of share that simply doesn’t exist via OpenRouter. Typical enterprise chat bot apps use it directly without paying a tax and may use litellm with another vendor for fallback.
> But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
That's political I think. I know several alt right types that swear by grok because "Elon doesn't give it any of that woke crap". They don't care that there's better, for them it's the only viable option.
Claude Opus is in the top 10, also people via OpenRouter mostly use these models for coding and Claude models are particularly good at this, the benchmark doesn't account only for coding capacities tho
grok is not bad, i think 4 is better than claude for most things other than tool calling.
of course, this is a politically charged subject now so fair assessments might be hard to come by - as evidenced by the downvotes i've already gotten on this comment
Reasoning models do a lot better at AIME than non-reasoning models, with o3 mini getting 85% and 4o-mini getting 11%. It makes some sense that this would apply to small models as well.
If you want to have an opinion on it,
just install lmstudio and run the q8_0 version of it i.e. here https://huggingface.co/bartowski/Qwen_Qwen3-4B-Instruct-2507....
you can even run it on a 4gb raspberry pi Qwen_Qwen3-4B-Instruct-2507-Q4_K_L.gguf https://lmstudio.ai/
Keep in mind if you run it at the full 262144 tokens of context youll need ~65gb of ram.
Anyway if you're on mac you can search for "qwen3 4b 2507 mlx 4bit" and run the mlx version which is often faster on m chips. Crazy impressive what you get from a 2gb file in my opinion.
It's pretty good for summaries etc, can even make simple index.html sites if you're teaching students but it can't really vibecode in my opinion. However for local automation tasks like summarizing your emails, or home automation or whatever it is excellent.
It's crazy that we're at this point now.
Thank you. To spare Mac readers time:
mlx 4bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
mlx 5bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
mlx 6bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
mlx 8bit: https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
edit: corrected the 4b link
Did you mean mlx 4bit:
https://huggingface.co/lmstudio-community/Qwen3-4B-Thinking-...
This comment saved 3 tons of CO2
> if you run it at the full 262144 tokens of context youll need ~65gb of ram
What is the relationship between context size and RAM required? Isn't the size of RAM related only to number of parameters and quantization?
The context cache (or KV cache) is where intermediate results are stored. One for each output token. Its size depends on the model architecture and dimensions.
KV cache size = 2 * batch_size * context_len * num_key_value_heads * head_dim * num_layers * element_size. The "2" is for the two parts, key and value. Element size is the precision in bytes. This model uses grouped query attention, which reduces num_key_value_heads compared to a multi head attention (MHA) model.
With batch size 1 (for low-latency single-user inference), 32k context (recommended in the model card), fp16 precision:
2 * 1 * 32768 * 8 * 128 * 36 * 2 = 4.5GiB.
I think, anyway. It's hard to keep up with this stuff. :)
Yes but you can quantise the KV cache too just like you can the weights.
Whats the space complexity for context size? And who is trying to drop it into linear complexity?
A 24GB GPU can run a ~30b parameter model at 4bit quantization at about 8k-12k context length before every GB of VRAM is occupied.
[dead]
No. Your KV cache is kept in memory also.
I mean...where do you think context is stored?
how about on apple silicon for the iphone
https://joejoe1313.github.io/2025-05-06-chat-qwen3-ios.html
Is there a crowd-sourced sentiment score for models? I know all these scores are juiced like crazy. I stopped taking them at face value months ago. What I want to know is if other folks out there actually use them or if they are unreliable.
Besides the LM Arena Leaderboard mentioned by a sibling comment, if go to the r/LocalLlama/ subreddit, you can very unscientifically get a rough sentiment of the performance of the models by reading the comments (and maybe even check the upvotes). I think the crowd's knee-jerk reaction is unreliable though, but that's what you asked for.
Not anymore tho. It used to be the place to vibe-check a model ~1 year ago, but lately it's filled with toxic my team vs. your team, memes about CEOs (wtf) and general poor takes on a lot of things.
For a while it was china vs. world, but lately it's even more divided, with heavy camping on specific models. You can still get some signal, but you have to either ban a lot of accounts, or read new during different tzs so you can get some of that "i'm just here for the tech stack" vibe from posters.
Yeah, some people just can't stop acting as if tech companies were sport teams, and it gets annoying fast.
I don't really go there much anymore but, when I was, there seemed to be an innordinate amount of Chinese nationalism from young accounts speaking odd English.
This has been around for a while https://lmarena.ai/leaderboard/text/coding
openrouter usage stats
https://openrouter.ai/rankings
The new qwen3 model is not out yet.
Since the ranking is based on token usage, wouldn't this ranking be skewed by the fact that small models' APIs are often used for consumer products, especially free ones? Meanwhile reasoning models skew it in the opposite direction, but to what extent I don't know.
It's an interesting proxy, but idk how reliable it'd be.
Also, these small models are meant to be run local so not going to appear on openrouter...
This one should work on personal computers! I'm thankful for Chinese companies raising the floor.
[flagged]
You can see not from clicking on their name, I wouldn’t assume every positive comment about China is a ‘shill’ - there are many people unhappy with our current neo-cold war.
Astroturf/shill accusations are also against the HN ethos. https://news.ycombinator.com/item?id=11257034
I’m American. Just giving some background to the feeling. There’s some discontent with some western communities (localllama) that Chinese developers have been open weighting all of their models while most western models have been closed weights.
Meta seems to have now stepped out of the running despite being the local LLM catalyst. Anthropic has done nothing. IBM's Granite and Microsoft's Phi are both very far behind. AWS doesn't even attempt to compete. Grok failed to make good on their promise. OpenAI only even entered the game yesterday so it's hard to tell if they're actually serious since they released such an overly censored model that isn't really better than Qwen options and at its smallest still requires a decent computer. Google seems to be the only domestic contender on the level of what Chinese companies are doing and they're being very careful to not cannibalize Gemini with Gemma.
Right now though China is dropping huge improvements across the entire spectrum of model sizes with Qwen, Kimi, DeepSeek, GLM, and Yi. We've also got Mistral doing competitive self-hosted models too, but they're French. Local AI tooling is plainly _not_ being driven forwards by the United States.
China has a business model where you can lose money and it doesn't matter. The state's modus operandi is just fund things until the leader changes his mind about it.
This is why the Chinese labs are so open, they don't ever need to make a profit, they just need to make good AI.
Sure, except all of these Chinese labs are attached to large, profitable Chinese tech & finance companies. But yeah, it's all unfair competition.
I don't follow
Xi Jinping has set AI as a national priority. This means that lenders (Chinese party run banks) will "lend" money to AI orgs with no financial strings attached. No business evaluation needed. This is how China does growth, they just fund it in the direction they want without much attention to profitability or returns. It's how you get billion dollar high speed rail lines that transport 50 people a day along the route.
Despite the many capitalist facets of China, it's core operations are still planned communist economy.
That you do not follow is apparent from your comment. I don't think this will be a productive discussion - you're speaking in such broad generalities about provable facts about specific companies (Alibaba, Baidu, etc.) that it is difficult to respond. Take care.
Sure. The US isn't morally any better on this though. VC backed companies and the "get users, figure out profitability later" strategy that have made outside countries unable to compete have been enabled by the US government tweaking interest rates and the supply value of the dollar. And the US backs all sorts of unprofitable things through grants, contracts, and bail outs.
The US government just hasn't yet found a reason to directly pay for a domestic country to release open models. But it's not like we're above that at all.
There isn't really question of morality here, China simply operates different than the US. US companies will need to provide a return to investors and want a secret competitive edge. Chinese companies will need to provide the state with AI, money isn't a concern.
As for why the US gets all the VC love, it's because the US has an extremely friendly business environment. People leave their home countries to start businesses in the US because of it. Europe has done a pretty good job suffocating their tech industry in comparison.
The US government only plays a relatively minor role in this too, because unlike China the US is not a planned economy.
> US government tweaking interest rates
They actually don't do that and, if you're up to speed on current events, Trump is pretty upset about it.
Let's say any country create the most powerful - and thus best - LLMs. They over time infiltrate it with their political will. Over 20-30 years, I'd imagine people asking those LLMs will have their minds' shifted.
But. That's just me, my pessimism-sci-fi scenario.
I think that just as how perplexity had actually created a deepseek(fine-tune?)[1], then there is more and more incentive towards making uncensored models though I am gonna be honest, Kimi K2 isn't that censored but I tried the gguf variant of this model on local pc and it definitely is censored / biased towards china (like taiwan is part of them and so on)
But still, the most recent version of american foss model gpt-oss is just so filled with censorship that its just not worth it in the name of "safety", so to me both are doing censorship but I'd much rather use chinese censorship since its only censored on chinese topics and I mean, I personally wouldn't be ever asking chinese models chinese questions but maybe that's just me.
And even if I would, I would probably ask it on some uncensored, in fact I was actually thinking of creating a fine tune like the perplexity, or just this idea to break chinese censorship.
I also think that some better idea needs to come up with multi modal approach so that censorship could be removed by mixing and matching american and chinese models, I do think it is far from reality but I read a recent comment about harmony which gpt-oss uses and it does look promising I am not sure.
[1]: https://huggingface.co/perplexity-ai/r1-1776
The problem I've had with Chinese models is that when they get confused they revert to Chinese which is just gibberish to me of course. American models revert to English in those situations but that's fine for me.
And models can be abliterated to remove the censorship. I use llama3 that way.
individualized recommendation systems are enough to drive everyone nuts
According to the benchmarks, this one is improved in every one of them compared to the previous version, some better than 30B-A3B. Definitely worth a try, it’ll easily fit into memory and token generation speed will be pleasantly fast.
There is a new Qwen3-30B-A3B, you are compare it to the old one. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
So this 4B dense model gets very similar performance to the 30B MoE variant with 7.5x smaller footprint.
It gets similar performance to the old version of the 30B MoE model, but not the updated version. https://huggingface.co/Qwen/Qwen3-30B-A3B-Thinking-2507
I still think that its still very commendable though.
I am running this beast on my dumb pc with no gpu, now we are talking!
It is interesting to think about how they are achieving these scores. The evals are rated by GPT-4.1. Beyond just overfitting to benchmarks, is it possible the models are internalizing how to manipulate the ratings model/agent? Is anyone manually auditing these performance tables?
Is there like a leaderboard or power rankings sort of thing that tracks these small open models and assigns ratings or grades to them based on particular use cases?
https://artificialanalysis.ai/leaderboards/models?open_weigh...
Qwen3-30A-A3B-2507 is much faster on my machine compared to gpt-oss-20B. This leaderboard does not reflect that.
This is perfect. Thanks.
Compare these rankings to actual usage: https://openrouter.ai/rankings
Claude is not cheap, why is it far and away the most popular if it's not top 10 in performance?
Qwen3 235b ranks highest on these benchmarks among open models, but I have never met someone who prefers its output over Deepseek R1. It's extremely wordy and often gets caught in thought loops.
My interpretation is that the models at the top of ArtificialAnalysis are focusing the most on public benchmarks in their training. Note I am not saying XAI is necessarily nefariously doing this, could just be that they decided it's better bang for the buck to rely on public benchmarks than to try to focus on building their own evaluation systems.
But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
OpenRouter rankings conflate many factors like output quality, popularity, price and legal concerns. They can not tell us whether a model is popular because it is genuinely good, or because many people have heard about it, or because it is free, or because the lawyers trust the provider.
The openrouter rankings can be biased.
For example, Google's inexplicable design decisions around libraries and APIs means it's often worth the 5% premium to just use OpenRouter to access their models. In other cases it's about which models particular agents default to.
Sonnet 4 is extremely good for tool-usage agentic setups though - something I have found other models struggle to do over a long-context.
Thanks for sharing that. Interesting that the leaderboard is dominated by Anthropic, Google and DeepSeek. Openai doesn't even register.
OpenAI has a lot of share that simply doesn’t exist via OpenRouter. Typical enterprise chat bot apps use it directly without paying a tax and may use litellm with another vendor for fallback.
I shared a link to small, open source models; Claude is neither.
> But Grok is not very good compared to the anthropic, openai, or google models despite ranking so highly in benchmarks.
That's political I think. I know several alt right types that swear by grok because "Elon doesn't give it any of that woke crap". They don't care that there's better, for them it's the only viable option.
Claude Opus is in the top 10, also people via OpenRouter mostly use these models for coding and Claude models are particularly good at this, the benchmark doesn't account only for coding capacities tho
grok is not bad, i think 4 is better than claude for most things other than tool calling.
of course, this is a politically charged subject now so fair assessments might be hard to come by - as evidenced by the downvotes i've already gotten on this comment
I am reading this right, is this model way better than Gemma 3n[1]? (For only the benchmarks that are common among the models)
=====
LiveCodeBench
E4B IT: 13.2
Qwen: 55.2
===== AIME25
E4B IT: 11.6
Qwen: 81.3
[1]: https://huggingface.co/google/gemma-3n-E4B
Reasoning models do a lot better at AIME than non-reasoning models, with o3 mini getting 85% and 4o-mini getting 11%. It makes some sense that this would apply to small models as well.
I've been trying this today, and I'm getting a lot of hallucinations for suggestions. However, the analysis of problems really quite good.