I like Node.js' simple and fully isolated concurrency model. You shouldn't be blocking the main event loop for 30 seconds! The main event loop is not intended to be used for heavy processing.
You can just set up a separate child process for that. The main event loop which handles connections should just co-ordinate and delegate work to other programs and processes. It can await for them to complete asynchronously; that way the event loop is not blocked.
I recall people have been able to get up to around a million (idle) WebSocket connections handled by a single process.
I was able to comfortably get 20k concurrent sockets per process each churning out 1 outbound message every 3 to 5 seconds (randomized to spread out the load).
It is a good thing that Node.js forces developers to think about this because most other engines which try to hide this complexity tend to impose a significant hidden cost on the server in the form of context switching... With Node.js, there is no such cost, your process can basically have a whole CPU core for itself and it can orchestrate other processes in a maximally efficient way if you write your code correctly... Which Node.js makes very easy to do. Spawning child processes and communicating with them in Node.js is a breeze.
Reading the article, I didn’t see this answered: why not scale to more nodes if your workload is CPU bound? Spin off 1 cpu and a few gb of ram container and scale that as wide as you need?
e.g., this certainly helps when the event loop is blocked, but so could FFI calls to another language for the CPU bound work. I’d only reach for a new Node thread if these didn’t pan out, because there’s usually a LOT that goes into spinning up a new node process in a container (isolating the data, making sure any bundlers and transpilers are working, making sure the worker doesn’t pull in all the app code, etc.).
Side car processes aren’t free, either. Now your processes are contending for the same pool of resources and can’t share anything, which IME means more likelihood of memory issues, esp if there isn’t anything limiting the workers your app can spawn.
Still, good article! Love seeing the ways people tackle CPU bound work loads in an otherwise I/O bound Node app.
True! Although in a lot of Node you DO have a compile chain (typescript) you need to account for. There’s a transactional cost there to get these working well, and only sharing the code it needs. These days it’s much smaller than it used to be, though, so worker functions are seeing more use.
I make my comment to note tho that in many envs it’s easier to scale out than account for all the extra complications of multiple processes in a single container.
The article calls worker threads "problematic", but it doesn’t really make a strong case for why they’re supposedly problematic.
Having a separate isolate in each threads spawned with the worker threads with a minimal footprint of 10MB does not seem like a high price to pay. It's not like you're going to spawn hundreds of them anyway is it?
You will very likely spawn less or as much threads as your CPU cores can handle concurrently.
You typically don't run a hundred of threads (OS threads) you use a thread pool and you cap the concurrency by setting a limit of maximum threads to spawn.
This is also how goroutines work under the hood, they are "green threads", an abstraction that operate on top of a much small OS thread pool.
Worker threads have constraints but most of them are intentional, and in many cases desirable.
I’d also add that SharedArrayBuffer doesn’t limit you to “shared counters or coordination primitives”. It’s just raw memory, you could store structured data in it using your own memory layout. There are libraries out there that implement higher-level data structures this way already
The problematic part is mostly in nomenclature. They’re called “threads” but don’t really behave the way you’d expect threads to.
They’re heavy, they don’t share the entire process memory space (ie can’t reference functions), and I believe their imports are separate from each other (ie reparsed for each worker into its own memory space).
In many ways they’re closer to subprocesses in other languages, with limited shared memory.
It’s not “clean” to spin up thousands of threads, but it does work and sometimes it’s easier to write and reason about than a whole pipeline of distributing work to worked threads. I probably wouldn’t do it in a server, but in a CLI I would totally do something like spawn a thread for each file a user wants to analyze and let the OS do task scheduling for me. If they give me a thousand files, they get a thousand threads. That overhead is pretty minimal with OS threads (on Linux, Windows is a different beast).
No, they're threads as far as the OS is concerned (they'll map to OS threads) and actually _do_ share physical process and memory (that's how SharedArrayBuffer works).
However, apart from atomic "plain" memory no objects are directly shared (For Node/V8 they live in so called Isolated iirc) so from a logical standpoint they're kinda like a process.
The underlying reason is that in JavaScript objects are by default open to modification, ie:
To get sane performance out of JS there are a ton of tricks the runtime does under the hood, the bad news is that those are all either slow (think Python GIL) or heavily exploitable in a multithreaded scenario.
If you've done multithreaded C/C++ work and touched upon Erlang the JS Worker design is the logical conclusion, message passing works for small packets (work orders, structured cloning) whilst large data-shipping can be problematic with cloning.
This is why SharedArrayBuffer:s allows for no-copy sharing since the plain memory arrays they expose don't offer any security surprises in terms of code execution (spectre style attacks is another story) and also allows for work-subdivision if needed.
I think the isolation and memory safety guarantees that worker threads (or Web Workers) provide are very welcome. The friction mainly comes from ergonomics, as pointed out in the article. So there’s definitely room for improvement there (even within the current constraints).
A worker thread or Web Worker runs in its own isolate, so it needs to initialise it by parsing and executing its entry point. I'm not quite sure whether that's something that already happens but you could imagine optimising this by caching or snapshotting the initial state of an isolate when multiple workers use the same entry point, so new workers can start faster.
That cannot be done with the original main thread isolate because usually the worker environment has both different capabilities than the main isolate and a different entry point.
If I have to handle 1000 files in a small CLI I would probably just use Node.js asynchronous IO in a single thread and let it handle platform specifics for me! You’ll get very good throughput without having to handle threads yourself.
If it’s any comfort, I don’t hear many JS/TS/Node/etc developers calling them threads or really thinking of them that way. Usually just Workers or Web Workers — "worker threads" mostly slips in from Node. Even then, "worker" dominates.
In terms of tradeoffs, if you’re coming from the single event loop model, they’re pretty consistent with the rest of JS. Isolation-first, explicit sharing, fewer footguns. So I think the tradeoffs are the right tradeoffs.
FWIW, traditional threads have their own tradeoffs (especially around IO). In JS that’s mostly a non-issue, so the "I need 1000s of threads" case just doesn’t come up very often.
We haven't refused, it just takes time! There was an update at the meeting two weeks ago [1]. There's a lot of other machinery which needs to be specified and implemented before module declarations will work but it's coming along.
Related tangent: Platformatic's "Watt" server^1 takes a pretty interesting approach to Node, leveraging worker threads on all available cores for maximum efficiency.
I get its a constraint of the language but the ubiquitousness of bundlers and differing toolchains in the JS world has always made me regret trying to use worker primitives, whether they be web workers, worker threads and more. Not to mention trying to ship them to users via a library being a nightmare as mentioned in the article.
Almost none of them treat these consistently (if they consider these at all) and all require you to work around them in strange ways.
It feels like there is a lot they could help with in the web world, especially in complex UI and moving computation off the main thread but they are just so clunky to use that almost nobody tries to work around it.
The ironic part is if bundlers, transpilers, compilers etc. weren't used at all they would probably have much more widespread use.
I love the simplicity of Node.js that each process or child process can have its own CPU core with essentially no context switching (assuming you have enough CPU cores).
Most other ways are just hiding the context switching costs and complicating monitoring IMO.
I'm currently writing simulations of trading algorithms for my own use.
I'm using worker_threads + SharedArrayBuffer and running them in Bun. I also tried porting the code to C# and Go, but the execution time ended up being very similar to the Bun version. NodeJS was slower.
Only C gave a clear, noticeable performance advantage — but since I haven't written C in a long time, the code became significantly harder to maintain.
I built an algorithmic trader years ago just using Python, but for the hot paths I gave each algorithm its own function in its own file, and then I would compile the files with Cython. The speedup was pretty significant. I barely wrote any "Cython" stuff (meaning declaring variables and other minor assists). The code is still very much python, just with a few little extras that are easy to understand.
I went through a similar journey trying worker threads for CPU-bound work in Node. The serialization cost of passing data between threads ate most of my gains, especially with larger inputs. Ended up going the napi-rs route instead — Rust addon running in the main thread with near-zero FFI overhead. Different tradeoff since you lose the parallelism, but for my workload the raw speed was already enough.
It's not ideal, the api is kind of low-yet-high-level and that brings some complications.
Move backpressure handling onto the task producer and use a SharedArrayBuffer between the producer and worker, where the worker atomically updates a work-count or current work item ID in that SharedArrayBuffer that the producer can read (atomically) to determine how far along the worker has gotten.
I like Node.js' simple and fully isolated concurrency model. You shouldn't be blocking the main event loop for 30 seconds! The main event loop is not intended to be used for heavy processing.
You can just set up a separate child process for that. The main event loop which handles connections should just co-ordinate and delegate work to other programs and processes. It can await for them to complete asynchronously; that way the event loop is not blocked.
I recall people have been able to get up to around a million (idle) WebSocket connections handled by a single process.
I was able to comfortably get 20k concurrent sockets per process each churning out 1 outbound message every 3 to 5 seconds (randomized to spread out the load).
It is a good thing that Node.js forces developers to think about this because most other engines which try to hide this complexity tend to impose a significant hidden cost on the server in the form of context switching... With Node.js, there is no such cost, your process can basically have a whole CPU core for itself and it can orchestrate other processes in a maximally efficient way if you write your code correctly... Which Node.js makes very easy to do. Spawning child processes and communicating with them in Node.js is a breeze.
You like limitations, which is ok, I'll take a better toolset.
You can not limit yourself when you need more but still follow these patterns in any other language.
Reading the article, I didn’t see this answered: why not scale to more nodes if your workload is CPU bound? Spin off 1 cpu and a few gb of ram container and scale that as wide as you need?
e.g., this certainly helps when the event loop is blocked, but so could FFI calls to another language for the CPU bound work. I’d only reach for a new Node thread if these didn’t pan out, because there’s usually a LOT that goes into spinning up a new node process in a container (isolating the data, making sure any bundlers and transpilers are working, making sure the worker doesn’t pull in all the app code, etc.).
Side car processes aren’t free, either. Now your processes are contending for the same pool of resources and can’t share anything, which IME means more likelihood of memory issues, esp if there isn’t anything limiting the workers your app can spawn.
Still, good article! Love seeing the ways people tackle CPU bound work loads in an otherwise I/O bound Node app.
> but so could FFI calls to another language for the CPU bound work
Worker threads can be more convenient than FFI, as you don't need to compile anything, you can reuse the main application's functions, etc.
True! Although in a lot of Node you DO have a compile chain (typescript) you need to account for. There’s a transactional cost there to get these working well, and only sharing the code it needs. These days it’s much smaller than it used to be, though, so worker functions are seeing more use.
I make my comment to note tho that in many envs it’s easier to scale out than account for all the extra complications of multiple processes in a single container.
> few gb of ram ...
5 years ago I never would have given this comment a second thought.
Now I read it and have to wonder: when does the price of ram start showing up in the butchers bill from your cloud provider?
You have to pay that cost in a worker thread anyway, too. There’s no free lunch.
I don't know about you, but my cloud provider has been charging me for the ram on my compute instances since the beginning.
Ram has always been one of the major price drivers...
But the prices have gotten stupid: https://pcpartpicker.com/trends/price/memory/
https://appleinsider.com/articles/26/02/27/the-global-ram-an...
The article calls worker threads "problematic", but it doesn’t really make a strong case for why they’re supposedly problematic.
Having a separate isolate in each threads spawned with the worker threads with a minimal footprint of 10MB does not seem like a high price to pay. It's not like you're going to spawn hundreds of them anyway is it? You will very likely spawn less or as much threads as your CPU cores can handle concurrently. You typically don't run a hundred of threads (OS threads) you use a thread pool and you cap the concurrency by setting a limit of maximum threads to spawn.
This is also how goroutines work under the hood, they are "green threads", an abstraction that operate on top of a much small OS thread pool.
Worker threads have constraints but most of them are intentional, and in many cases desirable.
I’d also add that SharedArrayBuffer doesn’t limit you to “shared counters or coordination primitives”. It’s just raw memory, you could store structured data in it using your own memory layout. There are libraries out there that implement higher-level data structures this way already
The problematic part is mostly in nomenclature. They’re called “threads” but don’t really behave the way you’d expect threads to.
They’re heavy, they don’t share the entire process memory space (ie can’t reference functions), and I believe their imports are separate from each other (ie reparsed for each worker into its own memory space).
In many ways they’re closer to subprocesses in other languages, with limited shared memory.
It’s not “clean” to spin up thousands of threads, but it does work and sometimes it’s easier to write and reason about than a whole pipeline of distributing work to worked threads. I probably wouldn’t do it in a server, but in a CLI I would totally do something like spawn a thread for each file a user wants to analyze and let the OS do task scheduling for me. If they give me a thousand files, they get a thousand threads. That overhead is pretty minimal with OS threads (on Linux, Windows is a different beast).
No, they're threads as far as the OS is concerned (they'll map to OS threads) and actually _do_ share physical process and memory (that's how SharedArrayBuffer works).
However, apart from atomic "plain" memory no objects are directly shared (For Node/V8 they live in so called Isolated iirc) so from a logical standpoint they're kinda like a process.
The underlying reason is that in JavaScript objects are by default open to modification, ie:
To get sane performance out of JS there are a ton of tricks the runtime does under the hood, the bad news is that those are all either slow (think Python GIL) or heavily exploitable in a multithreaded scenario.If you've done multithreaded C/C++ work and touched upon Erlang the JS Worker design is the logical conclusion, message passing works for small packets (work orders, structured cloning) whilst large data-shipping can be problematic with cloning.
This is why SharedArrayBuffer:s allows for no-copy sharing since the plain memory arrays they expose don't offer any security surprises in terms of code execution (spectre style attacks is another story) and also allows for work-subdivision if needed.
I think the isolation and memory safety guarantees that worker threads (or Web Workers) provide are very welcome. The friction mainly comes from ergonomics, as pointed out in the article. So there’s definitely room for improvement there (even within the current constraints).
A worker thread or Web Worker runs in its own isolate, so it needs to initialise it by parsing and executing its entry point. I'm not quite sure whether that's something that already happens but you could imagine optimising this by caching or snapshotting the initial state of an isolate when multiple workers use the same entry point, so new workers can start faster.
That cannot be done with the original main thread isolate because usually the worker environment has both different capabilities than the main isolate and a different entry point.
If I have to handle 1000 files in a small CLI I would probably just use Node.js asynchronous IO in a single thread and let it handle platform specifics for me! You’ll get very good throughput without having to handle threads yourself.
If it’s any comfort, I don’t hear many JS/TS/Node/etc developers calling them threads or really thinking of them that way. Usually just Workers or Web Workers — "worker threads" mostly slips in from Node. Even then, "worker" dominates.
In terms of tradeoffs, if you’re coming from the single event loop model, they’re pretty consistent with the rest of JS. Isolation-first, explicit sharing, fewer footguns. So I think the tradeoffs are the right tradeoffs.
FWIW, traditional threads have their own tradeoffs (especially around IO). In JS that’s mostly a non-issue, so the "I need 1000s of threads" case just doesn’t come up very often.
The worker situation would be much better with inline workers (or modules).
https://github.com/tc39/proposal-module-declarations
Unfortunately the JS standards folks have refused so far to make this situation better.
Ex. it should just be `new Worker(module { ... })`.
We haven't refused, it just takes time! There was an update at the meeting two weeks ago [1]. There's a lot of other machinery which needs to be specified and implemented before module declarations will work but it's coming along.
[1] https://docs.google.com/presentation/d/1inTcnb4hugyAvKrjFX_X...
It's all about security, see my other comment. https://news.ycombinator.com/item?id=47480080
Related tangent: Platformatic's "Watt" server^1 takes a pretty interesting approach to Node, leveraging worker threads on all available cores for maximum efficiency.
1. https://docs.platformatic.dev/docs/overview/architecture-ove...
I get its a constraint of the language but the ubiquitousness of bundlers and differing toolchains in the JS world has always made me regret trying to use worker primitives, whether they be web workers, worker threads and more. Not to mention trying to ship them to users via a library being a nightmare as mentioned in the article.
Almost none of them treat these consistently (if they consider these at all) and all require you to work around them in strange ways.
It feels like there is a lot they could help with in the web world, especially in complex UI and moving computation off the main thread but they are just so clunky to use that almost nobody tries to work around it.
The ironic part is if bundlers, transpilers, compilers etc. weren't used at all they would probably have much more widespread use.
Yea, for the vast, vast majority of workloads just forking separate node process ends up being better than mucking with threads.
It’s not weird that you can’t share state between totally different processes except by passing in args.
And you can make it thread-like if you prefer by creating a “load balancer” setup to begin with to keep them CPU bound.
Spawn a process for each CPU, bind data you need, and it can feel like multithreading from your perspective.More here https://github.com/bennyschmidt/simple-node-multiprocess
I love the simplicity of Node.js that each process or child process can have its own CPU core with essentially no context switching (assuming you have enough CPU cores).
Most other ways are just hiding the context switching costs and complicating monitoring IMO.
I'm currently writing simulations of trading algorithms for my own use. I'm using worker_threads + SharedArrayBuffer and running them in Bun. I also tried porting the code to C# and Go, but the execution time ended up being very similar to the Bun version. NodeJS was slower. Only C gave a clear, noticeable performance advantage — but since I haven't written C in a long time, the code became significantly harder to maintain.
I built an algorithmic trader years ago just using Python, but for the hot paths I gave each algorithm its own function in its own file, and then I would compile the files with Cython. The speedup was pretty significant. I barely wrote any "Cython" stuff (meaning declaring variables and other minor assists). The code is still very much python, just with a few little extras that are easy to understand.
I went through a similar journey trying worker threads for CPU-bound work in Node. The serialization cost of passing data between threads ate most of my gains, especially with larger inputs. Ended up going the napi-rs route instead — Rust addon running in the main thread with near-zero FFI overhead. Different tradeoff since you lose the parallelism, but for my workload the raw speed was already enough.
The lack of backpressure handling nor promise api for postMessage is also quite annoying, I had many OOMs because of it.
It's not ideal, the api is kind of low-yet-high-level and that brings some complications.
Move backpressure handling onto the task producer and use a SharedArrayBuffer between the producer and worker, where the worker atomically updates a work-count or current work item ID in that SharedArrayBuffer that the producer can read (atomically) to determine how far along the worker has gotten.
- you should be using multiple node processes - you should be spawning tools to do heavy computation
Or you could use Elixir with Postgrest and not have to bold on all this wonky 3rd paid tools for basic stuff like background jobs:
https://elixirisallyouneed.dev/tools?q=Pgflow