It is very strange that a post trying to explain the concept of "let it crash" in Elixir (which runs on the BEAM VM) does not mention the doctoral thesis of Joe Armstrong: "Making reliable distributed systems in the presence of software errors".
It must be compulsory lecture for anybody interested in reliable systems, even if they do not use the BEAM VM.
There are a few stages, and each improves on the previous ones:
1. Detect crashes at runtime and by default stop/crash to prevent continuing with invalid program state
2. Detect crashes at runtime and handle them according to the business context (e.g. crash or retry or fallback-to or ...) to prevent bad UX through crashes.
3. Detect potential crashes at compile-time to prevent the dev from forgetting to handle them according to the business context
4. Don't just detect the possibility of crashes but also the specific type and context to prevent the dev from making a logical mistake and causing a potential runtime error during error handling according to the business context
An example for stage 4 would be that the compiler checks that a fall-back option will actually always resolve the errors and not potentially introduce a new error / error type. Such as falling back to another URL does not actually always resolve the problem, there still needs to be handling for when the request to the alternative URL fails.
The philosophy described in the article is basically just stage 1 and a (partial) default restart instead of a default crash, which is maybe a slight improvement but not really sufficient, at least not by my personal standards.
Based on your list there is an opportunity to define stage -1 of error handling sanity, the Eval-Rinse-Reload loop, as implemented by FuckItJS, the original Javascript Error Steamroller: https://github.com/mattdiamond/fuckitjs
> Through a process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly compiles your code, detecting errors and slicing those lines out of the script. To survive such a violent process, FuckItJS reloads itself after each iteration, allowing the onerror handler to catch every single error in your terribly written code.
> [...]
> This will keep evaluating your code until all errors have been sliced off like mold on a piece of perfectly good bread. Whether or not the remaining code is even worth executing, we don't know. We also don't particularly care.
>>This organization corresponds nicely to a idealized human
organization of bosses and workers - bosses say what is to be done,
workers do stuff. Bosses do quality control and check that things get
done, if not they fire people re-organize and tell other people to do
the stuff. If they fail (the bosses) they get sacked etc. <<note I
said, idealized organization, usually if projects fail the bosses get
promoted and given more workers for their next project>>
How does restarting the process fix the crash? If the process crashed because a file was missing, it will still be missing when the process is restarted. Is an infinite crash-loop considered success in Erlang?
I’m only an armchair expert on Erlang. But, having looked into it repeatedly for a couple decades, my take-away is the “Let it crash” slogan is good. But, also presented a bit out of context. Or, at least assuming context that most people don’t have.
Erlang is used in situations involving a zillion incoming requests. If an individual request fails… Maybe it was important. Maybe it wasn’t. If it was important, it’s expected they’ll try again. What’s most important is that the rest of the requests are not interrupted.
What makes Erlang different is that it is natural and trivial to be able to shut down an individual request on the event of an error without worrying about putting any other part of the system into a bad state.
You can pull this off in other languages via careful attention to the details of your request-handling code. But, the creators of the Erlang language and foundational frameworks have set their users up for success via careful attention to the design of the system as a whole.
That’s great in the contexts in which Erlang is used. But, in the context of a Java desktop app like Open Office, it’s more like saying “Let it throw”. “It” being some user action. And, the slogan being to have a language and framework with such robust exception handling built-in that error handling becomes trivial and nearly invisible.
Let it crash, so that if something goes wrong, it does not do so silently.
Let it crash, because a relevant manager will detect it, report it, clean it up, and restart it, without you having to write a line of code for that.
Let it crash as soon as possible, so that any problem (like a crash loop) is readily visible. It's very easy to replace arbitrary bits of Erlang code in a running system, without affecting the rest of it. "Fix it in prod" is better than "miss it in prod", especially when you cannot stop the prod ever.
Are individual agents deployable on their own or does the entire "app" of agents need to be deployed as a single group? If individually deployable, what does this look like from a version control and a CI/CD perspective?
To the best of my knowledge: yes, individual parts are deployable separately, within reason. No, there explicitly no need to deploy the whole thing at once, and especially to shut it down all at once.
Erlang works by message passing and duck typing, so, as long as your interfaces are compatible (backwards or forwards), you can alter the implementation, and evolve the interfaces. Think microservices, but when every function can be a microservice, at an absolutely trivial cost.
> You can pull this off in other languages via careful attention to the details of your request-handling code. But, the creators of the Erlang language and foundational frameworks have set their users up for success via careful attention to the design of the system as a whole.
+10. So many people miss this very important point. If you have lots of mutable shared state, or can accidentally leak such into your actor code then the whole actor/supervision tree thing falls over very easily... because you can't just restart any actor without worrying about the rest of the system.
I think this is a large (but not the only[0]) part of why actors/supervisors haven't really caught on anywhere outside of Erlang, even for problem spaces where they would be suitable.
[0] I personally feel the model is very hard to reason about compared to threaded/blocking straight-line code using e.g. structured concurrency, but that may just be a me thing.
The alternative to straight-line code used to be called "spaghetti code".
There was a joke article parodying "GOTO considered harmful" by suggesting a "COME FROM" command. But in a lot of always, that's exactly what many modern frameworks and languages aim for.
I have worked Elixir/Erlang and Rust a lot, and I agree. Rust in particular gives ownership semantics to threaded/blocking/locking code, which I often times find _much_ easier to understand than a series of messages sent between tasks/processes in Elixir/Erlang.
However, in a world where you have to do concurrent blocking/locking code without the help of rigorous compiler-enforced ownership semantics, Elixir/Erlang is like water in the desert.
Elixir dev: It does not solve all issues. But sometimes you have some kind of rare bug that just happens once X,Z and Y happens in a specific order. If it is restarted it might not happen that way again. Or it might be a temporary problem. You are reaching for an API and it temporarily has issues. It might not have it anymore in 50 ms.
But of course if it crashes because you are reading a file that does not exist it doesnt solve the issue (but it avoids crashing the whole system).
Note that let is crash doesnt mean we shouldnt fix bugs. It is more about if there is a bug we havent fixed it is better to make the crash just crash a tiny part of the program than the whole program
It's not going to be missing the next time around. Usually the file is missing due to some concurrency-problem where the file only gets to exist a little later. A process restart certainly fixes this.
If the problem persists, a larger part of the supervision tree is restarted. This eventually leads to a crash of the full application, if nothing can proceed without this application existing in the Erlang release.
The key point is that there's a very large class of errors which is due to the concurrent interaction of different parts of the system. These problems often go away on the next try, because the risk of them occurring is low.
> Is an infinite crash-loop considered success in Erlang?
Of course not, but usually that's not what happens, instead a process crashes because some condition was not considered, the corresponding request is aborted, and a supervisor restarts the process (or doesn't because the acceptor spawns a process per request / client).
Or a long-running worker got into an incorrect state and crashed, and a supervisor will restart it in a known good state (that's a pretty common thing to do in hardware, BEAM makes that idiomatic in software).
Both of your examples look like infinite crash-loops if your work needs to be correct more than it needs to be available. E.g. there aren't any known good states prior to an unexpected crash, you're just throwing a hail mary because the alternatives are impractical.
When a process crashes, its supervisor restarts it according to some policy. These specify whether to restart the sibling process in their startup order or to only restart the crashed process.
But a supervisor also sets limits, like “10 restarts in a timespan of 1 second.” Once the limits are reached, the supervisor crashes. Supervisors have supervisors.
In this scenario the fault cascades upward through the system, triggering more broad restarts and state-reinitializations until the top-level supervisor crashes and takes the entire system down with it.
An example might bee losing a connection to the database. It’s not an expected fault to fail while querying it, so you let it crash. That kills the web request, but then the web server ends up crashing too because too many requests failed, then a task runner fails for similar reasons. The logger is still reporting all this because it’s a separate process tree, and the top-level app supervisor ends up restarting the entire thing. It shuts everything off, tries to restart the database connection, and if that works everything will continue, but if not, the system crashes completely.
Expected faults are not part of “let it crash.” E.g. if a user supplies a bad file path or network resource. The distinction is subjective and based around the expectations of the given app. Failure to read some asset included in the distribution is both unlikely and unrecoverable, so “let it crash” allows the code to be simpler in the happy path without giving up fault handling or burying errors deeper into the app or data.
> there aren't any known good states prior to an unexpected crash
If there aren't any good states then the program straight up doesn't work in the first place, which gets diagnosed pretty quickly before it hits the field.
> your work needs to be correct more than it needs to be available.
"correctness over availability" tends to not be a thing, if you assume you can reach perfect and full correctness then either you never release or reality quickly proves you wrong in the field. So maximally resilient and safe systems generally plan for errors happening and how to recover from them instead of assuming they don't. There are very few fully proven non-trivial programs, and there were even less 40 years ago.
And Erlang / BEAM was designed in a telecom context, so availability is the prime directive. Which is also why distribution is built-in: if you have a single machine and it crashes you have nothing.
I recommend https://ferd.ca/the-zen-of-erlang.html starting from "if my configuration file is corrupted, restarting won't fix anything". The tl;dr is it helps with transient bugs.
> if you feel that your well-understood regular failure case is viable, then all your error handling can fall-through to that case.
This is my favourite line, because it generalizes the underlying principle beyond the specific BEAM/OTP model in a way that carries over well to the more common sort of database-backed services that people tend to write.
...and does no harm for unfixable bugs. It's the logical equivalent of "switch off and on again" that as we know fixes most issues by itself, but happening only on a part of your software deployment, so most of it will keep running.
> in which case the whole program may finally crash.
This may happen if you let it, but it's basically never the desired outcome. If you were handling a user request, it should stop by returning a HTTP 500 to the client, or if you were processing a background job of some sort, it should stop with a watchdog process marking the job as a failure, not with the entire system crashing.
That's not what "let it crash" is about. Letting something crash in Erlang means that a process (actor) is allowed to crash, but then it gets restarted to try again, which would resolve the situation in case of transient errors.
The equivalent of "let it crash" outside of Erlang is a mountain of try-catch statements and hand-rolled retry wrappers with time delays, with none of the observability and tooling that you get in Erlang.
"Let it crash" is a sentence that gets attention. It makes a person want to know more about it, as it sounds controversial and different. "Let it heal" doesn't have that.
It also has a deeper philosophical meaning of unexpected software bugs should be noisy and obvious instead of causing silently corruption or misleading user experience. If monitoring doesn’t catch the failure, customers will and it can be fixed right away (whether it’s the software, a hardware error, dependency issue, etc.).
A web service returning a 500 error code is a lot more obvious than a 200 with an invalid payload. A crashed app with a stack trace is easier to debug and will cause more user feedback than an app than hangs in a retry loop.
When I had to deal with these things in the Java world, it meant not blindly handling or swallowing exceptions that business code had no business caring about. Does your account management code really think it knows how to properly handle an InterruptedException? Unless your answer is rollback and reset the interrupted flag it’s probably wrong. Can’t write a test for a particular failure scenario? That better blow up loudly with enough context that makes it possible to understand the error condition (and then write a test for it).
It is very common to interpret taglines by their face value, and I believe the author did just that, although the point brought up is valid.
In order to “let it crash”, we must design the system in a way that crashes would not be catastrophic, stability wise. Letting it crash is not a commandment, though: it is a reminder that, in most cases, a smart healing strategy might be overkill.
Maybe I didn’t make myself clear. “Let it crash” is not something that should be thought of at the component level, it should be thought of at the system level. The fact that the application crashes “gracefully” or not is not what is really important. You should design the system in a crash-friendly way, and not to write the application and think: “oh, I believe it is OK to let it crash here”.
My personal interpretation is that systems must be able to handle crashing processes gracefully. There is no benefit in letting processes crash just for the sake of it.
Actually, now I thought about it, I know exactly what irked me about the approach. I hope the author takes it as constructive feedback:
Saying "let it crash is a tagline that actually means something else because the BEAM is supposed to be used in this particular way" sounds slightly "cargo-cultish", to the point where we have to challenge the meaning of the actual word to make sense of it.
Joe Armstrong's e-mail, on the other hand, says (and I paraphrase): "the BEAM was designed from the ground up to help developers avoid the creation of ad-hoc protocols for process communication, and the OTP takes that into consideration already. Make sure your system, not your process, is resilient, and literally let processes crash." Boom. There is no gotcha there. Also, there is the added benefit that developers for other platforms now understand that the rationale is justified by the way BEAM/OTP were designed and may not be applicable to their own platforms.
If I sounded snarky that wasn't my intention. At the end of the day though it doesn't feel like you read the article which was clearly in a different context than the one in which you responded. FWIW I didn't expect this small article speaking to a small audience (Elixir devs) to make the rounds on hacker news.
I agree on the importance of defining terms, and I think the important thing here is that "process" in Joe's parlance is not an OS level process, it is one of a fleet of processes running inside the BEAM VM. And the "system" in this case is the supervisory system around it, which itself consists of individual processes.
I'm critiquing a common misunderstanding of the phrase "Let it crash", whereby effectively no local error handling is performed. This leads to worse user experiences and worse outcomes in general. I understand that you're offering critique, but it again sounds like you're critiquing a reductive element (the headline itself).
I did read the article. I concede that I might not have understood it. Again, I never said it is wrong, but rather that it has a blind spot. I am familiar with Joe Armstrong’s work because I worked on a proprietary (and rather worse tbf) native distributed systems middleware in the past.
I actually skimmed the article before posting. I have some exposure to Erlang, but not to Elixir. As I’ve already mentioned, I think the author’s covering of application behavior is OK, but there is more to the tagline than meets the eye.
Ah this makes sense. I always thought "let it crash" made it sound like Elixir devs just don't bother with error checking, like writing Java without any `catch`es, or writing Rust that only uses `.unwrap()`.
If they just mean "processes should be restartable" then that sounds way more reasonable. Similar idea to this but less fancy: https://flawless.dev/
It's a pretty terrible slogan if it makes your language sound worse than it actually is.
It can't work in the general case because replaying a sequence of syscalls is not sufficient to put the machine back in the same state as it was last time. E.g. second time around open behaves differently so you need to follow the error handling.
However sometimes that approach would work. I wonder how wide the area of effective application is. It might be wide enough to be very useful. The all or nothing database transaction model fits it well.
I think the slogan was meant to be provocative but unfortunately it has been misinterpreted more often than not.
For example, imagine you're working with a 3rd party API and, according to the documentation, it is supposed to return responses in a certain format. What if suddenly that API stops working? Or what if the format changes?
You could write code to handle that "what if" scenario, but then trying to handle every hypothetical your code becomes bloated, more complicated, and hard to understand.
So in these cases, you accept that the system will crash. But to ensure reliability, you don't want to bring down the whole system. So there are primitives that let you control the blast radius of the crash if something unexpected happens.
Let it crash does not mean you skip validating user input. Those are issues that you expect to happen. You handle those just as you would in any programming language.
I've been seeing a lot of these durable workflow engines around lately, for some reason. I'm not sure I understand the pitch. It just seems like a thin wrapper around some very normal patterns for running background jobs. Persist your jobs in a db, checkpoint as necessary, periodically retry. I guess they're meant to be a low-code alternative to writing the db tables yourself, but it seems like you're not saving much code in practice.
Imagine that you’re trying to access an API, which for some reason fails.
“Let it crash” isn’t an argument against handling the timeout, but rather that you should only retry a few, bounded times rather than (eg) exponentially back off indefinitely.
When you design from that perspective, you just fail your request processing (returning the request to the queue) and make that your manager’s problem. Your managing process can then restart you, reassign the work to healthy workers, etc. If your manager can’t get things working and the queue overflows, it throws it into dead letters and crashes. That might restart the server, it might page oncall, etc.
The core idea is that within your business logic is the wrong place to handle system health — and that many problems can be solved by routing around problems (ie, give task to a healthy worker) or restarting a process. A process should crash when it isn’t scoped to handle the problem it’s facing (eg, server OOM, critical dependency offline, bad permissions). Crashing escalates the problem until somebody can resolve it.
A condition that "should not happen" might still be a problem specific to a particular request. If you "just crash" it turns this request from one that only triggers a http 500 response to one that crashes the process. This increases the risk of Query of Death scenarios where the frontend that needs to serve this particular request starts retrying it with different backends and triggers restarts faster than the processes come back up.
So being too eager to "just crash" may turn a scenario where you fail to serve 1% of requests into a scenario where you serve none because all your processes keep restarting.
You should try to do some load testing of a real Erlang system and compare how it handles this scenario against other languages/frameworks. What you are describing is one of the exact things the Erlang system is strong against due to the scheduler.
Processes can be marked as temporary, which means they are not restarted, and that’s what is used when managing http connections, as you can’t really restart a request on the server without the client. So the scenario above wouldn’t happen.
You still want those processes to crash though, as it allows it to automatically clean up any concurrent work. For example, if during a request you start three processes to do concurrent work, like fetching APIs, then the request process crashes, the concurrent processes are automatically cleaned up.
My impression is that in Erlang land each process handler is really cheap so you can just keep on showing up with process handlers and not reach exhaustion like you do with other systems (at least in pre-async worlds...)
This is funny given Elixir/Erlangs whole idea is "let it crash". In Go I just have a Recovery Middleware for any type of problem. Don't know how other langs do it tho
erlang doesn't crash the program, it crashes the thread. erlang has a layered management system built in as part of OTP (open telecom platform, erlang was built for running highly concurrent telephony hardware). when a thread crashes, it dies and signals its parent. the parent then decides what to do. usually, that's just restarting the worker. maybe if ten workers have crashed in a minute, the manager itself will die and restart. issues bubble up, and managers restart subsystems automatically. for some things, like parsing user data, you might never cause the manager to die, and just always restart the worker.
the article, if you should choose to read it, is explaining that people have the misconception you appear to be having due to the 'let it fail' catchphrase. it goes into detail about this system, when failing is appropriate, and when trying to work around errors is appropriate.
as erlang uses greenthreads, restarting a thread for a user API is effectively instant and free.
It's not a misconception given that Elixir Forum and its Discords members will say that to you. Also I never assumed the whole program crashed so why would you explain this to me?
Why would one Blog guy know it better than a lot of other Elixir devs?
It’s well known among elixir devs that for reasons unkown, Elixir Forum is populated predominantly by people who don’t know what they’re talking about.
Question as a complete outsider: If I run idempotent Python applications in Kubernetes containers and they crash, Kubernetes will eventually restart them. Of course, knowing what to do on IO errors is nicer than destroying and restarting everything with a really bigger hammer (as the article also mentions, you can serve a better error message for whoever has to “deal” with the problem), but eventually they should end up in the same workable state.
Is this conceptually similar, but perhaps at code-level instead?
Somewhat, yes but it's much less powerful. In the BEAM these are trees of supervisors and monitors/links that choose how to restart and receive the stacktrace/error reason of the failure respectively. This gives a lot of freedom on how to handle the failure. In k8s, it's often just a dumb monitor/controller that knows little about how to remediate the issue on boot. Nevermind the boot time penalty.
Conceptually similar, different implementation. The perhaps most visible difference is that supervisors aren’t polling application state but are rather notified about errors (crashes), and restarting is extremely low latency. Erlang/BEAM was invented for telephony, and it is possible for this to happen on the middle of a protocol and the user not even notice.
In general, if you can move any kind of logic to a lower level, that's better.
For example, testing that kubernetes restarts work correctly is tricky and requires a complicated setup. Testing that an erlang process/actor behaves as expected is basically a unit test.
Oh of course, I'm sure the kubernetes project tests that they trigger restarts correctly etc.
But that doesn't cover the behavior of your app, the specific configuration you ask kubernetes to use and how the app uses its health endpoints etc. - this is all purely about your own code/config, the kubernetes team can't test that.
I don't code in Erlang or Elixir, aside from messing about. But I've found that letting an entire application crash is something that I can do under certain circumstances, especially when "you have a very big problem and will not go to space today". For example, if there's an error reading some piece of data that's in the application bundle and is needed to legitimately start up in the first place (assets for my game for instance). Then upon error it just "screams and dies" (spits out a stack trace and terminates).
Errors during initialization of a BEAM language application will crash the entire program, and you can decide to exit/crash a program if you get into some unrecoverable state. The important thing is the design of individual crashable/recoverable units.
I think a lot of folks who have never looked at Erlang or Elixir and BEAM before misunderstand this concept because they don't understand how fine-grained processes are, or can be, in Erlang. A very important note: Processes in BEAM languages are cheap, both to create and for context switching, compared to OS threads. While design-wise they offer similar capabilities, this cost difference results in a substantially different approach to design in Erlang than in systems where the cost of introducing and switching between threads is more expensive.
In a more conventional language where concurrency is relatively expensive, and assuming you're not an idiot who writes 1-10k SLOC functions, you end up with functions that have a "single responsibility" (maybe not actually a single responsibility, but closer to it than having 100 duties in one function) near the bottom of your call tree, but they all exist in one thread of execution. In a system, hypothetical, created in this model if your lowest level function is something like:
retrieve_data(db_connection, query_parameters) -> data
And the database connection fails, would you attempt to restart the database connection in this function? Maybe, but that'd be bad design. You'd most likely raise an exception or change the signature so you could express an error return, in Rust and similar it would become something like:
Somewhere higher in the call stack you have a handler which will catch the exception or process the error and determine what to do. That is, the function `retrieve_data` crashes, it fails to achieve its objective and does not attempt any corrective action (beyond maybe a few retries in case the error is transient).
In Erlang, you have a supervision tree which corresponds to this call tree concept but for processes. The process handling data retrieval, having been given some db_conn handler and the parameters, will fail for some reason. Instead of handling the error in this process, the process crashes. The failure condition is passed to the supervisor which may or may not have a handler for this situation.
You might put the simple retry policy in the supervisor (that basic assumption of transient errors, maybe a second or third attempt will succeed). It might have other retry policies, like trying the request again but with a different db_connection (that other one must be bad for some reason, perhaps the db instance it references is down). If it continues to fail, then this supervisor will either handle the error some other way (signaling to another process that the db is down, fix it or tell the supervisor what to do) or perhaps crash itself. This repeats all the way up the supervision tree, ultimately it could mean bringing down the whole system if the error propagates to a high enough level.
This is conceptually no different than how errors and exceptions are handled in sequential, non-concurrent systems. You have handlers that provide mechanisms for retrying or dealing with the errors, and if you don't the error is propagated up (hopefully you don't continue running in a known-bad state) until it is handled or the program crashes entirely.
In languages that offer more expensive concurrency (traditional OS threads), the cost of concurrency (in memory and time) means you end up with a policy that sits somewhere between Erlang's and a straight-line sequential program. Your threads will be larger than Erlang processes so they'll include more error handling within themselves, but ultimately they can still fail and you'll have a supervisor of some sort that determines what happens next (hopefully).
As more languages move to cheap concurrency (Go's goroutines, Java's virtual threads), system designs have a chance to shift closer to Erlang than that straight-line sequential approach if people are willing to take advantage of it.
There's really not more that's useful to say than the relevant section (4.4) of Joe Armstrong's thesis says:
>How does our philosophy of handling errors fit in with coding practices? What kind of code must the programmer write when they find an error? The philosophy is let some other process fix the error, but what does this mean for their code? The answer is let it crash. By this I mean that in the event of an error, then the program should just crash. But what is an error? For programming purpose we can say that:
>• exceptions occur when the run-time system does not know what to do.
>• errors occur when the programmer doesn’t know what to do.
>If an exception is generated by the run-time system, but the programmer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a file which does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps this exception and takes the necessary corrective action.
>Errors occur when the programmer does not know what to do. Programmers are supposed to follow specifications, but often the specification does not say what to do and therefore the programmer does not know what to do.
>[...]
>The defensive code detracts from the pure case and confuses the reader—the diagnostic is often no better than the diagnostic which the compiler supplies automatically.
Note that this "program" is a process. For a process doing work, encountering something it can't handle is an error per the above definitions, and the process should just die, since there's nothing better for it to do; for a supervisor process supervising such processes-doing-work, "my child process exited" is an exception at worst, and usually not even an exception since the standard library supervisor code already handles that.
The truth is that different errors have to lead to different results if you want a good organisational outcome. These could be:
- Fundamental/Fatal error: something without the process cannot function, e.g. we are missing an essential config option. Exiting with an error is totally adequate. You can't just heal from that as it would involve guessing information you don't have. Admins need to fix it
- Critical error: something that should not ever occur, e.g. having an active user without password and email. You don't exit, you skip it if thst is possible and ensure the first occurance is logged and admins are contacted
- Expected/Regular error: something that is expected to happen during the normal operations of the service, e.g. the other server you make requests to is being restarted and thus unreachable. Here the strategy may vary, but it could be something like retrying with random exponential backoff. Or you could briefly accept the values provided by that server are unknown and periodically retry to fill the unknown values. Or you could escalate that into a critical error after a certain amount of retries.
- Warnings: These are usually about something being not exactly ideal, but do not impede with the flow of the program at all. Usually has to do with bad data quality
If you can proceed without degrading the integrity of the system you should, the next thing is to decide jow important it is for humans to hear about it.
>When people say “let it crash”, they are referring to the fact that practically any exited process in your application will be subsequently restarted. Because of this, you can often be much less defensive around unexpected errors. You will see far fewer try/rescue, or matching on error states in Elixir code.
I just threw up in my mouth when I read this. I've never used this language so maybe my experience doesn't apply here but I'm imagining all the different security implications that ive seen arise from failing to check error codes.
That’s actually a good example. Imagine someone forgot to check the error code from an API response. In some languages, they may attempt to parse it as if it was successful request, and succeed, leading to a result with nulls, empty arrays, or missing data that then spreads through the system. In Elixir, parsing would most likely fail thanks to pattern matching [1] and if it by any chance that fails in a core part of the system, that failure will be isolated and that particular component can be restarted.
Elixir is not about willingly ignoring error codes or failure scenarios. It is about naturally limiting the blast radius of errors without a need to program defensively (as in writing code for scenarios you don’t know “just in case”).
Ok, so it's not really that you're not checking error codes. It's that you can write stuff like
ok = whatever().
If whatever is successful and idomatic, it returns ok, or maybe a tuple of {ok, SomeReturn}. In that case, execution would continue. If it returns an error tuple like {error, Reason}... "Let it crash" says you can just let it crash... You didn't have anything better to do, the built in crash because {error, Reason} will do fine.
Or you could do a
case whatever of
ok -> ok;
{error, nxdomain} -> ok
end.
If it was fine to get nxdomain error, but any other error isn't acceptable... It will just crash, and that's good or at least ok. Better than having to enumerate all the possible errors, or having a catchall that then explicitly throws an eeror. It's especially hard to enumerate all possible errors because the running system can change and may return a new error that wasn't enumerated when the requesting code was written.
There's lots of places where crashing isn't actually what you want, and you have to capture all errors, explicitly log it, and then move on... But when you can, checking for success or success and a handful of expected and recoverable errors is very nice.
If get a chance to read some Elixir/Erlang code you'll see that pattern matching is used frequently to assert expected error codes. It does not mean ignore errors.
This is a common misunderstanding because unfortunately the slogan is frequently misinterpreted.
It is very strange that a post trying to explain the concept of "let it crash" in Elixir (which runs on the BEAM VM) does not mention the doctoral thesis of Joe Armstrong: "Making reliable distributed systems in the presence of software errors".
It must be compulsory lecture for anybody interested in reliable systems, even if they do not use the BEAM VM.
https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A104...
Some core ideas from the paper for the inpatient (failures, isolation, healing):
- Failures are inevitabe, so systems must be designed to EXPECT and recover from them, NOT AVOID them completely.
- Let it crash philosophy allows components to FAIL and RECOVER quickly using supervision trees.
- Processes should be ISOLATED and communicate via MESSAGE PASSING, which prevents cascading failures.
- Supervision trees monitor other processes and RESTART them when they fail, creating a self-healing architecture.
There are a few stages, and each improves on the previous ones:
1. Detect crashes at runtime and by default stop/crash to prevent continuing with invalid program state
2. Detect crashes at runtime and handle them according to the business context (e.g. crash or retry or fallback-to or ...) to prevent bad UX through crashes.
3. Detect potential crashes at compile-time to prevent the dev from forgetting to handle them according to the business context
4. Don't just detect the possibility of crashes but also the specific type and context to prevent the dev from making a logical mistake and causing a potential runtime error during error handling according to the business context
An example for stage 4 would be that the compiler checks that a fall-back option will actually always resolve the errors and not potentially introduce a new error / error type. Such as falling back to another URL does not actually always resolve the problem, there still needs to be handling for when the request to the alternative URL fails.
The philosophy described in the article is basically just stage 1 and a (partial) default restart instead of a default crash, which is maybe a slight improvement but not really sufficient, at least not by my personal standards.
Based on your list there is an opportunity to define stage -1 of error handling sanity, the Eval-Rinse-Reload loop, as implemented by FuckItJS, the original Javascript Error Steamroller: https://github.com/mattdiamond/fuckitjs
> Through a process known as Eval-Rinse-Reload-And-Repeat, FuckItJS repeatedly compiles your code, detecting errors and slicing those lines out of the script. To survive such a violent process, FuckItJS reloads itself after each iteration, allowing the onerror handler to catch every single error in your terribly written code.
> [...]
> This will keep evaluating your code until all errors have been sliced off like mold on a piece of perfectly good bread. Whether or not the remaining code is even worth executing, we don't know. We also don't particularly care.
Oh, thank you for the nostalgic reminder of that one. I read that a decade ago and found it hilarious.
https://erlang.org/pipermail/erlang-questions/2003-March/007...
The origin, as far as I know it. I think it still holds, is insightful, as a general case. Let it heal seems pretty close to what Joe was getting at.
>>This organization corresponds nicely to a idealized human organization of bosses and workers - bosses say what is to be done, workers do stuff. Bosses do quality control and check that things get done, if not they fire people re-organize and tell other people to do the stuff. If they fail (the bosses) they get sacked etc. <<note I said, idealized organization, usually if projects fail the bosses get promoted and given more workers for their next project>>
We miss you Joe :)
He was one of my favorite humans; the few emails I exchanged with him were funny and insightful.
How does restarting the process fix the crash? If the process crashed because a file was missing, it will still be missing when the process is restarted. Is an infinite crash-loop considered success in Erlang?
I’m only an armchair expert on Erlang. But, having looked into it repeatedly for a couple decades, my take-away is the “Let it crash” slogan is good. But, also presented a bit out of context. Or, at least assuming context that most people don’t have.
Erlang is used in situations involving a zillion incoming requests. If an individual request fails… Maybe it was important. Maybe it wasn’t. If it was important, it’s expected they’ll try again. What’s most important is that the rest of the requests are not interrupted.
What makes Erlang different is that it is natural and trivial to be able to shut down an individual request on the event of an error without worrying about putting any other part of the system into a bad state.
You can pull this off in other languages via careful attention to the details of your request-handling code. But, the creators of the Erlang language and foundational frameworks have set their users up for success via careful attention to the design of the system as a whole.
That’s great in the contexts in which Erlang is used. But, in the context of a Java desktop app like Open Office, it’s more like saying “Let it throw”. “It” being some user action. And, the slogan being to have a language and framework with such robust exception handling built-in that error handling becomes trivial and nearly invisible.
Let it crash, so that if something goes wrong, it does not do so silently.
Let it crash, because a relevant manager will detect it, report it, clean it up, and restart it, without you having to write a line of code for that.
Let it crash as soon as possible, so that any problem (like a crash loop) is readily visible. It's very easy to replace arbitrary bits of Erlang code in a running system, without affecting the rest of it. "Fix it in prod" is better than "miss it in prod", especially when you cannot stop the prod ever.
Are individual agents deployable on their own or does the entire "app" of agents need to be deployed as a single group? If individually deployable, what does this look like from a version control and a CI/CD perspective?
To the best of my knowledge: yes, individual parts are deployable separately, within reason. No, there explicitly no need to deploy the whole thing at once, and especially to shut it down all at once.
Erlang works by message passing and duck typing, so, as long as your interfaces are compatible (backwards or forwards), you can alter the implementation, and evolve the interfaces. Think microservices, but when every function can be a microservice, at an absolutely trivial cost.
> You can pull this off in other languages via careful attention to the details of your request-handling code. But, the creators of the Erlang language and foundational frameworks have set their users up for success via careful attention to the design of the system as a whole.
+10. So many people miss this very important point. If you have lots of mutable shared state, or can accidentally leak such into your actor code then the whole actor/supervision tree thing falls over very easily... because you can't just restart any actor without worrying about the rest of the system.
I think this is a large (but not the only[0]) part of why actors/supervisors haven't really caught on anywhere outside of Erlang, even for problem spaces where they would be suitable.
[0] I personally feel the model is very hard to reason about compared to threaded/blocking straight-line code using e.g. structured concurrency, but that may just be a me thing.
The alternative to straight-line code used to be called "spaghetti code".
There was a joke article parodying "GOTO considered harmful" by suggesting a "COME FROM" command. But in a lot of always, that's exactly what many modern frameworks and languages aim for.
Haha... be the change! Program in INTERCAL! :)
I have worked Elixir/Erlang and Rust a lot, and I agree. Rust in particular gives ownership semantics to threaded/blocking/locking code, which I often times find _much_ easier to understand than a series of messages sent between tasks/processes in Elixir/Erlang.
However, in a world where you have to do concurrent blocking/locking code without the help of rigorous compiler-enforced ownership semantics, Elixir/Erlang is like water in the desert.
Elixir dev: It does not solve all issues. But sometimes you have some kind of rare bug that just happens once X,Z and Y happens in a specific order. If it is restarted it might not happen that way again. Or it might be a temporary problem. You are reaching for an API and it temporarily has issues. It might not have it anymore in 50 ms.
But of course if it crashes because you are reading a file that does not exist it doesnt solve the issue (but it avoids crashing the whole system).
Note that let is crash doesnt mean we shouldnt fix bugs. It is more about if there is a bug we havent fixed it is better to make the crash just crash a tiny part of the program than the whole program
Or more importantly, you can't design robust recovery and retry systems.
It's not going to be missing the next time around. Usually the file is missing due to some concurrency-problem where the file only gets to exist a little later. A process restart certainly fixes this.
If the problem persists, a larger part of the supervision tree is restarted. This eventually leads to a crash of the full application, if nothing can proceed without this application existing in the Erlang release.
The key point is that there's a very large class of errors which is due to the concurrent interaction of different parts of the system. These problems often go away on the next try, because the risk of them occurring is low.
> Is an infinite crash-loop considered success in Erlang?
Of course not, but usually that's not what happens, instead a process crashes because some condition was not considered, the corresponding request is aborted, and a supervisor restarts the process (or doesn't because the acceptor spawns a process per request / client).
Or a long-running worker got into an incorrect state and crashed, and a supervisor will restart it in a known good state (that's a pretty common thing to do in hardware, BEAM makes that idiomatic in software).
Both of your examples look like infinite crash-loops if your work needs to be correct more than it needs to be available. E.g. there aren't any known good states prior to an unexpected crash, you're just throwing a hail mary because the alternatives are impractical.
When a process crashes, its supervisor restarts it according to some policy. These specify whether to restart the sibling process in their startup order or to only restart the crashed process.
But a supervisor also sets limits, like “10 restarts in a timespan of 1 second.” Once the limits are reached, the supervisor crashes. Supervisors have supervisors.
In this scenario the fault cascades upward through the system, triggering more broad restarts and state-reinitializations until the top-level supervisor crashes and takes the entire system down with it.
An example might bee losing a connection to the database. It’s not an expected fault to fail while querying it, so you let it crash. That kills the web request, but then the web server ends up crashing too because too many requests failed, then a task runner fails for similar reasons. The logger is still reporting all this because it’s a separate process tree, and the top-level app supervisor ends up restarting the entire thing. It shuts everything off, tries to restart the database connection, and if that works everything will continue, but if not, the system crashes completely.
Expected faults are not part of “let it crash.” E.g. if a user supplies a bad file path or network resource. The distinction is subjective and based around the expectations of the given app. Failure to read some asset included in the distribution is both unlikely and unrecoverable, so “let it crash” allows the code to be simpler in the happy path without giving up fault handling or burying errors deeper into the app or data.
> there aren't any known good states prior to an unexpected crash
If there aren't any good states then the program straight up doesn't work in the first place, which gets diagnosed pretty quickly before it hits the field.
> your work needs to be correct more than it needs to be available.
"correctness over availability" tends to not be a thing, if you assume you can reach perfect and full correctness then either you never release or reality quickly proves you wrong in the field. So maximally resilient and safe systems generally plan for errors happening and how to recover from them instead of assuming they don't. There are very few fully proven non-trivial programs, and there were even less 40 years ago.
And Erlang / BEAM was designed in a telecom context, so availability is the prime directive. Which is also why distribution is built-in: if you have a single machine and it crashes you have nothing.
If it has no good states you probably know it before deploying to production.
I recommend https://ferd.ca/the-zen-of-erlang.html starting from "if my configuration file is corrupted, restarting won't fix anything". The tl;dr is it helps with transient bugs.
> if you feel that your well-understood regular failure case is viable, then all your error handling can fall-through to that case.
This is my favourite line, because it generalizes the underlying principle beyond the specific BEAM/OTP model in a way that carries over well to the more common sort of database-backed services that people tend to write.
...and does no harm for unfixable bugs. It's the logical equivalent of "switch off and on again" that as we know fixes most issues by itself, but happening only on a part of your software deployment, so most of it will keep running.
If the rest of the program is still running while you fix it, yes?
Also, restarting endlessly is just one strategy between multiple others.
Typically you then let the error bubble up in the supervisor tree if restarting multiple times doesn't fix it.
Of course there are still errors that can't be recovered from, in which case the whole program may finally crash.
> in which case the whole program may finally crash.
This may happen if you let it, but it's basically never the desired outcome. If you were handling a user request, it should stop by returning a HTTP 500 to the client, or if you were processing a background job of some sort, it should stop with a watchdog process marking the job as a failure, not with the entire system crashing.
returning HTTP 500 as early as possible is an example of "let it crash" approach outside of Erlang.
That's not what "let it crash" is about. Letting something crash in Erlang means that a process (actor) is allowed to crash, but then it gets restarted to try again, which would resolve the situation in case of transient errors.
The equivalent of "let it crash" outside of Erlang is a mountain of try-catch statements and hand-rolled retry wrappers with time delays, with none of the observability and tooling that you get in Erlang.
"Let it crash" is a sentence that gets attention. It makes a person want to know more about it, as it sounds controversial and different. "Let it heal" doesn't have that.
It also has a deeper philosophical meaning of unexpected software bugs should be noisy and obvious instead of causing silently corruption or misleading user experience. If monitoring doesn’t catch the failure, customers will and it can be fixed right away (whether it’s the software, a hardware error, dependency issue, etc.).
A web service returning a 500 error code is a lot more obvious than a 200 with an invalid payload. A crashed app with a stack trace is easier to debug and will cause more user feedback than an app than hangs in a retry loop.
When I had to deal with these things in the Java world, it meant not blindly handling or swallowing exceptions that business code had no business caring about. Does your account management code really think it knows how to properly handle an InterruptedException? Unless your answer is rollback and reset the interrupted flag it’s probably wrong. Can’t write a test for a particular failure scenario? That better blow up loudly with enough context that makes it possible to understand the error condition (and then write a test for it).
It is very common to interpret taglines by their face value, and I believe the author did just that, although the point brought up is valid.
In order to “let it crash”, we must design the system in a way that crashes would not be catastrophic, stability wise. Letting it crash is not a commandment, though: it is a reminder that, in most cases, a smart healing strategy might be overkill.
Author: I'm literally explaining not to interpret the tag line at face value.
Maybe I didn’t make myself clear. “Let it crash” is not something that should be thought of at the component level, it should be thought of at the system level. The fact that the application crashes “gracefully” or not is not what is really important. You should design the system in a crash-friendly way, and not to write the application and think: “oh, I believe it is OK to let it crash here”.
Then I don't think you understand how the phrase is used in Elixir/Erlang. The phrase is about letting processes crash.
No need for the snarky comment. If I am wrong, that is fine.
Of course Joe Armstrong could explain what I meant, but in a much better way: https://erlang.org/pipermail/erlang-questions/2003-March/007... (edit: see the "Why was error handling designed like this?" part for reference)
My personal interpretation is that systems must be able to handle crashing processes gracefully. There is no benefit in letting processes crash just for the sake of it.
Actually, now I thought about it, I know exactly what irked me about the approach. I hope the author takes it as constructive feedback:
Saying "let it crash is a tagline that actually means something else because the BEAM is supposed to be used in this particular way" sounds slightly "cargo-cultish", to the point where we have to challenge the meaning of the actual word to make sense of it.
Joe Armstrong's e-mail, on the other hand, says (and I paraphrase): "the BEAM was designed from the ground up to help developers avoid the creation of ad-hoc protocols for process communication, and the OTP takes that into consideration already. Make sure your system, not your process, is resilient, and literally let processes crash." Boom. There is no gotcha there. Also, there is the added benefit that developers for other platforms now understand that the rationale is justified by the way BEAM/OTP were designed and may not be applicable to their own platforms.
If I sounded snarky that wasn't my intention. At the end of the day though it doesn't feel like you read the article which was clearly in a different context than the one in which you responded. FWIW I didn't expect this small article speaking to a small audience (Elixir devs) to make the rounds on hacker news.
I agree on the importance of defining terms, and I think the important thing here is that "process" in Joe's parlance is not an OS level process, it is one of a fleet of processes running inside the BEAM VM. And the "system" in this case is the supervisory system around it, which itself consists of individual processes.
I'm critiquing a common misunderstanding of the phrase "Let it crash", whereby effectively no local error handling is performed. This leads to worse user experiences and worse outcomes in general. I understand that you're offering critique, but it again sounds like you're critiquing a reductive element (the headline itself).
I did read the article. I concede that I might not have understood it. Again, I never said it is wrong, but rather that it has a blind spot. I am familiar with Joe Armstrong’s work because I worked on a proprietary (and rather worse tbf) native distributed systems middleware in the past.
Yeah, but it's internet forum and for opinion pieces people first read comments and then maybe read the article if it's interesting.
I actually skimmed the article before posting. I have some exposure to Erlang, but not to Elixir. As I’ve already mentioned, I think the author’s covering of application behavior is OK, but there is more to the tagline than meets the eye.
Ah this makes sense. I always thought "let it crash" made it sound like Elixir devs just don't bother with error checking, like writing Java without any `catch`es, or writing Rust that only uses `.unwrap()`.
If they just mean "processes should be restartable" then that sounds way more reasonable. Similar idea to this but less fancy: https://flawless.dev/
It's a pretty terrible slogan if it makes your language sound worse than it actually is.
Flawless is interesting.
It can't work in the general case because replaying a sequence of syscalls is not sufficient to put the machine back in the same state as it was last time. E.g. second time around open behaves differently so you need to follow the error handling.
However sometimes that approach would work. I wonder how wide the area of effective application is. It might be wide enough to be very useful. The all or nothing database transaction model fits it well.
I think the slogan was meant to be provocative but unfortunately it has been misinterpreted more often than not.
For example, imagine you're working with a 3rd party API and, according to the documentation, it is supposed to return responses in a certain format. What if suddenly that API stops working? Or what if the format changes?
You could write code to handle that "what if" scenario, but then trying to handle every hypothetical your code becomes bloated, more complicated, and hard to understand.
So in these cases, you accept that the system will crash. But to ensure reliability, you don't want to bring down the whole system. So there are primitives that let you control the blast radius of the crash if something unexpected happens.
Let it crash does not mean you skip validating user input. Those are issues that you expect to happen. You handle those just as you would in any programming language.
I've been seeing a lot of these durable workflow engines around lately, for some reason. I'm not sure I understand the pitch. It just seems like a thin wrapper around some very normal patterns for running background jobs. Persist your jobs in a db, checkpoint as necessary, periodically retry. I guess they're meant to be a low-code alternative to writing the db tables yourself, but it seems like you're not saving much code in practice.
As someone has linked it: https://erlang.org/pipermail/erlang-questions/2003-March/007...
It is about self-healing, too.
I think it’s more subtle:
Imagine that you’re trying to access an API, which for some reason fails.
“Let it crash” isn’t an argument against handling the timeout, but rather that you should only retry a few, bounded times rather than (eg) exponentially back off indefinitely.
When you design from that perspective, you just fail your request processing (returning the request to the queue) and make that your manager’s problem. Your managing process can then restart you, reassign the work to healthy workers, etc. If your manager can’t get things working and the queue overflows, it throws it into dead letters and crashes. That might restart the server, it might page oncall, etc.
The core idea is that within your business logic is the wrong place to handle system health — and that many problems can be solved by routing around problems (ie, give task to a healthy worker) or restarting a process. A process should crash when it isn’t scoped to handle the problem it’s facing (eg, server OOM, critical dependency offline, bad permissions). Crashing escalates the problem until somebody can resolve it.
This is great, thanks for sharing! I've been thinking about improving error handling in my liveview app and this might be a nice way to start.
A condition that "should not happen" might still be a problem specific to a particular request. If you "just crash" it turns this request from one that only triggers a http 500 response to one that crashes the process. This increases the risk of Query of Death scenarios where the frontend that needs to serve this particular request starts retrying it with different backends and triggers restarts faster than the processes come back up.
So being too eager to "just crash" may turn a scenario where you fail to serve 1% of requests into a scenario where you serve none because all your processes keep restarting.
You should try to do some load testing of a real Erlang system and compare how it handles this scenario against other languages/frameworks. What you are describing is one of the exact things the Erlang system is strong against due to the scheduler.
Processes can be marked as temporary, which means they are not restarted, and that’s what is used when managing http connections, as you can’t really restart a request on the server without the client. So the scenario above wouldn’t happen.
You still want those processes to crash though, as it allows it to automatically clean up any concurrent work. For example, if during a request you start three processes to do concurrent work, like fetching APIs, then the request process crashes, the concurrent processes are automatically cleaned up.
> If you "just crash" it turns this request from one that only triggers a http 500 response to one that crashes the process.
In phoenix each request has its own process and crashing that process will result in a 500 being sent to the client.
My impression is that in Erlang land each process handler is really cheap so you can just keep on showing up with process handlers and not reach exhaustion like you do with other systems (at least in pre-async worlds...)
This is funny given Elixir/Erlangs whole idea is "let it crash". In Go I just have a Recovery Middleware for any type of problem. Don't know how other langs do it tho
erlang doesn't crash the program, it crashes the thread. erlang has a layered management system built in as part of OTP (open telecom platform, erlang was built for running highly concurrent telephony hardware). when a thread crashes, it dies and signals its parent. the parent then decides what to do. usually, that's just restarting the worker. maybe if ten workers have crashed in a minute, the manager itself will die and restart. issues bubble up, and managers restart subsystems automatically. for some things, like parsing user data, you might never cause the manager to die, and just always restart the worker.
the article, if you should choose to read it, is explaining that people have the misconception you appear to be having due to the 'let it fail' catchphrase. it goes into detail about this system, when failing is appropriate, and when trying to work around errors is appropriate.
as erlang uses greenthreads, restarting a thread for a user API is effectively instant and free.
It's not a misconception given that Elixir Forum and its Discords members will say that to you. Also I never assumed the whole program crashed so why would you explain this to me? Why would one Blog guy know it better than a lot of other Elixir devs?
It’s well known among elixir devs that for reasons unkown, Elixir Forum is populated predominantly by people who don’t know what they’re talking about.
Blog guy here: I do, in fact, know it better than a lot of other Elixir devs.
I don’t know Go, but that sounds like someone has simply written part of Erlang in Go.
"Let it crash" in Erlang/Elixir means that the process that serves the request is allowed to crash. It then will be restarted by the supervisor.
Supervisors themselves form a tree, so for a crash to take down the whole app, it needs to propagate all the way to the top.
Another explanation for people familiar with exceptions in other languages: "Don't try to catch the exception inside a request handler".
Question as a complete outsider: If I run idempotent Python applications in Kubernetes containers and they crash, Kubernetes will eventually restart them. Of course, knowing what to do on IO errors is nicer than destroying and restarting everything with a really bigger hammer (as the article also mentions, you can serve a better error message for whoever has to “deal” with the problem), but eventually they should end up in the same workable state.
Is this conceptually similar, but perhaps at code-level instead?
Somewhat, yes but it's much less powerful. In the BEAM these are trees of supervisors and monitors/links that choose how to restart and receive the stacktrace/error reason of the failure respectively. This gives a lot of freedom on how to handle the failure. In k8s, it's often just a dumb monitor/controller that knows little about how to remediate the issue on boot. Nevermind the boot time penalty.
https://hexdocs.pm/elixir/1.18.4/Supervisor.html
BEAM apps run great on k8s.
Conceptually similar, different implementation. The perhaps most visible difference is that supervisors aren’t polling application state but are rather notified about errors (crashes), and restarting is extremely low latency. Erlang/BEAM was invented for telephony, and it is possible for this to happen on the middle of a protocol and the user not even notice.
In general, if you can move any kind of logic to a lower level, that's better.
For example, testing that kubernetes restarts work correctly is tricky and requires a complicated setup. Testing that an erlang process/actor behaves as expected is basically a unit test.
I bet the kubernetes project has test for that, why should I as an application developer care about testing something other than my own code?
Oh of course, I'm sure the kubernetes project tests that they trigger restarts correctly etc.
But that doesn't cover the behavior of your app, the specific configuration you ask kubernetes to use and how the app uses its health endpoints etc. - this is all purely about your own code/config, the kubernetes team can't test that.
That's assuming your code is well-configured. How do you test your k8s configs?
This seems specific to BEAM as crashing a fast-cgi process is fine and response will be handled correctly with Apache or nginx.
Unix/BSD -> Crash, fix, restart.
GNU/MIT/Lisp -> Detect, offer a fix, continue.
I don't code in Erlang or Elixir, aside from messing about. But I've found that letting an entire application crash is something that I can do under certain circumstances, especially when "you have a very big problem and will not go to space today". For example, if there's an error reading some piece of data that's in the application bundle and is needed to legitimately start up in the first place (assets for my game for instance). Then upon error it just "screams and dies" (spits out a stack trace and terminates).
Errors during initialization of a BEAM language application will crash the entire program, and you can decide to exit/crash a program if you get into some unrecoverable state. The important thing is the design of individual crashable/recoverable units.
“Reset on error” might be a better phrasing.
Hackers also love auto-restarting services.
Exploitation of vulnerabilities isn’t always 100% reliable. Heap grooming might be limited or otherwise inadequate.
A quick automatic restart keeps them in business without any other human interaction involved.
Took me a minute to realize what you meant with "hackers". Quite the irony, given the name of the site we are having this conversation on.
I think a lot of folks who have never looked at Erlang or Elixir and BEAM before misunderstand this concept because they don't understand how fine-grained processes are, or can be, in Erlang. A very important note: Processes in BEAM languages are cheap, both to create and for context switching, compared to OS threads. While design-wise they offer similar capabilities, this cost difference results in a substantially different approach to design in Erlang than in systems where the cost of introducing and switching between threads is more expensive.
In a more conventional language where concurrency is relatively expensive, and assuming you're not an idiot who writes 1-10k SLOC functions, you end up with functions that have a "single responsibility" (maybe not actually a single responsibility, but closer to it than having 100 duties in one function) near the bottom of your call tree, but they all exist in one thread of execution. In a system, hypothetical, created in this model if your lowest level function is something like:
And the database connection fails, would you attempt to restart the database connection in this function? Maybe, but that'd be bad design. You'd most likely raise an exception or change the signature so you could express an error return, in Rust and similar it would become something like: Somewhere higher in the call stack you have a handler which will catch the exception or process the error and determine what to do. That is, the function `retrieve_data` crashes, it fails to achieve its objective and does not attempt any corrective action (beyond maybe a few retries in case the error is transient).In Erlang, you have a supervision tree which corresponds to this call tree concept but for processes. The process handling data retrieval, having been given some db_conn handler and the parameters, will fail for some reason. Instead of handling the error in this process, the process crashes. The failure condition is passed to the supervisor which may or may not have a handler for this situation.
You might put the simple retry policy in the supervisor (that basic assumption of transient errors, maybe a second or third attempt will succeed). It might have other retry policies, like trying the request again but with a different db_connection (that other one must be bad for some reason, perhaps the db instance it references is down). If it continues to fail, then this supervisor will either handle the error some other way (signaling to another process that the db is down, fix it or tell the supervisor what to do) or perhaps crash itself. This repeats all the way up the supervision tree, ultimately it could mean bringing down the whole system if the error propagates to a high enough level.
This is conceptually no different than how errors and exceptions are handled in sequential, non-concurrent systems. You have handlers that provide mechanisms for retrying or dealing with the errors, and if you don't the error is propagated up (hopefully you don't continue running in a known-bad state) until it is handled or the program crashes entirely.
In languages that offer more expensive concurrency (traditional OS threads), the cost of concurrency (in memory and time) means you end up with a policy that sits somewhere between Erlang's and a straight-line sequential program. Your threads will be larger than Erlang processes so they'll include more error handling within themselves, but ultimately they can still fail and you'll have a supervisor of some sort that determines what happens next (hopefully).
As more languages move to cheap concurrency (Go's goroutines, Java's virtual threads), system designs have a chance to shift closer to Erlang than that straight-line sequential approach if people are willing to take advantage of it.
There's really not more that's useful to say than the relevant section (4.4) of Joe Armstrong's thesis says:
>How does our philosophy of handling errors fit in with coding practices? What kind of code must the programmer write when they find an error? The philosophy is let some other process fix the error, but what does this mean for their code? The answer is let it crash. By this I mean that in the event of an error, then the program should just crash. But what is an error? For programming purpose we can say that:
>• exceptions occur when the run-time system does not know what to do.
>• errors occur when the programmer doesn’t know what to do.
>If an exception is generated by the run-time system, but the programmer had foreseen this and knows what to do to correct the condition that caused the exception, then this is not an error. For example, opening a file which does not exist might cause an exception, but the programmer might decide that this is not an error. They therefore write code which traps this exception and takes the necessary corrective action.
>Errors occur when the programmer does not know what to do. Programmers are supposed to follow specifications, but often the specification does not say what to do and therefore the programmer does not know what to do.
>[...]
>The defensive code detracts from the pure case and confuses the reader—the diagnostic is often no better than the diagnostic which the compiler supplies automatically.
Note that this "program" is a process. For a process doing work, encountering something it can't handle is an error per the above definitions, and the process should just die, since there's nothing better for it to do; for a supervisor process supervising such processes-doing-work, "my child process exited" is an exception at worst, and usually not even an exception since the standard library supervisor code already handles that.
https://fsharpforfunandprofit.com/rop/
Railway orientated programming to the rescue?
The truth is that different errors have to lead to different results if you want a good organisational outcome. These could be:
- Fundamental/Fatal error: something without the process cannot function, e.g. we are missing an essential config option. Exiting with an error is totally adequate. You can't just heal from that as it would involve guessing information you don't have. Admins need to fix it
- Critical error: something that should not ever occur, e.g. having an active user without password and email. You don't exit, you skip it if thst is possible and ensure the first occurance is logged and admins are contacted
- Expected/Regular error: something that is expected to happen during the normal operations of the service, e.g. the other server you make requests to is being restarted and thus unreachable. Here the strategy may vary, but it could be something like retrying with random exponential backoff. Or you could briefly accept the values provided by that server are unknown and periodically retry to fill the unknown values. Or you could escalate that into a critical error after a certain amount of retries.
- Warnings: These are usually about something being not exactly ideal, but do not impede with the flow of the program at all. Usually has to do with bad data quality
If you can proceed without degrading the integrity of the system you should, the next thing is to decide jow important it is for humans to hear about it.
>When people say “let it crash”, they are referring to the fact that practically any exited process in your application will be subsequently restarted. Because of this, you can often be much less defensive around unexpected errors. You will see far fewer try/rescue, or matching on error states in Elixir code.
I just threw up in my mouth when I read this. I've never used this language so maybe my experience doesn't apply here but I'm imagining all the different security implications that ive seen arise from failing to check error codes.
That’s actually a good example. Imagine someone forgot to check the error code from an API response. In some languages, they may attempt to parse it as if it was successful request, and succeed, leading to a result with nulls, empty arrays, or missing data that then spreads through the system. In Elixir, parsing would most likely fail thanks to pattern matching [1] and if it by any chance that fails in a core part of the system, that failure will be isolated and that particular component can be restarted.
Elixir is not about willingly ignoring error codes or failure scenarios. It is about naturally limiting the blast radius of errors without a need to program defensively (as in writing code for scenarios you don’t know “just in case”).
1: https://dashbit.co/blog/writing-assertive-code-with-elixir
Ok, so it's not really that you're not checking error codes. It's that you can write stuff like
If whatever is successful and idomatic, it returns ok, or maybe a tuple of {ok, SomeReturn}. In that case, execution would continue. If it returns an error tuple like {error, Reason}... "Let it crash" says you can just let it crash... You didn't have anything better to do, the built in crash because {error, Reason} will do fine.Or you could do a
If it was fine to get nxdomain error, but any other error isn't acceptable... It will just crash, and that's good or at least ok. Better than having to enumerate all the possible errors, or having a catchall that then explicitly throws an eeror. It's especially hard to enumerate all possible errors because the running system can change and may return a new error that wasn't enumerated when the requesting code was written.There's lots of places where crashing isn't actually what you want, and you have to capture all errors, explicitly log it, and then move on... But when you can, checking for success or success and a handful of expected and recoverable errors is very nice.
If get a chance to read some Elixir/Erlang code you'll see that pattern matching is used frequently to assert expected error codes. It does not mean ignore errors.
This is a common misunderstanding because unfortunately the slogan is frequently misinterpreted.