As someone who worked with serverless for multiple years (mostly amazon lambda but others too) i can absolutely apporove the authors points.
While it "takes away" some work from you, it adds this work on other points to solve the "artificial induced problems".
Another example i hit was a hard upload limit. Ported an application to a serverless variant, had an import API for huge customer exports. Shouldnt be a problem right? Just setup an ingest endpoint and some background workers to process the data.
Tho than i learned : i cant upload more than 100mb at a time through the "api gateway" (basically their proxy to invoke your code) and when asking if i could change it somehow i just was told to tell our customers to upload smaller file chunks.
While from a "technical" perspective this sounds logical, our customers not gonne start exchanging all their software so we get a "nicer upload strategy".
For me this is comparable with "it works in a vacuum" type of things. Its cool in theory, but as soon it hits reality you will realice quite fast that the time and money you safed on changing from permanent running machines to serverless, you will spend in other ways to solve the serverless specialities.
The way to work around this issue is to provide a presigned S3 url
Have the users upload to s3 directly and then they can either POST you what they uploaded or you can find some other means of correlating the input (eg: files in s3 are prefixed with the request id or something)
I agree this is annoying and maybe I’ve been in AWS ecosystem for too long.
However having an API that accepts an unbounded amount of data is a good recipe for DoS attacks, I suppose the 100MB is outdated as internet has gotten faster but eventually we do need some limit
Well i partly agree, and if i would be the one building the counterpart, i prolly had used presigned s3 urls also.
In this specific case im getting oldschool file upload request from software that was partly written before the 2000s - noones gonne adjust anything any more.
And ye, just accepting giant size uploads is far from good in terms of "Security" like DoS - but ye we talking about stupidly somewhere between 100 and 300mb CSV files (called them "huge" because in terms of product data 200-300mb text include quite alot) - not great but well we try to satisfy our customers needs.
But ye like all the other points - everything is solvable somehow - just needs us to spend more time to solve something that technickly wasn't a real problem in first place.
Edit: Another funny example. In a similar process on another provider i downloaded files in a similar size range from S3 to parse them - which died again and again. After contacting the hoster, because their logs litearlly just stopped no error tracing nothing) they told me that basically their setup only allows for 10mb local storing - and the default (in this case aws s3 adapter for PHP) always downloads it even if you tell it to "stream". So i build a solution that used HTTP ranged requests to "fake stream" the file into memory in smaller chunks so i could process it afterwards without completely download it. Just another example of : yes its solvable, but annoying.
I find with these types of customers it’s always easier to just ask them to save files locally and grant me privileges to read the data. Sometimes they’ll be on Google, Dropbox, Microsoft, etc and I also run a SFTP for this in case they want to move them over to my service.
Then I either batch/schedule the processing or give them an endpoint to just to trigger it (/data/import?filename=demo.csv)
It’s actually so common that I just have the “data exchange” conversation and let them decide which fits their needs best. Most of it is available for self service configuration.
Uploads to an S3 bucket can trigger a lambda… don’t complicate things. The upload trigger can tell the system about the upload and the client can continue on their day.
Uploader on the client uses presigned url. S3 triggers lambda. Lambda function takes file path and tells background workers about it either via queue, mq, rest, gRPC, or doing the lift in workflow etl functions.
And while you are being sarcastic, this is the Right Way to use queues.
Upload file to S3 -> trigger an SNS message for fanout if you need it -> SNS -> SQS trigger -> SQS to ETL jobs.
The ETL job can then be hosted using Lambda (easiest) or ECS/Docker/Fargate (still easy and scales on demand) or even a set of EC2 instances that scale based on the items in a queue (don’t do this unless you have a legacy app that can’t be containerized).
If your client only supports SFTP, there is the SFTP Transfer Service on AWS that will allow them to send the file via SFTP and it is automatically copied to an S3 bucket.
Alternatively, there are products that treat S3 as a mountable directory and they can just use whatever copy commands on their end to copy the file to a “folder”
If I have a user facing upload button, why can't I simply have a webserver that receives the data and pushes it into s3 via multi-part upload. Something that can be written in a framework of your choice in 10 minutes with 0 setup?
For uploads under 50 MB you could also skip the multipart upload and take a naive approach without taking a significant hit.
Solving a serverless limitation with more serverless so you can continue doing serverless when you can’t FormUpload a simple 101mb zip file as an application/octet-stream. Doubling down on it for a triple beat.
I wouldn't really call it "more" severless to rearrange the order a bit. Which makes it "solving a serverless limitation so you can continue doing severless". And that's just a deliberately awkward way of saying "solving a serverless limitation" because if you can solve it easily why would you not continue? Spite?
So I still don't see how it's notably worse than the idea of using serverless at all.
The controversy here is the fact that the API Gateway limits the upload resulting in having to engineer a workaround workflow using s3 and triggers (even if this is the serverless way) when all you want to do is upload a file. A POST call with an octet-octet stream. Let http handle resume. But you can’t and you end up going around the side door, when all you really want is client_body_max_size
The sarcasm of correctness yet playing down its complexity is entirely my own. We used to be able to do things easily.
Nope, you didn’t use terracottax so you failed anyway. 6 months before you can reapply in case the first humiliation wasn’t enough. Boss was looking for AWS Glue in there and you didn’t use it.
It actually is though. I don't need to build a custom upload client, I don't need to manage restart behavior, I get automatic restarts if any of the background workers fail, I have a dead letter queue built in to catch unusual failures, I can tie it all together with a common API that's a first class component of the system.
Working in the cloud forces you to address the hard problems first. If you actually take the time to do this everything else becomes _absurdly_ easy.
I want to write programs. I don't want to manage failures and fix bad data in the DB directly. I personally love the cloud and this separation of concerns.
For S3 you do need to generate a presigned URL, so you would have to add this logic there somewhere instead of "just having a generic HTTP upload endpoint".
Unless the solution is "don't have the problem in the first place" the cloud limitations are just getting in the way here.
The solution is to use the appropriate tool for the job. If you're locked in to highly crusty legacy software, it's inevitably going to require workarounds. There are good technical reasons why arbitrary-size single-part file uploads are now considered an anti-pattern. If you must support them, then don't be shocked if you wind up needing EC2 or other lower-level service as a point of ingress into your otherwise-serverless ecosystem.
If we want to treat the architectural peculiarities of GP's stack as an indictment of serverless in general, then we could just as well point to the limitations of running LAMP on a single machine as an indictment of servers in general (which obviously would be silly, since LAMP is still useful for some applications, as are bare metal servers).
We down play how trivial it is to generate a signed url, it’s only like a few lines and a function call to get but, you then have to send this to the client. The client has to then use this url, then check back with you to see if it arrived resulting in a kind of pea soup architecture unless your application is also entirely event driven. Oh how we get suckered in…
> Working in the cloud forces you to address the hard problems first.
It also forces you to address all the non-existent problems first, the ones you just wish you had like all the larger companies that genuinely have to deal with thousands of file upload per second.
And don't forget all the new infrastructure you added to do the job of just receiving the file in your app server and putting it into the place it was going to go anyway but via separate components that all always seem to end up with individual repositories, separate deployment pipelines, and that can't be effectively tested in isolation without going into their target environment.
And all the additional monitoring you need on each of the individual components that were added, particularly on those helpful background workers to make sure they're actually getting triggered (you won't know they're failing if they never got called in the first place due to misconfiguration).
And you're now likely locked into your upload system being directly coupled to your cloud vendor. Oh wait, you used Minio to provide a backend-agnostic intermediate layer? Great, that's another layer that needs managing.
Is a content delivery network better suited to handling concurrent file uploads from millions of concurrent users than your app server? I'd honestly hope so, that's what it's designed for. Was it necessary? I'd like to see the numbers first.
At the end of the day, every system design decision is a trade off and almost always involves some kind of additional complexity for some benefit. It might be worth the cost, but a lot of these system designs don't need this many moving parts to achieve the same results and this only serves to add complexity without solving a direct problem.
If you're actually that company, good for you and genuinely congratulations on the business success. The problem is that companies that don't currently and may never need that are being sold system designs that, while technically more than capable, are over-designed for the problem they're solving.
> You will have these problems. Not as often as the larger companies but to imagine that they simply don't exist is the opposite of sound engineering.
A lot of those failure mode examples seem well suited to client-side retries and appropriate rate limiting. If we're talking file uploads then sure, there absolutely are going to be cases where the benefits of having clients go to the third-party is more beneficial than costly (high variance in allowed upload size would be one to consider), but for simple upload cases I'm not so convinced that high-level client retries aren't something that would work.
> if they never got called in the first place due to misconfiguration
I find it hard to believe that having more components to monitor will ever be simpler than fewer. If we're being specific about vendors, the AWS console is IMHO the absolute worst place to go for a good centralized logging experience, so you almost certainly end up shipping your logs into a better centralized logging system that has more useful monitoring and visualisation features than CloudWatch and has the added benefit of not being the AWS console. The cost here? Financial, time, and complexity/moving parts for moving data from one to the other. Oh and don't forget to keep monitoring on the log shipping component too, that can also fail (and needs updates).
> The protocol provided by S3 is available through dozens of vendors.
It's become a de facto standard for sure, and is helpful for other vendors to re-implement it but at varying levels of compatibility.
> It only matters if it is of equivalent or lessor cost.
This is precisely the point, I'm saying that adding boxes in the system diagram is a guaranteed cost as much as a potential benefit.
> Yet you explicitly ignore these
I repeatedly mentioned things that to me count as complexity that should be considered. Additional moving parts/independent components, the associated monitoring required, repository sprawl, etc.
> No, I just read the documentation, and then built it.
I also just 'read the documention and built it', but other comments in the thread allude to vendor-specific training pushing for not only vendor-specific solutions (no surprise) but also the use of vendor-specific technology that maybe wasn't necessary for a reliable system. Why use a simple pull-based API with open standards when you can tie everything up in the world of proprietary vendor solutions that have their own common API?
We became the flagship customer for a division of AWS that was responsible for managing SSL certificates. We were doing vanity URLs and vanity URLs generally require individual SSL certificates for each domain name. We needed thousands and AWS tools for cert management at the time was really only happy with hundreds and they had backlog items to fix it but those were behind a year or two of other work. It took them about three months to get far enough along for our immediate needs. It's surprising the parts of AWS that have not adjusted to outliers that don't seem really to be that exceptional.
I also thought Lambda looked promising at first, but we ultimately abandoned all our Lambda projects and started using containers as needed.
Lambda still requires that you need to update the Node runtime every year or two, while with your own containers, you can decide on your own upgrade schedule.
I've observed massive back office pipelines using dozens of interconnected lambda, batching, streaming, distributed storage for ephemeral data and other rube Goldberg contraptions to build what was ultimately a cron job on a modest server running for 1 hour.
Being in the cloud doesn't mean you need to accept timeouts/limitations. CDK+fargate can easily run an ephemeral container to perform some offline processing.
Just to help future readers, there is an ecosystem of "tus" uploaders and endpoints, that chunk uploads, and feature resumeable uploads, that would be ideal for this kind of restriction:
Some architectural arguments I kick myself for not establishing a bibliography of all of my justifications. The thing with mastering something is that you copy the rules into the intuitive part of your brain and you no longer have to reason through it step by step like Socrates's lectures. You just know and you do.
The biggest one I regret is "communicating through the file system is 10x dumber than you think it is, even if you think you know how dumb it is." I should have a three page bibliography on that. Mostly people don't challenge you on this, but I had one brilliant moron at my last job who did, and all I could do was stare at him like he had three heads.
Like the article says, I think serverless has it's place, but I don't think it's for most applications. I can't see myself _ever_ using serverless services as a core part of my application for pretty much any startup, if I can avoid it. The infrastructure overhead is actually worse, IMO.
Everything is so platform specific and it's much stranger to test and develop against locally. Each platform has a different way to test, and the abstraction layers that exist (unless this has changed recently) always had pitfalls, since there are no true standards.
I'd much rather have a docker image as my deliverable "executable" so I can test, but still abstract away some stuff, like environment setup. Giving me a minimal Linux environment and filesystem feels like the most comfortable level of abstraction for me to develop well and also deploy and run in production effectively. I can also run that on demand or (most commonly) run that as a server that sits and waits for requests.
> Like the article says, I think serverless has it's place, but I don't think it's for most applications
I feel this way about many of the more popular trends over the last decade or so.
A technology becomes popular because a really large organization uses it to solve problems that only really exist st that scale. Then they talk about how it works and many tend to take that as an indicator that the solution is ideal, regardless of context.
GraphQL, react, Tailwind, NextJS, and many others. They solve specific problems and those may be problems you have. No tool is a universally useful solution though, it takes real experience and knowledge to discern what your problems are and how to best tackle them.
A lot of it is because platforms need a wedge against AWS/Azure.
Take edge computing for example – most apps are CRUD apps that talk to a single database, but vendors keep pushing for running things on edge servers talking to databases with eventual consistency because hey, at least AWS doesn't offer that! (or it would cost you 10k/month to run that on AWS)
For the problem of eventual consistency, Google's Spanner and Firestore both provide strong or tunable global consistency. Instead of "eventual consistency because AWS doesn’t offer that", GCP has "we give you strong consistency at scale."
That’s funny, I agree with all of those but tailwind which I find to be super convenient for small teams where most people aren’t css experts. Maybe that’s a specific problem but it feels pretty general to want convenience on top of base css stuff without writing and making it all yourself.
I think the entire concept of HTML and CSS when it comes to browser-based app development (as opposed to styling documents) is just not that ergonomic compared to a native equivalent.
It might have seen a lot of improvements over the years to make it more flexible but I can't imagine any serious project raw dogging CSS in favour of pulling in some kind of framework or abstraction to make it manageable.
Is the main benefit you find related to the resets and design system built in to tailwind?
I've always found that I need to know CSS to really use tailwind, and I need to additionally know tailwind-specific syntax to do anything somewhat complex related to layout or conditional styling.
As someone with years of experience on serverless stuff on AWS I might be a bit biased BUT I'd argue serverless is the sweet spot for most applications. You need to remember however that most applications aren't your typical startups or other software products but simply some rather boring line of business software nobody outside the company owning it knows of.
Concerning how IT departments in most non-software companies are, the minimal operational burden is a massive advantage and the productivity is great once you have a team with enough cloud expertise. Think bespoke e-commerce backends, product information management systems or data platforms with teams of a handful of developers taking responsibility for the whole application lifecycle.
The cloud expertise part is a hard requirement though but luckily on AWS the curriculum is somewhat standardized through developer and solutions architect certifications. That helps if you need to do handovers to maintenance or similar.
That said, even as a serverless fan, I immediately thought of containers when the performance requirements came up in the article. Same with the earlier trending "serverless sucks" about video processing on AWS. Most of the time serverless is great but it's definitely not a silver bullet.
I like your angle, but most applications is a big difference from most companies. Serverless comes after deciding whether or not to break up the monolith, and after breaking up engineering into separate teams. It's a good way to manage apps with high variance in traffic while keeping cloud spend down.
Localstack makes that pretty easy. Before Localstack I had a pre-staging environment (dev) target I would deploy to. Their free/community offering includes a Lambda environment; you deploy your dev "Lambda" locally to docker, using the same terraform / deploy flow you'd normally use but pointed at a Localstack server which mimics the AWS API instead. Some of their other offerings require you to pay though (Cloudfront, ECS, others) and I don't use those, yet at least.
I had a horrible time with Localstack. It's very similar to AWS but not exactly the same, so you hit all kinds of annoying edges and subtle bugs or differences and you're basically just doing multi-cloud at that point. The same kinds of problems you encounter with slightly inaccurate mocks in automated testing.
The better solution in my experience is to create an entirely different account to your production environments and just deploy things there to test. At least that way you're dealing with the real SQS, S3, etc, and if you have integration problems there, they're actually real things you need to fix for it to work in prod - not just weird bandaids to make it work with Localstack too.
I dealt with a microservice style serverless backend that was spread out across roughly 50 Go-based lambdas, one per API endpoint, with a few more for SQS and Cognito triggers. Was deployed via CloudFormation. Testing locally was an absolute nightmare even with Localstack.
Made me wish for a simple VPS, an ansible script to setup nginx and fail2ban and all that shit, and a deployment pipeline that scp'd a single binary into the right place.
Developer-specific sandbox environments with hot code reload is the golden standard here but Localstack is great if you can't do that due to (usually IT deparment-related, not technical) reasons.
That’s why Knative (Serverless on Kubernetes) accepts containers. It’s the standard packaging format that lets you lift and shift apps to many different platforms.
But they weren’t building an application. They were building a library that would get integrated in other stateful server applications not running on Cloudflare. The performance benefit comes from running their auth colocated with their customers, not anything else.
I've been doing AWS Lambda since it started up over 10 years ago. It solves a lot of problems for me. I don't ever have to worry about load balancing or scaling. I don't have to maintain a server. When it isn't being used, I am not paying for it. I've been running a pretty sophisticated project on Lambda for years, and I pay about $0.00/month for it. Most of the ~$0.45/mo I pay to AWS is in S3.
Lambda code is extremely easy to test locally, if you write it that way. I just run the file locally and it does what it would do in the cloud, there is literally no difference. But of course, YMMV depending on how you approach it.
I created my own build tools for Lambda about a month after Lambda was introduced as a product. It's been working great ever since. The workflow is very simple. When I update a file locally, it simply updates the Lambda function almost instantly. I can then test the Lambda live in the cloud. If I want to run the function locally, I just run it and it behaves the same way it would in the cloud. There's no need to run the function in AWS, if you write the code so it can be run locally. It's really, really easy to do, but I guess some people haven't figured that out yet.
I've never liked containers. It's always been way more opaque than writing Lambdas that can run locally as well as in the cloud.
Lambda does get expensive when call volume goes up. If you're handling 10rps typically, ECS becomes a lot cheaper.
It obviously depends on how long your request last but still.
As for running it locally, it depends what your upstream is. I can tell you that I've had to work around bugs in the marshalling from MSK for example. You would never find that locally. If it's just web requests, sure.
I think developers are drowning in tools to make things "easy", when in truth many problems are already easy with the most basic stuff in our tool belt (a compiler, some bash scripts, and some libraries). You can always build up from there.
This tooling fetish hurts both companies and developers.
Yeah; IMO Docker was our last universal improvement to productivity, in 2013, and very little we've invented since then can be said to have had such a wide-ranging positive impact, with such few drawbacks. Some systems are helpful for some companies, but then try to get applied to other companies where they don't make sense and things fall apart or productivity suffers. Cloudflare and others are trying to make v8 isolates a thing, and while they are awesome for some workloads, people want them to be the "next docker", and they aren't.
The model "give me docker image, we put it on internet" is staggeringly powerful. It'll probably still be the most OP way to host applications in 2040.
Docker + IaC* for me; git ops, immutable servers, immutable code, immutable config, (nearly) immutable infrastructure means I haven't had to drop to the command line on a server since 2015. If something is wrong you restart the container, if that doesn't work you restart the host it's running on. The "downside" to this is my "admin" shell skills outside of personal dev laptop commands have gotten rusty.
> If something is wrong you restart the container, if that doesn't work you restart the host it's running on
Haha, lucky you. If only world was this beautiful :) I regularly shell into Kubernetes nodes to debug memory leaks from non-limited pods, or to check some strange network issues.
It’s that, and the fact that precious few people seem to understand fundamentals anymore, which is itself fed by the desire to outsource everything to 3rd parties. You can build an entire stack where the only thing you’ve actually made is the core application, and even that is likely to be influenced if not built by AI.
The other troubling thing is that if you do invest time into learning fundamentals, you'll be penalized for it because it won't be what you're interviewed on and probably won't be what you're expected to do on the job.
I'm working on a project right now that should be two or three services running on a VM. Instead we have 40+ services spread across a K8s cluster with all the Helm Chart, ArgoCD, CICD pipeline fun that comes with it.
It drives me absolutely nuts. But hey if the company wants to pay me to add all that stuff to my resume, I guess I shouldn't complain.
Yeah, the previous company I worked for started with a Django monolith that someone had come in and taken an axe to essentially at random until there were 20 Django "microservices" that had to constantly talk to each other in order to do any operation while trying to maintain consistency across a gigantic k8s cluster. They were even all still connected to the same original database that had served the monolith!
Unfortunately my campaign of "what if we stuck all the django back together and just had one big server" got cut short by being laid off because they'd spent too much money on AWS and couldn't afford employees any more.
I had to chuckle, how ironic this is...
I worked on a project where they had 6 microservices with about 2-3 endpoints each and some of them would internally call other microservices to sync and join data. That was for 20 users top and managed by 1 team. The cloud bill was exciting to look at!
I find myself exhausted after a time, when I have to switch between 2-3 apps and many more tabs trying to co-ordinate things or when debugging issues with a teammate. And this is with me working professionally for only ~3 years.
I think the tools are nice to use early on but quickly become tough to manage as I get caught up with work, and can't keep up with the best way to manage them. Takes a lot of mental effort and context switching to manage updates or track things everywhere.
Agreed, but not Bash. Bash should not be used for anything except interactive use. It's just way too error-prone and janky otherwise.
I am not at all a fan of Python but even so any script that you write in Bash would be better in Python (except stuff like installation scripts where you want as few dependencies as possible).
If it's worth saving in a file, it's worth not using Bash.
And that is actually the advantage of serverless, in my mind. For some low-traffic workloads, you can host for next to nothing. Per invocation, it is expensive, but if you only have a few invocations of a workload that isn't very latency sensitive, you can run an entirely serverless architecture for pennies per month.
Where people get burned is moving high traffic volumes to serverless... then they look at their bill and go, "Oh my god, what have I done!?" Or they try to throw all sorts of duct tape at serverless to make it highly performant, which is a fool's errand.
Exactly. I've always found that how people want to use lambda is the exact opposite of how to use it cost effectively.
I've seen a lot of people want to use lambdas as rest endpoints and effectively replace their entire API with a cluster of lambdas.
But that's about the most expensive way to use a lambda! 1 request, one lambda.
Where these things are useful is when you say "I have this daily data pull and ETL that I need to do." Then all the sudden the cost is pretty dang competitive.
The amount of 0s in the price per second is mesmerizing, but just multiply this by 24h and 30 days, and you are well within the price range of a better EC2 with much better performance, plus you can process 1000 req/s instead of 1 req/s for the same price.
"Cheap" is relevant if you are talking about work load that is one off and doesn't run continuously. A lot of people use serverless to run a 24-7 service which sort of defeats the purpose. It doesn't get that cheap anymore.
Serverless is good if you have one off tasks that are used intermittently and are not consistent.
I once started working at a company that sold one of those visual programming things, and during training I was tasked with making a simple program, I was a bit overwhelmed with the amount of bugs and lack of tools to make basic features, so I made a prototype of the application I wanted in python, the plan was to port it later, I got it in a couple of days.
The tool developers weren't keen of the idea, they told me "Yeah, I can solve the problem with a script too, the challenge is to do it with our tool". And I thought it was kind of funny how they admitted that the premise of the tool didn't work.
It's like this holy grail panacea that arises 20 times every month, developers want to invent something that will avoid the work of actually developing, so they sunk-cost-fallacy themselves into a deep hole out of which they can only escape if they admit that they are tasked with automating, they cannot meta-automate themselves, and that they will have to gasp do some things manually and repeatedly, like any other working class.
This really is confirming my theory that the problem here is that "serverless" is so ill-defined that even it's own name is nonsensical.
Like, there are still servers.
Appears just about as intelligent as calling them "electricity-less." I mean, yes, I no longer think about electricity when deploying things, but that doesn't tell me anything meaningful about what's going on here.
I once got a job for a company whose entire stack was aws lambda. Not like a normal 'hey we have this flask application which handles 100 endpoints and we just use lambda to handle a call' - more like 'every singe route or function is its own lambda function'. I left inside 2 weeks.
Saw this with an infamous 996 AI startup where the founders vibe coded their app.
Their app had like 3 or 4 pages, that was all, but every endpoint was a Lambda call. Every frontend log entry was a Lambda call. Every job execution was a Lambda call that called a dozen others. That was their idea of an architecture.
Even with <100 customers they were already paying >100k dollars a month on cloud, more than their entire dev team cost.
The constant firefighting, downtimes and billing issues were enough to keep the team busy instead of making new features.
I was freelancing but I didn't take the job after the "CTO" gave me a tour.
In principle, serverless should be very efficient and cheap if all code is running the same VM. The heterogeneous architecture must make it expensive. Maybe serverless using WASM will one day make it cheap.
I don't think "serverless is bad" is necessarily the full lesson here. The bigger lesson is when a service has dependencies, moving that service closer to the client (without also moving those dependencies) will counterintuitively make the e2e experience slower, not faster.
Prefer building physically near your dependencies. If that's not fast enough, then you have to figure out how to move or sync all your dependencies closer to the client, which except in very simple cases, is almost always a huge can of worms.
> The bigger lesson is when a service has dependencies, moving that service closer to the client (without also moving those dependencies) will counterintuitively make the e2e experience slower, not faster
The problem here is that pretty much all services have dependencies - if they didn't, you could have already moved that logic client-side.
This bites edge-compute architectures really hard. You move compute closer to the client, but far from the DB, and in doing so your overall latency is almost always worse.
The most latency-friendly architecture is almost always going to be an in-memory DB on a monolithic server (albeit this comes with its own challenges around scaling and redundancy)
I don’t know if this is a good rule of thumb, I think it really depends on what you use the dependencies for, how often you need them, etc.
Consider for example a single DB dependency. Should the server be close to the DB or the client? It depends. How often does the client need the server? How often does the server need the DB? Which usecases are expected to be fast and which can be sacrificed as slow? What can be cached in the server? What can be cached in the client? etc etc.
And then of course you can split and do some things on the server and some in the edge…
Oh, it's a very good rule of thumb. It's probably not universal, but it's really close to it.
the problem is that nobody designs the dependencies flexible enough to let them run without fine-control. And the main application always wants to change the way it uses the dependencies, so it always needs further flexibility.
You can build an exception to the rule if you explicitly try. But I'm not sure one appears naturally. The natural way to migrate your server into the edge is by migrating entire workloads, dependencies included. You can split the work like you said, you just can't split single endpoints.
Well, I guess one can take one more step back and say this is all merely an example of "premature optimization is the root of all evil". Unless you know a-priori that you have some very hard latency requirements, start with something simple and low-maintenance. If low-latency requirements come in later, then design that holistically, not just looking at your component. Make sure you're measuring the right things; OOTB metrics often miss the e2e experience. And IME most latency issues come from unexpected places; I know I've spent weeks optimizing services to get an extra percent or two out of them, only to realize there's a config setting that reduced latency by half.
So generally, simplicity is your friend when it comes to latencies (among other things). Fewer things to cause long-tail spikes, more simple things you can try out that don't break the whole system, whereas if you start with a highly-optimized thing up-front, fixing some unexpected long-tail issue may require a complete rewrite.
Also, check with your PM or end users as to whether latency is even important. If the call to your service is generally followed up to a call to some ten-second process, users aren't going to notice the 20ms improvement to your own thing.
For a best price-to-performance ratio create your instances and do whatever is needed on them. Software stacks are not that complicated to delegate everything to the Wizards of Cloud Overcharging.
I think the "local maximum" we've gotten stuck at for application hosting is having a docker container as the canonical environment/deliverable, and injecting secrets when needed. That makes it easy to run and test locally, but still provides most of the benefits I think (infrastructure-as-code setups, reproducibility, etc). Serverless goes a little too far for most applications (in my opinion), but I have to admit some apps work really well under that model. There's a nearly endless number of simple/trivial utilities which wouldn't really gain anything from having their own infrastructure and would work just fine in a shared or on-demand hosting environment, and a massively scaled stateless service would thrive under a serverless environment much more than it would on a traditional server.
That's not to say that I think serverless is somehow only for simple or trivial use cases though, only that there's an impedance mismatch between the "classic web app" model, and what these platforms provide.
You are ready for misterio: https://github.com/daitangio/misterio
A tiny layer around stareless docker cluster.
I created it for my homelab and it gone wild
Docker is much like microservices. Appropriate for a subset of apps and yet touted as being 'the norm' when it shouldn't be.
There are drawbacks to using docker, such as security patching and operational overhead. And if you're blindly putting it into every project, how are you mitigating the risks it introduces?
Worse, the big reason it was useful, managing dependency hell, has largely been solved by making developers default to not installing dependencies globally.
We don't really need Docker anywhere near like we used to, and yet it persists as the default, unassailable.
Of course hosting companies must LOVE it, docker containers must increase their margins by 10% at least!
Someone else down thread has mentioned a tooling fetish, I feel Docker is part of that fetish.
It has downsides and risks involved, for sure. I think the security part is perhaps a bit overblown, though. In any environment, the developers either care about staying on top of security or they don't. In my experience, a dev team that skips proper security diligence when using Docker likely wouldn't handle it well outside of Docker either. The number of boxes out there running some old version of Debian that hasn't been patched in the last decade is probably higher than any of us would like.
Although I'm sure many people just do it because they believe (falsely) that it's a silver bullet, I definitely wouldn't call it part of a "tooling fetish". I think it's a reasonable choice much more often than the microservice architecture is.
I deeply disagree. Docker’s key innovation is not its isolation; it’s the packaging. There is no other language-agnostic way to say “here’s code, run it on the internet”. Solutions prior to Docker (eg buildpacks) were not so much language agnostic as they were language aware.
Even if you allow yourself the disadvantage that any non-Docker solution won’t be language-agnostic: how do you get the code bundle to your server? Zip & SFTP? How do you start it? ./start.sh? How do you restart under failure? Systemd? Congrats, you reinvented docker but worse. Want to upgrade a dependency due to a security vulnerability? Do you want to SSH into N replicated VMs and run your Linux distribution specific package update command, or press the little refresh icon in your CI to rebuild a new image then be done?
Docker is the one good thing the ops industry has invented in the last 15 years.
This is a really nice insight. I think years of linux have kind of numbed me to this. I've spent so much time on systems which use systemd now that going back to an Alpine Linux box always takes me a second to adjust, even though I know more or less how to do everything on there. I think docker's done a lot to help with that though since the interface is the same everywhere. A typical setup for me now is to have the web server running on the host and everything else behind docker, since that gives me the benefit of using the OS's configuration and security updates for everything exposed to the outside world (firewalls, etc).
Another thing about packaging. I've started noticing myself subconsciously adding even a trivial Dockerfile for most of my projects now just in case I want to run it later and not hassle with installing anything. That way it gives me a "known working" copy which I can more or less rely on to run if I need to. It took a while for me to get to that point though
It's all the same stuff. Docker just wraps what you'd do in a VM.
For the slight advantage of deploying every server with a single line, you've still got to write the mutli-line build script, just for docker instead. Plus all the downsides of docker.
There's another idea too, that docker is essentially a userspace service manager. It makes things like sandboxing, logging, restarting, etc the same everywhere, which makes having that multi-line build script more valuable.
In a sense it's just the "worse is better" solution[0], where instead of applying the good practices (sandboxing, isolation, good packaging conventions, etc) which leads to those benefits, you just wrap everything in a VM/service manager/packaging format which gives it to you anyway. I don't think it's inherently good or bad, although I understand why it leaves a bad taste in people's mouths.
Docker images are self-running. Infrastructure systems do not have to be told how to run a Docker image; they can just run them. Scripts, on the other hand, are not; at the most simple level because you'd have to inform your infrastructure system what the name of the script is, but more comprehensively and typically because there's often dependencies the run script implies of its environment, but does not (and, frankly, cannot) express. Docker solves this.
> Docker just wraps what you'd do in a VM.
Docker is not a VM.
> Plus all the downsides of docker.
Of which you've managed to elucidate zero, so thanks for that.
Hard disagree. I've used Docker predominantly in monoliths, and it has served me well. Before that I used VMs (via Vagrant). Docker certainly makes microservices more tenable because of the lower overhead, but the core tenets of reproducibility and isolation are useful regardless of architecture.
There's some truth to this too honestly. At $JOB we prototyped one of our projects in Rust to evaluate the language for use, and only started using Docker once we chose to move to .NET, since the Rust deployment story was so seamless.
- Eliminated complex caching workarounds and data pipeline overhead
- Simplified architecture from distributed system to straightforward application
We, as developers/engineers (put whatever title you want), tend to make things complex for no reason sometimes. Not all systems have to follow state-of-the-art best practices. Many times, secure, stable, durable systems outperform these fancy techs and inventions.
Don't get me wrong, I love to use all of these technologies and fancy stuff, but sometimes that old, boring, monolithic API running on an EC2 solves 98% of your business problems, so no need to introduce ECS, K8S, Serverless, or whatever.
Anyway, I guess I'm getting old, or I understand the value of a resilient system, and I'm trying to find peace xD.
Last I heard (~5 years ago), lambda@edge doesn't actually run on edge POPs anyway; they're just hooks that you can put in your edge configs that execute logic in the nearest region before/after running your edge config. But it's definitely a datacenter round-trip to invoke them.
Adding that much compute to an edge POP is a big lift; even firecracker gets heavy at scale. And security risk for executing arbitrary code since these POPs don't have near the physical security of a datacenter, small scale makes more vulnerable to timing attacks, etc.
The takeaway here isn’t that serverless doesn’t work, it’s that the authors didn’t understand what they were building on. Putting a latency-critical API on a stateless edge runtime was a rookie mistake, and the pain they describe was entirely predictable.
Most cloud pain people experience is from a misunderstanding / abuse of solutions architecture and could have been avoided with a more thoughtful design. It tends to be a people problem, not a tool problem.
However, in my experience cloud vendors sell the snot out of their offerings, and the documentation is closer to marketing than truthful technical documentation. Their products’ genuine performance is a closely guarded proprietary secret, and the only way to find out… e.g. whether Lambdas are fast enough for your use case, or whether AWS RDS cross-region replication is good enough for you… is to run your own performance testing.
I’ve been burned enough times by AWS making it difficult to figure out exactly how performant their services are, and I’ve learned to test everything myself for the workloads I’ll be running.
> the documentation is closer to marketing than truthful technical documentation
I participated in AWS training and certification given by AWS for a company to obtain a government contract and I can 100% say that the PAID TRAINING itself is also 100% marketing and developer evangelism.
100% agree with you. I took a corporate training, and at one point crammed for the developer cert. It it just marketing. There is never a question where the answer is "Just run this service on EC2 yourself". It is about maximizing your usage of AWS services.
Infra will always be full of so much nonsense because it’s really hard to tell successful developers their code and system design is unusable. People use it because they are paid to do so usually, but it’s literally some of the worst product development I’ve ever seen.
AWS will hopefully be reduced to natural language soon enough with AI, and their product team can move on (most likely they moved on a long time ago, and the revolving door at the company meant it was going remain a shittily thought out platform in long term maintenance).
Some things never change. I remember ~20 years ago a bunch of expensive F5s suddenly showing up to our offices because the CTO and enterprise architects were convinced that irules could solve all their performance problems for something that wasn't even cacheable (gaming results) and would have shoved too much of our logic into the underpowered CPUs on them.
They were a much nicer, if overpriced, load balancing alternative to the Cisco Content Switch we were using, though.
This is exactly why I'd rather get a fat VPS from a reputable provider. As long as the bandwidth is sufficient the only limitation is vertical scaling.
I'm partial to this, the only thing I've found that is harder to achieve is the "edge" part of cloud services. Having a server at each continent is enough for most needs but having users route to the closest one is not as clear to me.
I know about Anycast but not how to make it operational for dynamic web products (not like CDN static assets). Any tips on this?
DIY Anycast is probably beyond most people’s reach, as you need to deal with BGP directly.
One cool trick is using GeoDNS to route the same domain to a different IP depending on the location of the user, but there are some caveats of course due to caching and TTL.
To get anycast working, you need BGP, and to get it working well, I think you need a good understanding of BGP and a lot of points of presence and well connected at each. BGP's default metric of distance is number of networks traversed, which does funny things.
Say you're in city A where you use transit provider 1 and city B where you use transit provider 2. If a user is in city B and their ISP is only connected to transit provider 1, BGP says deliver your traffic to city A, because then traffic doesn't leave transit provider 1 until it hits your network. So for every transit network you use, you really want to connect to it at all your PoPs, and you probably want to connect to as many transit networks as feasible. If you're already doing multihoming at many sites, it's something to consider; if not, it's probably a whole lot of headache.
GeoDNS as others suggested is a good option. Plenty of providers out there, it's not perfect, but it's alright.
Less so for web browsers, but you can also direct users to specific servers. Sample performance for each /24 and /48 and send users to the best server based on the statistics, use IP location as a fallback source of info. Etc. Not great for simple websites, more useful for things with interaction and to reduce the time it takes for tcp slow start (and similar) to reach the available bandwidth.
You could start using DNS Traffic Shaping where DNS server looks at IP making the request and returns the IP of closest server.
Azure/AWS/GCP all have solutions for this and does not require you to use their services. There are probably other DNS providers that can do it as well.
Cloudflare can also do this as well but it's probably more expensive than DNS.
You took the words right out of my mouth. Between aggressive salespeople marketing any given product as a panacea for everything and mandates from above to arbitrarily use X thing to do Y, there’s a lot of just plain bad architecture out there.
I think they are shooting themselves in the foot with this approach. If you have to run a monte carlo simulation on every one of their services at your own time and expense just to understand performance and costs, people will naturally shy away from such black boxes.
> people will naturally shy away from such black boxes.
I don't this isn't true. In fact, it seems that in the industry, many developers don't proceed with caution and go straight into usage, only to find the problems later down the road. This is a result of intense marketing on the part of cloud providers.
The fact is most developers in most companies have very little choice. Many medium to large companies (1k-50k employees) the CTO gets wined and dined by AWS/Azure/Oracle and they decide to move to that cloud. They bring in their solutions architects and do the training. The corporate architects for the divisions set the goals. So the rank and file developers get told that they have to make this work in AWS using RDS and they have almost zero power over this choice.
It doesn't even have to be in companies that big. The AWS salespeople took the CTO and a couple of directors of engineering for diner in a fancy restaurant. That was in a fintech that had around 200 employees. AWS also paid for the mandatory marketing... sorry, mandatory training sessions we tech managers had to do.
This is how much it takes for a CTO to demand the next week that "everything should be done with AWS cloud-native stuff if possible".
I feel like every cloud build meeting should have a moment where everyone has to defend the question "Wait! could this be a regular database with a regular app on a server with a regular cache?"
Bo Burmham said, "self awareness does not absolve anyone of anything"
But here I dont think they (or their defenders) are still aware of the real lesson here.
Theres literally zero information thats valuable here. Its like saying "we used an 18 wheeler as our family car and then we switched over to a regular camry and solved all our problems." What is the lesson to be learned in that statement?
The real interesting post mortem would be if they go, "god in retrospect what a stupid decision we took; what were we thinking? Why did we not take a step back earlier and think, why are we doing it this way?" If they wrote a blog post that way, that would likely have amazing takeaways.
What did your internal discussion conclude for the question "Why did we not take a step back earlier and think, why are we doing it this way?"
Im genuinely curious because this is not singling out your team or org, this is a very common occurrence among modern engineering teams, and I've often found myself on the losing end of such arguments. So I am all ears to hear at least one such team telling what goes on in their mind when they make terrible architecture decisions and if they learned anything philosophical that would prevent a repeat.
Oh we had it coming for quite some time and knew we would need to rebuild it, we just didn’t have the capacity to do it unfortunately.
I was working on it on and off moving one endpoint at a time but it was very slow until we hired someone who was able to focus on it.
It didn’t feel good at all. We knew the product had massive flaws due to the latency but couldn’t address it quickly. Especially cause we he to build more workarounds as time went on. Workarounds we knew would be made redundant by the reimplementation.
I think we had that discussion if “wtf are we doing here” pretty early, but we didn’t act on it in the beginning, instead we tried different approaches to make it work within the serverless constraints cause that’s what we knew well.
I have had CTOs (two in my career) tell me we had to use our AWS credits since they were going to expire worthless. Both experiences were at vc-backed startups.
What's valuable about rediscovering that stateless architectures requiring network round-trips for state access are slower than in-memory state? This isn't new information, it's a predictable consequence of their architecture choice that anyone with distributed systems experience could have told them on day zero.
Sure, but there are some fundamentals about latency that any programmer should know [0] (absolute values outdated, but still useful as relative comparisons), like “network calls are multiple orders of magnitude slower than IPC.”
I’m assuming you’re an employee of the company based on your comments, so please don’t take this poorly - I applaud any and all public efforts to bring back sanity to modern architecture, especially with objective metrics.
And yeah you’re right in hindsight it was a terrible idea to begin with
I thought it could work but didn’t benchmark it enough and didn’t plan enough. It all looked great in early POCs and all of these issues cropped up as we built it
You don't need experience and there is not really a lot to know about "distributed systems" in this case, that's basic CS knowledge about networks, latency and what "serverless" actually is, you can read about it.
To be honest, to me it reads like people who don't understand the problem they're solving, haven't acquired the necessary knowledge to solve it (either by learning themselves or by asking/hiring people who have it), and seeing such an amateurish mistake doesn't inspire confidence for the future.
You should either hire people that know what they are doing or upgrade your knowledge about systems you are using before making decisions to use them.
Sometimes I see a post about sorting algorithms online. Some people seem to benefit from reading about these things, but often, I find there isn't much new information for me. That's OK, because I know somebody somewhere benefits from knowing this.
It is your decision to make this a circlejerk of musings about how the company must be run by amateurs. Whatever crusade you're fighting in vividly criticising them is not valuable at all. People need to learn and share so we can all improve, stop distracting from that point.
I would not assume this was a "rookie mistake". I've been here once or twice, and a common story is that engineers don't want to do it a certain way, but management overrules them for some vague hand-wavy reason like, "This way is more modern." Another common story is that you know you're not choosing the most [scalable|robust|performant|whatever] design, but ancillary constraints like time and money push you into a "worse is better" decision.
Or maybe the original implementation team really didn't know what they were doing. But I'd rather give them the benefit of the doubt. Either way, I appreciate them sharing these observations because sharing these kinds of stories is how we collectively get better as a professional community.
> but management overrules them for some vague hand-wavy reason like, "This way is more modern."
This matches my experience. It's very difficult to argue against costly and/or inappropriate technical decisions in environments where the 'Senior Tech Leadership' team are just not that technical but believe they are, and so are influenced by every current industry trend masquerading as either 'scalable', 'modern' or (worst of all) 'best practice'.
What's even more dangerous is when senior tech leadership used to be technical but haven't actually got their hands dirty in 5 or 10 years, and don't realize that this means they aren't actually holding all the cards when they try to dictate these kinds of tactical, detail-oriented technical decisions.
I see this a lot in startups that grew big before they had a chance to grow up.
And to add, this rarely indicates anything about the depth and/or breadth of the 'used to' experience.
A lot of the strongest individual contributors I see want to stay in that track and use that experience to make positive and sensible change, while the ones that move into the management tracks don't always have such motivations. There's no gatekeeping intended here, just an observation that the ones that are intrinsically motivated by the detailed technical work naturally build that knowledge base through time spent hands-on in those areas and are best able to make more impactful systemic decisions.
People in senior tech leadership also are not often exposed to the direct results of their decisions too (if they even stay in the company for long enough to see the outcome of longer-term decisions, which itself is rare).
While it's not impossible to find the folk that do have breadth of experience and depth of knowledge but are comfortable and want to be in higher-level decision making places, it's frustratingly rare. And in a lot of cases, the really good ones that speak truth to power end up in situations where 'Their last day was yesterday, we wish them all the best in their future career endeavours.' It's hardly surprising that it's a game that the most capable technical folks just don't want to play, even if they're the ones that should be playing it.
This all could just be anecdata from a dysfunctional org, of course...
My personal experience is that if you want guaranteed anything (quick scaling, latency, CPU, disk or network throughput), your best bet is to manually provision EC2 instances (or use some API that does). Once you give up control hoping to gain performance for free, you usually end up with an unfixable bottleneck.
If you're looking for a middle ground between VMs and serverless, ECS Fargate is a good option. Because a container is always running, you won't experience any cold start times.
Yes, though unless you’re provisioning your own EC2s for them to run on, you have no guarantee about the server generation, and IME AWS tends to provision older stuff for Fargate.
This may or may not matter to you depending on your application’s needs, but there is a significant performance difference between, say, an m4 family (Haswell / Broadwell) and an m7i family (Sapphire Rapids) - literally a decade of hardware improvements. Memory performance in particular can be a huge hit for latency-sensitive applications.
ECS is good, just expensive and still requires more devops than it should. Docker Swarm is an easy way to run production container services on VMs. I built a free golang tool called Rove that provisions fresh Ubuntu VMs in one command and diffs updates. It's also easy-enough to use Swarm directly.
I’ve used a modified version of this for 8 years - I didn’t write it. Updating your ECS Docker image is just passing in the parameter of your new image and updating the cloudformation stack.
Honestly I didn't have a good experience with ECS (Fargate) - I remember I had to write a ton of CF deployment scripts+bash scripts, setting up a private AWS docker registry, having a terrible time debugging while my CF deployment always failed, deploys taking forever, finding out that AWS is too miserly to pay Docker to use the official repo so they are stuck on the free tier, meaning sometimes deploys would fail due to Dockerhub kicking the AWS docker agent out etc. It had limitations like not being able to attach a block volume to the docker instance, so overall I remember spending a week setting up the IaC for a simple-ass CRUD app on Fargate ECS.
Setting up the required roles and permissions was also a nightmare. The deployment round trip time was also awful.
The 2 good experiences I had with AWS was when we had a super smart devops guy who set up the whole docker pipeline on top of actual instances, so we could deploy our docker compose straight to a server in under 1 minute (this wasn't a scaled app), and had everything working.
Lambda is also pretty cool, you can just zip everything up and do a deploy from aws cli without much scripting and pretty straightforward IaC.
A lot of AWS requires way too much config. It is a mystery to me why AWS doesn't lean into extending the capabilities of App Runner. I actually built a whole continuous deployment PaaS for AWS ECS with a Heroku-like UX, ended up shutting it down eventually because although useful, their pricing is pretty awful. What I need to do is figure out how to bring it back, just minus the hosted service so I can use it on corporate projects that require AWS...
Yeah I haven't had any issues with Swarm. Heard good things from people running substantial clusters. Would be interested in hearing about what rough edges people have run into as well!
You're confusing network proximity with application architecture. Edge deployment helps connection latency. Stateless runtime destroys it by forcing every cache access through the network.
The whole point of edge is NOT to make latency-critical APIs with heavy state requirements faster. It's to make stateless operations faster. Using it for the former is exactly the mismatch I'm describing.
Their 30ms+ cache reads vs sub-10ms target latency proves this. Edge proximity can't save you when your architecture adds 3x your latency budget per cache hit.
Realistically, they should be able to do sub ms cache hits which land in the same datacenter. I know cloudflare doesn't have "named" datacenters like other providers but at the end of the day, there are servers somewhere and if your lambda runs twice in the same one there is no reason why a pull-through cache can't experience a standard intra data-center latency hit.
I wonder if there is anything other than good engineering getting in the way of this and even sub us intra-process pull through caches for busy lambda functions. After all, if my lambda is getting called 1000X per second from the same point of presence, why wouldn't they keep the process in memory?
On serverless, whenever you call your code, it has to be executed but first the infrastructure has to find a place to run it and sometimes if there's no running instance available, it must fire up a new instance to run your code.
Their problem isn't serverless, rather Cloudflare Workers and WebAssembly.
All major cloud vendors have serveless solutions based on containers, with longer managed lifetimes between requests, and naturally the ability to use properly AOT compiled languages on the containers.
Serverless only makes sense if the lifetime doesn't matter to your application, so if you find that you need to think about your lifetime then serverless is simply not the right technology for your use case.
I would doubt that this is categorically true. Serverless inherently makes the whole architecture more complex with more moving parts in most cases compared to classical web applications.
> Serverless inherently makes the whole architecture more complex with more moving parts
Why's that? Serverless is just the generic name for CGI-like technologies, and CGI is exactly how classical web application were typically deployed historically, until Rails became such a large beast that it was too slow to continue using CGI, and thus running your application as a server to work around that problem in Rails pushed it to become the norm across the industry — at least until serverless became cool again.
Making your application the server is what is more complex with more moving parts. CGI was so much simpler, albeit with the performance tradeoff.
Perhaps certain implementations make things needlessly complex, but it is not clear why you think serverless must fundamentally be that way.
Depends pretty much where those classical web applications are hosted, how big is the infrasture taking care of security, backups, scalability, failovers, and the amount of salaries being paid, including on-call bonus.
Serverless is not a panacea. And the alternative isn't always "multiple devops salaries" - unless the only two options you see are server serverless vs outrageously stupid complicated kubernetes cluster to host a website.
There's a huge gap between serverless and full infra management. Also, IMO, serverless still requires engineers just to manage that. Your concerns shift, but then you need platform experts.
It can be good for connecting AWS stuff to AWS stuff. "On s3 update, sync change to dynamo" or something. But even then, now you've got a separate coding, testing, deployment, monitoring, alerting, debugging pipeline from your main codebase, so is it actually worth it?
But no, I'd not put any API services/entrypoints on a lambda, ever. Maybe you could manufacture a scenario where like the API gets hit by one huge spike at a random time once per year, and you need to handle the scale immediately, and so it's much cheaper to do lambda than make EC2 available year-round for the one random event. But even then, you'd have to ensure all the API's dependencies can also scale, in which case if one of those is a different API server, then you may as well just put this API onto that server, and if one of them is a database, then the EC2 instance probably isn't going to be a large percentage of the cost anyway.
Actually I don't even think connecting AWS services to each other is a good reason in most cases. I've seen too many cases where things like this start off as a simple solution, but eventually you get a use case where some s3 updates should not sync to dynamo. And so then you've got to figure out a way to thread some "hints" through to the lambda, either metadata on the s3 blob, or put it in a redis instance that the lambda can query, etc., and it gets all convoluted. In those kinds of scenarios, it's almost always better just to have the logic that writes to s3 also update dynamo. That way it's all in one place, can be stepped through in a debugger, gets deployed together, etc.
There are probably exceptions, but I can't think of a single case where doing this kind of thing in a lambda didn't cause problems at some point, whereas I can't really think of an instance where putting this kind of logic directly into my main app has caused any regrets.
For a thing, which permanently has load it makes little sense.
It can make sense if you have very differing load, with few notable spikes or on an all in on managed services, where serverless things are event collectors from other services ("new file in object store" - trigger function to update some index)
Agree, it seems like they decided to use Cloudflare Workers and then fought them every step of the way instead of going back and evaluating if it actually fit the use case properly.
It reminds me of the companies that start building their application using a NoSQL database and then start building their own implementation of SQL on top of it.
Ironically, I really like cloudflare but actively dislike workers and avoid them when possible. R2/KV/D1 are all fantastic and being able to shard customer data via DOs is huge, but I find myself fighting workers when I use them for non-trivial cases. Now that Cloudflare has containers I'm pushing people that way.
In that scenario, how do you keep cold startup as fast as possible?
The nice thing about JS workers is that they can start really fast from cold. If you have low or irregular load, but latency is important, Cloudflare Workers or equivalent is a great solution (as the article says towards the end).
If you really need a full-featured container with AOT compiled code, won't that almost certainly have a longer cold startup time? In that scenario, surely you're better off with a dedicated server to minimise latency (assuming you care about latency). But then you lose the ability to scale down to zero, which is the key advantage of serverless.
Cloudflare has containers now too, and having used AppRunner and Cloud Run, it's much easier to work with. Once they get rid of the container caps and add more flexibility in terms of container resources, I would never go back to the big cloud containers, the price and ease of use of Cloudflare's containers just destroy them.
> We built chproxy specifically because ClickHouse doesn't like thousands of tiny inserts. It's a Go service that buffers events and sends them in large batches. Each Cloudflare Worker would send individual analytics events to chproxy, which would then aggregate and send them to ClickHouse.
While I understand how this isn't the only thing that needed to be buffered, for Clickhouse data specifically I'd be curious why they built a separate service rather than use asynchronous inserts:
Serverless seems to me like the tech version of the classic sales technique. You see it every Black Friday, get them in the door with the cheap alluring offers, then upsell the shit out of them. And in the case of serverless, lock them with as many provider specific services as possible.
I feel like we're headed the same direction with AI right now. It creates as many problems as it solves, and the solution is always to pay them for the newer faster model and use more tokens. With all the new AI security threats coming to light we'll start seeing them offer solutions for that too. They'll sell you the threats and the solutions, and the entire developer community will thank them for it.
As a soon to be graybeard think this has been fairly obvious
from the start. And outside of specific workflows, you are
adding unneeded complexity to a system that does not need it.
In general, an anti-pattern.
but it does have valid use in some cases.
On the last project I worked on that came to involve
serverless, it made no sense at all other than it was
"the fad".
For this system we had an excellent knowledge of what the theoretical limit of users and connections as.
This apparently needed to be done container, serverless, kafka, blah blah.
Annoyed with the whole thing I took a few nights to tear logic out from the micro servers or nano services, and wrapped the whole thing into a Frankenstein monolith.
AT least 60% of the code all had to do with dealing solely with the code needed to pass information around to different services so it was easier to maintain.
Well my hacked together moonlight was not a great start for anything but a demo.
I installed Postgres on my laptop, ran the monolith on it, took 3 servers each pushing the theoretical maximum load we would have, and what do you know the performance was fine.
But the architecture was the architecture decided upon.
One thing I could not find in the write-up is the change in the expense. Did serverless save any money, compared to always-up VMs? Did much of their load run under the free tier limits?
Serverless shines when the load is very spiky, and you can afford high long-tail latency. Then you don't pay for all that time when your server would be idling. (This is usually not the case for auth APIs, unless they auth other infrequently invoked operations.)
"Taking hand off of boiling kettle decreased anxiety and increased focus in 97% of study participants."
Stephen King's Dark Tower series never resonated with me and I got stuck in book two. But it has one of my favorite philosophical insults of all time:
"Those who [X] have forgotten the faces of their fathers."
I feel like there's a collective amnesia just beginning to wear off as people remember the Fallacies of Distributed Computing and basic facts about multitasking. And that amnesia absolutely feels to me as if everyone has forgotten the faces of their fathers. <waves cane threateningly>
30ms p99 for a cache read! Serverless might have been a problem, but I'm not sure it was the problem. In my experience a p99 of 2ms is more typical - 30ms is the sort of time I'd expect for p99 on a database query in production serving.
You don't need process-local caches to get the sort of performance they're looking for, and there are good reasons why most teams avoid stateful processing, it's much harder to get right and has bad failure modes.
Fun fact, the overhead for a DB query (not including the network latency) is about 250 microseconds (or 0.25ms). And a typical DB kernel can do at least 16MB/core second of throughput. So that 30ms query I would expect is scanning several GB of data to find its results. Try doing that yourself and your latency would probably grow into seconds. Systems programming is hard...
PS moving from stateful processing to something with a distributed cache is moving from RAM to network. That's at least a 10x drop off in throughput.
After building my first Serverless/Cloudflare worker app, this is why I migrated to Deno.
Deno enables you to run the same codebase in deno (self-hosted/local) and in deno deploy (serverless platform from deno).
I wanted my app to be self-hostable as well, and Cloudflare worker is a hard ecosystem lock to their platform, which makes it undesirable (imo).
I ported my worker project into Django since cloudflare workers wouldn’t allow selection of region for hosting workers which is generally required due to data compliances. This is something all cloud providers provide from day one yet cloudflare made it an enterprise feature.
Also the vendor lock-in doesn’t help with durable objects and D2 instead of simply doing what supabase and others are doing by providing Postgres or standard SQLite as a service.
Hey "vercel security checkpoint", I'm repeatedly getting a code 99 "Failed to verify your browser" on an iphone with a VPN exits in California and Canada via proton VPN and Firefox Focus.
"Self-Hosting : Being tied to Cloudflare's runtime meant our customers couldn't self-host Unkey. While the Workers runtime is technically open source, getting it running locally (even in dev mode) is incredibly difficult.
With standard Go servers, self-hosting becomes trivial:"
A key point that I always make. Serverless is good if you want a simple periodic task to run intermittently without worrying about a full time server. The moment things get more complex than that (which in real world it almost always is), you need a proper server.
Have you done new benchmarks since Cloudflare announced their latest round of performance improvements for Workers?
Just curious if this workload also saw some of the same improvements (on a quick read it seems like you could have been hitting the routing problem CF mentions)
Really great writeup. The charts tell the story beautifully, and the latency gains are surely a win for your company and customers. I always wonder about the tradeoffs. Is there a measurable latency difference for your non-colocated customers? What does maintenance look like for your Go servers? I assume that your Cloudflare costs dropped?
It’s faster for non-colocated customers too weirdly
I think cause connections can be reused more often. Cloud flare workers are really prone to doing a lot of TLS handshakes cause they spin up new ones constantly
Right now were just hang aws far hate for the go servers, so there really isn’t much maintenance at all. We’ll be moving that into eks soon though cause we are starting to add more stuff and need k8s anyways
I would love to know the net result to their financials for this move. I have no doubt they were able to improve their performance. I'm just wondering if the juice was worth the squeeze, especially if they could have been building other features that customers would want. I didn't read anything about the opportunity cost in the article or even the consideration.
That said, as an example, an m8g.8xlarge gives you 32 vCPU / 128 GiB RAM for about $1000/month in us-east-1 for current on-demand pricing, and that drops to just under $700 if you can do a 1-year RI. I’m guessing this application isn’t super memory-heavy, so you could save even more by switching to the c-family: same vCPU, half the RAM.
Stick two of those behind a load balancer, and you have more compute than a lot of places actually need.
Or, if you have anything resembling PMF, spend $10K or so on a few used servers and put them into some good colo providers. They’ll do hardware replacement for you (for a fee).
They just use two servers and configure a loadbalancer within Cloudflare. Come on. Self-Hosting is no rocket science. You don‘t have to make it seem complicated. People have been doing this decades before AWS invented serverless.
Only if that system is stateless. If you have any sort of internal memory that sticks around between requests, then either you face a cold start problem (because of empty caches) or you somehow need to persist that state somewhere. And persisting that state either means you need a backup solution or your latency is terrible because you are hitting network for something that only needs to hit RAM.
If it’s got to be serverless then use a PaaS that is docker based as input format. That at least gives some level of platform portability if you need to shift it later
But yeah in this case 10ms requirement doesn’t leave a lot of room for elaborate anything
Incredible that these kinds of services were hosted like this.
I guess they never came out of MVP, which could warrant using serverless, but in the end it makes 0 sense to use some slow solution like this for the service they are offering.
Why didnt they go with a self hosted backend right away?
Its funny how nowadays most devs are too scared to roll their own and just go with the cloud offerings that cost them tech debt and actual money down the road.
I doubt they literally said “perfect for low latency APIs” but their messaging is definitely trying to convince you that they’re fast globally, just look at the workers.ckoudflare.com page
Unlikely? They could've just as well deployed their single go binary to a vm from day 1 and it would've been smooth sailing for their use case, while they acquire customers.
The cloudflare workers they chose aren't really suited for latency critical, high throughput APIs they were designing.
Many organizations would benefit from just running a well-architected monolith. Amazon pushed microservices hard internally, but they face scale challenges most companies do not have and will never have.
- if you don't understand your concept/market, build in a VPS. you can get away with scaling a VPS for a while.
- if you intend to be netflix, rent at the edge (mid game) and eventually own the edge servers in POPs (otherwise, "edge" compute isn't worth it). before that, start with a beefy VPS cluster with HA and SQLproxy.
- if you spin to zero, use lambdas (for things like codebuild or fractional needs of a computer). before that, build things in VPSes.
- if you spin up/down but not to zero, use container platforms?
- once you have a reliably steady understanding of your infrastructure, buy physical servers in a colo
I often don't know what to make of DHH. He's a living contradiction. On one hand he will continually rant about how bad the overhead and waste of cloud services is, and on the other hand he will staunchly defend the most inefficient programming language that is regularly used for backend development, as well as defend the enourmous overfetching that active record leads to.
Really I think DHH just likes to tell others what he likes.
In all fairness, the performance penalty for virtualization is 4x and the penalty for interpreted code is 1.5x. So he comes out ahead, but its more in a broken watch is right twice a day sort of way.
As someone who worked with serverless for multiple years (mostly amazon lambda but others too) i can absolutely apporove the authors points.
While it "takes away" some work from you, it adds this work on other points to solve the "artificial induced problems".
Another example i hit was a hard upload limit. Ported an application to a serverless variant, had an import API for huge customer exports. Shouldnt be a problem right? Just setup an ingest endpoint and some background workers to process the data.
Tho than i learned : i cant upload more than 100mb at a time through the "api gateway" (basically their proxy to invoke your code) and when asking if i could change it somehow i just was told to tell our customers to upload smaller file chunks.
While from a "technical" perspective this sounds logical, our customers not gonne start exchanging all their software so we get a "nicer upload strategy".
For me this is comparable with "it works in a vacuum" type of things. Its cool in theory, but as soon it hits reality you will realice quite fast that the time and money you safed on changing from permanent running machines to serverless, you will spend in other ways to solve the serverless specialities.
The way to work around this issue is to provide a presigned S3 url
Have the users upload to s3 directly and then they can either POST you what they uploaded or you can find some other means of correlating the input (eg: files in s3 are prefixed with the request id or something)
I agree this is annoying and maybe I’ve been in AWS ecosystem for too long.
However having an API that accepts an unbounded amount of data is a good recipe for DoS attacks, I suppose the 100MB is outdated as internet has gotten faster but eventually we do need some limit
Well i partly agree, and if i would be the one building the counterpart, i prolly had used presigned s3 urls also.
In this specific case im getting oldschool file upload request from software that was partly written before the 2000s - noones gonne adjust anything any more.
And ye, just accepting giant size uploads is far from good in terms of "Security" like DoS - but ye we talking about stupidly somewhere between 100 and 300mb CSV files (called them "huge" because in terms of product data 200-300mb text include quite alot) - not great but well we try to satisfy our customers needs.
But ye like all the other points - everything is solvable somehow - just needs us to spend more time to solve something that technickly wasn't a real problem in first place.
Edit: Another funny example. In a similar process on another provider i downloaded files in a similar size range from S3 to parse them - which died again and again. After contacting the hoster, because their logs litearlly just stopped no error tracing nothing) they told me that basically their setup only allows for 10mb local storing - and the default (in this case aws s3 adapter for PHP) always downloads it even if you tell it to "stream". So i build a solution that used HTTP ranged requests to "fake stream" the file into memory in smaller chunks so i could process it afterwards without completely download it. Just another example of : yes its solvable, but annoying.
I find with these types of customers it’s always easier to just ask them to save files locally and grant me privileges to read the data. Sometimes they’ll be on Google, Dropbox, Microsoft, etc and I also run a SFTP for this in case they want to move them over to my service.
Then I either batch/schedule the processing or give them an endpoint to just to trigger it (/data/import?filename=demo.csv)
It’s actually so common that I just have the “data exchange” conversation and let them decide which fits their needs best. Most of it is available for self service configuration.
Yep, I concur. You need to meet them on their (legacy) terrain, get access to their data and then you can do any fancy thing you want to do.
Uploads to an S3 bucket can trigger a lambda… don’t complicate things. The upload trigger can tell the system about the upload and the client can continue on their day.
Uploader on the client uses presigned url. S3 triggers lambda. Lambda function takes file path and tells background workers about it either via queue, mq, rest, gRPC, or doing the lift in workflow etl functions.
Easy peasy. /s
And while you are being sarcastic, this is the Right Way to use queues.
Upload file to S3 -> trigger an SNS message for fanout if you need it -> SNS -> SQS trigger -> SQS to ETL jobs.
The ETL job can then be hosted using Lambda (easiest) or ECS/Docker/Fargate (still easy and scales on demand) or even a set of EC2 instances that scale based on the items in a queue (don’t do this unless you have a legacy app that can’t be containerized).
If your client only supports SFTP, there is the SFTP Transfer Service on AWS that will allow them to send the file via SFTP and it is automatically copied to an S3 bucket.
Alternatively, there are products that treat S3 as a mountable directory and they can just use whatever copy commands on their end to copy the file to a “folder”
If I have a user facing upload button, why can't I simply have a webserver that receives the data and pushes it into s3 via multi-part upload. Something that can be written in a framework of your choice in 10 minutes with 0 setup?
For uploads under 50 MB you could also skip the multipart upload and take a naive approach without taking a significant hit.
You can - you generate the pre-signed S3 URL and they upload it to the place your URL tells it to.
https://fullstackdojo.medium.com/s3-upload-with-presigned-ur...
And before you cry “lock in”, S3 API compatible services are a dime a dozen outside of AWS including GCP and even Backblaze B2.
> Uploads to an S3 bucket can trigger a lambda… don’t complicate things.
I read this and was getting ready to angrily start beating my keyboard. The best satire is hard to detect.
I don't really get the joke. S3 triggering a lambda doesn't sound meaningfully more complicated than using a lambda by itself. What am I missing?
Solving a serverless limitation with more serverless so you can continue doing serverless when you can’t FormUpload a simple 101mb zip file as an application/octet-stream. Doubling down on it for a triple beat.
I wouldn't really call it "more" severless to rearrange the order a bit. Which makes it "solving a serverless limitation so you can continue doing severless". And that's just a deliberately awkward way of saying "solving a serverless limitation" because if you can solve it easily why would you not continue? Spite?
So I still don't see how it's notably worse than the idea of using serverless at all.
The controversy here is the fact that the API Gateway limits the upload resulting in having to engineer a workaround workflow using s3 and triggers (even if this is the serverless way) when all you want to do is upload a file. A POST call with an octet-octet stream. Let http handle resume. But you can’t and you end up going around the side door, when all you really want is client_body_max_size
The sarcasm of correctness yet playing down its complexity is entirely my own. We used to be able to do things easily.
It gets really complex in this workflow to even achieve something like “file coprocessor successfully” on the client side with this approach
how will your client know if you backend lambda crashed or whatever? All it knows is the upload to s3 succeeded
Basically you’re turning a synchronous process into asynchronous
unfortunately they ruined it at the end with that /s
Did I? I don’t think I did.
Every day we stray further from the light
If you don’t do it this way you fail the system design interview.
Nope, you didn’t use terracottax so you failed anyway. 6 months before you can reapply in case the first humiliation wasn’t enough. Boss was looking for AWS Glue in there and you didn’t use it.
> Easy peasy. /s
It actually is though. I don't need to build a custom upload client, I don't need to manage restart behavior, I get automatic restarts if any of the background workers fail, I have a dead letter queue built in to catch unusual failures, I can tie it all together with a common API that's a first class component of the system.
Working in the cloud forces you to address the hard problems first. If you actually take the time to do this everything else becomes _absurdly_ easy.
I want to write programs. I don't want to manage failures and fix bad data in the DB directly. I personally love the cloud and this separation of concerns.
> I don't need to build a custom upload client
GP said this is an app from the 2000s.
For S3 you do need to generate a presigned URL, so you would have to add this logic there somewhere instead of "just having a generic HTTP upload endpoint".
Unless the solution is "don't have the problem in the first place" the cloud limitations are just getting in the way here.
The solution is to use the appropriate tool for the job. If you're locked in to highly crusty legacy software, it's inevitably going to require workarounds. There are good technical reasons why arbitrary-size single-part file uploads are now considered an anti-pattern. If you must support them, then don't be shocked if you wind up needing EC2 or other lower-level service as a point of ingress into your otherwise-serverless ecosystem.
If we want to treat the architectural peculiarities of GP's stack as an indictment of serverless in general, then we could just as well point to the limitations of running LAMP on a single machine as an indictment of servers in general (which obviously would be silly, since LAMP is still useful for some applications, as are bare metal servers).
We down play how trivial it is to generate a signed url, it’s only like a few lines and a function call to get but, you then have to send this to the client. The client has to then use this url, then check back with you to see if it arrived resulting in a kind of pea soup architecture unless your application is also entirely event driven. Oh how we get suckered in…
> Working in the cloud forces you to address the hard problems first.
It also forces you to address all the non-existent problems first, the ones you just wish you had like all the larger companies that genuinely have to deal with thousands of file upload per second.
And don't forget all the new infrastructure you added to do the job of just receiving the file in your app server and putting it into the place it was going to go anyway but via separate components that all always seem to end up with individual repositories, separate deployment pipelines, and that can't be effectively tested in isolation without going into their target environment.
And all the additional monitoring you need on each of the individual components that were added, particularly on those helpful background workers to make sure they're actually getting triggered (you won't know they're failing if they never got called in the first place due to misconfiguration).
And you're now likely locked into your upload system being directly coupled to your cloud vendor. Oh wait, you used Minio to provide a backend-agnostic intermediate layer? Great, that's another layer that needs managing.
Is a content delivery network better suited to handling concurrent file uploads from millions of concurrent users than your app server? I'd honestly hope so, that's what it's designed for. Was it necessary? I'd like to see the numbers first.
At the end of the day, every system design decision is a trade off and almost always involves some kind of additional complexity for some benefit. It might be worth the cost, but a lot of these system designs don't need this many moving parts to achieve the same results and this only serves to add complexity without solving a direct problem.
If you're actually that company, good for you and genuinely congratulations on the business success. The problem is that companies that don't currently and may never need that are being sold system designs that, while technically more than capable, are over-designed for the problem they're solving.
> the ones you just wish you had
You will have these problems. Not as often as the larger companies but to imagine that they simply don't exist is the opposite of sound engineering.
> if they never got called in the first place due to misconfiguration
Centralized logging is built into all these platforms. Debugging these issues is one of the things that becomes absurdly easy.
> likely locked into your upload system
The protocol provided by S3 is available through dozens of vendors.
> Was it necessary?
It only matters if it is of equivalent or lessor cost.
> every system design decision is a trade off
Yet you explicitly ignore these.
> are being sold system designs
No, I just read the documentation, and then built it. That's one of those "trade offs" you're willingly ignoring.
> The protocol provided by S3 is available through dozens of vendors.
But not all of the S3 API is supported by other vendors - the asynchronous triggers for lambdas and the CloudTrail logs that you write code to parse.
> You will have these problems. Not as often as the larger companies but to imagine that they simply don't exist is the opposite of sound engineering.
A lot of those failure mode examples seem well suited to client-side retries and appropriate rate limiting. If we're talking file uploads then sure, there absolutely are going to be cases where the benefits of having clients go to the third-party is more beneficial than costly (high variance in allowed upload size would be one to consider), but for simple upload cases I'm not so convinced that high-level client retries aren't something that would work.
> if they never got called in the first place due to misconfiguration
I find it hard to believe that having more components to monitor will ever be simpler than fewer. If we're being specific about vendors, the AWS console is IMHO the absolute worst place to go for a good centralized logging experience, so you almost certainly end up shipping your logs into a better centralized logging system that has more useful monitoring and visualisation features than CloudWatch and has the added benefit of not being the AWS console. The cost here? Financial, time, and complexity/moving parts for moving data from one to the other. Oh and don't forget to keep monitoring on the log shipping component too, that can also fail (and needs updates).
> The protocol provided by S3 is available through dozens of vendors.
It's become a de facto standard for sure, and is helpful for other vendors to re-implement it but at varying levels of compatibility.
> It only matters if it is of equivalent or lessor cost.
This is precisely the point, I'm saying that adding boxes in the system diagram is a guaranteed cost as much as a potential benefit.
> Yet you explicitly ignore these
I repeatedly mentioned things that to me count as complexity that should be considered. Additional moving parts/independent components, the associated monitoring required, repository sprawl, etc.
> No, I just read the documentation, and then built it.
I also just 'read the documention and built it', but other comments in the thread allude to vendor-specific training pushing for not only vendor-specific solutions (no surprise) but also the use of vendor-specific technology that maybe wasn't necessary for a reliable system. Why use a simple pull-based API with open standards when you can tie everything up in the world of proprietary vendor solutions that have their own common API?
Enjoyed reading this, thanks for writing it.
People often don't know how different might be easier for their case.
Following others, or the best practices, when they might not apply in their case can lead to to social proof architecture a little too often.
this kinda proves the point that you have to know a silly workaround
We became the flagship customer for a division of AWS that was responsible for managing SSL certificates. We were doing vanity URLs and vanity URLs generally require individual SSL certificates for each domain name. We needed thousands and AWS tools for cert management at the time was really only happy with hundreds and they had backlog items to fix it but those were behind a year or two of other work. It took them about three months to get far enough along for our immediate needs. It's surprising the parts of AWS that have not adjusted to outliers that don't seem really to be that exceptional.
I also thought Lambda looked promising at first, but we ultimately abandoned all our Lambda projects and started using containers as needed.
Lambda still requires that you need to update the Node runtime every year or two, while with your own containers, you can decide on your own upgrade schedule.
Not if you deploy your container to Lambda…
I've observed massive back office pipelines using dozens of interconnected lambda, batching, streaming, distributed storage for ephemeral data and other rube Goldberg contraptions to build what was ultimately a cron job on a modest server running for 1 hour.
Being in the cloud doesn't mean you need to accept timeouts/limitations. CDK+fargate can easily run an ephemeral container to perform some offline processing.
Just to help future readers, there is an ecosystem of "tus" uploaders and endpoints, that chunk uploads, and feature resumeable uploads, that would be ideal for this kind of restriction:
https://tus.io/
The hardest problem in computer science is coping a file from one computer to another.
Some architectural arguments I kick myself for not establishing a bibliography of all of my justifications. The thing with mastering something is that you copy the rules into the intuitive part of your brain and you no longer have to reason through it step by step like Socrates's lectures. You just know and you do.
The biggest one I regret is "communicating through the file system is 10x dumber than you think it is, even if you think you know how dumb it is." I should have a three page bibliography on that. Mostly people don't challenge you on this, but I had one brilliant moron at my last job who did, and all I could do was stare at him like he had three heads.
Like the article says, I think serverless has it's place, but I don't think it's for most applications. I can't see myself _ever_ using serverless services as a core part of my application for pretty much any startup, if I can avoid it. The infrastructure overhead is actually worse, IMO.
Everything is so platform specific and it's much stranger to test and develop against locally. Each platform has a different way to test, and the abstraction layers that exist (unless this has changed recently) always had pitfalls, since there are no true standards.
I'd much rather have a docker image as my deliverable "executable" so I can test, but still abstract away some stuff, like environment setup. Giving me a minimal Linux environment and filesystem feels like the most comfortable level of abstraction for me to develop well and also deploy and run in production effectively. I can also run that on demand or (most commonly) run that as a server that sits and waits for requests.
> Like the article says, I think serverless has it's place, but I don't think it's for most applications
I feel this way about many of the more popular trends over the last decade or so.
A technology becomes popular because a really large organization uses it to solve problems that only really exist st that scale. Then they talk about how it works and many tend to take that as an indicator that the solution is ideal, regardless of context.
GraphQL, react, Tailwind, NextJS, and many others. They solve specific problems and those may be problems you have. No tool is a universally useful solution though, it takes real experience and knowledge to discern what your problems are and how to best tackle them.
A lot of it is because platforms need a wedge against AWS/Azure.
Take edge computing for example – most apps are CRUD apps that talk to a single database, but vendors keep pushing for running things on edge servers talking to databases with eventual consistency because hey, at least AWS doesn't offer that! (or it would cost you 10k/month to run that on AWS)
For the problem of eventual consistency, Google's Spanner and Firestore both provide strong or tunable global consistency. Instead of "eventual consistency because AWS doesn’t offer that", GCP has "we give you strong consistency at scale."
That’s funny, I agree with all of those but tailwind which I find to be super convenient for small teams where most people aren’t css experts. Maybe that’s a specific problem but it feels pretty general to want convenience on top of base css stuff without writing and making it all yourself.
I think the entire concept of HTML and CSS when it comes to browser-based app development (as opposed to styling documents) is just not that ergonomic compared to a native equivalent.
It might have seen a lot of improvements over the years to make it more flexible but I can't imagine any serious project raw dogging CSS in favour of pulling in some kind of framework or abstraction to make it manageable.
Is the main benefit you find related to the resets and design system built in to tailwind?
I've always found that I need to know CSS to really use tailwind, and I need to additionally know tailwind-specific syntax to do anything somewhat complex related to layout or conditional styling.
You need to know a little css but way less. I don’t think I’ve ever gotten backed into using !important with tailwind.
As someone with years of experience on serverless stuff on AWS I might be a bit biased BUT I'd argue serverless is the sweet spot for most applications. You need to remember however that most applications aren't your typical startups or other software products but simply some rather boring line of business software nobody outside the company owning it knows of.
Concerning how IT departments in most non-software companies are, the minimal operational burden is a massive advantage and the productivity is great once you have a team with enough cloud expertise. Think bespoke e-commerce backends, product information management systems or data platforms with teams of a handful of developers taking responsibility for the whole application lifecycle.
The cloud expertise part is a hard requirement though but luckily on AWS the curriculum is somewhat standardized through developer and solutions architect certifications. That helps if you need to do handovers to maintenance or similar.
That said, even as a serverless fan, I immediately thought of containers when the performance requirements came up in the article. Same with the earlier trending "serverless sucks" about video processing on AWS. Most of the time serverless is great but it's definitely not a silver bullet.
I like your angle, but most applications is a big difference from most companies. Serverless comes after deciding whether or not to break up the monolith, and after breaking up engineering into separate teams. It's a good way to manage apps with high variance in traffic while keeping cloud spend down.
Let me tell you about all the fun I'm having trying to execute my amazon lambda app locally so I can test before deploying...
Localstack makes that pretty easy. Before Localstack I had a pre-staging environment (dev) target I would deploy to. Their free/community offering includes a Lambda environment; you deploy your dev "Lambda" locally to docker, using the same terraform / deploy flow you'd normally use but pointed at a Localstack server which mimics the AWS API instead. Some of their other offerings require you to pay though (Cloudfront, ECS, others) and I don't use those, yet at least.
I had a horrible time with Localstack. It's very similar to AWS but not exactly the same, so you hit all kinds of annoying edges and subtle bugs or differences and you're basically just doing multi-cloud at that point. The same kinds of problems you encounter with slightly inaccurate mocks in automated testing.
The better solution in my experience is to create an entirely different account to your production environments and just deploy things there to test. At least that way you're dealing with the real SQS, S3, etc, and if you have integration problems there, they're actually real things you need to fix for it to work in prod - not just weird bandaids to make it work with Localstack too.
How do you keep the accounts in sync? Isn't deploying to remote a fairly slow process?
I dealt with a microservice style serverless backend that was spread out across roughly 50 Go-based lambdas, one per API endpoint, with a few more for SQS and Cognito triggers. Was deployed via CloudFormation. Testing locally was an absolute nightmare even with Localstack.
Made me wish for a simple VPS, an ansible script to setup nginx and fail2ban and all that shit, and a deployment pipeline that scp'd a single binary into the right place.
Developer-specific sandbox environments with hot code reload is the golden standard here but Localstack is great if you can't do that due to (usually IT deparment-related, not technical) reasons.
That’s why Knative (Serverless on Kubernetes) accepts containers. It’s the standard packaging format that lets you lift and shift apps to many different platforms.
Lambda shifting to this model would be such a nice future. though even the lambda variants that can run containers have some painful issues.
But they weren’t building an application. They were building a library that would get integrated in other stateful server applications not running on Cloudflare. The performance benefit comes from running their auth colocated with their customers, not anything else.
I've been doing AWS Lambda since it started up over 10 years ago. It solves a lot of problems for me. I don't ever have to worry about load balancing or scaling. I don't have to maintain a server. When it isn't being used, I am not paying for it. I've been running a pretty sophisticated project on Lambda for years, and I pay about $0.00/month for it. Most of the ~$0.45/mo I pay to AWS is in S3.
Lambda code is extremely easy to test locally, if you write it that way. I just run the file locally and it does what it would do in the cloud, there is literally no difference. But of course, YMMV depending on how you approach it.
I created my own build tools for Lambda about a month after Lambda was introduced as a product. It's been working great ever since. The workflow is very simple. When I update a file locally, it simply updates the Lambda function almost instantly. I can then test the Lambda live in the cloud. If I want to run the function locally, I just run it and it behaves the same way it would in the cloud. There's no need to run the function in AWS, if you write the code so it can be run locally. It's really, really easy to do, but I guess some people haven't figured that out yet.
I've never liked containers. It's always been way more opaque than writing Lambdas that can run locally as well as in the cloud.
Lambda does get expensive when call volume goes up. If you're handling 10rps typically, ECS becomes a lot cheaper.
It obviously depends on how long your request last but still.
As for running it locally, it depends what your upstream is. I can tell you that I've had to work around bugs in the marshalling from MSK for example. You would never find that locally. If it's just web requests, sure.
What scaling and load balancing do you need if you pay $0 per month?
Don’t all “serverless” platforms accept docker images? I know Cloud Run does.
If I understand correctly your concern is mostly with “serverless functions” which abstracts away even more.
Cloud Run is more akin to AWS ECS (on Fargate), which also uses containers.
I think developers are drowning in tools to make things "easy", when in truth many problems are already easy with the most basic stuff in our tool belt (a compiler, some bash scripts, and some libraries). You can always build up from there.
This tooling fetish hurts both companies and developers.
Yeah; IMO Docker was our last universal improvement to productivity, in 2013, and very little we've invented since then can be said to have had such a wide-ranging positive impact, with such few drawbacks. Some systems are helpful for some companies, but then try to get applied to other companies where they don't make sense and things fall apart or productivity suffers. Cloudflare and others are trying to make v8 isolates a thing, and while they are awesome for some workloads, people want them to be the "next docker", and they aren't.
The model "give me docker image, we put it on internet" is staggeringly powerful. It'll probably still be the most OP way to host applications in 2040.
Even better than docker: just run some scripts on the machine.
Docker + IaC* for me; git ops, immutable servers, immutable code, immutable config, (nearly) immutable infrastructure means I haven't had to drop to the command line on a server since 2015. If something is wrong you restart the container, if that doesn't work you restart the host it's running on. The "downside" to this is my "admin" shell skills outside of personal dev laptop commands have gotten rusty.
*Terraform, imo, released in ~2014
> If something is wrong you restart the container, if that doesn't work you restart the host it's running on
Haha, lucky you. If only world was this beautiful :) I regularly shell into Kubernetes nodes to debug memory leaks from non-limited pods, or to check some strange network issues.
It’s that, and the fact that precious few people seem to understand fundamentals anymore, which is itself fed by the desire to outsource everything to 3rd parties. You can build an entire stack where the only thing you’ve actually made is the core application, and even that is likely to be influenced if not built by AI.
The industry is creating learned helplessness.
The other troubling thing is that if you do invest time into learning fundamentals, you'll be penalized for it because it won't be what you're interviewed on and probably won't be what you're expected to do on the job.
I'm working on a project right now that should be two or three services running on a VM. Instead we have 40+ services spread across a K8s cluster with all the Helm Chart, ArgoCD, CICD pipeline fun that comes with it.
It drives me absolutely nuts. But hey if the company wants to pay me to add all that stuff to my resume, I guess I shouldn't complain.
Yeah, the previous company I worked for started with a Django monolith that someone had come in and taken an axe to essentially at random until there were 20 Django "microservices" that had to constantly talk to each other in order to do any operation while trying to maintain consistency across a gigantic k8s cluster. They were even all still connected to the same original database that had served the monolith!
Unfortunately my campaign of "what if we stuck all the django back together and just had one big server" got cut short by being laid off because they'd spent too much money on AWS and couldn't afford employees any more.
I had to chuckle, how ironic this is... I worked on a project where they had 6 microservices with about 2-3 endpoints each and some of them would internally call other microservices to sync and join data. That was for 20 users top and managed by 1 team. The cloud bill was exciting to look at!
It spreading like wildfire, in all areas, not just development.
50 tools in ops/backend to make my life easier.
75 tools in the frontend to make my life easier.
45 different tools used by product to make my life easier.
20 used by HR to make my life easier.
10 used by office management to make my life easier.
None of them really do.
I find myself exhausted after a time, when I have to switch between 2-3 apps and many more tabs trying to co-ordinate things or when debugging issues with a teammate. And this is with me working professionally for only ~3 years.
I think the tools are nice to use early on but quickly become tough to manage as I get caught up with work, and can't keep up with the best way to manage them. Takes a lot of mental effort and context switching to manage updates or track things everywhere.
Also, if you're trying to run lean and cheap you can go very far with just a simple VPS like Hetzner.
Helps resumes! No one got to Staff suggesting bash scripts
Agreed, but not Bash. Bash should not be used for anything except interactive use. It's just way too error-prone and janky otherwise.
I am not at all a fan of Python but even so any script that you write in Bash would be better in Python (except stuff like installation scripts where you want as few dependencies as possible).
If it's worth saving in a file, it's worth not using Bash.
A lot of people don’t know about compilers, bash scripts and libraries.
Excerpt AWS lambda is stupidly cheap!
For certain workloads :)
And that is actually the advantage of serverless, in my mind. For some low-traffic workloads, you can host for next to nothing. Per invocation, it is expensive, but if you only have a few invocations of a workload that isn't very latency sensitive, you can run an entirely serverless architecture for pennies per month.
Where people get burned is moving high traffic volumes to serverless... then they look at their bill and go, "Oh my god, what have I done!?" Or they try to throw all sorts of duct tape at serverless to make it highly performant, which is a fool's errand.
Exactly. I've always found that how people want to use lambda is the exact opposite of how to use it cost effectively.
I've seen a lot of people want to use lambdas as rest endpoints and effectively replace their entire API with a cluster of lambdas.
But that's about the most expensive way to use a lambda! 1 request, one lambda.
Where these things are useful is when you say "I have this daily data pull and ETL that I need to do." Then all the sudden the cost is pretty dang competitive.
> Where these things are useful
All the backend processing and just general 'glue' in your architectures
The amount of 0s in the price per second is mesmerizing, but just multiply this by 24h and 30 days, and you are well within the price range of a better EC2 with much better performance, plus you can process 1000 req/s instead of 1 req/s for the same price.
How long is a piece of string ?
"Cheap" is relevant if you are talking about work load that is one off and doesn't run continuously. A lot of people use serverless to run a 24-7 service which sort of defeats the purpose. It doesn't get that cheap anymore.
Serverless is good if you have one off tasks that are used intermittently and are not consistent.
I once started working at a company that sold one of those visual programming things, and during training I was tasked with making a simple program, I was a bit overwhelmed with the amount of bugs and lack of tools to make basic features, so I made a prototype of the application I wanted in python, the plan was to port it later, I got it in a couple of days.
The tool developers weren't keen of the idea, they told me "Yeah, I can solve the problem with a script too, the challenge is to do it with our tool". And I thought it was kind of funny how they admitted that the premise of the tool didn't work.
It's like this holy grail panacea that arises 20 times every month, developers want to invent something that will avoid the work of actually developing, so they sunk-cost-fallacy themselves into a deep hole out of which they can only escape if they admit that they are tasked with automating, they cannot meta-automate themselves, and that they will have to gasp do some things manually and repeatedly, like any other working class.
This really is confirming my theory that the problem here is that "serverless" is so ill-defined that even it's own name is nonsensical.
Like, there are still servers.
Appears just about as intelligent as calling them "electricity-less." I mean, yes, I no longer think about electricity when deploying things, but that doesn't tell me anything meaningful about what's going on here.
It's more like cgi-bin as a service.
I once got a job for a company whose entire stack was aws lambda. Not like a normal 'hey we have this flask application which handles 100 endpoints and we just use lambda to handle a call' - more like 'every singe route or function is its own lambda function'. I left inside 2 weeks.
Saw this with an infamous 996 AI startup where the founders vibe coded their app.
Their app had like 3 or 4 pages, that was all, but every endpoint was a Lambda call. Every frontend log entry was a Lambda call. Every job execution was a Lambda call that called a dozen others. That was their idea of an architecture.
Even with <100 customers they were already paying >100k dollars a month on cloud, more than their entire dev team cost.
The constant firefighting, downtimes and billing issues were enough to keep the team busy instead of making new features.
I was freelancing but I didn't take the job after the "CTO" gave me a tour.
In principle, serverless should be very efficient and cheap if all code is running the same VM. The heterogeneous architecture must make it expensive. Maybe serverless using WASM will one day make it cheap.
I don't think "serverless is bad" is necessarily the full lesson here. The bigger lesson is when a service has dependencies, moving that service closer to the client (without also moving those dependencies) will counterintuitively make the e2e experience slower, not faster.
Prefer building physically near your dependencies. If that's not fast enough, then you have to figure out how to move or sync all your dependencies closer to the client, which except in very simple cases, is almost always a huge can of worms.
> The bigger lesson is when a service has dependencies, moving that service closer to the client (without also moving those dependencies) will counterintuitively make the e2e experience slower, not faster
The problem here is that pretty much all services have dependencies - if they didn't, you could have already moved that logic client-side.
This bites edge-compute architectures really hard. You move compute closer to the client, but far from the DB, and in doing so your overall latency is almost always worse.
The most latency-friendly architecture is almost always going to be an in-memory DB on a monolithic server (albeit this comes with its own challenges around scaling and redundancy)
I don’t know if this is a good rule of thumb, I think it really depends on what you use the dependencies for, how often you need them, etc.
Consider for example a single DB dependency. Should the server be close to the DB or the client? It depends. How often does the client need the server? How often does the server need the DB? Which usecases are expected to be fast and which can be sacrificed as slow? What can be cached in the server? What can be cached in the client? etc etc.
And then of course you can split and do some things on the server and some in the edge…
Oh, it's a very good rule of thumb. It's probably not universal, but it's really close to it.
the problem is that nobody designs the dependencies flexible enough to let them run without fine-control. And the main application always wants to change the way it uses the dependencies, so it always needs further flexibility.
You can build an exception to the rule if you explicitly try. But I'm not sure one appears naturally. The natural way to migrate your server into the edge is by migrating entire workloads, dependencies included. You can split the work like you said, you just can't split single endpoints.
Well, I guess one can take one more step back and say this is all merely an example of "premature optimization is the root of all evil". Unless you know a-priori that you have some very hard latency requirements, start with something simple and low-maintenance. If low-latency requirements come in later, then design that holistically, not just looking at your component. Make sure you're measuring the right things; OOTB metrics often miss the e2e experience. And IME most latency issues come from unexpected places; I know I've spent weeks optimizing services to get an extra percent or two out of them, only to realize there's a config setting that reduced latency by half.
So generally, simplicity is your friend when it comes to latencies (among other things). Fewer things to cause long-tail spikes, more simple things you can try out that don't break the whole system, whereas if you start with a highly-optimized thing up-front, fixing some unexpected long-tail issue may require a complete rewrite.
Also, check with your PM or end users as to whether latency is even important. If the call to your service is generally followed up to a call to some ten-second process, users aren't going to notice the 20ms improvement to your own thing.
For a best price-to-performance ratio create your instances and do whatever is needed on them. Software stacks are not that complicated to delegate everything to the Wizards of Cloud Overcharging.
I think the "local maximum" we've gotten stuck at for application hosting is having a docker container as the canonical environment/deliverable, and injecting secrets when needed. That makes it easy to run and test locally, but still provides most of the benefits I think (infrastructure-as-code setups, reproducibility, etc). Serverless goes a little too far for most applications (in my opinion), but I have to admit some apps work really well under that model. There's a nearly endless number of simple/trivial utilities which wouldn't really gain anything from having their own infrastructure and would work just fine in a shared or on-demand hosting environment, and a massively scaled stateless service would thrive under a serverless environment much more than it would on a traditional server.
That's not to say that I think serverless is somehow only for simple or trivial use cases though, only that there's an impedance mismatch between the "classic web app" model, and what these platforms provide.
You are ready for misterio: https://github.com/daitangio/misterio A tiny layer around stareless docker cluster. I created it for my homelab and it gone wild
That's really interesting, I might actually use that for mine too. Thanks for sharing.
Docker is much like microservices. Appropriate for a subset of apps and yet touted as being 'the norm' when it shouldn't be.
There are drawbacks to using docker, such as security patching and operational overhead. And if you're blindly putting it into every project, how are you mitigating the risks it introduces?
Worse, the big reason it was useful, managing dependency hell, has largely been solved by making developers default to not installing dependencies globally.
We don't really need Docker anywhere near like we used to, and yet it persists as the default, unassailable.
Of course hosting companies must LOVE it, docker containers must increase their margins by 10% at least!
Someone else down thread has mentioned a tooling fetish, I feel Docker is part of that fetish.
It has downsides and risks involved, for sure. I think the security part is perhaps a bit overblown, though. In any environment, the developers either care about staying on top of security or they don't. In my experience, a dev team that skips proper security diligence when using Docker likely wouldn't handle it well outside of Docker either. The number of boxes out there running some old version of Debian that hasn't been patched in the last decade is probably higher than any of us would like.
Although I'm sure many people just do it because they believe (falsely) that it's a silver bullet, I definitely wouldn't call it part of a "tooling fetish". I think it's a reasonable choice much more often than the microservice architecture is.
I deeply disagree. Docker’s key innovation is not its isolation; it’s the packaging. There is no other language-agnostic way to say “here’s code, run it on the internet”. Solutions prior to Docker (eg buildpacks) were not so much language agnostic as they were language aware.
Even if you allow yourself the disadvantage that any non-Docker solution won’t be language-agnostic: how do you get the code bundle to your server? Zip & SFTP? How do you start it? ./start.sh? How do you restart under failure? Systemd? Congrats, you reinvented docker but worse. Want to upgrade a dependency due to a security vulnerability? Do you want to SSH into N replicated VMs and run your Linux distribution specific package update command, or press the little refresh icon in your CI to rebuild a new image then be done?
Docker is the one good thing the ops industry has invented in the last 15 years.
This is a really nice insight. I think years of linux have kind of numbed me to this. I've spent so much time on systems which use systemd now that going back to an Alpine Linux box always takes me a second to adjust, even though I know more or less how to do everything on there. I think docker's done a lot to help with that though since the interface is the same everywhere. A typical setup for me now is to have the web server running on the host and everything else behind docker, since that gives me the benefit of using the OS's configuration and security updates for everything exposed to the outside world (firewalls, etc).
Another thing about packaging. I've started noticing myself subconsciously adding even a trivial Dockerfile for most of my projects now just in case I want to run it later and not hassle with installing anything. That way it gives me a "known working" copy which I can more or less rely on to run if I need to. It took a while for me to get to that point though
Running script 1 is harder than running script 2?
It's all the same stuff. Docker just wraps what you'd do in a VM.
For the slight advantage of deploying every server with a single line, you've still got to write the mutli-line build script, just for docker instead. Plus all the downsides of docker.
There's another idea too, that docker is essentially a userspace service manager. It makes things like sandboxing, logging, restarting, etc the same everywhere, which makes having that multi-line build script more valuable.
In a sense it's just the "worse is better" solution[0], where instead of applying the good practices (sandboxing, isolation, good packaging conventions, etc) which leads to those benefits, you just wrap everything in a VM/service manager/packaging format which gives it to you anyway. I don't think it's inherently good or bad, although I understand why it leaves a bad taste in people's mouths.
[0]: https://en.wikipedia.org/wiki/Worse_is_better
Docker images are self-running. Infrastructure systems do not have to be told how to run a Docker image; they can just run them. Scripts, on the other hand, are not; at the most simple level because you'd have to inform your infrastructure system what the name of the script is, but more comprehensively and typically because there's often dependencies the run script implies of its environment, but does not (and, frankly, cannot) express. Docker solves this.
> Docker just wraps what you'd do in a VM.
Docker is not a VM.
> Plus all the downsides of docker.
Of which you've managed to elucidate zero, so thanks for that.
Hard disagree. I've used Docker predominantly in monoliths, and it has served me well. Before that I used VMs (via Vagrant). Docker certainly makes microservices more tenable because of the lower overhead, but the core tenets of reproducibility and isolation are useful regardless of architecture.
Depends on the language. Java or Go you really don't need docker.
There's some truth to this too honestly. At $JOB we prototyped one of our projects in Rust to evaluate the language for use, and only started using Docker once we chose to move to .NET, since the Rust deployment story was so seamless.
What are you isolating it from? Everything runs on it's own box these days anyway.
Yeah that's becoming increasingly true. I guess it really depends on what your setup is.
The dirty secret of Docker is almost every docker container deployed is actually a VM, not just a container.
These two have resonated with me deeply.
- Eliminated complex caching workarounds and data pipeline overhead
- Simplified architecture from distributed system to straightforward application
We, as developers/engineers (put whatever title you want), tend to make things complex for no reason sometimes. Not all systems have to follow state-of-the-art best practices. Many times, secure, stable, durable systems outperform these fancy techs and inventions. Don't get me wrong, I love to use all of these technologies and fancy stuff, but sometimes that old, boring, monolithic API running on an EC2 solves 98% of your business problems, so no need to introduce ECS, K8S, Serverless, or whatever.
Anyway, I guess I'm getting old, or I understand the value of a resilient system, and I'm trying to find peace xD.
But when were serverless systems like lambda and cloud workers "best practices" for low latency apis?
According to their marketing material, when they started supporting running in edge pop's, they became the best option for low-latency APIs.
Last I heard (~5 years ago), lambda@edge doesn't actually run on edge POPs anyway; they're just hooks that you can put in your edge configs that execute logic in the nearest region before/after running your edge config. But it's definitely a datacenter round-trip to invoke them.
Adding that much compute to an edge POP is a big lift; even firecracker gets heavy at scale. And security risk for executing arbitrary code since these POPs don't have near the physical security of a datacenter, small scale makes more vulnerable to timing attacks, etc.
The takeaway here isn’t that serverless doesn’t work, it’s that the authors didn’t understand what they were building on. Putting a latency-critical API on a stateless edge runtime was a rookie mistake, and the pain they describe was entirely predictable.
I’ve found this to be true, with one caveat.
Most cloud pain people experience is from a misunderstanding / abuse of solutions architecture and could have been avoided with a more thoughtful design. It tends to be a people problem, not a tool problem.
However, in my experience cloud vendors sell the snot out of their offerings, and the documentation is closer to marketing than truthful technical documentation. Their products’ genuine performance is a closely guarded proprietary secret, and the only way to find out… e.g. whether Lambdas are fast enough for your use case, or whether AWS RDS cross-region replication is good enough for you… is to run your own performance testing.
I’ve been burned enough times by AWS making it difficult to figure out exactly how performant their services are, and I’ve learned to test everything myself for the workloads I’ll be running.
> the documentation is closer to marketing than truthful technical documentation
I participated in AWS training and certification given by AWS for a company to obtain a government contract and I can 100% say that the PAID TRAINING itself is also 100% marketing and developer evangelism.
100% agree with you. I took a corporate training, and at one point crammed for the developer cert. It it just marketing. There is never a question where the answer is "Just run this service on EC2 yourself". It is about maximizing your usage of AWS services.
Running on EC2 is hardly ever the correct answer. I’ve had to deploy to EC2 over the years and every method is a pain.
Just use Docker, there are plenty of services where deployment is simply - “hand your container to us and we run it”.
Even the most complicated popular ways to deploy Docker are simpler than deploying to a VM and a lot less error prone.
Platform dependency/lockin is never mentioned as a con[cern].
Infra will always be full of so much nonsense because it’s really hard to tell successful developers their code and system design is unusable. People use it because they are paid to do so usually, but it’s literally some of the worst product development I’ve ever seen.
AWS will hopefully be reduced to natural language soon enough with AI, and their product team can move on (most likely they moved on a long time ago, and the revolving door at the company meant it was going remain a shittily thought out platform in long term maintenance).
Some things never change. I remember ~20 years ago a bunch of expensive F5s suddenly showing up to our offices because the CTO and enterprise architects were convinced that irules could solve all their performance problems for something that wasn't even cacheable (gaming results) and would have shoved too much of our logic into the underpowered CPUs on them.
They were a much nicer, if overpriced, load balancing alternative to the Cisco Content Switch we were using, though.
This is exactly why I'd rather get a fat VPS from a reputable provider. As long as the bandwidth is sufficient the only limitation is vertical scaling.
I'm partial to this, the only thing I've found that is harder to achieve is the "edge" part of cloud services. Having a server at each continent is enough for most needs but having users route to the closest one is not as clear to me.
I know about Anycast but not how to make it operational for dynamic web products (not like CDN static assets). Any tips on this?
Someone correct me if I’m wrong but:
DIY Anycast is probably beyond most people’s reach, as you need to deal with BGP directly.
One cool trick is using GeoDNS to route the same domain to a different IP depending on the location of the user, but there are some caveats of course due to caching and TTL.
EDIT: Back to Anycast, there are also some providers who allow you BGP configuration, like those: https://www.virtua.cloud/features/your-ip-space - https://us.ovhcloud.com/network/byoip - https://docs.hetzner.com/robot/colocation/pricing/ ... However you still need to get the IPs by yourself, by dealing with your Regional Registry (RIPE in my case, in Europe)
To get anycast working, you need BGP, and to get it working well, I think you need a good understanding of BGP and a lot of points of presence and well connected at each. BGP's default metric of distance is number of networks traversed, which does funny things.
Say you're in city A where you use transit provider 1 and city B where you use transit provider 2. If a user is in city B and their ISP is only connected to transit provider 1, BGP says deliver your traffic to city A, because then traffic doesn't leave transit provider 1 until it hits your network. So for every transit network you use, you really want to connect to it at all your PoPs, and you probably want to connect to as many transit networks as feasible. If you're already doing multihoming at many sites, it's something to consider; if not, it's probably a whole lot of headache.
GeoDNS as others suggested is a good option. Plenty of providers out there, it's not perfect, but it's alright.
Less so for web browsers, but you can also direct users to specific servers. Sample performance for each /24 and /48 and send users to the best server based on the statistics, use IP location as a fallback source of info. Etc. Not great for simple websites, more useful for things with interaction and to reduce the time it takes for tcp slow start (and similar) to reach the available bandwidth.
You could start using DNS Traffic Shaping where DNS server looks at IP making the request and returns the IP of closest server.
Azure/AWS/GCP all have solutions for this and does not require you to use their services. There are probably other DNS providers that can do it as well.
Cloudflare can also do this as well but it's probably more expensive than DNS.
You took the words right out of my mouth. Between aggressive salespeople marketing any given product as a panacea for everything and mandates from above to arbitrarily use X thing to do Y, there’s a lot of just plain bad architecture out there.
>> is to run your own performance testing
I think they are shooting themselves in the foot with this approach. If you have to run a monte carlo simulation on every one of their services at your own time and expense just to understand performance and costs, people will naturally shy away from such black boxes.
> people will naturally shy away from such black boxes.
I don't this isn't true. In fact, it seems that in the industry, many developers don't proceed with caution and go straight into usage, only to find the problems later down the road. This is a result of intense marketing on the part of cloud providers.
The fact is most developers in most companies have very little choice. Many medium to large companies (1k-50k employees) the CTO gets wined and dined by AWS/Azure/Oracle and they decide to move to that cloud. They bring in their solutions architects and do the training. The corporate architects for the divisions set the goals. So the rank and file developers get told that they have to make this work in AWS using RDS and they have almost zero power over this choice.
It doesn't even have to be in companies that big. The AWS salespeople took the CTO and a couple of directors of engineering for diner in a fancy restaurant. That was in a fintech that had around 200 employees. AWS also paid for the mandatory marketing... sorry, mandatory training sessions we tech managers had to do.
This is how much it takes for a CTO to demand the next week that "everything should be done with AWS cloud-native stuff if possible".
I feel like every cloud build meeting should have a moment where everyone has to defend the question "Wait! could this be a regular database with a regular app on a server with a regular cache?"
The takeaway isn't that they didn't understand, it's that they are sharing information which you agree is valuable
Bo Burmham said, "self awareness does not absolve anyone of anything"
But here I dont think they (or their defenders) are still aware of the real lesson here.
Theres literally zero information thats valuable here. Its like saying "we used an 18 wheeler as our family car and then we switched over to a regular camry and solved all our problems." What is the lesson to be learned in that statement?
The real interesting post mortem would be if they go, "god in retrospect what a stupid decision we took; what were we thinking? Why did we not take a step back earlier and think, why are we doing it this way?" If they wrote a blog post that way, that would likely have amazing takeaways.
I can assure you that was pretty close to the internal conversation lol
Not sure what the different takeaways would be though?
What did your internal discussion conclude for the question "Why did we not take a step back earlier and think, why are we doing it this way?"
Im genuinely curious because this is not singling out your team or org, this is a very common occurrence among modern engineering teams, and I've often found myself on the losing end of such arguments. So I am all ears to hear at least one such team telling what goes on in their mind when they make terrible architecture decisions and if they learned anything philosophical that would prevent a repeat.
Oh we had it coming for quite some time and knew we would need to rebuild it, we just didn’t have the capacity to do it unfortunately.
I was working on it on and off moving one endpoint at a time but it was very slow until we hired someone who was able to focus on it.
It didn’t feel good at all. We knew the product had massive flaws due to the latency but couldn’t address it quickly. Especially cause we he to build more workarounds as time went on. Workarounds we knew would be made redundant by the reimplementation.
I think we had that discussion if “wtf are we doing here” pretty early, but we didn’t act on it in the beginning, instead we tried different approaches to make it work within the serverless constraints cause that’s what we knew well.
I have had CTOs (two in my career) tell me we had to use our AWS credits since they were going to expire worthless. Both experiences were at vc-backed startups.
What's valuable about rediscovering that stateless architectures requiring network round-trips for state access are slower than in-memory state? This isn't new information, it's a predictable consequence of their architecture choice that anyone with distributed systems experience could have told them on day zero.
Not everyone is born with experience in distributed systems
Sure, but there are some fundamentals about latency that any programmer should know [0] (absolute values outdated, but still useful as relative comparisons), like “network calls are multiple orders of magnitude slower than IPC.”
I’m assuming you’re an employee of the company based on your comments, so please don’t take this poorly - I applaud any and all public efforts to bring back sanity to modern architecture, especially with objective metrics.
0: https://gist.github.com/hellerbarde/2843375
I cofounded it yeah
And yeah you’re right in hindsight it was a terrible idea to begin with
I thought it could work but didn’t benchmark it enough and didn’t plan enough. It all looked great in early POCs and all of these issues cropped up as we built it
That's fair, but then the framing matters. The article criticizes serverless architecture rather than acknowledging an evaluation failure.
"Serverless was fighting us" vs "We didn't understand serverless tradeoffs" - one is a learning experience, the other is misdirected criticism.
Yeah that’s fair
You don't need experience and there is not really a lot to know about "distributed systems" in this case, that's basic CS knowledge about networks, latency and what "serverless" actually is, you can read about it. To be honest, to me it reads like people who don't understand the problem they're solving, haven't acquired the necessary knowledge to solve it (either by learning themselves or by asking/hiring people who have it), and seeing such an amateurish mistake doesn't inspire confidence for the future. You should either hire people that know what they are doing or upgrade your knowledge about systems you are using before making decisions to use them.
Sometimes I see a post about sorting algorithms online. Some people seem to benefit from reading about these things, but often, I find there isn't much new information for me. That's OK, because I know somebody somewhere benefits from knowing this.
It is your decision to make this a circlejerk of musings about how the company must be run by amateurs. Whatever crusade you're fighting in vividly criticising them is not valuable at all. People need to learn and share so we can all improve, stop distracting from that point.
Uh, no, 95% of our architectures are stateless and its fine because RTT isn't dogshit, unlike AWS lambda.
I would not assume this was a "rookie mistake". I've been here once or twice, and a common story is that engineers don't want to do it a certain way, but management overrules them for some vague hand-wavy reason like, "This way is more modern." Another common story is that you know you're not choosing the most [scalable|robust|performant|whatever] design, but ancillary constraints like time and money push you into a "worse is better" decision.
Or maybe the original implementation team really didn't know what they were doing. But I'd rather give them the benefit of the doubt. Either way, I appreciate them sharing these observations because sharing these kinds of stories is how we collectively get better as a professional community.
> but management overrules them for some vague hand-wavy reason like, "This way is more modern."
This matches my experience. It's very difficult to argue against costly and/or inappropriate technical decisions in environments where the 'Senior Tech Leadership' team are just not that technical but believe they are, and so are influenced by every current industry trend masquerading as either 'scalable', 'modern' or (worst of all) 'best practice'.
What's even more dangerous is when senior tech leadership used to be technical but haven't actually got their hands dirty in 5 or 10 years, and don't realize that this means they aren't actually holding all the cards when they try to dictate these kinds of tactical, detail-oriented technical decisions.
I see this a lot in startups that grew big before they had a chance to grow up.
> used to be technical
And to add, this rarely indicates anything about the depth and/or breadth of the 'used to' experience.
A lot of the strongest individual contributors I see want to stay in that track and use that experience to make positive and sensible change, while the ones that move into the management tracks don't always have such motivations. There's no gatekeeping intended here, just an observation that the ones that are intrinsically motivated by the detailed technical work naturally build that knowledge base through time spent hands-on in those areas and are best able to make more impactful systemic decisions.
People in senior tech leadership also are not often exposed to the direct results of their decisions too (if they even stay in the company for long enough to see the outcome of longer-term decisions, which itself is rare).
While it's not impossible to find the folk that do have breadth of experience and depth of knowledge but are comfortable and want to be in higher-level decision making places, it's frustratingly rare. And in a lot of cases, the really good ones that speak truth to power end up in situations where 'Their last day was yesterday, we wish them all the best in their future career endeavours.' It's hardly surprising that it's a game that the most capable technical folks just don't want to play, even if they're the ones that should be playing it.
This all could just be anecdata from a dysfunctional org, of course...
My personal experience is that if you want guaranteed anything (quick scaling, latency, CPU, disk or network throughput), your best bet is to manually provision EC2 instances (or use some API that does). Once you give up control hoping to gain performance for free, you usually end up with an unfixable bottleneck.
If you're looking for a middle ground between VMs and serverless, ECS Fargate is a good option. Because a container is always running, you won't experience any cold start times.
Yes, though unless you’re provisioning your own EC2s for them to run on, you have no guarantee about the server generation, and IME AWS tends to provision older stuff for Fargate.
This may or may not matter to you depending on your application’s needs, but there is a significant performance difference between, say, an m4 family (Haswell / Broadwell) and an m7i family (Sapphire Rapids) - literally a decade of hardware improvements. Memory performance in particular can be a huge hit for latency-sensitive applications.
https://aws.amazon.com/blogs/aws/announcing-amazon-ecs-manag...
ECS is good, just expensive and still requires more devops than it should. Docker Swarm is an easy way to run production container services on VMs. I built a free golang tool called Rove that provisions fresh Ubuntu VMs in one command and diffs updates. It's also easy-enough to use Swarm directly.
I’ve used a modified version of this for 8 years - I didn’t write it. Updating your ECS Docker image is just passing in the parameter of your new image and updating the cloudformation stack.
https://github.com/1Strategy/fargate-cloudformation-example/...
Thanks for sharing! I'll bookmark that.
Honestly I didn't have a good experience with ECS (Fargate) - I remember I had to write a ton of CF deployment scripts+bash scripts, setting up a private AWS docker registry, having a terrible time debugging while my CF deployment always failed, deploys taking forever, finding out that AWS is too miserly to pay Docker to use the official repo so they are stuck on the free tier, meaning sometimes deploys would fail due to Dockerhub kicking the AWS docker agent out etc. It had limitations like not being able to attach a block volume to the docker instance, so overall I remember spending a week setting up the IaC for a simple-ass CRUD app on Fargate ECS.
Setting up the required roles and permissions was also a nightmare. The deployment round trip time was also awful.
The 2 good experiences I had with AWS was when we had a super smart devops guy who set up the whole docker pipeline on top of actual instances, so we could deploy our docker compose straight to a server in under 1 minute (this wasn't a scaled app), and had everything working.
Lambda is also pretty cool, you can just zip everything up and do a deploy from aws cli without much scripting and pretty straightforward IaC.
I posted a link to a CloudFormation template I’ve used to deploy to ECS off an on for 8 years in a sibling reply. It’s stupid simple.
But the easy solution is just to use AWS’s own Docker registry and copy the images to it. Fargate has allowed you to attach EFS volumes for years.
A lot of AWS requires way too much config. It is a mystery to me why AWS doesn't lean into extending the capabilities of App Runner. I actually built a whole continuous deployment PaaS for AWS ECS with a Heroku-like UX, ended up shutting it down eventually because although useful, their pricing is pretty awful. What I need to do is figure out how to bring it back, just minus the hosted service so I can use it on corporate projects that require AWS...
Sounds useful! I hear mixed things about Swarm. You like it?
Edit: found it. Cool! https://rove.dev/
Yeah I haven't had any issues with Swarm. Heard good things from people running substantial clusters. Would be interested in hearing about what rough edges people have run into as well!
There isn't much for them to mess with in EKS either. It is very close to the metal and easy to reason about.
This is basically criticizing them for admitting to being one of today's 10,000.
https://xkcd.com/1053/
Personally, I appreciate the info and the admission.
> Putting a latency-critical API on a stateless edge runtime
Isn’t this the whole point of serverless edge?
It’s understood to be more complex, with more vendor lockin, and more expensive.
Trade off is that it’s better supported and faster by being on the edge.
Why would anyone bother to learn a proprietary platform for non critical, latency agnostic service?
You're confusing network proximity with application architecture. Edge deployment helps connection latency. Stateless runtime destroys it by forcing every cache access through the network.
The whole point of edge is NOT to make latency-critical APIs with heavy state requirements faster. It's to make stateless operations faster. Using it for the former is exactly the mismatch I'm describing.
Their 30ms+ cache reads vs sub-10ms target latency proves this. Edge proximity can't save you when your architecture adds 3x your latency budget per cache hit.
Realistically, they should be able to do sub ms cache hits which land in the same datacenter. I know cloudflare doesn't have "named" datacenters like other providers but at the end of the day, there are servers somewhere and if your lambda runs twice in the same one there is no reason why a pull-through cache can't experience a standard intra data-center latency hit.
I wonder if there is anything other than good engineering getting in the way of this and even sub us intra-process pull through caches for busy lambda functions. After all, if my lambda is getting called 1000X per second from the same point of presence, why wouldn't they keep the process in memory?
On serverless, whenever you call your code, it has to be executed but first the infrastructure has to find a place to run it and sometimes if there's no running instance available, it must fire up a new instance to run your code.
That's hot start VS cold start.
Agreed. Wondering what sort of discovery or design phase their legacy arch went thru.
But but it's webscale!
Their problem isn't serverless, rather Cloudflare Workers and WebAssembly.
All major cloud vendors have serveless solutions based on containers, with longer managed lifetimes between requests, and naturally the ability to use properly AOT compiled languages on the containers.
At that point, why should I use serverless at all? If I have to think about the lifetime of the servers running my serverless functions?
Serverless only makes sense if the lifetime doesn't matter to your application, so if you find that you need to think about your lifetime then serverless is simply not the right technology for your use case.
Because it is still less management effort than taking full control of the whole infrastructure.
Usually a decision factor between more serverless, or more DevOps salaries.
I would doubt that this is categorically true. Serverless inherently makes the whole architecture more complex with more moving parts in most cases compared to classical web applications.
> Serverless inherently makes the whole architecture more complex with more moving parts
Why's that? Serverless is just the generic name for CGI-like technologies, and CGI is exactly how classical web application were typically deployed historically, until Rails became such a large beast that it was too slow to continue using CGI, and thus running your application as a server to work around that problem in Rails pushed it to become the norm across the industry — at least until serverless became cool again.
Making your application the server is what is more complex with more moving parts. CGI was so much simpler, albeit with the performance tradeoff.
Perhaps certain implementations make things needlessly complex, but it is not clear why you think serverless must fundamentally be that way.
Depends pretty much where those classical web applications are hosted, how big is the infrasture taking care of security, backups, scalability, failovers, and the amount of salaries being paid, including on-call bonus.
Serverless is not a panacea. And the alternative isn't always "multiple devops salaries" - unless the only two options you see are server serverless vs outrageously stupid complicated kubernetes cluster to host a website.
There's a huge gap between serverless and full infra management. Also, IMO, serverless still requires engineers just to manage that. Your concerns shift, but then you need platform experts.
A smaller team, and from business point of view others take care of SLAs, which matters in cost center budgets.
Pay 1 devops engineer 10% more and you'll get more than twice the benefit of 2 average engineers.
It can be good for connecting AWS stuff to AWS stuff. "On s3 update, sync change to dynamo" or something. But even then, now you've got a separate coding, testing, deployment, monitoring, alerting, debugging pipeline from your main codebase, so is it actually worth it?
But no, I'd not put any API services/entrypoints on a lambda, ever. Maybe you could manufacture a scenario where like the API gets hit by one huge spike at a random time once per year, and you need to handle the scale immediately, and so it's much cheaper to do lambda than make EC2 available year-round for the one random event. But even then, you'd have to ensure all the API's dependencies can also scale, in which case if one of those is a different API server, then you may as well just put this API onto that server, and if one of them is a database, then the EC2 instance probably isn't going to be a large percentage of the cost anyway.
Actually I don't even think connecting AWS services to each other is a good reason in most cases. I've seen too many cases where things like this start off as a simple solution, but eventually you get a use case where some s3 updates should not sync to dynamo. And so then you've got to figure out a way to thread some "hints" through to the lambda, either metadata on the s3 blob, or put it in a redis instance that the lambda can query, etc., and it gets all convoluted. In those kinds of scenarios, it's almost always better just to have the logic that writes to s3 also update dynamo. That way it's all in one place, can be stepped through in a debugger, gets deployed together, etc.
There are probably exceptions, but I can't think of a single case where doing this kind of thing in a lambda didn't cause problems at some point, whereas I can't really think of an instance where putting this kind of logic directly into my main app has caused any regrets.
For a thing, which permanently has load it makes little sense.
It can make sense if you have very differing load, with few notable spikes or on an all in on managed services, where serverless things are event collectors from other services ("new file in object store" - trigger function to update some index)
Agree, it seems like they decided to use Cloudflare Workers and then fought them every step of the way instead of going back and evaluating if it actually fit the use case properly.
It reminds me of the companies that start building their application using a NoSQL database and then start building their own implementation of SQL on top of it.
Ironically, I really like cloudflare but actively dislike workers and avoid them when possible. R2/KV/D1 are all fantastic and being able to shard customer data via DOs is huge, but I find myself fighting workers when I use them for non-trivial cases. Now that Cloudflare has containers I'm pushing people that way.
Hey! Bet I can guess who
In that scenario, how do you keep cold startup as fast as possible?
The nice thing about JS workers is that they can start really fast from cold. If you have low or irregular load, but latency is important, Cloudflare Workers or equivalent is a great solution (as the article says towards the end).
If you really need a full-featured container with AOT compiled code, won't that almost certainly have a longer cold startup time? In that scenario, surely you're better off with a dedicated server to minimise latency (assuming you care about latency). But then you lose the ability to scale down to zero, which is the key advantage of serverless.
Apparently not nice enough, given that they rewrote the application in Go.
Serverless with containers is basically managed Kubernetes, where someone else has the headache to keep the whole infrastructure running.
Cloudflare has containers now too, and having used AppRunner and Cloud Run, it's much easier to work with. Once they get rid of the container caps and add more flexibility in terms of container resources, I would never go back to the big cloud containers, the price and ease of use of Cloudflare's containers just destroy them.
I doubt that the bill would be that much cheaper, nonetheless thanks for making me aware they are a thing now.
They're much cheaper, they're just DOs, and they get billed as such. They also have faster cold start times and automatic multi-region support.
What does DO mean in this context?
Durable Object
Indeed.
They get to the bottom of the post and drop:
> Fargate handles scaling for us without the serverless constraints
They dropped workers for containers.
You're saying serverless can have really low latency and fast 24/7?
Isn't serverless at the base the old model, of shared vms, except with a ton of people?
I'm old school I guess, baremetal for days...
Yes, check Cloud Run, AWS Lambda, Azure Functions with containers.
> We built chproxy specifically because ClickHouse doesn't like thousands of tiny inserts. It's a Go service that buffers events and sends them in large batches. Each Cloudflare Worker would send individual analytics events to chproxy, which would then aggregate and send them to ClickHouse.
While I understand how this isn't the only thing that needed to be buffered, for Clickhouse data specifically I'd be curious why they built a separate service rather than use asynchronous inserts:
https://clickhouse.com/docs/optimize/asynchronous-inserts
I think someone should make a timeline of software technology eras, each beginning with 'why XYZ is the future' and ending with articles like this.
Serverless seems to me like the tech version of the classic sales technique. You see it every Black Friday, get them in the door with the cheap alluring offers, then upsell the shit out of them. And in the case of serverless, lock them with as many provider specific services as possible.
I feel like we're headed the same direction with AI right now. It creates as many problems as it solves, and the solution is always to pay them for the newer faster model and use more tokens. With all the new AI security threats coming to light we'll start seeing them offer solutions for that too. They'll sell you the threats and the solutions, and the entire developer community will thank them for it.
As a soon to be graybeard think this has been fairly obvious from the start. And outside of specific workflows, you are adding unneeded complexity to a system that does not need it. In general, an anti-pattern. but it does have valid use in some cases.
On the last project I worked on that came to involve serverless, it made no sense at all other than it was "the fad".
For this system we had an excellent knowledge of what the theoretical limit of users and connections as.
This apparently needed to be done container, serverless, kafka, blah blah.
Annoyed with the whole thing I took a few nights to tear logic out from the micro servers or nano services, and wrapped the whole thing into a Frankenstein monolith.
AT least 60% of the code all had to do with dealing solely with the code needed to pass information around to different services so it was easier to maintain. Well my hacked together moonlight was not a great start for anything but a demo.
I installed Postgres on my laptop, ran the monolith on it, took 3 servers each pushing the theoretical maximum load we would have, and what do you know the performance was fine. But the architecture was the architecture decided upon.
One thing I could not find in the write-up is the change in the expense. Did serverless save any money, compared to always-up VMs? Did much of their load run under the free tier limits?
Serverless shines when the load is very spiky, and you can afford high long-tail latency. Then you don't pay for all that time when your server would be idling. (This is usually not the case for auth APIs, unless they auth other infrequently invoked operations.)
"Taking hand off of boiling kettle decreased anxiety and increased focus in 97% of study participants."
Stephen King's Dark Tower series never resonated with me and I got stuck in book two. But it has one of my favorite philosophical insults of all time:
"Those who [X] have forgotten the faces of their fathers."
I feel like there's a collective amnesia just beginning to wear off as people remember the Fallacies of Distributed Computing and basic facts about multitasking. And that amnesia absolutely feels to me as if everyone has forgotten the faces of their fathers. <waves cane threateningly>
30ms p99 for a cache read! Serverless might have been a problem, but I'm not sure it was the problem. In my experience a p99 of 2ms is more typical - 30ms is the sort of time I'd expect for p99 on a database query in production serving.
You don't need process-local caches to get the sort of performance they're looking for, and there are good reasons why most teams avoid stateful processing, it's much harder to get right and has bad failure modes.
Fun fact, the overhead for a DB query (not including the network latency) is about 250 microseconds (or 0.25ms). And a typical DB kernel can do at least 16MB/core second of throughput. So that 30ms query I would expect is scanning several GB of data to find its results. Try doing that yourself and your latency would probably grow into seconds. Systems programming is hard...
PS moving from stateful processing to something with a distributed cache is moving from RAM to network. That's at least a 10x drop off in throughput.
After building my first Serverless/Cloudflare worker app, this is why I migrated to Deno. Deno enables you to run the same codebase in deno (self-hosted/local) and in deno deploy (serverless platform from deno).
I wanted my app to be self-hostable as well, and Cloudflare worker is a hard ecosystem lock to their platform, which makes it undesirable (imo).
Here is a link to my reasoning from back then: https://github.com/K0IN/Notify/pull/77#issuecomment-16776070...
I ported my worker project into Django since cloudflare workers wouldn’t allow selection of region for hosting workers which is generally required due to data compliances. This is something all cloud providers provide from day one yet cloudflare made it an enterprise feature.
Also the vendor lock-in doesn’t help with durable objects and D2 instead of simply doing what supabase and others are doing by providing Postgres or standard SQLite as a service.
Hey "vercel security checkpoint", I'm repeatedly getting a code 99 "Failed to verify your browser" on an iphone with a VPN exits in California and Canada via proton VPN and Firefox Focus.
What gives?
``` sfo1::1760587368-8k6JCK3uO27oMpuTbnS4Hb3X2K9bVsc ```
"Self-Hosting : Being tied to Cloudflare's runtime meant our customers couldn't self-host Unkey. While the Workers runtime is technically open source, getting it running locally (even in dev mode) is incredibly difficult.
With standard Go servers, self-hosting becomes trivial:"
A key point that I always make. Serverless is good if you want a simple periodic task to run intermittently without worrying about a full time server. The moment things get more complex than that (which in real world it almost always is), you need a proper server.
Author of that blog here, happy to answer any questions :)
Have you done new benchmarks since Cloudflare announced their latest round of performance improvements for Workers?
Just curious if this workload also saw some of the same improvements (on a quick read it seems like you could have been hitting the routing problem CF mentions)
Really great writeup. The charts tell the story beautifully, and the latency gains are surely a win for your company and customers. I always wonder about the tradeoffs. Is there a measurable latency difference for your non-colocated customers? What does maintenance look like for your Go servers? I assume that your Cloudflare costs dropped?
It’s faster for non-colocated customers too weirdly
I think cause connections can be reused more often. Cloud flare workers are really prone to doing a lot of TLS handshakes cause they spin up new ones constantly
Right now were just hang aws far hate for the go servers, so there really isn’t much maintenance at all. We’ll be moving that into eks soon though cause we are starting to add more stuff and need k8s anyways
Not a question: thanks for the writeup and for the honesty of saying that serverless is not inherently bad, just not the right fit for your usecase!
Unfortunately too many comments here are quick to come to the wrong conclusion, based only on the title. Not a reason to change it though!
Thanks
It’s totally fair criticism that the title and wording is a bit clickbaity
But that’s ok
Do you have a clearer picture of what use-cases you would use serverless functions for in the future (if any)?
I would love to know the net result to their financials for this move. I have no doubt they were able to improve their performance. I'm just wondering if the juice was worth the squeeze, especially if they could have been building other features that customers would want. I didn't read anything about the opportunity cost in the article or even the consideration.
Linux servers running Go apps? Would be nice to see server cost and specs, backup strategy, etc.
What do you find so peculiar about it? A lot of people are running Go apps on VPSs.
Next article - why we switched from our own servers to serverless for reliability. A small performance hit was worth it.
TFA states that they’re running on AWS Fargate.
That said, as an example, an m8g.8xlarge gives you 32 vCPU / 128 GiB RAM for about $1000/month in us-east-1 for current on-demand pricing, and that drops to just under $700 if you can do a 1-year RI. I’m guessing this application isn’t super memory-heavy, so you could save even more by switching to the c-family: same vCPU, half the RAM.
Stick two of those behind a load balancer, and you have more compute than a lot of places actually need.
Or, if you have anything resembling PMF, spend $10K or so on a few used servers and put them into some good colo providers. They’ll do hardware replacement for you (for a fee).
They just use two servers and configure a loadbalancer within Cloudflare. Come on. Self-Hosting is no rocket science. You don‘t have to make it seem complicated. People have been doing this decades before AWS invented serverless.
Yet, idiots remain.
Backup strategy? What do you mean by that?
Servers go down. What is the plan to get them "backup" and running ;)
They probably don't need one for the application servers. And they probably already have a backup strategy for their DBs.
Most server outages are caused by hardware failures which EC2 MOSTLY abstracts from you.
Also, if it's just Golang, point Ansible or whatever deploys at new server and trigger a deploy.
But is this not needed with the so-called cloud systems?
Only if that system is stateless. If you have any sort of internal memory that sticks around between requests, then either you face a cold start problem (because of empty caches) or you somehow need to persist that state somewhere. And persisting that state either means you need a backup solution or your latency is terrible because you are hitting network for something that only needs to hit RAM.
I'm assuming "High Availability" is what is really meant here.
about 5 years ago we made an API that took an image and gave you a lat/lon in return (basically like a visual GPS)
The front end was entirely in python AWS lambdas, with a message queue that talked to the (slow) GPU backend.
The "lambda tax" was about 100ms on average. given that an average request was around 4 seconds it seemed ok.
If it’s got to be serverless then use a PaaS that is docker based as input format. That at least gives some level of platform portability if you need to shift it later
But yeah in this case 10ms requirement doesn’t leave a lot of room for elaborate anything
Incredible that these kinds of services were hosted like this.
I guess they never came out of MVP, which could warrant using serverless, but in the end it makes 0 sense to use some slow solution like this for the service they are offering.
Why didnt they go with a self hosted backend right away?
Its funny how nowadays most devs are too scared to roll their own and just go with the cloud offerings that cost them tech debt and actual money down the road.
We did initially but thought cloud flare was a better solution for scalability and latency.
We believed their docs/marketing without doing extensive benchmarks, which is on us.
The appeal was also to use the same typescript stack across everything, which was nice to work with
Where did their marketing or documentation say this service is perfect for low latency APIs?
I doubt they literally said “perfect for low latency APIs” but their messaging is definitely trying to convince you that they’re fast globally, just look at the workers.ckoudflare.com page
It sounds like you picked the wrong platform because you didn’t understand what you were doing.
That’s not a technology issue.
30ms P99 does not a cache make.
Source work somewhere where you easily get 1ms cached relational DB reads from outside the service.
30ms makes me suspect it went cross region.
Interesting writeup. The serverless approach helped with GTM. (I speculate) raising capital afforded them extra devs who noticed the cache latency.
> The serverless approach helped with GTM
Unlikely? They could've just as well deployed their single go binary to a vm from day 1 and it would've been smooth sailing for their use case, while they acquire customers.
The cloudflare workers they chose aren't really suited for latency critical, high throughput APIs they were designing.
We all love a vendor lock-in, don’t we? Until it backstabs us and we go back to VMs
I think this is what is being said:
"Down with serverless! Long live serverless!"
Many organizations would benefit from just running a well-architected monolith. Amazon pushed microservices hard internally, but they face scale challenges most companies do not have and will never have.
Seems like serverless is ideal for that <10$/mo budget sweetspot when you can't yet afford a VPS.
That mythical sweetspot where you can afford extra development work, but cannot afford 10$/month?
da rulez:
- if you don't understand your concept/market, build in a VPS. you can get away with scaling a VPS for a while.
- if you intend to be netflix, rent at the edge (mid game) and eventually own the edge servers in POPs (otherwise, "edge" compute isn't worth it). before that, start with a beefy VPS cluster with HA and SQLproxy.
- if you spin to zero, use lambdas (for things like codebuild or fractional needs of a computer). before that, build things in VPSes.
- if you spin up/down but not to zero, use container platforms?
- once you have a reliably steady understanding of your infrastructure, buy physical servers in a colo
Somewhere in Denmark, DHH is smiling
I often don't know what to make of DHH. He's a living contradiction. On one hand he will continually rant about how bad the overhead and waste of cloud services is, and on the other hand he will staunchly defend the most inefficient programming language that is regularly used for backend development, as well as defend the enourmous overfetching that active record leads to.
Really I think DHH just likes to tell others what he likes.
In all fairness, the performance penalty for virtualization is 4x and the penalty for interpreted code is 1.5x. So he comes out ahead, but its more in a broken watch is right twice a day sort of way.
Gives him a break from writing out of touch screeds about countries he knows nothing about I guess.
If they use Cloudflare that automatically disqualifies them from me reading whatever they wrote, or caring about it.