> So: the matrix.org database secondary lost its FS due to a RAID failure earlier today (11:17 UTC). Then, we lost the primary at 17:26. We're trying to restore the primary DB FS (which could be fastish), while also doing a point-in-time backup restore from last night (which takes >10h). We believe the incremental DB traffic since last night is intact however. Apologies for the downtime; folks on their own homeserver are of course not impacted.
10 hours seems like a long time for a db restore of a chat server. Matrix is still just a chat server, right? I have so many questions that maybe I should keep my nose out of.
[Edit] From another comment, 55TB?!? Holy wat-man...
- Probably thousands of large chatrooms, and hundreds of millions of small chatrooms
- Probably hundreds of millions of messages that include a media upload like an image or video, including countless re-posts of random memes
- Overhead from ratchet algorithm cryptography, as well as additional message metadata that is likely in JSON format
- Huge excesses of messages from bridge bots, spam bots, and malfunctioning utility bots. To give a sense of scale... the entirety of Libera.chat (formerly Freenode IRC) used to be bridged to matrix.org, meaning almost every single message from Libera would be copied to matrix.org automatically.
- Everything from other homeservers that federate with matrix.org and have been joined by at least one matrix.org user, including homeservers that no longer exist
There's around 50M rooms, and the dataset is back to 2014.
However, much of the room is taken up by the Synapse DB schema being particularly denormalised (prioritising performance over disk footprint) - especially caching snapshots of historical key/value state for rooms, which currently takes up ~65x more space than then actual underlying dataset. Ironically, we're looking into that currently, but not fast enough to speed up this DB rebuild.
Thanks for that explanation. It makes a lot more sense now. Back in the day I managed a Jabber + bandersnatch corporate server and it had many years of Jabber + AIM + MSN + ICQ + Yahoo chat but the database was tiny which is why I was very surprised to see a chat server take up so much space.
Good luck on getting the schema overhead out of the way. I'm sure nowadays you are probably also using faster underlying storage SSD's behind the raid controllers. Dell/HP keep them overpriced of course but I found them to be very much worth it for databases as did the DBA's.
I hope your on-call teams get to take a week off after that incident.
Love y'all and love matrix. Thanks for the free matrix.org server. But maybe now is the time to research setting up a home server for mission critical stuff?
I don't immediately see an official doc on this; is it right under my nose?
This is the perfect chance to stress that people should choose different homeservers.
But it is hard to trust a random server, if all you know is the name and mean uptime. Mastodon shows the community posts and an introduction by the local admin, before you make an account. Matrix should do the same.
The best way to make it easier for people to choose a homeserver would be to have a complete migration system so that people could easily move their account to a different server if for whatever reason they decide they chose the wrong one.
nope, because the matrix-org branch of synapse was archived at the end of 2023, so you'd be horribly out of date and insecure; instead the team who wrote Synapse switched to AGPL and now release it at github.com/element-hq.
Looks like it's going to take a while to come back up.
> Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic.
> So: the matrix.org database secondary lost its FS due to a RAID failure earlier today (11:17 UTC). Then, we lost the primary at 17:26. We're trying to restore the primary DB FS (which could be fastish), while also doing a point-in-time backup restore from last night (which takes >10h). We believe the incremental DB traffic since last night is intact however. Apologies for the downtime; folks on their own homeserver are of course not impacted.
The stuff of absolute nightmares...
https://mastodon.matrix.org/@matrix/115136245785561439
10 hours seems like a long time for a db restore of a chat server. Matrix is still just a chat server, right? I have so many questions that maybe I should keep my nose out of.
[Edit] From another comment, 55TB?!? Holy wat-man...
There's more than you might think...
- Probably thousands of large chatrooms, and hundreds of millions of small chatrooms
- Probably hundreds of millions of messages that include a media upload like an image or video, including countless re-posts of random memes
- Overhead from ratchet algorithm cryptography, as well as additional message metadata that is likely in JSON format
- Huge excesses of messages from bridge bots, spam bots, and malfunctioning utility bots. To give a sense of scale... the entirety of Libera.chat (formerly Freenode IRC) used to be bridged to matrix.org, meaning almost every single message from Libera would be copied to matrix.org automatically.
- Everything from other homeservers that federate with matrix.org and have been joined by at least one matrix.org user, including homeservers that no longer exist
There's around 50M rooms, and the dataset is back to 2014.
However, much of the room is taken up by the Synapse DB schema being particularly denormalised (prioritising performance over disk footprint) - especially caching snapshots of historical key/value state for rooms, which currently takes up ~65x more space than then actual underlying dataset. Ironically, we're looking into that currently, but not fast enough to speed up this DB rebuild.
So that means it's possible to reduce the DB size under 1TB, your recovery time would take less than 1 hour, rather >17hours.
Thanks for that explanation. It makes a lot more sense now. Back in the day I managed a Jabber + bandersnatch corporate server and it had many years of Jabber + AIM + MSN + ICQ + Yahoo chat but the database was tiny which is why I was very surprised to see a chat server take up so much space.
Good luck on getting the schema overhead out of the way. I'm sure nowadays you are probably also using faster underlying storage SSD's behind the raid controllers. Dell/HP keep them overpriced of course but I found them to be very much worth it for databases as did the DBA's.
I hope your on-call teams get to take a week off after that incident.
for the sake of completeness, we got the DB recovered after about 23h of downtime. postmortem will be forthcoming.
See also
https://www.theregister.com/2025/09/03/matrixorg_raid_failur... https://www.heise.de/en/news/Matrix-main-server-down-million...
> We are in the process of restoring the matrix.org database from a backup. The matrix.org homeserver will be offline until this has been completed
whoops
Ouch, sounds like bad luck. Best wishes for the recovery.
Best of luck to the team at Matrix/Element for restoring from a nasty outage.
Hope this gets done sooner, horrible outage
Now would be a good time to migrate to a different homeserver :).
Not really possible in a comprehensive way.
> Not really possible in a comprehensive way.
Now you're getting it
The failure to do and test DR.
erm, no? this literally was our DR in the event of losing both primary & secondary DBs, and we'd tested it, and it worked as well as expected.
Love y'all and love matrix. Thanks for the free matrix.org server. But maybe now is the time to research setting up a home server for mission critical stuff?
I don't immediately see an official doc on this; is it right under my nose?
Is this doc good? https://www.redpill-linpro.com/techblog/2025/04/08/matrix-ba...
Matrix.org itself doesn't publish an 'official' way to run a server, given there are multiple implementations and distros out there.
If you're happy using kubernetes, https://element.io/server-suite/community should be a good bet (or https://element.io/server-suite/pro if you are actually doing mission-critical stuff and want a version professionally supported by Element)
If you're happy using docker-compose, then https://github.com/element-hq/element-docker-demo is a very simple template for getting going.
Alternatively, https://github.com/spantaleev/matrix-docker-ansible-deploy is quite popular as a 3rd-party distro using ansible-managed docker containers.
Sorry all for the downtime on matrix.org - we're having to do a full 55TB db restore from backup which will take ~17 hours to run. :|
This is the perfect chance to stress that people should choose different homeservers.
But it is hard to trust a random server, if all you know is the name and mean uptime. Mastodon shows the community posts and an introduction by the local admin, before you make an account. Matrix should do the same.
The best way to make it easier for people to choose a homeserver would be to have a complete migration system so that people could easily move their account to a different server if for whatever reason they decide they chose the wrong one.
Isn't the "official" way to just clone matrix.org branch of synapse and deploy it as-is following all the instructions?
nope, because the matrix-org branch of synapse was archived at the end of 2023, so you'd be horribly out of date and insecure; instead the team who wrote Synapse switched to AGPL and now release it at github.com/element-hq.
matrix.org branch is actually 5 months old now: https://github.com/element-hq/synapse/tree/matrix-org-hotfix...
erm, that branch isn’t in use anywhere these days? it used to be what we ran on the matrix.org homeserver, but it’s dead.
Looks like it's going to take a while to come back up.
> Sorry, but it's bad news: we haven't been able to restore the DB primary filesystem to a state we're confident in running as a primary (especially given our experiences with slow-burning postgres db corruption). So we're having to do a full 55TB DB snapshot restore from last night, which will take >10h to recover the data, and then >4h to actually restore, and then >3h to catch up on missing traffic.
https://mastodon.matrix.org/@matrix/115136866878237078