What my pattern-matching eyes immediately spotted is that the hn username that posted this is rabinovich. The linked article speaks about Masha Rabinovich. Maybe a coincidence.
> in a 2012 F-Secure forum post, a “masharabinovich” complains about “my website http://archive.is/” being blacklisted. They pop up on Wikipedia as well getting told off for adding too many links to archive.is, including a mention that they’re using the Czech ISP fiber.cz
This feels like the start of treasure hunt like game. Between username of rabinovich (as others have pointed out) and the prior submission by rabinovich of an archive.today like tool 3 months ago - https://ghostarchive.org/. When you click into the search query examples for ghostarchive such as this one https://ghostarchive.org/search?term=https://docs.google.com. Many of the documents are very weird indeed.
Remember when Archive.is/today used to send Cloudflare DNS users into an endless captcha loop because the creator had some kind of philosophical disagreement with Cloudflare? Not the first time they’ve done something petty like this.
I believe they're probably trying to get the blog suspended (automatically?) hence the cache busting; chewing through higher than normal resources all of a sudden might do the trick even if it doesn't actually take it offline.
It occurred to me while reading the article that I could also just have checked the TLS cert. The cert I was given presents "Common Name tls.automattic.com". However, maybe someone will discover bgp.he.net via this :-)
It is using the ?s= parameter which causes WordPress to initiate a search for a random string. This can result in high CPU usage, which I believe is one of the DoS vectors that works on hosted WordPress.
"It’s a testament to their persistence that they’re managed to keep this up for over 10 years, and I for one will be buying Denis/Masha/whoever a well deserved cup of coffee."
Revealing publicly available information (actually publicly available, in the sense of "any person can easily look this up", not "publicly available" in a sense of "publicly available in leaked databases", which actual doxxers use as an excuse for their actions) isn't doxxing.
Sometimes HN admins revive quality posts that didn't get much traction when they were first posted. When this happens, the timestamps are updated to make the post look new.
I can't say for sure whether this is what happened here, but it is a possible explanation.
It's not random, setting the query string to a new value on every fetch is a cache busting technique - it's trying to prevent the browser from caching the page, presumably to increase bandwidth usage.
It's trying to prevent the server from caching the search. Thousands of different searches will cause high CPU load and the WordPress might decide to suspend the blog.
There's really no interpretation of this which isn't malicious, although, not to defend this behaviour whatsoever, I'm not entirely surprised by it. The only real value of archive.is is its paywall bypassing abilities and, presumably, large swaths of residential proxies that allow it to archive sites that archive.org can't. Only somebody with some degree of lawlessness would operate such a project.
It's not just for paywall bypassing. Sometimes there are archive.today snapshots that aren't in the Wayback Machine (though I think your overall point about lawlessness still stands).
For example, there was some NASA debris that hit a guy's house in Florida and it was in the news. [1]
Some news sites linked to a Twitter post he made with the images but he later deleted the post. [2]
The Wayback Machine has a ton of snapshots of the Twitter post but none of them render for me. [3]
Archive.today has a different approach to the baseline archive technology (executing javascript at archival time and saving the DOM instead of saving and replaying server responses verbatim). Additionally, Archive.today employs a number of site specific mitigations which aren't visible to the end user. In some cases, for instance, they use accounts, but then retroactively modify the DOM to mask this mitigation. [0] While the exact strategy they use for Twitter isn't known to me, they are doing something by their own admission. [1]
Not excusing this malicious behavior, but I have to say, the mentioned blog post is a major dick move, too. Got quite the impression of a passive aggressive undertone, and there is clearly bittersweet irony in collecting and "archiving" an archiver's personal information from long ago traces. Maybe it's all some feud between two dicks, some backstory untold. Maybe the blog author wanted some information gone from archive.today, but was denied.
Blog post author here. Nope, I was just curious, since it's quite remarkable how huge archive.today is, how widely it's used, and how little we know about it. I do acknowledge the irony of an archiver being upset by an archive of their own work though :)
All that said, the post does not actually dox anyone (as far as I can tell, every name mentioned is an alias or red herring), and the "investigation" was basically punching things into my favorite search engine and seeing what came up. If a nation state level threat actor or even one of the copyright cabals wanted to find the maintainer, they have much better ways of going about it.
Perhaps, and yet I've referenced this article numerous times over the years. The most important property of an archive is that it saves an authentic copy of the source material—that is to say, the archive must be trusted. If archive.today is indeed a legitimate archival source first and foremost as it purports to be, the user has a reasonable interest in investigating the people behind it so that they can come to an informed conclusion about if they can trust the archive or not.
Pretty sure that blog is hosted on Wordpress.com infrastructure so it's not like the blog owner would even notice unless it generates so much traffic that WP itself notices.
That said I don't think there's many non-malicious explanation for this, I would suggest writing to HN and see about blocking submissions from the domain hn@ycombinator.com
In the past week or so, I have received a GDPR takedown attempt of the archive.today blog post (which my hosting provider rightly rejected), a politely worded request to take it down (which was sadly eaten by my spam filter), and now this (thanks to the HN reader who tipped me off).
Given that the proverbial cat has been out of the bag for 2.5 years at this point, I'm genuinely puzzled as to what they're hoping to achieve, but this does not seem like a very good way of going about it.
What my pattern-matching eyes immediately spotted is that the hn username that posted this is rabinovich. The linked article speaks about Masha Rabinovich. Maybe a coincidence.
> in a 2012 F-Secure forum post, a “masharabinovich” complains about “my website http://archive.is/” being blacklisted. They pop up on Wikipedia as well getting told off for adding too many links to archive.is, including a mention that they’re using the Czech ISP fiber.cz
This feels like the start of treasure hunt like game. Between username of rabinovich (as others have pointed out) and the prior submission by rabinovich of an archive.today like tool 3 months ago - https://ghostarchive.org/. When you click into the search query examples for ghostarchive such as this one https://ghostarchive.org/search?term=https://docs.google.com. Many of the documents are very weird indeed.
Remember when Archive.is/today used to send Cloudflare DNS users into an endless captcha loop because the creator had some kind of philosophical disagreement with Cloudflare? Not the first time they’ve done something petty like this.
That's still a thing. Happens to me as we speak.
For me it just doesn't resolve at all on Cloudflare dns. So annoying.
Hmm. If it is an attempt at DDoS attacks, it's probably not very fruitful:
Viewing the first IP address on https://bgp.he.net/ip/192.0.78.25 shows AS2635 (https://bgp.he.net/AS2635) is announcing 192.0.78.0/24. AS2635 is owned by https://automattic.com aka wordpress.com. I assume that for a managed environment at their scale, this is just another Wednesday for them.I believe they're probably trying to get the blog suspended (automatically?) hence the cache busting; chewing through higher than normal resources all of a sudden might do the trick even if it doesn't actually take it offline.
It occurred to me while reading the article that I could also just have checked the TLS cert. The cert I was given presents "Common Name tls.automattic.com". However, maybe someone will discover bgp.he.net via this :-)
good ol' hurricane electric
> maybe someone will discover bgp.he.net via this
I did, thank you!
Add https://bgp.tools to the list
It is using the ?s= parameter which causes WordPress to initiate a search for a random string. This can result in high CPU usage, which I believe is one of the DoS vectors that works on hosted WordPress.
DDosing but still archiving:
https://archive.is/https://gyrovague.com/2023/08/05/archive-...
Well that is a very silly way to punish the author of an article you don’t want people to know about.
"It’s a testament to their persistence that they’re managed to keep this up for over 10 years, and I for one will be buying Denis/Masha/whoever a well deserved cup of coffee."
https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...
And one where the author's cool with whoever is running archive.today.
> And one where the author's cool with whoever is running archive.today.
I don't think it really matters how "cool" you are with someone while actively trying to doxx them.
Revealing publicly available information (actually publicly available, in the sense of "any person can easily look this up", not "publicly available" in a sense of "publicly available in leaked databases", which actual doxxers use as an excuse for their actions) isn't doxxing.
OP frames this like they just stumbled across the blog post but they created an account matching the name discussed within it three months ago?
I’m confused.
Sometimes HN admins revive quality posts that didn't get much traction when they were first posted. When this happens, the timestamps are updated to make the post look new.
I can't say for sure whether this is what happened here, but it is a possible explanation.
https://news.ycombinator.com/item?id=45922875
“Behind the complaints: Our investigation into the suspicious pressure on Archive.today”
Given it's set to generate random pages on the site, is there even any possible explanation for this that isn't sketchy?
It's not random, setting the query string to a new value on every fetch is a cache busting technique - it's trying to prevent the browser from caching the page, presumably to increase bandwidth usage.
It's trying to prevent the server from caching the search. Thousands of different searches will cause high CPU load and the WordPress might decide to suspend the blog.
There's really no interpretation of this which isn't malicious, although, not to defend this behaviour whatsoever, I'm not entirely surprised by it. The only real value of archive.is is its paywall bypassing abilities and, presumably, large swaths of residential proxies that allow it to archive sites that archive.org can't. Only somebody with some degree of lawlessness would operate such a project.
It's not just for paywall bypassing. Sometimes there are archive.today snapshots that aren't in the Wayback Machine (though I think your overall point about lawlessness still stands).
For example, there was some NASA debris that hit a guy's house in Florida and it was in the news. [1] Some news sites linked to a Twitter post he made with the images but he later deleted the post. [2]
The Wayback Machine has a ton of snapshots of the Twitter post but none of them render for me. [3]
But archive.today's snapshot works great. [4]
[1] https://www.bbc.com/news/articles/c9www02e49zo
[2] https://xcancel.com/Alejandro0tero/status/176872903149342722...
[3] https://web.archive.org/web/20240715000000*/https://twitter....
[4] https://archive.md/obuWr
Archive.today has a different approach to the baseline archive technology (executing javascript at archival time and saving the DOM instead of saving and replaying server responses verbatim). Additionally, Archive.today employs a number of site specific mitigations which aren't visible to the end user. In some cases, for instance, they use accounts, but then retroactively modify the DOM to mask this mitigation. [0] While the exact strategy they use for Twitter isn't known to me, they are doing something by their own admission. [1]
[0] https://blog.archive.today/post/708008224368001024/why-isnt-... compounded with personal observation.
[1] https://blog.archive.today/post/708565142782246912/pretty-pl...
Not excusing this malicious behavior, but I have to say, the mentioned blog post is a major dick move, too. Got quite the impression of a passive aggressive undertone, and there is clearly bittersweet irony in collecting and "archiving" an archiver's personal information from long ago traces. Maybe it's all some feud between two dicks, some backstory untold. Maybe the blog author wanted some information gone from archive.today, but was denied.
Blog post author here. Nope, I was just curious, since it's quite remarkable how huge archive.today is, how widely it's used, and how little we know about it. I do acknowledge the irony of an archiver being upset by an archive of their own work though :)
All that said, the post does not actually dox anyone (as far as I can tell, every name mentioned is an alias or red herring), and the "investigation" was basically punching things into my favorite search engine and seeing what came up. If a nation state level threat actor or even one of the copyright cabals wanted to find the maintainer, they have much better ways of going about it.
Perhaps, and yet I've referenced this article numerous times over the years. The most important property of an archive is that it saves an authentic copy of the source material—that is to say, the archive must be trusted. If archive.today is indeed a legitimate archival source first and foremost as it purports to be, the user has a reasonable interest in investigating the people behind it so that they can come to an informed conclusion about if they can trust the archive or not.
Pretty sure that blog is hosted on Wordpress.com infrastructure so it's not like the blog owner would even notice unless it generates so much traffic that WP itself notices.
That said I don't think there's many non-malicious explanation for this, I would suggest writing to HN and see about blocking submissions from the domain hn@ycombinator.com
Gyrovague here, author of the targeted blog post:
https://gyrovague.com/2023/08/05/archive-today-on-the-trail-...
In the past week or so, I have received a GDPR takedown attempt of the archive.today blog post (which my hosting provider rightly rejected), a politely worded request to take it down (which was sadly eaten by my spam filter), and now this (thanks to the HN reader who tipped me off).
Given that the proverbial cat has been out of the bag for 2.5 years at this point, I'm genuinely puzzled as to what they're hoping to achieve, but this does not seem like a very good way of going about it.
What did the politely worded request say, was it from the creator?
I will not be sharing any discussions publicly until/unless we come to an agreement, but yes, at least it appeared to be.
Great article, is the attack affecting you in any way?
Do you know when it began?
And what do you think of the account reporting this being named rabinovich, and having being created months ago?
I just tried in my browser (Firefox on Ubuntu) and got the same result. Deeply curious.
Worth blocking the URL for users of that Archive site then, avoid extra burden?
They might need to tweak a single word. Streisand readers won’t have a clue which.
Save the page now and compare a week later.
https://news.ycombinator.com/item?id=46628734 makes some good points, it shouldn't have been downvoted do death
And that's how advertising works, folks. If someone wants a website dead, I want to know more about it.