1 Trillion Web Pages Archived

356 points | by pabs3 10 hours ago

47 comments

arjie 8 hours ago
Something I wish we could have is some kind of peer mirror of archive.org. The main IA web application gets angry pretty quickly if you're trying to click through a few different dates. If there were some kind of way to slowly mirror (torrent-style) and offer pages as a peer from archive.org that would be neat. It would be cool to show up as an alternative source for the data and the archive.org app could fetch it out of there on a user's choice and validate the checksum if required.
In the end, I've ended up just keeping my own ArchiveBox and it's an all right experience. In the end, it's only useful for things I know I wanted to archive. For almost everything I go to the IA - which has so much.
[-]
- renegat0x0 2 hours ago
  - I can confirm that the web archive can be really slow
  - I think I have seen that AI scrapers create bottleneck in the bandwidth
  - To some digital archives you need to create scientific accounts (I think Common Crawl works like that)
  - Data quite easily can be very big. The goal is to store many things. We not only store Internet, but with additional dimension of time
  - Since there is a lot of data, it is difficult to navigate it, search it, so it easily can become unusable
  - For example that is why I created my own meta data link, I needed some information about domains
  Link:
  https://github.com/rumca-js/Internet-Places-Database
- zapataband2 8 hours ago
  Is there such thing as "versioned" torrents? Assuming you have the right PGP key you could mix bittorrent and packaging systems to get an update-able distribution
  [-]
  - throawayonthe 2 hours ago
    trere is the bittorrent v2 standard: https://blog.libtorrent.org/2020/09/bittorrent-v2/
    but unfortunately most foss torrent clients do not support it, partly because at release libtorrent 2.0.x had poor io performance in some cases so torrent clients reverted to the 1.2.x branch
  - pabs3 8 hours ago
    Couple of BEPs related to updating torrents:
    https://www.bittorrent.org/beps/bep_0039.html https://www.bittorrent.org/beps/bep_0046.html
  - hsbauauvhabzb 8 hours ago
    A Torrent would probably suffocate under the small file distribution. I’m not sure how the romset torrents work but I thought they were versioned.
    But torrent is probably the wrong tech. I’m sure there would be many players willing to host a few TB or more each, which could be fronted via something so it’s transparent to the user.
    But a better option might be a subscription model, anything else will be slammed by crawlers.
jonah-archive 8 hours ago
Hi, I run the datacenter/infrastructure team at the Internet Archive! We would love to see you at our various events this fall but if paying for the ticket is difficult for you, please email me (in bio) and we'll get you in (if possible).
[-]
- psychoslave 6 hours ago
  Are they distributed events all around the world of just in wherever the team is gathered (San Francisco I guess?)
  By the way, thank you all the teams in IA, what you provide is such an important thing for humanity.
- vettyvignesh 5 hours ago
  would love technical details around this feat. ex: how you even crawl to begin with, storage, etc
- NetOpWibby 7 hours ago
  I would love to work for IA but openings are rare
  [-]
  - pabs3 6 hours ago
    If you are in Europe, consider Software Heritage (similar to IA but for source code) too:
    https://www.softwareheritage.org/jobs/
    [-]
    - msephton 3 hours ago
      Internet Archive now have a presence in Amsterdam
- awesomeMilou 8 hours ago
  What events are we talking about here?
  [-]
  - jackling 7 hours ago
    Probably these: https://blog.archive.org/events/
- moralestapia 6 hours ago
  Hey, Q., so what's the size of the internet archive?
  [-]
  - the_real_cher 2 hours ago
    I'm betting exabyte or close maybe
  - metalman 2 hours ago
    it is large enough that I am wondering if the data captured by the actual physical magnetic charges has a heft, that a person could feel. obviously the hardware would fill a house or something, but at what point does the worlds data become a discernable physical reality, at least in theory
- southernplaces7 4 hours ago
  Most of all, i'm curious about how you reliably and securely store or host so many archived pages. Would you mind briefly explaining such a huge undertaking? Also, total congratulations on the fantastic achievement of this. You guys are my go-to for so much information.
  Edit: And how many terabytes it all amounts to.
- WhereIsTheTruth 5 hours ago
  We all know the NSA has access to servers hosted in the U.S. How are you protecting the archive from malicious tampering? Are you using any form of immutable storage? Is it post-quantum secure?
  [-]
  - gosub100 an hour ago
    Why would they do that? Have you previously seen a case where they "maliciously tampered" with anyone's website?
    [-]
    - WhereIsTheTruth a few seconds ago
      I just question the integrity and immutability of the data IA is archiving, that's all
      You want to know why they'd tamper data? I don't know
      https://blog.archive.org/2018/04/24/addressing-recent-claims...
      NSA already paid to back-door RSA, got caught shiping pre-hacked routers, can rewrite pages mid-flight with QUANTUM, penetrate and siphon data from remote infected machines.. what else could they do?
      https://www.amnesty.org/en/latest/news/2022/09/myanmar-faceb...
pabs3 8 hours ago
If anyone wants to help feed in more stuff, ArchiveTeam is a related volunteer group that sends data to IA:
https://archiveteam.org/
msephton 3 hours ago
1 trillion web pages archived is quite an achievement. But...there's no way to search them? You have to know what url your want to pull from the archive, which reduces the usefulness of the service. I'd like to search through all those trillion pages for, say, the name of an artist, or for a filename, or for image content.
[-]
- qwertytyyuu 3 hours ago
  That would be hell to index
  [-]
  - Exuma 3 hours ago
    I imagine it would be no different than current indexing strategies with a temporal aspect baked in... it would act almost like a different site, and maybe roll up the results after the fact by domain
  - citbl 3 hours ago
    If it was a commercial problem, e.g. from Google, it would be solved.
    The reality is that many things don't exist simply because someone isn't paid to do it.
    [-]
    - Keyframe 3 hours ago
      How much AI companies have benefited by leeching off of IA and Common Crawl, it's a shame there's no at least some money flowing back in.
- bluebarbet 2 hours ago
  Consider the privacy implications of that. It would effectively create a parallel web where `robots.txt` counts for nothing and where it becomes - retroactively - impossible to delete one's site. Yes, there's ultimately no way to prevent it happening, given that the data is public. But to make the existing IA searchable is IMO just a terrible idea.
  [-]
  - breakingcups 43 minutes ago
    Actually, I believe the IA respects robots.txt retroactively, eg. putting something on the disallow list now removes the same page scrapes from a yeaer ago from public access in teh Wayback Machine, but I'd love to be corrected on that.
- emporas 3 hours ago
  I use GPT web search, and I ask it usually to find textbooks from IA. It works really well for textbooks, but not sure about web pages.
yupyupyups 20 minutes ago
https://hoarding.support/
ks2048 3 hours ago
I wonder if Internet Archive and Common Crawl have worked together?
How does their scope or infrastructure compare?
I know they serve different purposes, but both are essentially doing similar things.
[-]
- pabs3 2 hours ago
  I think IA ingests crawl WARCs from CC, as well as other groups like ArchiveTeam.
BiraIgnacio an hour ago
Congratulations!
not--felix 4 hours ago
I wonder if openai has archived more pages by now
ChrisArchitect 8 hours ago
Related blog post inviting stories:
https://blog.archive.org/2025/09/23/celebrating-1-trillion-w...
zghst 8 hours ago
A great milestone for internet history!
i_have_to_speak 7 hours ago
Is there an index of all these pages?
FooBarWidget 7 hours ago
I'm kinda surprised IA hasn't long been shutdown by copyright chasers.
And for single page archives I tend to use archive.is nowadays. For as long as I can remember, IA has been unusably slow.
But still kudos to them for the effort.
[-]
- fragmede 6 hours ago
  I very much don't get all of the show "king of the hill" being up on there.
typpilol 9 hours ago
I thought this was going to be a technical article but there was nothing in it
[-]
- ehsanu1 8 hours ago
  Seeing some stats would be fun. I wonder what the amount of data is here. And the distribution would be interesting too, especially since some pages are archived at multiple points in time, and pages have been getting heavier these days.
lyu07282 5 hours ago
I was hoping this would include a talk by Jason Scott/@textfiles his talks are always so much fun
lofaszvanitt 6 hours ago
Would be nice to have visit statistics per domain. So people who host their live sites could determine who visits and what on archive.org under their domain vs their live site :).
timmy777 5 hours ago
How do you prevent government (and other people who can access the data) from rewriting history?
Do you hash them in some sort of block chain?
The inability to rewrite history will be a fantastic gift to the world.
itsme0000 7 hours ago
Yeah but their view and download metrics are flat out wrong all the time. If they weren’t a nonprofit they’d be sued for that. But still great company a place for obsolete AWS equipment to retire.
[-]
- psychoslave 6 hours ago
  What do you mean?