Those of us who explored the digital frontier in the late 1990s and early 2000s will remember a vastly different Internet landscape from what exists today.
At the dawn of the World Wide Web in 1991, there was a single website online from the European Organisation for Nuclear Research (CERN) explaining the vision for its WWW project, still available today at info.cern.ch.
In just a few short years, by the time Yahoo launched in 1994, the number of websites had ballooned to around 3,000.
Today, that number has skyrocketed to nearly 1.09 billion websites and those feeling nostalgic may want to revisit old websites, blogs, forums, and online communities they frequented in the past.
Enter the Internet Archive, a US-based nonprofit that hosts a free digital archive, providing users with the opportunity to view websites and digital content that may have been altered or removed from the Internet over time, thereby preserving a snapshot of history.
Furthermore, it serves as a valuable resource for researchers and academics seeking access to past versions of websites, historical documents, and other pertinent materials.
Teachers can also use the Internet Archive as an educational tool to show the development of Internet technology and culture.
Window into history
The Internet Archive stores snapshots of websites dating back to its founding in 1996 that can be visited via a web browser through its aptly named Wayback Machine (web.archive.org).
This includes websites like Geocities, a hosting service that allowed users to create their own personal websites grouped into “neighbourhoods” based on themes (shut down in 2009), Yahoo Answers (ceased operations in 2021), and even The Star Online as it was when it originally launched in 1997.
The Internet Archive and other similar archivers use web crawlers, also known as robots, in their endeavour to capture a snapshot of the Internet.
These web crawlers function in a manner similar to those of search engines like Google or Bing. However, instead of solely indexing webpages for display in search results, they undertake a more comprehensive task.
These web crawlers systematically visit websites, follow every link they encounter and download each page they come across.
This process ensures that the archive contains a snapshot of the Internet, preserving digital content for future access and historical reference.
According to the Internet Archive’s Kahle, webpages have a lifespan of about 100 days before being altered or removed. — SEBASTIAAN TER BURG/Flickr
In an interview with Dutch newspaper de Volkskrant, Internet Archive founder Brewster Kahle said that a new snapshot of a webpage is taken every two months, though the nonprofit also stated that for certain websites it can be more or less frequent.
“People trust that these corporations are going to actually be there, and they’re not.
“Google video, this was before YouTube, they collected six or seven million videos from all sorts of people, and they’re all gone. Yahoo also had a video project, gone. Geocities, gone.
“So all of these companies are just going to come and go, so the only thing that’s going to survive are these longer-term organisations, whether they’re universities, libraries, or archives,” said Kahle in a video interview with de Volkskrant.
Users are also able to submit their web address or URL for archival purposes, though the time it takes to show up on the Wayback Machine can vary from a few hours to days.
“The wonder of the World Wide Web is that people are pouring their lives into it, that it’s got all this richness.
“We have hundreds of millions of people recording, but the life of a webpage is only about 100 days before it’s either changed or deleted.
“So if we don’t save it, it will be gone, it will just melt like the rain. So we collect one copy of every page from every website every two months,” Kahle told de Volkskrant.
However, not all websites want to be part of the archival process – some use the file robots.txt to indicate that their webpages should not be indexed, though the Internet Archive has stopped respecting this since 2017.
As of 2019, according to Google, over 500 million websites use robots.txt, likely prompting most archival sites to ignore it, though the Internet Archive allows website owners to issue removal requests.
Artificial intelligence (AI) companies, which extensively scrape data from websites to train their machine learning models, have also adopted this practice of disregarding robots.txt.
Who else remembers using Windows 95 PCs and dial-up Internet to get online? — Image by freepik
Other archival sites like Archive.today and Archiveteam (wiki.archiveteam.org) are also dedicated to creating digital archives of the Internet. However, what distinguishes the Internet Archive is its extensive collection, encompassing not just web content but also books, software, videos, audio and images.
While strolling through a wild west of the Internet, where “design best practices” were non-existent, might sound like a delightful way to spend a leisurely afternoon, the Wayback Machine has many more uses than that.
Beyond browsing
Frequent Wikipedia readers may have encountered a dead link among the citations in an article, leading to either an unavailable or defunct webpage. In such instances, having a copy of that page accessible via the Wayback Machine can be invaluable, with Wikipedia even providing a guide on how to accomplish this for its contributors.
Similarly, in academia, this practice is employed to combat “reference rot” and guarantee that sources and citations are preserved for posterity. This ensures that future researchers have access to the source material should they wish to check it.
“We also work with national libraries to collect their whole domains, and this is put together into this free service called the Wayback Machine on archive.org, where you can go and look for old websites and surf the Web as it was.
“It turns out that it’s doable, actually, because computers are getting so large that you can store so much information, it’s actually cost-effective to record everything ever written.
“To give you an idea, a book is about a megabyte, and the largest library in the world is the Library of Congress, which has 28 million books, so that’s 28TB.
“That’s four hard drives; you can have all of the words in the Library of Congress on four hard drives,” Kahle said in the video.
In the past, webpages archived by the Internet Archive have been used as evidence in legal cases, with a United States appeals court ruling in 2018 that such webpages constitute a legitimate form of evidence in litigation.
The 2018 case saw a hacker convicted, with archived websites being used as evidence to link the person with a virus and botnet, a collection of hijacked computers commonly used for DDoS (distributed denial-of-service) attacks.
However, this hasn’t spared the Internet Archive from controversy, with it having been sued multiple times over alleged copyright infringement over the books it hosts.
In 2016, Nintendo requested that the website remove digital copies of its Nintendo Power magazine, which was discontinued in 2012.
In 2020, it was also subject to lawsuits from four major book publishers, including Hachette, HarperCollins, John Wiley & Sons, and Penguin Random House.
In 2023, a judge ruled against the Internet Archive, limiting its access to scanned copies of books. The Internet Archive also reached an undisclosed settlement agreement with the publishers, but it has initiated an appeal.
For those seeking a legal source of digital books, Project Gutenberg is a reliable option that hosts over 70,000 ebooks, which users can access for free.
The titles available on Project Gutenberg are older works that have entered the public domain, a concept that gained attention following the copyright expiration of the Steamboat Willie version of Mickey Mouse in January.
Frequent Wikipedia readers may have encountered a dead link among the citations in an article, leading to either an unavailable or defunct webpage. In such instances, having a copy of that page accessible via the Wayback Machine can be invaluable, with Wikipedia even providing a guide on how to accomplish this for its contributors. — Photo by Oberon Copeland @veryinformed.com on Unsplash
Simply put, a work enters the public domain when its copyright expires, allowing media, such as books on Project Gutenberg, to be freely distributed to the public.
This includes classic works of literature such as William Shakespeare’s Romeo And Juliet, Mary Shelley’s Frankenstein, and Lewis Carroll’s Alice In Wonderland.
Project Gutenberg operates as a legitimate provider of free books, while Z-Library, an illegal ebook repository, was shut down in December 2022 after the FBI seized its domains.
Z-Library hosted a digital collection of pirated books, with an estimated 13.35 million books and 84.8 million academic articles.
Flashback to fun
For individuals born in the 1990s, the demise of Adobe Flash in 2020 signified not only the end of an era but also the loss of countless Flash games that once entertained during the dial-up Internet days.
Though there are still websites hosting Flash games that can be played directly in the browser, it’s important to note that accessing these sites poses a cybersecurity risk.
With Adobe discontinuing support for Flash and most web browsers subsequently removing the feature, users are left with the option of downgrading to older versions of browsers to access Flash content. However, this practice is not advisable, as it leaves users vulnerable to threats such as malware due to missing security updates.
Because of this, there was a fear that games that were not lucky enough to receive a modern revival, like Neopets, Uberstrike or Henry Stickmin, could face the prospect of disappearing forever.
But thanks to the Flashpoint Archive, many of the games remain accessible for play. It is a preservation project that is looking to host as many discontinued browser-based games and animations as possible and make them accessible without compromising security.
Many of the older Flash titles Flashpoint hosts have been classified as abandonware, as the original creators and rights owners have abandoned the intellectual property, no longer defending its copyright.
In an FAQ, it said that it will comply with requests for content removal, which can be sent via email or its Discord channel, by copyright owners.
This was the case with titles from game studio Nitrome, which said that it intended to remake its games in HTML5 and requested that its library of games be removed from the collection.
Flashpoint Archive also maintains a page listing curations that are not accepted on the platform due to the requests of copyright holders.
At present, the project has over 150,000 games and 25,000 animations, including familiar titles like The Fancy Pants Adventures, Heli Attack, Age Of War, and Bowmaster: Prelude.
The collection is playable within the Flashpoint program on desktop computers, which supports Windows, Mac, and Linux operating systems.
Users can choose from either Flashpoint Infinity, which downloads games on demand from the archive’s servers, or Flashpoint Ultimate, which is essentially a copy of everything the project has in its collection.
You need to ensure that you have sufficient storage when opting for the latter, as the current version takes up about 1.48TB of space.
In addition to the main collection, there’s also a separate selection of titles for Java mobile games under the Kahvibreak project. This curated assortment comprises nearly 2,000 games, providing users with a diverse array of options for nostalgic gaming experiences.