python wayback machine api

How to Get Cached Pages From Wayback Machine API

Estimated Read Time: 5 minute(s)
Common Topics: api, wayback, machine, data, cached

Archive.org’s Wayback Machine has been a staple in the SEO industry for looking back at cached historical web pages. Each cached page is called a snapshot. It’s great for tracking progress, troubleshooting issues, or if you are lucky, recovering data. Using the Wayback Machine GUI is not always quick or frustration-free. Using the steps below, using Python, we’ll be able to tap into the free API to return the nearest snapshot from the date provided. This is great if you don’t know the exact date of the cached page you’re looking for.

I’m not aware of any call limits, but do be kind to their system and only grab what you really can use. See Wayback API documentation for more information.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax understood
  • Access to a Linux installation (I recommend Ubuntu) or a Jupyter/Collab notebook

Starting the Script

First, let’s import a user agent module to help not get denied by Wayback’s API.

pip3 install fake-useragent

Next, we import our required modules. We use requests and fake_useragent to call the API, JSON, and re (regular expressions) to handle the response.

import requests
import json
import re
from fake_useragent import UserAgent

Craft the Wayback API Call

In the code below we are grabbing the epoch timestamp for today, but you could modify this to find the nearest snapshot at any time in history (from Wayback’s inception). Use this epoch converter to help convert human date time to machine date time. Be sure to replace the URL below with your site/page.

url = "https://www.importsem.com"
ua = UserAgent()
headers = {"user-agent": ua.chrome}

timestamp = datetime.now().timestamp()
wburl = "https://archive.org/wayback/available?url="+url+"&timestamp=" + str(timestamp)

Make the Wayback API Call

Now we are ready to use the requests module to make the API call. Luckily the API call is a very simple query string type. We load the URL, and header information and we can trust Wayback Machine and disable certificate validation which sometimes trips up the call. Then we load the JSON response into a data variable.

response = requests.get(wburl,headers=headers,verify=False)
data = response.json()

Process the Wayback API Response

Let’s use the API documentation example response

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

We can see we need to get to the URL property. Now from here, we can get the snapshot URL in two ways. The JSON response is very simple, but some responses are extremely complex and in some cases malformed so you need to find another way.

The other way is to convert the JSON object into a string and then match via regular expression. You can choose either way. Note there are cases where the API will not find any snapshots for a URL, so we need to use Try/Except to catch the error and report it usefully.

Option 1

We load the JSON object into a Python dictionary object. This creates a type of associative array where we can parse through the information. Next, we try to access the URL property of the dictionary. If we can. load it, if not, load “n/a” into the variable and print an error.

geturl = json.loads(data)

try:
    wayback = geturl['archived_snapshots']['closest']['url']
except:
    wayback = "n/a"
    print("No snapshot URL returned")
Option 2

We convert the JSON object into a Python string. Then we use a regular expression to search for the URL in the string. If the search results in a match, great, load it. If not, load “n/a” and print the error.

jsonstr = json.dumps(data)
matchResult = re.search('http://web\.archive\.org[^"]*',jsonstr)

try:
    wayback = matchResult[0]
except:
    wayback = "n/a"
    print("No snapshot URL returned")

Conclusion

From here you can use this information any way you like. You can easily automate this script to store it in a database to have access to these snapshot URLs over time. I can even imagine loading a CSV from a Screaming Frog crawl and retrieving the last snapshot date for every URL on your site, but again be kind to the API.

I’ve shown that with a tiny bit of Python code, you can retrieve cached pages from Wayback Machine even if you don’t know the exact date of the snapshot. Now get out there and try it out! Follow me on Twitter and let me know your Wayback Machine API applications and ideas!

Wayback Machine FAQ

What is the Wayback Machine API, and how does it work for retrieving cached pages?

The Wayback Machine API is a tool provided by the Internet Archive that allows users to access historical snapshots of websites. It works by sending HTTP requests to the Wayback Machine API endpoint with the desired URL and timestamp parameters.

How can I use the Wayback Machine API to get cached pages programmatically?

To retrieve cached pages programmatically, you need to make HTTP requests to the Wayback Machine API endpoint, specifying the target URL and the desired timestamp. The API will then respond with the archived content for the specified date.

Can I retrieve cached pages for any website using the Wayback Machine API?

Yes, the Wayback Machine API allows you to retrieve cached pages for a wide range of websites. However, keep in mind that not all websites may have complete historical snapshots, and some content may not be available.

What format does the response from the Wayback Machine API come in?

The response from the Wayback Machine API is typically in HTML format, representing the content of the archived page. You can parse and process this HTML data as needed for your application.

Are there any limitations or rate limits when using the Wayback Machine API?

Yes, the Wayback Machine API has rate limits to prevent abuse. It’s important to review the API documentation for details on rate limits and usage policies to ensure compliance.

How far back in time can I retrieve cached pages using the Wayback Machine API?

The availability of historical snapshots depends on the specific website and the frequency of archiving by the Wayback Machine. Some websites may have extensive archives, while others may have limited snapshots.

Can I retrieve only specific elements or data from the cached pages using the Wayback Machine API?

The Wayback Machine API primarily provides full HTML content for archived pages. If you need specific elements or data, you may need to parse the HTML response to extract the desired information.

Are there any authentication requirements to use the Wayback Machine API?

As of the last update, the Wayback Machine API does not require authentication for basic usage. However, it’s advisable to check the API documentation for any changes or updates to authentication requirements.

Can I use the Wayback Machine API for commercial purposes?

The Wayback Machine API is generally free to use for non-commercial purposes. For commercial or high-volume usage, it’s recommended to review the Internet Archive’s terms of service and consider reaching out to them for specific agreements.

Where can I find more detailed documentation on using the Wayback Machine API?

You can find detailed documentation, including API endpoints, parameters, and examples, on the official Internet Archive website. Refer to the [Wayback Machine API documentation](https://archive.org/help/wayback_api.php) for comprehensive information.
Greg Bernhardt
Follow me