python wayback machine api

Archive.org’s Wayback Machine has been a staple in the SEO industry for looking back at cached historical web pages. Each cached page is called a snapshot. It’s great for tracking progress, troubleshooting issues or if you are lucky, recovering data. Using the Wayback Machine GUI is not always quick or frustration free. Using the steps below, using Python, we’ll be able to tap into the free API to return the nearest snapshot from the date provided. This is great if you don’t know the exact date of the cached page you’re looking for.

I’m not aware of any call limits, but do be kind to their system and only grab what you really can use. See Wayback API documentation for more information.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax understood
  • Access to a Linux installation (I recommend Ubuntu) or a Jupyter/Collab notebook

Starting the Script

First let’s import a user agent module to help not get denied by Wayback’s API.

pip3 install fake-useragent

Next we import our required modules. We use requests and fake_useragent to call the API, json and re (regular expressions) to handle the response.

import requests
import json
import re
from fake_useragent import UserAgent

Craft the API Call

In the code below we are grabbing the epoch timestamp for today, but you could modify this to find the nearest snapshot at any time in history (from Wayback’s inception). Use this epoch converter to help convert human date time to machine date time. Be sure to replace url below with your own site/page.

url = "https://www.importsem.com"
ua = UserAgent()
headers = {"user-agent": ua.chrome}

timestamp = datetime.now().timestamp()
wburl = "https://archive.org/wayback/available?url="+url+"&timestamp=" + str(timestamp)

Make the API Call

Now we are ready to use the requests module to make the API call. Luckily the API call is a very simple querystring type. We load the URL, header information and we can trust Wayback Machine and disable certificate validation which sometimes trips up the call. Then we load the json response into a data variable.

response = requests.get(wburl,headers=headers,verify=False)
data = response.json()

Process the API Response

Let’s use the API documentation example response

{
    "archived_snapshots": {
        "closest": {
            "available": true,
            "url": "http://web.archive.org/web/20130919044612/http://example.com/",
            "timestamp": "20130919044612",
            "status": "200"
        }
    }
}

We can see we need to get to the URL property. Now from here we can get the snapshot URL two ways. The JSON response is very simple, but there are responses that are extremely complex and in some cases malformed so you need to find another way.

That other way is to convert the JSON object into a string and then match via regular expression. You can choose either way. Note there are cases where the API will not find any snapshots for a URL, so we need to use Try/Except to catch the error and report it usefully.

Option 1

We load the JSON object into a Python dictionary object. This creates a type of associative array where we can parse through the information. Next we try to access the URL property of the dictionary. If we can. load it, if not, load “n/a” into the variable and print an error.

geturl = json.loads(data)

try:
    wayback = geturl['archived_snapshots']['closest']['url']
except:
    wayback = "n/a"
    print("No snapshot URL returned")
Option 2

We convert the JSON object into a Python string. Then we use a regular expression to search for the URL in the string. If the search results in a match, great, load it. If not, load “n/a” and print the error.

jsonstr = json.dumps(data)
matchResult = re.search('http://web\.archive\.org[^"]*',jsonstr)

try:
    wayback = matchResult[0]
except:
    wayback = "n/a"
    print("No snapshot URL returned")

Conclusion

From here you can use this information any way you like. You can easily automate this script to store in a database to have access to these snapshot URLs overtime. I can even imagine loading a csv from a Screaming Frog crawl and retrieving the last snapshot date for every URL on your site, but again be kind to the API.

I’ve shown that with a tiny bit of Python code you can retrieve cached pages from Wayback Machine even if you don’t know the exact date of the snapshot. Feel free to comment or Tweet me any cool ways you’ve built Wayback Machine API into your own scripts and extended this one.

Greg Bernhardt
Follow me

Leave a Reply