How to Get Cached Pages From Wayback Machine API
Archive.org’s Wayback Machine has been a staple in the SEO industry for looking back at cached historical web pages. Each cached page is called a snapshot. It’s great for tracking progress, troubleshooting issues, or if you are lucky, recovering data. Using the Wayback Machine GUI is not always quick or frustration-free. Using the steps below, using Python, we’ll be able to tap into the free API to return the nearest snapshot from the date provided. This is great if you don’t know the exact date of the cached page you’re looking for.
I’m not aware of any call limits, but do be kind to their system and only grab what you really can use. See Wayback API documentation for more information.
Table of Contents
Requirements and Assumptions
- Python 3 is installed and basic Python syntax understood
- Access to a Linux installation (I recommend Ubuntu) or a Jupyter/Collab notebook
Starting the Script
First, let’s import a user agent module to help not get denied by Wayback’s API.
pip3 install fake-useragent
Next, we import our required modules. We use requests and fake_useragent to call the API, JSON, and re (regular expressions) to handle the response.
import requests import json import re from fake_useragent import UserAgent
Craft the Wayback API Call
In the code below we are grabbing the epoch timestamp for today, but you could modify this to find the nearest snapshot at any time in history (from Wayback’s inception). Use this epoch converter to help convert human date time to machine date time. Be sure to replace the URL below with your site/page.
url = "https://www.importsem.com" ua = UserAgent() headers = {"user-agent": ua.chrome} timestamp = datetime.now().timestamp() wburl = "https://archive.org/wayback/available?url="+url+"×tamp=" + str(timestamp)
Make the Wayback API Call
Now we are ready to use the requests module to make the API call. Luckily the API call is a very simple query string type. We load the URL, and header information and we can trust Wayback Machine and disable certificate validation which sometimes trips up the call. Then we load the JSON response into a data variable.
response = requests.get(wburl,headers=headers,verify=False) data = response.json()
Process the Wayback API Response
Let’s use the API documentation example response
{ "archived_snapshots": { "closest": { "available": true, "url": "http://web.archive.org/web/20130919044612/http://example.com/", "timestamp": "20130919044612", "status": "200" } } }
We can see we need to get to the URL property. Now from here, we can get the snapshot URL in two ways. The JSON response is very simple, but some responses are extremely complex and in some cases malformed so you need to find another way.
The other way is to convert the JSON object into a string and then match via regular expression. You can choose either way. Note there are cases where the API will not find any snapshots for a URL, so we need to use Try/Except to catch the error and report it usefully.
Option 1
We load the JSON object into a Python dictionary object. This creates a type of associative array where we can parse through the information. Next, we try to access the URL property of the dictionary. If we can. load it, if not, load “n/a” into the variable and print an error.
geturl = json.loads(data) try: wayback = geturl['archived_snapshots']['closest']['url'] except: wayback = "n/a" print("No snapshot URL returned")
Option 2
We convert the JSON object into a Python string. Then we use a regular expression to search for the URL in the string. If the search results in a match, great, load it. If not, load “n/a” and print the error.
jsonstr = json.dumps(data) matchResult = re.search('http://web\.archive\.org[^"]*',jsonstr) try: wayback = matchResult[0] except: wayback = "n/a" print("No snapshot URL returned")
Conclusion
From here you can use this information any way you like. You can easily automate this script to store it in a database to have access to these snapshot URLs over time. I can even imagine loading a CSV from a Screaming Frog crawl and retrieving the last snapshot date for every URL on your site, but again be kind to the API.
I’ve shown that with a tiny bit of Python code, you can retrieve cached pages from Wayback Machine even if you don’t know the exact date of the snapshot. Now get out there and try it out! Follow me on Twitter and let me know your Wayback Machine API applications and ideas!
Wayback Machine FAQ
What is the Wayback Machine API, and how does it work for retrieving cached pages?
How can I use the Wayback Machine API to get cached pages programmatically?
Can I retrieve cached pages for any website using the Wayback Machine API?
What format does the response from the Wayback Machine API come in?
Are there any limitations or rate limits when using the Wayback Machine API?
How far back in time can I retrieve cached pages using the Wayback Machine API?
Can I retrieve only specific elements or data from the cached pages using the Wayback Machine API?
Are there any authentication requirements to use the Wayback Machine API?
Can I use the Wayback Machine API for commercial purposes?
Where can I find more detailed documentation on using the Wayback Machine API?
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024