google cache dates python

Viewing cached links in Google is often used by SEOs as a troubleshooting or information recovery method. Google caches some of the web pages it crawls and creates a type of snapshot of that page at the time of the crawl. You’ll often notice some resources or images aren’t rendered so it’s rarely a perfect copy. It’s still useful to see what content is being cached and if something is missing finding out why. There is also a correlation between how often Google caches a page and its perceived importance.  High value pages are often cached daily whereas lesser value pages can go week or months without a new cache date. Below I’ll show you how to scrape the cache date for a set of URLs so you can do the analysis yourself.

Requirements and Assumptions

Starting the Script

First we install the fake_useragent module to use in our call header. This can help ward off light abuse detection. Google is for good reason very sensitive to non-human requests. If you find yourself blocked you can try using the module proxy-requests, but I didn’t have much success with it. Many of them were blocked already and the module too forever to run.  The most successful tactic I’ve found is call delays. Note if you are using Google Colab, put an exclamation mark at the beginning of pip3, so !pip3.

pip3 install fake_useragent

Next we import our modules we will use.

  • requests will be used to request the Google cache page.
  • re is for using the regular expression to grab the date from the cache page.
  • fake_useragent is to generate the random user agent
  • numpy is used to select a random number from a list for script delay
  • time will be used to cause the script delay
  • pandas if you have a csv file of URLs and for storing the scrape results
import requests
import re
from fake_useragent import UserAgent
import numpy as np
import time
import pandas as pd

Build the URL List

Let’s build our URL list for the pages we want to check their cache dates.

urls = ['https://www.rocketclicks.com/seo/','https://www.rocketclicks.com/ppc/','https://www.rocketclicks.com/about-us/']

If you have a csv file of URLs you can use the code below instead. Just modify to use the csv path and the column name where the URLs are stored. We load the csv into a pandas dataframe and then convert the url/address column to a list.

loadurls = pd.read_csv("PATH_TO_FILE")
urls = loadurls['COLUMN_NAME_WITH_URLS'].tolist()

Build Loop Counter and Pandas Dataframe

Now later on we’re going to set a script delay between calls to Google. So we don’t end up with an extra delay at the end of the script. Note, you could also move the delay we’ll build later on to the top of the for loop we’ll soon build, but this seems more fun to do. For this we’ll build a counter variable and find the total number of URLs in the list.

After that we build our pandas dataframe container where we will store the scrape results for each URL. It has two simple columns, URL and Cache Date.

counter = 0
urlnum = len(urls)

d = {'URL': [], 'Cache Date': []}
df = pd.DataFrame(data=d)

Begin Processing URLs

Next let’s start that loop to start processing the list of URLs. We’re going to use the cache: search operator and for a clean search we’ll strip anything before the root domain.

for url in urls:	
  urlform = url.replace('https://','')
  urlform = urlform.replace('http://','')
  urlform = urlform.replace('www.','')

Build the Cache Query

Now we can build the query using the cache: operator and the URL we just cleaned. Here we also build the delay list. This list contains numbers from 10-20 in increments of 1 and represents the seconds of delay between requests. I found this to be the most successful method of avoiding Google blocking your request. Anything quicker and Google starts blocking.  We’ll use this list with numpy to randomize our delay using that list. Lastly we build the full query URL we are going to request. I’ve just started learning about f-strings so I wanted to use them in this script a bit. It’s just a modern method for inserting variables into strings.

query = "cache:"+urlform

delays = [*range(10, 21, 1)]

queryurl = f"https://google.com/search?q={query}"

Request Google Cache’d Page

Next we will build our fake user agent and assign it to the header variable and make our request to Google. Fingers crossed!

ua = UserAgent() 
header = {"user-agent": ua.chrome}

resp = requests.get(queryurl, headers=header,verify=True)
print(urlform)

Process Cached Paged

The request has been made, let’s check the response. If Google returns a 200 status code you are all good! If not, you may be blocked. Most times it’s temporary and clears after an hour or so. If the request is successful we can use RegEx to grab the cache date that is just text on the page. Let me break down the RegEx for those who aren’t completely comfortable with it.

The cache date is in this format: Sep 12, 2020

[a-zA-z]{3}\s[0-9]{1,2},\s[0-9]{4}

[a-zA-z]{3} matches as 3 non-case sensitive letters

\s[0-9]{1,2} matches a space then between 1-2 numbers

,\s matches a comma then a space

[0-9]{4} matches 4 numbers

After matching the cache date we create a dictionary object using the URL and cache date. Then we append it to the dataframe we created earlier. Each URL will be added to the dataframe this way.

if resp.status_code == 200:
    getcache = re.search("[a-zA-z]{3}\s[0-9]{1,2},\s[0-9]{4}",resp.text)
    g_cache = getcache.group(0)
    
    newdate = {"URL":urlform,"Cache Date":g_cache}
    df = df.append(newdate, ignore_index=True)
else:
    print("Google may have blocked you, try again in an hour")

Set Script Delay and Print Dataframe

Finally we update our counter and check if there are any more URLs in the list. If there are it’s time to set that delay. We use the numpy module to randomly select a number in that delay list we create earlier and then use that in our sleep() function. This will pause the script for that random amount of time. This helps confuse Google and prevents them from detecting patterns in request time. aka, humans don’t make 100 searches at exactly  .1s intervals. At last we display the cache date information we’ve been collecting.

counter += 1

if counter != urlnum:
  delay = np.random.choice(delays)
  #print("sleeping for "+str(delay)+" seconds" + "\n")
  time.sleep(delay)

df

Example Output

google cache date

Conclusion

You can see how easy it is to grab cache dates for your URLs. The biggest challenge is keeping Google off your back and finding ways to prevent getting temporarily blocked. From here you can store this information in a database, use it in reporting, or insert the data into another script. Enjoy! Don’t forget to follow me and let me know how you are using this script and extending it.

Greg Bernhardt
Follow me

Leave a Reply