Python automate lighthouse

Google’s web page scanner Lighthouse has been a fixture as one of the most important tools to use when evaluating a web page. This scanner at a high level measures your page’s performance, seo, accessibility and best practices. At a deeper level it gives more granular metrics for each of those categories and displays recommendations. Most SEOs are very familiar with running Lighthouse within Google Chrome’s DevTools. Using Lighthouse within the browser is very easy and handy, but it’s still manual and difficult to scale. What if you want to run Lighthouse on multiple pages on a daily basis? Python as usual comes to the rescue. This tutorial will offer the bare bones needed to setup your automated Lighthouse scanning. From this tutorial it will be clear and easy to extend it for your complete purposes.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax understood
  • Access to a Linux installation (I recommend Ubuntu) or Google Colab. I don’t recommend Google Colab for this as performance is not great.
  • This script has been tested on Lighthouse 6.4.1 which is current as of this tutorial’s publishing date.

Starting the Script

First step, we install the lighthouse CLI from you which is easy to install using the node package manager. Be sure to read the documentation and consider the options available. An alternative to running Lighthouse locally is to use the Page Speed Insights API. If you are running from Google Colab, put an exclamation point at the beginning, otherwise enter the following in your terminal:

npm install -g lighthouse

Next we import our required modules.

  • json: to process the json output from Lighthouse
  • os: to execute the local Lighthouse CLI
  • pandas: for storing the results and exporting to csv
  • datatime: used in naming the json file
import json
import os
import pandas as pd
from datetime import datetime

Create Dataframe

We’re going to store each URL’s high level ratings into a dataframe. Let’s set up that empty dataframe to use later. Note, there are dozens of metrics you can pull from these reports. Open the JSON output file once you run Lighthouse to see the possibilities and then add to the dataframe column below and make the necessary edits when we grab data from the JSON file.

df = pd.DataFrame([], columns=['URL','SEO','Accessibility','Performance','Best Practices'])

Create support variables

Now we setup a few easy variables we’ll use throughout. We’ll use the name and getdate for naming the output file, and the urls list is what we’re going to loop through and run Lighthouse on. You have two options for pushing the script your list of URLs to scan. You can either use the code and put them in a list or you can import from a csv, often a Screaming Frog or similar crawl file.

name = "RocketClicks" 
getdate = datetime.now().strftime("%m-%d-%y")

urls = ["https://www.rocketclicks.com","https://www.rocketclicks.com/seo/","https://www.rocketclicks.com/ppc/"]

Use this code below if you are importing from a crawl file. Change YOUR_CRAWL_CSV to the path/name of your crawl csv file. Then we convert the dataframe to a Python list.

df_urls = pd.read_csv("YOUR_CRAWL_CSV.csv")[["Address"]]
urls = df_urls.values.tolist()

Run Lighthouse

Now it’s time to loop through that list of URLs and run Lighthouse! We’ll use the OS Python module to execute Lighthouse via CLI. Be sure to check the docs for details on all the options available. Also change the output path to your local environment.

for url in urls:    
    stream = os.popen('lighthouse --quiet --no-update-notifier --no-enable-error-reporting --output=json --output-path=YOUR_LOCAL_PATH'+name+'_'+getdate+'.report.json --chrome-flags="--headless" ' + url)

Due to Python executing an application outside the script we need to pause the script and wait for Lighthouse to finish. I’ve found 2 minutes, or 120 seconds works for most pages. Tweak as needed if you get an error that the JSON output file doesn’t exist.  The alternative to a script pause is to write a loop looking for the output file and having it loop back endlessly until the file exists and then the script continues. Once the pause is over and Lighthouse is likely finished we build the full path to the file so we can process it in the next snippet. Be sure to change “YOUR_LOCAL_PATH”.

    time.sleep(120)
    print("Report complete for: " + url)

    json_filename = 'YOUR_LOCAL_PATH' + name + '_' + getdate + '.report.json'

Process Report

Now let’s open that JSON report file and start processing it.

    with open(json_filename) as json_data:
        loaded_json = json.load(json_data)

As mentioned earlier, there is a ton of data in this report file and I encourage you to go through it and pick out the things you want to store. For this tutorial we’re just going to grab the high level ratings for each of the 4 main categories. Remember these scores are out of 100. We multiply by 100 here because Lighthouse records these scores as floats from .00 to 1. 1 being 100% score.

   
    seo = str(round(loaded_json["categories"]["seo"]["score"] * 100))
    accessibility = str(round(loaded_json["categories"]["accessibility"]["score"] * 100))
    performance = str(round(loaded_json["categories"]["performance"]["score"] * 100))
    best_practices = str(round(loaded_json["categories"]["best-practices"]["score"] * 100))

Add Data to Dataframe

Now we take those high level ratings and put them in a dictionary list and then append them to the dataframe. Each URL will be added to this dataframe. After this it loops back to th next URL if there is one.

    dict = {"URL":url,"SEO":seo,"Accessibility":accessibility,"Performance":performance,"Best Practices":best_practices}
    df = df.append(dict, ignore_index=True).sort_values(by='SEO', ascending=False)

Finally, we have all our ratings in the dataframe. From here you can on your own do more manipulation, pipe into another script or store in a database. In this tutorial we’re going to simply export the data to a csv file to be opened in say Google Sheets or Excel. Be sure to replace “SAVE_PATH”.

df.to_csv(SAVE_PATH/'lighthouse_' + name + '_' + getdate + '.csv')
print(df)

Output

lighthouse results in excel

Automating the Scan

If your Lighthouse script is working well when you run it manually, it’s time to automate it. Luckily, Linux already supplies us with a solution by using the crontab. The crontab stores entries of scripts where you can dictate when to execute them (like a scheduler). You have lots of flexibility with how you schedule your script (any time of day, day of the week, day of the month, etc.).

But first if you are going this route you should add a shebang to the very top of your script, it tells Linux to run the script using Python3:

#!/usr/bin/python3

Now back to the crontab! To open it and add entries to the crontab, run this command:

crontab -e

It will likely open up the crontab file in vi editor. On a blank line at the bottom of the file, type the code below. This code will run the script at midnight every Sunday. To change the time to something else, use this cronjob time editor. Customize with your path to the script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py

If you want to create a log file to record each time the script ran, you can use this instead. Customize with your path to the script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py > PATH_TO_FILE/FILENAME.log 2>&1

Save the crontab file and you’re good to go! Just note, your computer needs to be on at the time the cronjob is set to run.

Conclusion

Lighthouse has become a standard tool for SEOs to understand their pages well being. It’s time to level up our usage of Lighthouse and move beyond DevTools. Automation and granular customization using Python is one great way to achieve that. Level up this script by inserting data into a database or use Google Sheets API to add results to an existing sheet! Please follow me on Twitter for feedback and showcasing interesting ways to extend the script. Enjoy!

Looking for something more comprehensive? See Hamlet Batista’s BrightonSEO slides on Automating Lighthouse on a big scale!

Greg Bernhardt
Follow me

Leave a Reply