How SEOs Can Use Python to Automate Lighthouse Reports

Estimated Read Time: 7 minute(s)

Common Topics: data, offset, lighthouse, file, script

Google’s web page scanner Lighthouse has been a fixture as one of the most important tools to use when evaluating a web page. This scanner at a high level measures your page’s performance, SEO, accessibility, and best practices. At a deeper level, it gives more granular metrics for each of those categories and displays recommendations. Most SEOs are very familiar with running Lighthouse within Google Chrome’s DevTools. Using Lighthouse within the browser is very easy and handy, but it’s still manual and difficult to scale. What if you want to run Lighthouse on multiple pages on a daily basis? Python as usual comes to the rescue. This tutorial will offer the bare bones needed to set up your automated Lighthouse scanning. From this tutorial, it will be clear and easy to extend it for your complete purposes.

Table of Contents

Key Points

Google‘s web page scanner Lighthouse measures page performance, SEO, accessibility, and best practices
Python can be used to automate the use of Lighthouse
Requirements and Assumptions include Python 3, access to Linux or Google Colab, and Lighthouse 6.4.1
Install the Lighthouse CLI and import the required modules
Create an empty dataframe and support variables
Loop through the list of URLs and run Lighthouse
Process the JSON report, open the file, and grab the high–level ratings
Add ratings to the dataframe, export to CSV, and automate the scan with crontab

Requirements and Assumptions

Python 3 is installed and basic Python syntax understood
Access to a Linux installation (I recommend Ubuntu) or Google Colab. I don’t recommend Google Colab for this as the performance is not great.
This script has been tested on Lighthouse 6.4.1 which is current as of this tutorial’s publishing date.

Starting the Script

In the first step, we install the lighthouse CLI from you which is easy to install using the node package manager. Be sure to read the documentation and consider the options available. An alternative to running Lighthouse locally is to use the Page Speed Insights API. If you are running from Google Colab, put an exclamation point at the beginning, otherwise enter the following in your terminal:

npm install -g lighthouse

Next, we import our required modules.

json: to process the JSON output from Lighthouse
os: to execute the local Lighthouse CLI
pandas: for storing the results and exporting to CSV
datatime: used in naming the JSON file

import json
import os
import pandas as pd
from datetime import datetime

Create Dataframe

We’re going to store each URL’s high-level ratings into a dataframe. Let’s set up that empty dataframe to use later. Note, there are dozens of metrics you can pull from these reports. Open the JSON output file once you run Lighthouse to see the possibilities and then add to the dataframe column below and make the necessary edits when we grab data from the JSON file.

df = pd.DataFrame([], columns=['URL','SEO','Accessibility','Performance','Best Practices'])

Create support variables

Now we set up a few easy variables we’ll use throughout. We’ll use the name and getdate for naming the output file, and the URL list is what we’re going to loop through and run Lighthouse on. You have two options for pushing the script to your list of URLs to scan. You can either use the code and put them in a list or you can import from a CSV, often a Screaming Frog or similar crawl file.

name = "RocketClicks" 
getdate = datetime.now().strftime("%m-%d-%y")

urls = ["https://www.rocketclicks.com","https://www.rocketclicks.com/seo/","https://www.rocketclicks.com/ppc/"]

Use this code below if you are importing from a crawl file. Change YOUR_CRAWL_CSV to the path/name of your crawl CSV file. Then we convert the dataframe to a Python list.

df_urls = pd.read_csv("YOUR_CRAWL_CSV.csv")[["Address"]]
urls = df_urls.values.tolist()

Run Lighthouse

Now it’s time to loop through that list of URLs and run Lighthouse! We’ll use the OS Python module to execute Lighthouse via CLI. Be sure to check the docs for details on all the options available. Also, change the output path to your local environment.

for url in urls:    
    stream = os.popen('lighthouse --quiet --no-update-notifier --no-enable-error-reporting --output=json --output-path=YOUR_LOCAL_PATH'+name+'_'+getdate+'.report.json --chrome-flags="--headless" ' + url)

Due to Python executing an application outside the script we need to pause the script and wait for Lighthouse to finish. I’ve found 2 minutes, or 120 seconds works for most pages. Tweak as needed if you get an error that the JSON output file doesn’t exist. The alternative to a script pause is to write a loop looking for the output file and having it loop back endlessly until the file exists and then the script continues. Once the pause is over and Lighthouse is likely finished we build the full path to the file so we can process it in the next snippet. Be sure to change “YOUR_LOCAL_PATH”.

    time.sleep(120)
    print("Report complete for: " + url)

    json_filename = 'YOUR_LOCAL_PATH' + name + '_' + getdate + '.report.json'

Process Report

Now let’s open that JSON report file and start processing it.

    with open(json_filename) as json_data:
        loaded_json = json.load(json_data)

As mentioned earlier, there is a ton of data in this report file and I encourage you to go through it and pick out the things you want to store. For this tutorial, we’re just going to grab the high-level ratings for each of the 4 main categories. Remember these scores are out of 100. We multiply by 100 here because Lighthouse records these scores as floats from .00 to 1. 1 being 100% score.

   
    seo = str(round(loaded_json["categories"]["seo"]["score"] * 100))
    accessibility = str(round(loaded_json["categories"]["accessibility"]["score"] * 100))
    performance = str(round(loaded_json["categories"]["performance"]["score"] * 100))
    best_practices = str(round(loaded_json["categories"]["best-practices"]["score"] * 100))

Add Data to Dataframe

Now we take those high-level ratings and put them in a dictionary list and then append them to the dataframe. Each URL will be added to this dataframe. After this, it loops back to the next URL if there is one.

    dict = {"URL":url,"SEO":seo,"Accessibility":accessibility,"Performance":performance,"Best Practices":best_practices}
    df = df.append(dict, ignore_index=True).sort_values(by='SEO', ascending=False)

Finally, we have all our ratings in the dataframe. From here you can on your own do more manipulation, pipe into another script, or store in a database. In this tutorial, we’re going to simply export the data to a CSV file to be opened in say Google Sheets or Excel. Be sure to replace “SAVE_PATH”.

df.to_csv(SAVE_PATH/'lighthouse_' + name + '_' + getdate + '.csv')
print(df)

Output

Automating the Scan

If your Lighthouse script is working well when you run it manually, it’s time to automate it. Luckily, Linux already supplies us with a solution by using the crontab. The crontab stores entries of scripts where you can dictate when to execute them (like a scheduler). You have lots of flexibility with how you schedule your script (any time of day, day of the week, day of the month, etc.).

But first, if you are going this route you should add a shebang to the very top of your script, it tells Linux to run the script using Python3:

#!/usr/bin/python3

Now back to the crontab! To open it and add entries to the crontab, run this command:

crontab -e

It will likely open up the crontab file in vi editor. On a blank line at the bottom of the file, type the code below. This code will run the script at midnight every Sunday. To change the time to something else, use this cronjob time editor. Customize with your path to the script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py

If you want to create a log file to record each time the script ran, you can use this instead. Customize with your path to the script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py > PATH_TO_FILE/FILENAME.log 2>&1

Save the crontab file and you’re good to go! Just note, your computer needs to be on at the time the cronjob is set to run.

Conclusion

Lighthouse has become a standard tool for SEOs to understand their pages well being. It’s time to level up our usage of Lighthouse and move beyond DevTools. Automation and granular customization using Python is one great way to achieve that. Level up this script by inserting data into a database or use Google Sheets API to add results to an existing sheet! Please follow me on Twitter for feedback and to showcase interesting ways to extend the script. Enjoy!

Looking for something more comprehensive? See Hamlet Batista’s BrightonSEO slides on Automating Lighthouse on a big scale!

Google Lighthouse and Python FAQ

How can Python be utilized to automate Lighthouse reports for SEO purposes?

Python scripts can automate Lighthouse reports by interacting with the Lighthouse API, providing SEOs with a streamlined process for performance analysis and optimization.

What Python libraries are commonly used for making API requests to Lighthouse and automating reports?

Python’s requests library is commonly employed for making HTTP requests and interacting with the Lighthouse API, facilitating the automation of performance reports for SEO analysis.

What specific information can be obtained from Lighthouse reports using Python?

Python scripts can extract a wealth of information from Lighthouse reports, including performance metrics, accessibility, SEO, best practices, and progressive web app (PWA) details.

Are there any considerations or limitations to be aware of when automating Lighthouse reports with Python?

Consider factors such as API rate limits, the need for authentication, and potential variations in report data. Ensure compliance with Lighthouse API usage policies during automation.

Where can I find examples and documentation for using Python to automate Lighthouse reports for SEO?

Explore the official documentation for Lighthouse API for comprehensive guides and examples. Additionally, refer to online tutorials and Python resources for practical demonstrations and implementation details in automating Lighthouse reports for SEO analysis.

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!