Python automate lighthouse

Automate Google Lighthouse with Python for SEO Reports

Estimated Read Time: 7 minute(s)
Common Topics: data, offset, lighthouse, file, enlighter

Google’s web page scanner Lighthouse has been a fixture among the most important tools for evaluating web pages. At a high level, this scanner measures a page’s performance, SEO, accessibility, and best practices. At a deeper level, it provides granular metrics for each category and displays recommendations. Most SEOs are familiar with running Lighthouse within Google Chrome’s DevTools. Running Lighthouse in the browser is easy and convenient, but it is manual and difficult to scale. What if you want to run Lighthouse on multiple pages daily? Python, as usual, comes to the rescue. This tutorial provides the bare bones needed to set up automated Lighthouse scanning. You can easily extend it for your own purposes.

Key Points

  1. Google‘s web page scanner Lighthouse measures page performance, SEO, accessibility, and best practices
  2. Python can be used to automate the use of Lighthouse
  3. Requirements and Assumptions include Python 3, access to Linux or Google Colab, and Lighthouse 6.4.1
  4. Install the Lighthouse CLI and import the required modules
  5. Create an empty dataframe and support variables
  6. Loop through the list of URLs and run Lighthouse
  7. Process the JSON report, open the file, and grab the highlevel ratings
  8. Add ratings to the dataframe, export to CSV, and automate the scan with crontab

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax understood
  • Access to a Linux installation (I recommend Ubuntu) or Google Colab. I don’t recommend Google Colab for this, as performance is not great.
  • This script has been tested on Lighthouse 6.4.1, which was current as of this tutorial’s publishing date.

Starting the Script

First, install the lighthouse CLI from npm, using the node package manager. It is easy to install; be sure to read the documentation and consider the available options. An alternative to running Lighthouse locally is to use the PageSpeed Insights API. If you are running from Google Colab, put an exclamation point at the beginning; otherwise, enter the following in your terminal:

npm install -g lighthouse

Next, we import our required modules.

  • json: to process the JSON output from Lighthouse
  • os: to execute the local Lighthouse CLI
  • pandas: for storing the results and exporting to CSV
  • datetime: used in naming the JSON file
import json
import os
import pandas as pd
from datetime import datetime

Create Dataframe

We’re going to store each URL’s high-level ratings in a dataframe. Let’s set up that empty dataframe to use later. Note: there are dozens of metrics you can pull from these reports. Open the JSON output after running Lighthouse to see the available metrics, then add columns to the dataframe and adjust the JSON extraction accordingly.

df = pd.DataFrame([], columns=['URL','SEO','Accessibility','Performance','Best Practices'])

Create support variables

Now we set up a few simple variables we’ll use throughout. We’ll use the name and getdate variables to name the output file, and the URL list is what we’ll loop through and scan. You have two options for populating the list of URLs to scan: put them directly in a list in the code, or import them from a CSV (for example, a Screaming Frog crawl file).

name = "RocketClicks" 
getdate = datetime.now().strftime("%m-%d-%y")

urls = ["https://www.rocketclicks.com","https://www.rocketclicks.com/seo/","https://www.rocketclicks.com/ppc/"]

Use the code below if you are importing from a crawl file. Change YOUR_CRAWL_CSV to the path/name of your crawl CSV file. Then convert the dataframe to a Python list.

df_urls = pd.read_csv("YOUR_CRAWL_CSV.csv")[["Address"]]
urls = df_urls.values.tolist()

Run Lighthouse

Now it’s time to loop through the list of URLs and run Lighthouse. We’ll use the os Python module to execute Lighthouse via the CLI. Be sure to check the docs for details on available options. Also, change the output path to match your local environment.

for url in urls:    
    stream = os.popen('lighthouse --quiet --no-update-notifier --no-enable-error-reporting --output=json --output-path=YOUR_LOCAL_PATH'+name+'_'+getdate+'.report.json --chrome-flags="--headless" ' + url)

Because the script launches an external application, we need to pause and wait for Lighthouse to finish. I have found 2 minutes (120 seconds) works for most pages; tweak this as needed if you get an error that the JSON output file doesn’t exist. Alternatively, write a loop that waits until the output file exists before continuing. Once the pause is over and Lighthouse has likely finished, build the full path to the file so we can process it in the next snippet. Be sure to change “YOUR_LOCAL_PATH”.

    time.sleep(120)
    print("Report complete for: " + url)

    json_filename = 'YOUR_LOCAL_PATH' + name + '_' + getdate + '.report.json'

Process Report

Now let’s open that JSON report file and start processing it.

    with open(json_filename) as json_data:
        loaded_json = json.load(json_data)

As mentioned earlier, there is a lot of data in this report file; go through it and pick the fields you want to store. For this tutorial, we’re grabbing the high-level ratings for the four main categories. These scores are out of 100. Lighthouse records them as floats from 0.00 to 1.00, so we multiply by 100 (1.00 equals 100%).

   
    seo = str(round(loaded_json["categories"]["seo"]["score"] * 100))
    accessibility = str(round(loaded_json["categories"]["accessibility"]["score"] * 100))
    performance = str(round(loaded_json["categories"]["performance"]["score"] * 100))
    best_practices = str(round(loaded_json["categories"]["best-practices"]["score"] * 100))

Add Data to Dataframe

Now we take those high-level ratings, put them in a dictionary, and append them to the dataframe. Each URL will be added to this dataframe. After this, the loop continues to the next URL (if any).

    dict = {"URL":url,"SEO":seo,"Accessibility":accessibility,"Performance":performance,"Best Practices":best_practices}
    df = df.append(dict, ignore_index=True).sort_values(by='SEO', ascending=False)

Finally, we have all ratings in the dataframe. From here you can perform further manipulation, pipe results into another script, or store them in a database. In this tutorial, we’ll simply export the data to a CSV file for use in Google Sheets or Excel. Be sure to replace “SAVE_PATH”.

df.to_csv(SAVE_PATH/'lighthouse_' + name + '_' + getdate + '.csv')
print(df)

Output

lighthouse results in excel

Automating the Scan

If your Lighthouse script runs correctly when executed manually, it’s time to automate it. Linux provides a built-in scheduler via crontab. The crontab stores scheduled entries and lets you dictate when to execute them (time of day, day of week, day of month, etc.).

But first, add a shebang to the very top of your script; it tells Linux to run the script using Python 3:

#!/usr/bin/python3

To edit the crontab, run this command:

crontab -e

It will likely open the crontab file in the vi editor. On a blank line at the bottom of the file, type the following line. This will run the script at midnight every Sunday. To change the schedule, use the cronjob time editor. Customize the path to your script.

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py

If you want a log file to record each run, use this variant (customize paths):

0 0 * * SUN /usr/bin/python3 PATH_TO_SCRIPT/filename.py > PATH_TO_FILE/FILENAME.log 2>&1

Save the crontab file and you’re set. Note: your computer must be on when the cronjob is scheduled to run.

Conclusion

Lighthouse has become a standard tool for SEOs to understand their pages’ well-being. It’s time to level up your use of Lighthouse and move beyond DevTools. Automation and targeted customization using Python are great ways to achieve that. Extend this script by inserting data into a database or use the Google Sheets API to add results to a sheet. Please follow me on Twitter for feedback and to see interesting extensions of the script. Enjoy!

Looking for something more comprehensive? See Hamlet Batista’s BrightonSEO slides on Automating Lighthouse for large-scale approaches.

Google Lighthouse and Python FAQ

How can Python be utilized to automate Lighthouse reports for SEO purposes?

Python scripts can automate Lighthouse reports by invoking the Lighthouse CLI or interacting with the Lighthouse API, giving SEOs a repeatable workflow for performance analysis and reporting.

What Python libraries are commonly used for making API requests to Lighthouse and automating reports?

Common libraries include requests for HTTP interactions and pandas for structuring and exporting results. The tutorial also uses os and json for executing the CLI and parsing output.

What specific information can be obtained from Lighthouse reports using Python?

Python scripts can extract a wide range of data from Lighthouse reports, including performance metrics, accessibility scores, SEO scores, best-practices results, and PWA-related information.

Are there any considerations or limitations to be aware of when automating Lighthouse reports with Python?

Consider API rate limits, authentication requirements, and variability in report data. Ensure your automation respects Lighthouse and PageSpeed Insights usage policies.

Where can I find examples and documentation for using Python to automate Lighthouse reports for SEO?

Refer to the official Lighthouse and PageSpeed Insights documentation for examples and API details. Additionally, search online tutorials and Python reference material for practical implementations and sample scripts.

Greg Bernhardt
Follow me