Monitor robots.txt Changes with Python and Difflib
Robots.txt is a useful tool for SEOs to control crawling by spiders. However, it is sensitive: a simple mistake can cause significant damage. When collaborating with an SEO team or a client’s developers, someone might tinker where they shouldn’t, or accidentally make a change. Damage is easier to mitigate if caught early. Use the Python tutorial below to detect Robots.txt changes quickly.
Table of Contents
Requirements and Assumptions
- Python 3 is installed and you understand basic Python syntax
- Access to a Linux installation (I recommend Ubuntu) or to Google Colab
Starting the Script
First, install the fake_useragent module to use in the request header. This can help avoid light abuse detection. If you are using Google Colab, prefix pip3 with an exclamation mark (for example, !pip3).
pip3 install fake_useragent
Now we import the needed modules. The os module handles writing and reading files. requests grabs the robots.txt content. fake_useragent provides the user agent for the request. time controls script delays. difflib compares two pieces of content.
import os import requests from fake_useragent import UserAgent import time import difflib
Assign Variables
First, set some variables. We’ll set a name to use in our file names and a URL for the site where we want to test the robots.txt file. Then we set up the fake user agent and replace any space in the name with an underscore. The script stores the current robots.txt, the previously stored robots.txt, and a changes HTML file.
name="pf"
url="https://www.physicsforums.com"
ua = UserAgent()
header = {"user-agent": ua.chrome}
name = name.replace(" ","_")
oldrobots = "old_robots_" + name + ".txt"
newrobots = "new_robots_" + name + ".txt"
changes = "changes_" + name + ".html"
Grab the current robots.txt file
Request the current robots.txt file and save it locally. This will overwrite any existing file, ensuring the local copy is up to date. Note the forward slash in the request (url + “/robots.txt”); make sure your url variable does not already end with a slash.
getrobotstxt = requests.get(url + "/robots.txt",headers=header,verify=True) open(newrobots, "wb").write(getrobotstxt.content)
Now open the previously stored robots.txt file if it exists. If you haven’t run this for the site yet, the file won’t exist and the script will create it using the current file, so the two will match on the first run. On subsequent runs the script becomes useful when the file has changed.
try: get_old_robotstxt = open(oldrobots,"r") except: open(oldrobots, "wb").write(getrobotstxt.content) time.sleep(1) get_old_robotstxt = open(oldrobots,"r") get_old_robotstxt_contents = get_old_robotstxt.read()
Next we load the current robots.txt file into a variable so we can compare it to the previous copy.
get_new_robotstxt = open(newrobots,"r") get_new_robotstxt_contents = get_new_robotstxt.read()
Compare current robots.txt to old
Compare the two files at a high level before further processing. This step can also detect server-side blocking. For example, sites using the Apache module ModSecurity may block robots.txt when the request appears non-human. If the files match the script prints a message and ends; if they differ the script continues to process the differences.
blocked = re.search("Security",getrobotstxt.text)
if blocked:
print("blocked by modsecurity")
elif get_old_robotstxt_contents == get_new_robotstxt_contents:
print(name + " Robots.txt Match")
else:
print(name + " Robots.txt don't match")
Next, split each file into a list of lines so you can compare corresponding lines. In my testing splitlines() works best when both inputs are the same type (both from a file or both from a string). Since the previous robots.txt is stored on disk, we also write and read the current robots.txt from a file.
try: oldrobot = get_old_robotstxt_contents.splitlines() newrobot = get_new_robotstxt_contents.splitlines() except: oldrobot = "" newrobot = get_new_robotstxt_contents.splitlines()
If the current and previous files differ, feed the two line lists into difflib.HtmlDiff() to build an HTML comparison table. The script writes that table with embedded CSS to an HTML file you can open in a browser to review the differences.
if get_new_robotstxt_contents != get_old_robotstxt_contents:
difftable = difflib.HtmlDiff(wrapcolumn=95).make_table(oldrobot,newrobot)
open(changes, "w").write("<html><head><style>.diff_add{background-color:#cf9}.diff_sub{background-color:#fcf}.diff_next{display:none}.diff_chg{background-color:#fc9}.diff td{overflow-wrap:break-word;font-size:14px;overflow:auto;font-family:Consolas}.diff table{overflow:auto}td.diff_header{width:10px;padding-left:0;padding-right:0;text-align:center;font-weight:700}table{padding:5px}</style></head><body>" + difftable + "</body></html>")
Example Output
Pink = removed
Green = added
Orange = changed

Conclusion
Managing and monitoring your robots.txt file is important; this tutorial should give you a practical starting point. You can extend the script by storing the files in a database and by setting up an email notification if there is a change. I provide examples in other tutorials. Try it out, and follow me on Twitter to share your applications and ideas.
For robots.txt testing with Python, see advertools.
Difflib robots.txt FAQ
How can Python Difflib be used to detect and display changes in robots.txt files?
Use Python’s Difflib module to compare different versions of robots.txt and highlight the differences between them.
Is Difflib suitable for detecting changes in other types of files, or is it specific to robots.txt?
Difflib is versatile and applies to many kinds of text files, not just robots.txt. It compares sequences of lines and highlights differences.
Are there specific considerations for using Python Difflib with robots.txt files?
No special considerations are required. Difflib handles robots.txt like any other text file and provides insight into modifications across versions.
Can Python scripts utilizing Difflib be automated to check for robots.txt changes regularly?
Yes. Python scripts can be automated to check robots.txt files regularly using Difflib, enabling proactive monitoring.
Where can I find examples and documentation for implementing Python Difflib for robots.txt changes?
See Python’s official Difflib documentation and online tutorials that demonstrate using Difflib to detect and display changes in robots.txt files.
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024







