Use Python to Scrape Technical Info for Domains
SEOs wear many hats. During a technical audit or troubleshooting, it’s useful to have a domain’s public technical information on hand. Below are some Python tools you can use to fetch that domain information. You can easily loop this over your client list—using a Python list or a database—and automate it to run every morning so you always have fresh data.
Table of Contents
Install Modules
!pip3 install whois
!pip3 install dnspython
!pip3 install pyOpenSSL
Note that the Whois module depends on a whois client installed on your system. Windows and Google Colab do not include one by default. This is best run on Linux distributions such as Ubuntu. To ensure whois is installed and up to date on Ubuntu, run the commands below.
sudo apt-get update sudo apt-get install whois
Import Modules
import whois import json import requests import re import socket import dns.resolver import ssl import OpenSSL
After installing and importing the modules, set the domain variable to the domain you want to query.
domain="rocketclicks.com"
Get MX Records
mailservers = ""
for x in dns.resolver.resolve(domain, 'MX'):
mailservers += x.to_text() + "\n"
print(mailservers)
Get WHOIS Records
There is a lot of information available. Uncomment print(w) to view the JSON response and select the fields you need.
w = whois.whois(domain) #print(w) registrar = w['registrar'] expiration_date = w['expiration_date'] print(registrar) print(expiration_date)
Get Domain IP
domainip = socket.gethostbyname(domain)
Get NameServers
dnsrecords=""
getresolver = dns.resolver.Resolver()
getns = getresolver.resolve(domain, "NS")
for rdata in getns:
dnsrecords += str(rdata) + "\n"
print(dnsrecords)
Get Text Records (SPF)
A try/except block is needed because not all domains have TXT records. See the SPF/DMARC module to extend this with validation and warnings.
textrecords = ""
getresolver = dns.resolver.Resolver()
try:
gettext = getresolver.resolve(domain, "TXT")
for rdata in gettext:
textrecords += str(rdata) + "\n"
except:
textrecords = "n/a"
print(textrecords)
Get Server Request Header Info
response = requests.head(url,verify=True)
header = dict(response.headers)
headerinfo = ""
for key, value in header.items():
headerinfo += key + ': ' + value + "\n"
print(headerinfo)
TLS Version and Certificate Info
try:
cert = ssl.get_server_certificate((domain, 443))
x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert)
expobj = str(x509.get_notAfter())
expiredate = re.search("[0-9]{8}",expobj)
date1 = expiredate.group(0)
datey = date1[0:4]
datem = date1[4:6]
dated = date1[6:8]
date = datem + "-" + dated + "-" + datey
issueobj = str(x509.get_issuer())
issurer = re.search("CN=[a-zA-Z0-9\s'-]+",issueobj)
issurer1 = issurer.group(0).replace("'","")
print(issurer1)
sslinfo = "Expiry Date: " + date + " \n Issuer: " + issurer1
except:
sslinfo = "n/a"
hostname = domain
context = ssl.create_default_context()
try:
with socket.create_connection((hostname, 443)) as sock:
with context.wrap_socket(sock, server_hostname=hostname) as ssock:
tls = ssock.version()
tls = tls.replace("TLSv","")
sslerror = "0"
except BaseException as e:
tls="0"
print(sslinfo)
print(tls)
Conclusion
Now you have a framework to build an uptime monitor using a Raspberry Pi, an electrical breadboard, and an LCD screen—there’s plenty of potential to expand. If you discover additional modules or methods for scraping domain technical information, let me know and I’ll add them to this list. In a follow-up post I’ll cover checking blacklists, reverse IPs, and detecting a site’s technologies. Stay tuned. Try this out, and follow me on Twitter to share your Python SEO applications and ideas!
Python and Domain Info FAQ
How can Python be employed to scrape technical information for domains?
You can use Python with libraries such as BeautifulSoup and requests to extract technical details from websites, including domain-related data.
Are there specific Python libraries commonly used for web scraping technical information from domains?
Common libraries include BeautifulSoup and requests, which help parse HTML and retrieve domain-related technical data.
What technical details can be scraped for domains using Python?
Python scripts can gather DNS records, SSL certificate details, server headers, and other data associated with a domain.
Are there any ethical considerations or legal implications when scraping technical information for domains?
Adhere to ethical scraping practices, respect site terms of service, and comply with applicable laws when extracting domain technical information.
Where can I find examples and documentation for using Python to scrape technical information for domains?
See online tutorials and the documentation for libraries like BeautifulSoup and requests for examples and guidance on extracting domain technical details.
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024













