Use Python to Scrape Technical Info for Domains

Estimated Read Time: 4 minute(s)

Common Topics: domain, data, technical, information, enlighter

SEOs wear many hats and from time to time whether during a technical audit or technical troubleshooting, it’s nice to have public technical information handy for a domain you’re working on. Below are some Python tools you can use to easily grab that available domain information. It would be easy to loop this over your entire client list via a manual Python list object or a database and automate it every morning so you always have the freshest information at your disposal.

Table of Contents

Install Modules

!pip3 install whois

!pip3 install dnspython

!pip3 install pyOpenSSL

Note that the Whois module is dependent on having a Whois app on your computer. Windows does not inherently have one, neither does Google Colab. This is best run with Linux like Ubuntu. To make sure your Whois on Ubuntu is installed and updated, run the commands below.

sudo apt-get update
sudo apt-get install whois

Import Modules

import whois
import json
import requests
import re
import socket
import dns.resolver
import ssl
import OpenSSL

Now that the modules have been installed and imported we need to set our domain variable which contains the domain you want to use.

domain="rocketclicks.com"

Get MX Records

mailservers = "" 
for x in dns.resolver.resolve(domain, 'MX'): 
    mailservers += x.to_text() + "\n"
print(mailservers)

Get WHOIS Records

Note, that there is a bunch of information you can grab. Uncomment print(w) to see the JSON response and you can pick out what you want.

w = whois.whois(domain)

#print(w)

registrar = w['registrar']
expiration_date = w['expiration_date']

print(registrar)
print(expiration_date)

Get Domain IP

domainip = socket.gethostbyname(domain)

Get NameServers

dnsrecords=""
getresolver = dns.resolver.Resolver() 
getns = getresolver.resolve(domain, "NS") 
for rdata in getns:
    dnsrecords += str(rdata) + "\n"
print(dnsrecords)

Get Text Records (SPF)

You need the Try/Except because not all domains will have text records. See this SPF/DMARC module to extend this with validation and warning outputs.

textrecords = ""
getresolver = dns.resolver.Resolver()
    
try:
    gettext = getresolver.resolve(domain, "TXT") 
    for rdata in gettext: 
        textrecords += str(rdata) + "\n"
except:
    textrecords = "n/a"
print(textrecords)

Get Server Request Header Info

response = requests.head(url,verify=True)
header = dict(response.headers)
headerinfo = ""
for key, value in header.items():
    headerinfo += key + ': ' + value + "\n"
print(headerinfo)

TLS Version and Certificate Info

try:
      cert = ssl.get_server_certificate((domain, 443))
      x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert)

      expobj = str(x509.get_notAfter())
      expiredate = re.search("[0-9]{8}",expobj)
      
      date1 = expiredate.group(0)
      datey = date1[0:4]
      datem = date1[4:6]
      dated = date1[6:8]
      date = datem + "-" + dated + "-" + datey

      issueobj = str(x509.get_issuer())
      issurer = re.search("CN=[a-zA-Z0-9\s'-]+",issueobj)
      issurer1 = issurer.group(0).replace("'","")
      print(issurer1)
      
      sslinfo = "Expiry Date: " + date + " \n Issuer: " + issurer1
  except:
      sslinfo = "n/a"
      
  hostname = domain
  context = ssl.create_default_context()

  try:
      with socket.create_connection((hostname, 443)) as sock:
          with context.wrap_socket(sock, server_hostname=hostname) as ssock:
              tls = ssock.version()
              tls = tls.replace("TLSv","")
              sslerror = "0"
  except BaseException as e:
      tls="0"

print(sslinfo)
print(tls)

Conclusion

Now you have the framework to begin creating your uptime monitor using a Raspberry Pi, an electrical breadboard, and an LCD screen. Lots more potential on this one! And there you have it! If you find more modules and opportunities to scrap technical information for domains, please let me know and I’ll add it to this list! In a follow-up post, I’ll be showing how to check for blacklists, reverse ips, and technologies a website is using. Stay tuned! Now get out there and try it out! Follow me on Twitter and let me know your Python SEO applications and ideas!

Python and Domain Info FAQ

How can Python be employed to scrape technical information for domains?

Python can be used with web scraping libraries like BeautifulSoup and requests to extract technical details from websites, including domain-related information.

Are there specific Python libraries commonly used for web scraping technical information from domains?

BeautifulSoup, requests, and other scraping libraries in Python provide powerful tools for extracting technical data from HTML, enabling efficient domain information retrieval.

What technical details can be scraped for domains using Python?

Python scripts can extract a range of technical details, including DNS records, SSL certificate information, server headers, and other relevant data associated with a domain.

Are there any ethical considerations or legal implications when scraping technical information for domains?

It’s crucial to adhere to ethical scraping practices, respect website terms of service, and ensure compliance with relevant laws and regulations when extracting technical information from domains using Python.

Where can I find examples and documentation for using Python to scrape technical information for domains?

Explore online tutorials, documentation for web scraping libraries like BeautifulSoup and requests, and Python resources focused on extracting technical details from domains for practical examples and guidance.

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)