Use Python to Scrape Technical Info for Domains
SEOs wear many hats and from time to time whether during a technical audit or technical troubleshooting, it’s nice to have public technical information handy for a domain you’re working on. Below are some Python tools you can use to easily grab that available domain information. It would be easy to loop this over your entire client list via a manual Python list object or a database and automate it every morning so you always have the freshest information at your disposal.
Table of Contents
Install Modules
!pip3 install whois
!pip3 install dnspython
!pip3 install pyOpenSSL
Note that the Whois module is dependent on having a Whois app on your computer. Windows does not inherently have one, neither does Google Colab. This is best run with Linux like Ubuntu. To make sure your Whois on Ubuntu is installed and updated, run the commands below.
sudo apt-get update sudo apt-get install whois
Import Modules
import whois import json import requests import re import socket import dns.resolver import ssl import OpenSSL
Now that the modules have been installed and imported we need to set our domain variable which contains the domain you want to use.
domain="rocketclicks.com"
Get MX Records
mailservers = "" for x in dns.resolver.resolve(domain, 'MX'): mailservers += x.to_text() + "\n" print(mailservers)
Get WHOIS Records
Note, that there is a bunch of information you can grab. Uncomment print(w) to see the JSON response and you can pick out what you want.
w = whois.whois(domain) #print(w) registrar = w['registrar'] expiration_date = w['expiration_date'] print(registrar) print(expiration_date)
Get Domain IP
domainip = socket.gethostbyname(domain)
Get NameServers
dnsrecords="" getresolver = dns.resolver.Resolver() getns = getresolver.resolve(domain, "NS") for rdata in getns: dnsrecords += str(rdata) + "\n" print(dnsrecords)
Get Text Records (SPF)
You need the Try/Except because not all domains will have text records. See this SPF/DMARC module to extend this with validation and warning outputs.
textrecords = "" getresolver = dns.resolver.Resolver() try: gettext = getresolver.resolve(domain, "TXT") for rdata in gettext: textrecords += str(rdata) + "\n" except: textrecords = "n/a" print(textrecords)
Get Server Request Header Info
response = requests.head(url,verify=True) header = dict(response.headers) headerinfo = "" for key, value in header.items(): headerinfo += key + ': ' + value + "\n" print(headerinfo)
TLS Version and Certificate Info
try: cert = ssl.get_server_certificate((domain, 443)) x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert) expobj = str(x509.get_notAfter()) expiredate = re.search("[0-9]{8}",expobj) date1 = expiredate.group(0) datey = date1[0:4] datem = date1[4:6] dated = date1[6:8] date = datem + "-" + dated + "-" + datey issueobj = str(x509.get_issuer()) issurer = re.search("CN=[a-zA-Z0-9\s'-]+",issueobj) issurer1 = issurer.group(0).replace("'","") print(issurer1) sslinfo = "Expiry Date: " + date + " \n Issuer: " + issurer1 except: sslinfo = "n/a" hostname = domain context = ssl.create_default_context() try: with socket.create_connection((hostname, 443)) as sock: with context.wrap_socket(sock, server_hostname=hostname) as ssock: tls = ssock.version() tls = tls.replace("TLSv","") sslerror = "0" except BaseException as e: tls="0" print(sslinfo) print(tls)
Conclusion
Now you have the framework to begin creating your uptime monitor using a Raspberry Pi, an electrical breadboard, and an LCD screen. Lots more potential on this one! And there you have it! If you find more modules and opportunities to scrap technical information for domains, please let me know and I’ll add it to this list! In a follow-up post, I’ll be showing how to check for blacklists, reverse ips, and technologies a website is using. Stay tuned! Now get out there and try it out! Follow me on Twitter and let me know your Python SEO applications and ideas!
Python and Domain Info FAQ
How can Python be employed to scrape technical information for domains?
Python can be used with web scraping libraries like BeautifulSoup and requests to extract technical details from websites, including domain-related information.
Are there specific Python libraries commonly used for web scraping technical information from domains?
BeautifulSoup, requests, and other scraping libraries in Python provide powerful tools for extracting technical data from HTML, enabling efficient domain information retrieval.
What technical details can be scraped for domains using Python?
Python scripts can extract a range of technical details, including DNS records, SSL certificate information, server headers, and other relevant data associated with a domain.
Are there any ethical considerations or legal implications when scraping technical information for domains?
It’s crucial to adhere to ethical scraping practices, respect website terms of service, and ensure compliance with relevant laws and regulations when extracting technical information from domains using Python.
Where can I find examples and documentation for using Python to scrape technical information for domains?
Explore online tutorials, documentation for web scraping libraries like BeautifulSoup and requests, and Python resources focused on extracting technical details from domains for practical examples and guidance.
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024