In this tutorial, we will learn how to automate the collection of various domain-related technical information using Python. The script will gather data such as WHOIS details, DNS records, SSL certificates, reverse IP lookup, blacklist status, robots.txt, and more. Using the pandas library, we will also show how to store the collected data in a CSV file. This tutorial is ideal for anyone interested in automating domain monitoring or data collection tasks. Common roles may be an SEO, webmaster, web host or competitive data analyst.
Table of Contents
Requirements
Before getting started, make sure you have Python installed on your system or notebook. You’ll also need the following Python packages:
- pandas – For working with data and saving results to CSV.
- requests – To handle HTTP requests to external APIs and websites.
- dnspython – For querying DNS records.
- pyOpenSSL – For retrieving SSL certificate information.
- whois – To fetch WHOIS details for a domain.
You can install these libraries using pip:
Load Domain Data
We start by loading a CSV file that contains the list of domains we want to process. This CSV should include a single column of domain URLs in a column named “urls”. This DataFrame will hold the domain data and allow us to loop through each domain for information collection.
csv_file = "client_data.csv" df = pd.read_csv(csv_file)
Collect WHOIS Information
WHOIS data provides critical domain details such as the registrar, creation date, and expiration date. We use the whois
module to fetch this data. This function retrieves WHOIS details, extracts relevant fields, and returns the registrar, creation date, and expiry date.
def get_whois_info(domain): w = whois.whois(domain) string = json.dumps(w.text) registrar = re.search("Registrar:\s[^R]*", string) registrar = registrar.group(0).replace('\\r\\n', '') if registrar else "n/a" creation_date = re.search("Creation\sDate:\s[^A-Z]*", string) creation_date = creation_date.group(0).replace('\\r\\n', '') if creation_date else "n/a" expiry_date = re.search("Registry\sExpiry\sDate:\s[^A-Z]*", string) expiry_date = expiry_date.group(0).replace('\\r\\n', '') if expiry_date else "n/a" return registrar, creation_date, expiry_date
Fetch DNS Records
DNS records, such as MX (mail servers), NS (name servers), and TXT (text records), are essential for understanding how a domain is configured. We use dnspython
for this task. This function queries the DNS records for MX, NS, and TXT records and returns them as strings.
def get_dns_records(domain): mailservers = "" try: for x in dns.resolver.resolve(domain, 'MX'): mailservers += x.to_text() + "<br>" except: mailservers = "n/a" dnsrecords = "" try: myResolver = dns.resolver.Resolver() myAnswers = myResolver.resolve(domain, "NS") for rdata in myAnswers: dnsrecords += str(rdata) + "<br>" except: dnsrecords = "n/a" textrecords = "" try: myAnswers = myResolver.resolve(domain, "TXT") for rdata in myAnswers: textrecords += str(rdata) + "<br>" except: textrecords = "n/a" return mailservers, dnsrecords, textrecords
Step 4: SSL Certificate Information
SSL certificates ensure secure communication between users and servers. This step retrieves the certificate expiration date and the issuer using pyOpenSSL
. This function retrieves the SSL certificate and extracts the expiration date and issuer details.
def get_ssl_info(domain): try: cert = ssl.get_server_certificate((domain, 443)) x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert) expiredate = str(x509.get_notAfter()) date = f"{expiredate[4:6]}-{expiredate[6:8]}-{expiredate[:4]}" # Format to MM-DD-YYYY issuer = str(x509.get_issuer()) issuer = re.search("CN=[a-zA-Z0-9\s'-]+", issuer).group(0).replace("'", "") if issuer else "n/a" return date, issuer except Exception as e: return "n/a", str(e)
Step 5: Blacklist Check
Domain and IP blacklist checks are crucial for security. We query external APIs to check if a domain or IP is blacklisted. This function checks if the domain or its associated IP is blacklisted and returns the status. Don’t forget to register for your Hentrix Tools API key.
def get_blacklist_status(domainip, domain): hendrix-tools-api-key = '' #replace with your api key ipblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/ipv4/{domainip}/" domainblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/domain/{domain}/" try: ipresponse = requests.get(ipblacklist_url) domainresponse = requests.get(domainblacklist_url) ipdata = json.loads(ipresponse.text) domaindata = json.loads(domainresponse.text) ipblacklist = json.dumps(ipdata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}","").replace("}]", "").replace("null", "none") domainblacklist = json.dumps(domaindata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}", "").replace("}]", "").replace("null", "none") return f"<b>By IP:</b> {ipblacklist} <br><b>By Domain:</b> {domainblacklist}" except: return "n/a"
Step 6: Reverse IP Lookup
Reverse IP lookup allows you to find other domains hosted on the same server. We use an external API for this. This function performs a reverse IP lookup and returns other domains hosted on the same IP address. Note, that this won’t work well for domains behind Cloudflare which obscures the the true IP. This Hacktarget API is free but don’t abuse it or you’ll get blocked.
def get_reverse_ip(domainip): rip_url = f"https://api.hackertarget.com/reverseiplookup/?q={domainip}" try: rip_response = requests.get(rip_url) reverseip = rip_response.text.strip() return reverseip.replace("b'", "").replace("''", "'") except: return "n/a"
Fetching Robots.txt
The robots.txt
file provides guidelines for search engines and web crawlers. This function fetches the contents of a domain’s robots.txt
. This function retrieves the robots.txt
file from the domain and formats it for easier viewing.
def get_robots_txt(url): try: robots_url = url + "/robots.txt" response = requests.get(robots_url, verify=False) return response.text.replace("\n", "<br>").replace("'", "''") except: return "n/a"
SSL Error and TLS Version
Checking for SSL errors and the TLS version is important for secure communication. This function checks for SSL errors and retrieves the TLS version used. This function attempts to establish an SSL connection and retrieves the TLS version, handling any SSL errors encountered.
def get_ssl_error_and_tls(domain): context = ssl.create_default_context() try: with socket.create_connection((domain, 443)) as sock: with context.wrap_socket(sock, server_hostname=domain) as ssock: tls = ssock.version().replace("TLSv", "") sslerror = "0" except Exception as e: sslerror = str(e) tls = "0" return sslerror, tls
Collect Domain Technology Information with BuiltWith API
In addition to the technical data, we can fetch the technology stack and social media links associated with the domain. We use the BuiltWith API (cheap) to get information about the technologies used on a website.
This function:
- Queries the BuiltWith API for technology stack information.
- Retrieves the associated social media links for the domain.
def get_technology_info(domain): builtwith-api-key = '' #replace with your api key url = f"https://api.builtwith.com/v14/api.json?KEY={builtwith-api-key}&liveonly=yes&LOOKUP={domain}" response = requests.get(url) data = response.json() technology = "" social = "" for result in data["Results"][0]["Result"]["Paths"]: for tech in result["Technologies"]: technology += f"<a href='{tech['Link']}'><b>{tech['Name']}</b></a><br>{tech['Description']}<br><br>" try: for value in data["Results"][0]["Meta"]["Social"]: social += f"<a href='{value}'>{value}</a><br>" except: social = "n/a" return technology, social
Looping Through Domains and Collecting Data
Now, we will loop through the list of domains, execute each function, and store the data in a new DataFrame.
# Loop through records and process data for index, row in df.iterrows(): domain = row['url'].replace("https://", "").replace("www.", "").replace("/", "") domainip = socket.gethostbyname(domain) # Collect domain data registrar, creation_date, expiry_date = get_whois_info(domain) mailservers, dnsrecords, textrecords = get_dns_records(domain) ssl_expiry, ssl_issuer = get_ssl_info(domain) blacklist = get_blacklist_status(domainip, domain) technology, social = get_technology_info(domain) reverseip = get_reverse_ip(domainip) robots = get_robots_txt(row['url']) sslerror, tls = get_ssl_error_and_tls(domain) # Collect all data in a dictionary client_data = { 'clientid': row['clientid'], 'date': datetime.now().strftime('%m/%d/%Y'), 'domainip': domainip, 'mailservers': mailservers, 'whois': f"{registrar}<br>{creation_date}<br>{expiry_date}", 'dnsrecords': dnsrecords, 'textrecords': textrecords, 'sslinfo': f"<b>Expiry Date:</b> {ssl_expiry} <br><b>Issuer:</b> {ssl_issuer}", 'blacklist': blacklist, 'tech': technology, 'social': social, 'robots': robots, 'reverseip': reverseip, 'sslerror': sslerror, 'tls': tls } # Append the collected data to the result DataFrame if 'result_df' not in locals(): # If the new DataFrame doesn't exist, create it result_df = pd.DataFrame(columns=client_data.keys()) result_df = result_df.append(client_data, ignore_index=True) # Display the resulting DataFrame result_df # Optionally save the result to a CSV file result_df.to_csv("collected_domain_info.csv", index=False)
Conclusion
With the functions above, you can automate the collection of important technical data related to any domain. The script fetches WHOIS data, DNS records, SSL certificate information, reverse IP lookup results, blacklist status, technology stack, and social media links using the BuiltWith API. All of this data is stored in a structured format (CSV), making it easy to analyze, monitor, and report on domain status.
By leveraging Python’s powerful libraries such as pandas
, requests
, dnspython
, and pyOpenSSL
, this script automates domain monitoring tasks and helps you stay informed about the technical health and setup of domains you manage or monitor.
Follow me at: https://www.linkedin.com/in/gregbernhardt/
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024