Collect Domain Security Information with Python

Estimated Read Time: 7 minute(s)

Common Topics: domain, data, replace, ip, return

In this tutorial, we will learn how to automate the collection of various domain-related technical information using Python. The script will gather data such as WHOIS details, DNS records, SSL certificates, reverse IP lookup, blacklist status, robots.txt, and more. Using the pandas library, we will also show how to store the collected data in a CSV file. This tutorial is ideal for anyone interested in automating domain monitoring or data collection tasks. Common roles may be an SEO, webmaster, web host or competitive data analyst.

Table of Contents

Requirements

Before getting started, make sure you have Python installed on your system or notebook. You’ll also need the following Python packages:

pandas – For working with data and saving results to CSV.
requests – To handle HTTP requests to external APIs and websites.
dnspython – For querying DNS records.
pyOpenSSL – For retrieving SSL certificate information.
whois – To fetch WHOIS details for a domain.

You can install these libraries using pip:

Load Domain Data

We start by loading a CSV file that contains the list of domains we want to process. This CSV should include a single column of domain URLs in a column named “urls”. This DataFrame will hold the domain data and allow us to loop through each domain for information collection.

csv_file = "client_data.csv"
df = pd.read_csv(csv_file)

Collect WHOIS Information

WHOIS data provides critical domain details such as the registrar, creation date, and expiration date. We use the whois module to fetch this data. This function retrieves WHOIS details, extracts relevant fields, and returns the registrar, creation date, and expiry date.

def get_whois_info(domain):
    w = whois.whois(domain)
    string = json.dumps(w.text)
    
    registrar = re.search("Registrar:\s[^R]*", string)
    registrar = registrar.group(0).replace('\\r\\n', '') if registrar else "n/a"

    creation_date = re.search("Creation\sDate:\s[^A-Z]*", string)
    creation_date = creation_date.group(0).replace('\\r\\n', '') if creation_date else "n/a"

    expiry_date = re.search("Registry\sExpiry\sDate:\s[^A-Z]*", string)
    expiry_date = expiry_date.group(0).replace('\\r\\n', '') if expiry_date else "n/a"

    return registrar, creation_date, expiry_date

Fetch DNS Records

DNS records, such as MX (mail servers), NS (name servers), and TXT (text records), are essential for understanding how a domain is configured. We use dnspython for this task. This function queries the DNS records for MX, NS, and TXT records and returns them as strings.

def get_dns_records(domain):
    mailservers = ""
    try:
        for x in dns.resolver.resolve(domain, 'MX'):
            mailservers += x.to_text() + "<br>"
    except:
        mailservers = "n/a"

    dnsrecords = ""
    try:
        myResolver = dns.resolver.Resolver()
        myAnswers = myResolver.resolve(domain, "NS")
        for rdata in myAnswers:
            dnsrecords += str(rdata) + "<br>"
    except:
        dnsrecords = "n/a"

    textrecords = ""
    try:
        myAnswers = myResolver.resolve(domain, "TXT")
        for rdata in myAnswers:
            textrecords += str(rdata) + "<br>"
    except:
        textrecords = "n/a"

    return mailservers, dnsrecords, textrecords

Step 4: SSL Certificate Information

SSL certificates ensure secure communication between users and servers. This step retrieves the certificate expiration date and the issuer using pyOpenSSL. This function retrieves the SSL certificate and extracts the expiration date and issuer details.

def get_ssl_info(domain):
    try:
        cert = ssl.get_server_certificate((domain, 443))
        x509 = OpenSSL.crypto.load_certificate(OpenSSL.crypto.FILETYPE_PEM, cert)
        expiredate = str(x509.get_notAfter())
        date = f"{expiredate[4:6]}-{expiredate[6:8]}-{expiredate[:4]}"  # Format to MM-DD-YYYY
        issuer = str(x509.get_issuer())
        issuer = re.search("CN=[a-zA-Z0-9\s'-]+", issuer).group(0).replace("'", "") if issuer else "n/a"
        return date, issuer
    except Exception as e:
        return "n/a", str(e)

Step 5: Blacklist Check

Domain and IP blacklist checks are crucial for security. We query external APIs to check if a domain or IP is blacklisted. This function checks if the domain or its associated IP is blacklisted and returns the status. Don’t forget to register for your Hentrix Tools API key.

def get_blacklist_status(domainip, domain):
    hendrix-tools-api-key = '' #replace with your api key
    ipblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/ipv4/{domainip}/"
    domainblacklist_url = f"https://api.hetrixtools.com/v2/{hendrix-tools-api-key}/blacklist-check/domain/{domain}/"
    try:
        ipresponse = requests.get(ipblacklist_url)
        domainresponse = requests.get(domainblacklist_url)

        ipdata = json.loads(ipresponse.text)
        domaindata = json.loads(domainresponse.text)

        ipblacklist = json.dumps(ipdata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}","").replace("}]", "").replace("null", "none")
        domainblacklist = json.dumps(domaindata['blacklisted_on']).replace("[{", "").replace("{", "").replace("}", "").replace("}]", "").replace("null", "none")

        return f"<b>By IP:</b> {ipblacklist} <br><b>By Domain:</b> {domainblacklist}"
    except:
        return "n/a"

Step 6: Reverse IP Lookup

Reverse IP lookup allows you to find other domains hosted on the same server. We use an external API for this. This function performs a reverse IP lookup and returns other domains hosted on the same IP address. Note, that this won’t work well for domains behind Cloudflare which obscures the the true IP. This Hacktarget API is free but don’t abuse it or you’ll get blocked.

def get_reverse_ip(domainip):
    rip_url = f"https://api.hackertarget.com/reverseiplookup/?q={domainip}"
    try:
        rip_response = requests.get(rip_url)
        reverseip = rip_response.text.strip()
        return reverseip.replace("b'", "").replace("''", "'")
    except:
        return "n/a"

Fetching Robots.txt

The robots.txt file provides guidelines for search engines and web crawlers. This function fetches the contents of a domain’s robots.txt. This function retrieves the robots.txt file from the domain and formats it for easier viewing.

def get_robots_txt(url):
    try:
        robots_url = url + "/robots.txt"
        response = requests.get(robots_url, verify=False)
        return response.text.replace("\n", "<br>").replace("'", "''")
    except:
        return "n/a"

SSL Error and TLS Version

Checking for SSL errors and the TLS version is important for secure communication. This function checks for SSL errors and retrieves the TLS version used. This function attempts to establish an SSL connection and retrieves the TLS version, handling any SSL errors encountered.

def get_ssl_error_and_tls(domain):
    context = ssl.create_default_context()
    try:
        with socket.create_connection((domain, 443)) as sock:
            with context.wrap_socket(sock, server_hostname=domain) as ssock:
                tls = ssock.version().replace("TLSv", "")
                sslerror = "0"
    except Exception as e:
        sslerror = str(e)
        tls = "0"
    
    return sslerror, tls

Collect Domain Technology Information with BuiltWith API

In addition to the technical data, we can fetch the technology stack and social media links associated with the domain. We use the BuiltWith API (cheap) to get information about the technologies used on a website.

This function:

Queries the BuiltWith API for technology stack information.
Retrieves the associated social media links for the domain.

def get_technology_info(domain):
    builtwith-api-key = '' #replace with your api key
    url = f"https://api.builtwith.com/v14/api.json?KEY={builtwith-api-key}&liveonly=yes&LOOKUP={domain}"
    response = requests.get(url)
    data = response.json()

    technology = ""
    social = ""
    
    for result in data["Results"][0]["Result"]["Paths"]:
        for tech in result["Technologies"]:
            technology += f"<a href='{tech['Link']}'><b>{tech['Name']}</b></a><br>{tech['Description']}<br><br>"

    try:
        for value in data["Results"][0]["Meta"]["Social"]:
            social += f"<a href='{value}'>{value}</a><br>"
    except:
        social = "n/a"

    return technology, social

Looping Through Domains and Collecting Data

Now, we will loop through the list of domains, execute each function, and store the data in a new DataFrame.

# Loop through records and process data
for index, row in df.iterrows():
    domain = row['url'].replace("https://", "").replace("www.", "").replace("/", "")
    domainip = socket.gethostbyname(domain)
    
    # Collect domain data
    registrar, creation_date, expiry_date = get_whois_info(domain)
    mailservers, dnsrecords, textrecords = get_dns_records(domain)
    ssl_expiry, ssl_issuer = get_ssl_info(domain)
    blacklist = get_blacklist_status(domainip, domain)
    technology, social = get_technology_info(domain)
    reverseip = get_reverse_ip(domainip)
    robots = get_robots_txt(row['url'])
    sslerror, tls = get_ssl_error_and_tls(domain)
    
    # Collect all data in a dictionary
    client_data = {
        'clientid': row['clientid'],
        'date': datetime.now().strftime('%m/%d/%Y'),
        'domainip': domainip,
        'mailservers': mailservers,
        'whois': f"{registrar}<br>{creation_date}<br>{expiry_date}",
        'dnsrecords': dnsrecords,
        'textrecords': textrecords,
        'sslinfo': f"<b>Expiry Date:</b> {ssl_expiry} <br><b>Issuer:</b> {ssl_issuer}",
        'blacklist': blacklist,
        'tech': technology,
        'social': social,
        'robots': robots,
        'reverseip': reverseip,
        'sslerror': sslerror,
        'tls': tls
    }
    
    # Append the collected data to the result DataFrame
    if 'result_df' not in locals():
        # If the new DataFrame doesn't exist, create it
        result_df = pd.DataFrame(columns=client_data.keys())
    
    result_df = result_df.append(client_data, ignore_index=True)

# Display the resulting DataFrame
result_df 

# Optionally save the result to a CSV file
result_df.to_csv("collected_domain_info.csv", index=False)

Conclusion

With the functions above, you can automate the collection of important technical data related to any domain. The script fetches WHOIS data, DNS records, SSL certificate information, reverse IP lookup results, blacklist status, technology stack, and social media links using the BuiltWith API. All of this data is stored in a structured format (CSV), making it easy to analyze, monitor, and report on domain status.

By leveraging Python’s powerful libraries such as pandas, requests, dnspython, and pyOpenSSL, this script automates domain monitoring tasks and helps you stay informed about the technical health and setup of domains you manage or monitor.

Follow me at: https://www.linkedin.com/in/gregbernhardt/

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)