Calculate Similarity Between Article Elements Using spaCy

Read Time: 4 minutes

Readability: Accessible (Clear & approachable)

Core Topics: similaritysentspacynlpstep

In this Python SEO tutorial, we’ll walk through a script that uses SpaCy to calculate similarity metrics between content keywords and an article’s body. This analysis helps SEOs and content creators assess content relevance and keyword alignment.

Using natural language processing (NLP), we’ll compute similarity scores to gauge how well keywords match main content. The script leverages multi-processing to speed up the processing of large datasets.

Table of Contents

Requirements and Assumptions

Python 3 with SpaCy Installed: You’ll need SpaCy’s en_core_web_md language model.
CSV File of Content Data: A CSV file (blog_data.csv) with columns body_text, primary_keyword, and topic_clusters for analysis.
1. This CSV should contain the URL, title, primary keyword, and primary topic cluster

Step 1: Install SpaCy Language Model

First, download the SpaCy language model en_core_web_md, which provides vectors used for similarity calculations.

!python -m spacy download en_core_web_md > /dev/null

Step 2: Import Libraries

Import the necessary libraries: SpaCy for NLP, pandas for data handling, and concurrent.futures for multi-processing.

import pandas as pd
from google.cloud import bigquery
import google.auth
from statistics import mean
from concurrent.futures import ProcessPoolExecutor
import spacy

# Load SpaCy model
nlp = spacy.load("en_core_web_md")

Step 3: Define the Similarity Calculation Function

The main function, get_sim, computes similarity between the article body and keywords using sentence-, noun-phrase-, or entity-level comparisons.

def get_sim(body_text, compare, istype):
    sim_set = []
    body_text_vec = nlp(body_text)
    compare_vec = nlp(compare)

    if not compare_vec.has_vector:
        return 0.0

    if istype == "sent":
        for sent in body_text_vec.sents:
            if sent.has_vector:
                sim_set.append(sent.similarity(compare_vec))

    elif istype == "np":
        for sent in body_text_vec.sents:
            nchunks = [nchunk.text for nchunk in sent.noun_chunks]
            nchunk_utt = nlp(" ".join(nchunks))
            if nchunk_utt.has_vector:
                sim_set.append(nchunk_utt.similarity(compare_vec))

    elif istype == "ent":
        for sent in body_text_vec.sents:
            ent = [ent.text for ent in sent.ents]
            ent_utt = nlp(" ".join(ent))
            if ent_utt.has_vector:
                sim_set.append(ent_utt.similarity(compare_vec))

    if sim_set:
        sim_mean = round(mean(sim_set), 3)
    else:
        sim_mean = 0.0

    return sim_mean

Step 4: Process Each Row of Content Data

The process_row function applies get_sim to compute several similarity metrics for each row. It calculates similarity scores between:

body_text and primary_keyword
body_text and topic_clusters
primary_keyword and topic_clusters

def process_row(row):
    body_text = row["body_text"]
    primary_keyword = row["primary_keyword"]
    topic_clusters = row["topic_clusters"]

    bt_kw_sent_sim = get_sim(body_text, primary_keyword, "sent")
    bt_kw_np_sim = get_sim(body_text, primary_keyword, "np")
    bt_kw_ent_sim = get_sim(body_text, primary_keyword, "ent")

    bt_tc_sent_sim = get_sim(body_text, topic_clusters, "sent")
    bt_tc_np_sim = get_sim(body_text, topic_clusters, "np")
    bt_tc_ent_sim = get_sim(body_text, topic_clusters, "ent")

    tc_kw_sim = get_sim(primary_keyword, topic_clusters, "sent")

    return (bt_kw_sent_sim, bt_kw_np_sim, bt_kw_ent_sim, bt_tc_sent_sim, bt_tc_np_sim, bt_tc_ent_sim, tc_kw_sim)

Step 5: Load Content Data from CSV

Load the content data from the CSV into a DataFrame with pandas.

qr_article_body_content = pd.read_csv("blog_data.csv")

Step 6: Apply Multi-Processing for Efficiency

The script uses ProcessPoolExecutor to process rows in parallel, which improves throughput for large datasets.

with ProcessPoolExecutor() as executor:
    results = list(executor.map(process_row, qr_article_body_content.to_dict('records')))

Step 7: Store and Organize Results

Store the results back in the original DataFrame for straightforward analysis and export.

qr_article_body_content["bt_kw_sent_sim"], qr_article_body_content["bt_kw_np_sim"], qr_article_body_content["bt_kw_ent_sim"], qr_article_body_content["bt_tc_sent_sim"], qr_article_body_content["bt_tc_np_sim"], qr_article_body_content["bt_tc_ent_sim"], qr_article_body_content["tc_kw_sim"] = zip(*results)

# Optional: Drop the original body_text column to reduce data size
qr_article_body_content.drop(columns=["body_text"], inplace=True)

Step 8: Review Results

Now the DataFrame qr_article_body_content contains new columns for each similarity score, making analysis straightforward.

Example Output

After running the script, you will have a DataFrame that includes the following:

bt_kw_sent_sim: Similarity between body_text and primary_keyword at the sentence level.
bt_kw_np_sim: Similarity based on noun phrases.
bt_kw_ent_sim: Similarity based on named entities.
bt_tc_sent_sim: Similarity between body_text and topic_clusters at the sentence level.
bt_tc_np_sim: Similarity based on noun phrases.
bt_tc_ent_sim: Similarity based on named entities.
tc_kw_sim: Sentence-level similarity between primary_keyword and topic_clusters.

Conclusion

This Python script provides a robust framework for analyzing content relevance and keyword alignment using NLP. It’s a valuable tool for SEO experts and content analysts aiming to improve keyword coherence in content.

Feel free to expand the functionality by exploring different NLP features or refining similarity calculations to better match your SEO goals.

Follow me at: https://www.linkedin.com/in/gregbernhardt/

Author
Recent Posts

Follow me

Greg Bernhardt

5+ years as Sr. SEO Specialist for Shopify. 25+ years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about AI, search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)