Find Interlinking Opps via Entity N-gram Matches Using Python
Any seasoned SEO knows that finding internal links at scale is difficult but important. This is especially true if your content isn’t well organized topically.
If your blog is disorganized or full of seemingly random articles and you need to add internal links intelligently, this tutorial is for you.
In this Python SEO tutorial, I’ll show you, line by line, how to break each article’s content into entity N-grams and compare that set of entities to every other article. Articles with high matches are more likely to be good fits for internal linking.
Table of Contents
Requirements and Assumptions
- Python 3 is installed and basic Python syntax is understood
- Access to a Linux installation (I recommend Ubuntu) or Google Colab
- Be careful copying code: indents may not be preserved
Import Modules
- requests: for calling both APIs
- pandas: storing the result data
- nltk: for NLP functions, mini-modules listed below
- nltk.download(‘punkt’)
- from nltk.util import ngrams
- from nltk.corpus import stopwords
- nltk.download(‘stopwords’)
- nltk.download(‘averaged_perceptron_tagger’)
- string: for manipulation of strings
- collections: for item frequency counting
- bs4: for web scraping
- json: for handling API call responses
import requests
import pandas as pd
import nltk
nltk.download('punkt')
from nltk.util import ngrams
from nltk.corpus import stopwords
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
import string
from collections import Counter
from bs4 import BeautifulSoup
import json
Import URLs to Dataframe
The first step is to get your URL CSV for the pages you want to process. In my case, I used ScreamingFrog, so the URL column is titled “Address”. Alter as necessary. I would not run this script on an uncleaned URL list. I would also not run this on thousands of pages or the processing may take quite a long time.
The list of URLs should contain only your content pages, such as blog posts, as you’ll see later when we limit where we scrape text from. The pages should use the same template.
We also create a small punctuation filter (which you can extend) and two lists for later.
df = pd.read_csv("urls.csv")
urls = df["Address"].tolist()
punct = ["’","”"]
ngram_count = []
ngram_dedupe = []
N-gram Function
Next, we create the N-gram function. It uses the NLTK library to split webpage text into successive N-word combinations. The parameter num is N: set num to 1 to generate 1-grams, 2 for 2-grams, and so on.
def extract_ngrams(data, num): n_grams = ngrams(nltk.word_tokenize(data), num) gram_list = [ ' '.join(grams) for grams in n_grams] return gram_list
Google Knowledge Graph API
Now we set up the function to send each n-gram in our list to Google Knowledge Graph API to determine if it’s an entity. This helps clean the n-gram list and ensure we’re matching words that are topical and contextually important. You can easily create a free API here if you don’t already have one. We loop through the n-gram list and send each term to the API. If the API returns a hit, we consider it an entity and add it to a list for later use; otherwise we discard it.
def kg(keywords):
kg_entities = []
apikey=''
for x in keywords:
url = f'https://kgsearch.googleapis.com/v1/entities:search?query={x}&key='+apikey+'&limit=1&indent=True'
payload = {}
headers= {}
response = requests.request("GET", url, headers=headers, data = payload)
data = json.loads(response.text)
try:
getlabel = data['itemListElement'][0]['result']['@type']
hit = "yes"
kg_entities.append(x)
except:
hit = "no"
return kg_entities
The next section is presented as a single block of code so you can see the full flow. In general, this is where we loop through each URL, scrape the HTML, and process it.
Chunk 1
Here we use BeautifulSoup to parse the HTML. To focus on the page’s main content, we remove non-topical elements—typically boilerplate such as the header, footer, and sidebar. We then select the parent DIV that contains the main body content using soup.find("div",{"class":"entry-content"}). Edit this selector to match your template; it’s likely to work for WordPress. Finally, we lowercase the text, collapse multiple spaces, and compress the text to a single line.
Chunk 2
Next, we pass the cleaned text to the NLTK N-gram function to generate N-grams (the code uses N=1 but you can change it). We then filter the N-grams to remove punctuation, any n-grams containing colons, and stopwords.
Chunk 3
This section focuses on list cleaning. We remove duplicate items and non-string entries, then filter out part-of-speech (POS) tags that cannot, by definition, be entities. See the linked guide for POS tag acronyms.
Chunk 4
In this final step, we add the cleaned list to our master lists.
for x in urls:
res = requests.get(x)
html_page = res.text
# CHUNK 1
soup = BeautifulSoup(html_page, 'html.parser')
for script in soup(["script", "noscript","nav","style","input","meta","label","header","footer","aside","h1","a","table","button"]):
script.decompose()
text = soup.find("div",{"class":"entry-content"}) #edit this to match your template
page_text = (text.get_text()).lower()
page_text = page_text.strip().replace(" ","")
text_content1 = "".join([s for s in page_text.splitlines(True) if s.strip("\r\n")])
# CHUNK 2
keywords = extract_ngrams(text_content1, 1)
keywords = [x.lower() for x in keywords if x not in string.punctuation]
keywords = [x for x in keywords if ":" not in x]
stop_words = set(stopwords.words('english'))
keywords = [x for x in keywords if not x in stop_words]
keywords = [x for x in keywords if not x in punct]
# CHUNK 3
keywords_dedupe = list(set(keywords))
tokens = [x for x in keywords_dedupe if isinstance(x, str)]
keywords = nltk.pos_tag(tokens)
keywords = [x for x in keywords if x[1] not in ['CD','RB','JJS','IN','MD','WDT','DT','RBR','VBZ','WP']]
keywords = [x[0] for x in keywords]
# CHUNK 4
attach_url = x
keywords.append(attach_url)
ngram_dedupe.append(keywords)
Now we create an empty DataFrame to store internal linking opportunities. We compare each URL to every other URL (skipping self-comparisons), collect match results in a dictionary, and append them to the DataFrame. The match count is stored in the variable ngram_matchs.
d = {'from url': [], 'to url': [], 'ngram match count':[]}
df2 = pd.DataFrame(data=d)
for x in ngram_dedupe:
item_url = x[-1]
for y in ngram_dedupe:
item_url2 = y[-1]
if item_url != item_url2:
ngram_matchs = set(x).intersection(y)
new_dict = {"from url":item_url,"to url":item_url2,"ngram match count":len(ngram_matchs)}
df2 = df2.append(new_dict, ignore_index=True)
Finally, we sort the results and save them as a CSV. Higher match counts indicate stronger topical interlinking opportunities between the two pages.
df2 = df2.sort_values(by=['ngram match count'],ascending=False)
df2.to_csv('ngram_interlinking.csv')
df2
Example Output

Conclusion
Internal linking at scale on large, unstructured sites is challenging. The framework above helps uncover opportunities across your content; extend and customize the scripts to fit your needs.
Try it out. Follow me on Twitter and share your Python SEO applications and ideas.
N-Gram Interlinking SEO FAQ
How can Python be utilized to find interlinking opportunities based on entity N-gram matches?
Python scripts can analyze content, extract entities via N-grams, and identify interlinking opportunities by finding shared entities across pages.
Which Python libraries are commonly used for finding interlinking opportunities with entity N-gram matches?
Commonly used Python libraries for this task include nltk for natural language processing, spaCy for entity recognition, and pandas for data manipulation.
What specific steps are involved in using Python for finding interlinking opportunities via entity N-gram matches?
The process includes fetching and processing content, applying NLP techniques for entity recognition using N-grams, and using Python to analyze common entities to identify potential interlinking opportunities.
Are there any considerations or limitations when using Python for this purpose?
Consider the diversity of content, the choice of N-gram lengths, and clearly defined criteria for identifying interlinking opportunities. Periodic reanalysis may be necessary as content changes.
Where can I find examples and documentation for finding interlinking opportunities with Python and entity N-gram matches?
Explore online tutorials, documentation for the relevant Python libraries, and resources specific to natural language processing and interlinking strategies for practical examples and detailed guides.
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024














