python seo interlinking
Estimated Read Time: 6 minute(s)
Common Topics: keywords, nltk, data, list, import

Any seasoned SEO will know that finding internal linking at scale can be a difficult process but a very important one. This is especially true if your content is not well organized topically.

If you have a blog that is a mess or seemingly full of random articles and you’re tasked with internally linking them in an intelligent way then this tutorial is for you!

In this Python SEO tutorial, I’m going to show you line-by-line how to break down each article’s content into entity N-grams and then compare that set of entities to every other article. Articles with high matches are more likely to be good fits for internal linking.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax is understood
  • Access to a Linux installation (I recommend Ubuntu) or Google Colab
  • Be careful with copying the code as indents are not always preserved well

Import Modules

  • requestsfor calling both APIs
  • pandasstoring the result data
  • nltk: for NLP functions, mini-modules listed below
    • from nltk.util import ngrams
    • from nltk.corpus import stopwords
  • string: for manipulation of strings
  • collections: for item frequency counting
  • bs4: for web scraping
  • json: for handling API call responses
import requests
import pandas as pd
import nltk'punkt')
from nltk.util import ngrams
from nltk.corpus import stopwords'stopwords')'averaged_perceptron_tagger')
import string
from collections import Counter
from bs4 import BeautifulSoup
import json

Import URLs to Dataframe

The first step is to get your URL CSV for the pages you want to process. In my case, I used ScreamingFrog, so the URL column is titled “Address”. Alter as necessary. I would not run this script on an uncleaned URL list. I would also not run this on thousands of pages or the processing may take quite a long time.

The list of URLs should be only your content pages such as your blog as you’ll see later on when we limit where on the page we scrape text from. the pages need to have the same template.

We then also create a small punctuation filter which you can add to for removal during processing and we create two lists for later on.

df = pd.read_csv("urls.csv")
urls = df["Address"].tolist()
punct = ["’","”"]
ngram_count = []
ngram_dedupe = []

N-gram Function

Next, we create our N-gram function. This will use the NLTK library to split up webpage text into successive N-word combinations.  The variable num is N. So if you set the value of Num to 1 then you’re generating 1-grams, if 2 then 2-grams etc.

def extract_ngrams(data, num):
  n_grams = ngrams(nltk.word_tokenize(data), num)
  gram_list = [ ' '.join(grams) for grams in n_grams]
  return gram_list

Google Knowledge Graph API

Now we set up the function to send each n-gram in our list to Google Knowledge Graph API to determine if it’s an entity. This helps clean the n-gram list and ensure we’re matching words that are topical and contextually important. You can easily create a free API here if you don’t already have one. All that happens here is that we loop through our n-gram list and send each one into the API. If we get a hit then we know it’s an entity and add it to a new list we’ll use later. If not a hit then not an entity and we throw it out.

def kg(keywords):

  kg_entities = []

  for x in keywords:
    url = f'{x}&key='+apikey+'&limit=1&indent=True'
    payload = {}
    headers= {}
    response = requests.request("GET", url, headers=headers, data = payload)
    data = json.loads(response.text)

      getlabel = data['itemListElement'][0]['result']['@type']
      hit = "yes"
      hit = "no"

  return kg_entities

This next series of steps I’m going to leave as a large block of code so you can see the entirety of what’s happening and simply chunk out the description of what’s happening. In general, this is where we loop through each URL, scrape the HTML and then process it.

Chunk 1

Here we engage with the BS4 module to take the HTML code we got from the URL. To process the important context of the webpage we want to prune many of the elements of the webpage that are not topical. Usually, this is the boilerplate areas like the header, footer, and sidebar. Then we select only the text from the parent DIV or element that contains the main body content using soup.find("div",{"class":"entry-content"}). You’ll want to edit this to match your template unless it’s WordPress which that code will likely work. Then we lowercase all the text, do a replacement for multiple spaces compress the text into a single line.

Chunk 2

Next, we send the text we just cleaned to the NLTK N-gram function to create the list by whatever we want which is 1 in the code below, but you can change it to say 2. Then we take that N-gram list and clean it some more. We filter out any punctuation, any N-grams with colons, and any items that are stopwords.

Chunk 3

This section is more about list cleaning. We start this section by removing duplicate list items and items that are not string datatypes. Then we filter out POS word types that couldn’t by definition by entities. You can see what the POS word type acronyms are in this guide.

Chunk 4

In this last step, we simply add the newly cleaned to our master lists.

for x in urls:
  res = requests.get(x)
  html_page = res.text

  # CHUNK 1
  soup = BeautifulSoup(html_page, 'html.parser')
  for script in soup(["script", "noscript","nav","style","input","meta","label","header","footer","aside","h1","a","table","button"]):
  text = soup.find("div",{"class":"entry-content"}) #edit this to match your template
  page_text = (text.get_text()).lower()
  page_text = page_text.strip().replace("  ","")
  text_content1 = "".join([s for s in page_text.splitlines(True) if s.strip("\r\n")])

  # CHUNK 2
  keywords = extract_ngrams(text_content1, 1)
  keywords = [x.lower() for x in keywords if x not in string.punctuation]
  keywords = [x for x in keywords if ":" not in x]
  stop_words = set(stopwords.words('english'))
  keywords = [x for x in keywords if not x in stop_words]
  keywords = [x for x in keywords if not x in punct]

  # CHUNK 3
  keywords_dedupe = list(set(keywords))
  tokens = [x for x in keywords_dedupe if isinstance(x, str)]
  keywords = nltk.pos_tag(tokens)
  keywords = [x for x in keywords if x[1] not in ['CD','RB','JJS','IN','MD','WDT','DT','RBR','VBZ','WP']]
  keywords = [x[0] for x in keywords]

  # CHUNK 4
  attach_url = x

Now we create our empty dataframe where we’ll store our internal linking opportunities. Then we simply compare each URL to each other URL and avoid comparing against itself. Then we add the results for each comparison to a dictionary object and add the to dataframe we created earlier. The match count is stored in ngram_matchs.

d = {'from url': [], 'to url': [], 'ngram match count':[]}
df2 = pd.DataFrame(data=d)

for x in ngram_dedupe:
  item_url = x[-1]
  for y in ngram_dedupe:
    item_url2 = y[-1]
    if item_url != item_url2:
      ngram_matchs = set(x).intersection(y)
      new_dict = {"from url":item_url,"to url":item_url2,"ngram match count":len(ngram_matchs)}
      df2 = df2.append(new_dict, ignore_index=True)

All that is left is to print out the results and optionally save the results as a CSV. The higher the match the more likely there is a topical interlinking opportunity between the two pages.

df2 = df2.sort_values(by=['ngram match count'],ascending=False)

Example Output

interlinking by ngram-python


Internal linking at scale and with large unstructured websites can be a real challenge. Using the framework I’ve built above can help you uncover opportunities across your content. Remember to always keep extending the scripts and customize them to your own needs. This is just the beginning.

Now get out there and try it out! Follow me on Twitter and let me know your Python SEO applications and ideas!

Greg Bernhardt
Follow me

Leave a Reply