detect generic links python

Optimizing anchor text for internal links has been a stable activity within SEO for a very long time. Google even has an entry on anchor text in their SEO guidelines. Anchor text provides the user and search engine with valuable contextual tags for the topical nature of the page you’re linking to. This is a direct opportunity to tell Google and the user what the page they may next visit is about and what it should be noted for within the context it is being provided.

Alas, we still see a lot of internal links on websites that are not descriptive and contain simple directives such as “click here”, “learn more”, and “link”. These are missed opportunities and should be cleaned up. This Python SEO guide will show you the framework for detecting generic anchor text across any number of pages and provide an output in table form ready to be optimized.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax is understood.
  • Access to a Linux installation (I recommend Ubuntu) or Google Colab.
  • Crawl CSV with URLs named urls.csv with URL column titled “url”.
  • Understand the basics of Regular Expressions
  • Careful with copying the code as indents are not always preserved well.

Import Python Modules

Let’s start by importing the modules we’ll need for the script.

  • bs4: HTML parser to retrieve links on a page
  • re: for regular expression usage in detecting words in anchor text
  • requestsmakes the HTTP calls for each URL and each image
  • pandasfor storing the results in a dataframe
from bs4 import BeautifulSoup
import re
import pandas as pd
import requests

Write RegEx for Text Matching

Now we’ll construct our regEx list. This contains the patterns we’re going to look for in our anchor text. You’re free to add as many as you like. I also added in ‘www’ and ‘http’ to surface links where the anchor text is the URL path. Let me explain the format I initially added.

  • (.*)? optionally selects infinite content before the target word.
  • (?<![a-zA-Z]) is a negative-lookbehind. It won’t allow selection if a character is immediately before the target word. This prevents words within words from being selected.
  • (?![a-zA-Z]) is a negative-lookahead. It won’t allow selection if a character is immediately after the target word. This prevents words within words from being selected.
  • (.*)? optionally selects infinite content before the target word.
generic_anchors = [
'(.*)?(?<![a-z])here(?![a-z])(.*)?',
'(.*)?(?<![a-z])follow(?![a-z])(.*)?',
'(.*)?(?<![a-z])click(?![a-z])(.*)?',
'(.*)?(?<![a-z])learn(?![a-z])(.*)?',
'(.*)?(?<![a-z])read(?![a-z])(.*)?',
'(.*)?(?<![a-z])more(?![a-z])(.*)?',
'(.*)?(?<![a-z])go(?![a-z])(.*)?',
'(.*)?(?<![a-z])link(?![a-z])(.*)?',
'(.*)?(?<![a-z])watch(?![a-z])(.*)?',
'(.*)?(?<![a-z])find(?![a-z])(.*)?',
'(.*)?(?<![a-z])webpage(?![a-z])(.*)?',
'(.*)?(?<![a-z])website(?![a-z])(.*)?',
'(.*)?(?<![a-z])page(?![a-z])(.*)?',
'(.*)?(?<![a-z])www(?![a-z])(.*)?',
'(.*)?(?<![a-z])http(?![a-z])(.*)?'
]

Data Import and Storage

Next, we import our URL list that we will process and convert the URL column to a list for easier looping. Then we create our empty dataframe with our three columns, which we’ll end up storing our end data. Lastly, we create a few empty lists that we will use for temporary storage until we move the data to the dataframe.

df = pd.read_csv('urls.csv')
url_list = df['url'].tolist()

df1 = pd.DataFrame(columns = ['url', 'internal link', 'anchor text'])

url = []
internal_link = []
anchor_text = []

Process the URLs

Now it’s time to loop through the URL list from your CVS crawl file we just imported. First, we grab the HTML content of the page using get() and load up the BS4 object injecting the HTML data. Then we remove several HTML tags using the function decompose() from being processed due to their likelihood of containing templated links like navigation. Feel free to add in other block HTML tags that you want to remove to not consider links in them. Lastly, we parse out all the links and store them in the object links.

for x in url_list:

  html = requests.get(x)

  soup = BeautifulSoup(html.text, 'html.parser')

  for tag in soup(["header","footer","aside","nav","script"]): 
    tag.decompose()

  links = soup.findAll('a')

Clean the Data

It’s not uncommon for the HTML to contain links with empty anchors or for the bs4 parser to pick up nonetype links so we’ll do some list comprehension to filter them out otherwise the next code block errors. Also at this point, we’ll combine our regex statements from earlier using join() which mashes them into a large single regex string. We’ll feed this string into the regex processor in the next code block.

links = [x for x in links if x != None]

combined = "(" + ")|(".join(generic_anchors) + ")"

Process each link

For the next code bit I’ll describe line by line:

  1.  Loop through each link in the links object.
  2.  If the length of the anchor text is not 0 and less than 20 then process the anchor text. Feel free to adjust these numbers for what works for you.
  3.  If the regex we built earlier is a match using re.match() with the anchor text (stripping whitespace and ignoring casing) proceed.
  4.  Append the URL where the anchor text is found, the link URL, and the anchor text to lists for storage
  5.  Process the next link in the list until exhausted
for y in links:
  if len(y.text) != 0 and len(y.text) < 20:
    if re.match(combined, y.text.strip(), re.IGNORECASE):
      url.append(x)
      internal_link.append(y["href"])
      anchor_text.append(y.text)

Populate the dataframe

The final task is simply inserting the lists we used to store our data into the empty dataframe we created earlier and viewing the output.

df1["url"] = url
df1["internal link"] = internal_link
df1["anchor text"] = anchor_text
df1

Sample Output

dataframe anchor text

Conclusion

That’s all there is to it! You now have the framework for analyzing your internal links for generic anchor text. Feel free to adjust the regex, some of the parameters, and the words you consider generic. Remember, anchor text is an important value for communicating the topicality of the next page. Be descriptive and include keywords, but always be natural and user-friendly.

Remember to try and make my code even more efficient and extend it in ways I never thought of!  Now get out there and try it out! Follow me on Twitter and let me know your SEO applications and ideas for internal link analysis!

Greg Bernhardt
Follow me

Leave a Reply