detect generic links python
Estimated Read Time: 6 minute(s)
Common Topics: text, anchor, links, python, data

Optimizing anchor text for internal links has been a stable activity within SEO for a very long time. Google even has an entry on anchor text in their SEO guidelines. Anchor text provides the user and search engine with valuable contextual tags for the topical nature of the page you’re linking to. This is a direct opportunity to tell Google and the user what the page they may next visit is about and what it should be noted for within the context it is being provided.

Alas, we still see a lot of internal links on websites that are not descriptive and contain simple directives such as “click here”, “learn more”, and “link”. These are missed opportunities and should be cleaned up. This Python SEO guide will show you the framework for detecting generic anchor text across any number of pages and provide an output in table form ready to be optimized.

Requirements and Assumptions

  • Python 3 is installed and basic Python syntax is understood.
  • Access to a Linux installation (I recommend Ubuntu) or Google Colab.
  • Crawl CSV with URLs named urls.csv with URL column titled “url”.
  • Understand the basics of Regular Expressions
  • Careful with copying the code as indents are not always preserved well.

Import Python Modules

Let’s start by importing the modules we’ll need for the script.

  • bs4: HTML parser to retrieve links on a page
  • re: for regular expression usage in detecting words in anchor text
  • requestsmakes the HTTP calls for each URL and each image
  • pandasfor storing the results in a dataframe
from bs4 import BeautifulSoup
import re
import pandas as pd
import requests

Write RegEx for Text Matching

Now we’ll construct our regEx list. This contains the patterns we’re going to look for in our anchor text. You’re free to add as many as you like. I also added in ‘www’ and ‘http’ to surface links where the anchor text is the URL path. Let me explain the format I initially added.

  • (.*)? optionally selects infinite content before the target word.
  • (?<![a-zA-Z]) is a negative-lookbehind. It won’t allow selection if a character is immediately before the target word. This prevents words within words from being selected.
  • (?![a-zA-Z]) is a negative-lookahead. It won’t allow selection if a character is immediately after the target word. This prevents words within words from being selected.
  • (.*)? optionally selects infinite content before the target word.
generic_anchors = [
'(.*)?(?<![a-z])here(?![a-z])(.*)?',
'(.*)?(?<![a-z])follow(?![a-z])(.*)?',
'(.*)?(?<![a-z])click(?![a-z])(.*)?',
'(.*)?(?<![a-z])learn(?![a-z])(.*)?',
'(.*)?(?<![a-z])read(?![a-z])(.*)?',
'(.*)?(?<![a-z])more(?![a-z])(.*)?',
'(.*)?(?<![a-z])go(?![a-z])(.*)?',
'(.*)?(?<![a-z])link(?![a-z])(.*)?',
'(.*)?(?<![a-z])watch(?![a-z])(.*)?',
'(.*)?(?<![a-z])find(?![a-z])(.*)?',
'(.*)?(?<![a-z])webpage(?![a-z])(.*)?',
'(.*)?(?<![a-z])website(?![a-z])(.*)?',
'(.*)?(?<![a-z])page(?![a-z])(.*)?',
'(.*)?(?<![a-z])www(?![a-z])(.*)?',
'(.*)?(?<![a-z])http(?![a-z])(.*)?'
]

Data Import and Storage

Next, we import our URL list that we will process and convert the URL column to a list for easier looping. Then we create our empty dataframe with our three columns, which we’ll end up storing our end data. Lastly, we create a few empty lists that we will use for temporary storage until we move the data to the dataframe.

df = pd.read_csv('urls.csv')
url_list = df['url'].tolist()

df1 = pd.DataFrame(columns = ['url', 'internal link', 'anchor text'])

url = []
internal_link = []
anchor_text = []

Process the URLs

Now it’s time to loop through the URL list from your CVS crawl file we just imported. First, we grab the HTML content of the page using get() and load up the BS4 object injecting the HTML data. Then we remove several HTML tags using the function decompose() from being processed due to their likelihood of containing templated links like navigation. Feel free to add in other block HTML tags that you want to remove to not consider links in them. Lastly, we parse out all the links and store them in the object links.

for x in url_list:

  html = requests.get(x)

  soup = BeautifulSoup(html.text, 'html.parser')

  for tag in soup(["header","footer","aside","nav","script"]): 
    tag.decompose()

  links = soup.findAll('a')

Clean the Data

It’s not uncommon for the HTML to contain links with empty anchors or for the bs4 parser to pick up nonetype links so we’ll do some list comprehension to filter them out otherwise the next code block errors. Also at this point, we’ll combine our regex statements from earlier using join() which mashes them into a large single regex string. We’ll feed this string into the regex processor in the next code block.

links = [x for x in links if x != None]

combined = "(" + ")|(".join(generic_anchors) + ")"

Process each link

For the next code bit I’ll describe line by line:

  1.  Loop through each link in the links object.
  2.  If the length of the anchor text is not 0 and less than 20 then process the anchor text. Feel free to adjust these numbers for what works for you.
  3.  If the regex we built earlier is a match using re.match() with the anchor text (stripping whitespace and ignoring casing) proceed.
  4.  Append the URL where the anchor text is found, the link URL, and the anchor text to lists for storage
  5.  Process the next link in the list until exhausted
for y in links:
  if len(y.text) != 0 and len(y.text) < 20:
    if re.match(combined, y.text.strip(), re.IGNORECASE):
      url.append(x)
      internal_link.append(y["href"])
      anchor_text.append(y.text)

Populate the dataframe

The final task is simply inserting the lists we used to store our data into the empty dataframe we created earlier and viewing the output.

df1["url"] = url
df1["internal link"] = internal_link
df1["anchor text"] = anchor_text
df1

Sample Output

dataframe anchor text

Conclusion

That’s all there is to it! You now have the framework for analyzing your internal links for generic anchor text. Feel free to adjust the regex, some of the parameters, and the words you consider generic. Remember, anchor text is an important value for communicating the topicality of the next page. Be descriptive and include keywords, but always be natural and user-friendly.

Remember to try and make my code even more efficient and extend it in ways I never thought of!  Now get out there and try it out! Follow me on Twitter and let me know your SEO applications and ideas for internal link analysis!

Anchor Text FAQ

How can Python be employed to detect generic anchor text in links for SEO analysis?

Python scripts can be developed to analyze anchor text in links, identifying patterns indicative of generic terms or phrases often used for SEO optimization.

Which Python libraries are commonly used for detecting generic anchor text in links?

Commonly used Python libraries for this task include beautifulsoup for HTML parsing, nltk for natural language processing, and pandas for data manipulation.

What specific steps are involved in using Python to detect generic anchor text in links for SEO?

The process includes fetching link data, preprocessing the anchor text, applying NLP techniques to identify generic terms, and using Python for analysis and insights related to SEO optimization.

Are there any considerations or limitations when using Python for detecting generic anchor text in links?

Consider the diversity of anchor text, the choice of features for detection, and the need for a clear understanding of the goals and criteria for identifying generic terms. Regular updates to the analysis may be necessary.

Where can I find examples and documentation for detecting generic anchor text in links with Python?

Explore online tutorials, documentation for relevant Python libraries, and resources specific to natural language processing for practical examples and detailed guides on detecting generic anchor text using Python for SEO analysis.

Greg Bernhardt
Follow me

Leave a Reply