Use Machine Learning and Python for Easy Text Classification

Estimated Read Time: 7 minute(s)

Common Topics: data, text, learning, label, machine

The topic of machine learning has never been so hot and for good reason. With machine learning (ML), you’re able to take SEO automation, forecasting, classification, etc to the next level. The problem previously was that machine learning was reserved for only the most advanced data scientists and engineers due to its inherent sophistication.

Not any more! Due to the heroic efforts by data scientists at Hugging Face (a popular machine learning organization), modules, datasets, and algorithms are now accessible for those even with basic Python skills.

In this tutorial, I’m going to show you how easy it is to use Hugging Face’s ML packages to classify your website pages using a given list of categories in under 30 lines! Once you classify your content, you can find interesting insights across content categories.

Quick app teaser! Charly Wargnier is currently working hard on a SEO Streamlit app that extends a similar script in so many ways. I can’t wait for it to be released, maybe in November, it will be a game-changer! Stay tuned on Charly’s Twitter feed.

Table of Contents

Requirements and Assumptions

Python 3 is installed and basic Python syntax understood
Access to a Linux installation (I recommend Ubuntu) or Google Colab.
Set of labels to be used for categorization
ScreamingFrog crawl HTML CSV

Install Python Modules

We will use the HuggingFace transformer module do to all the heavy ML lifting. It’s quite simple in usage. Remember if you are running in Google Colab to add an exclamation mark at the beginning

!pip3 install transformers

Now let’s import the two modules we need:

transformers: transformers module provides general-purpose architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU)
pandas: for storing the results into a dataframe
collections: for counting the number of each label tagged

### import modules
from transformers import pipeline
import pandas as pd
from collections import Counter

Import Crawl Data and Create Label List

Next, we’ll import our page titles and meta descriptions for each page from our CSV ScreamingFrog file. The second line is an optional filter if you only want to label a certain path structure. The third line is if you want to strip out any boilerplate branding that could influence the evaluation. Uncommented them if desired. Furthermore, processing can be slow and thus making testing tedious if you have a long URL list. I have added a snippet that only grabs the first 50 rows of the dataframe. Uncomment #[:50] if you want to take advantage of that for debugging.

df = pd.read_csv('internal_html.csv')[['Address','Title 1','Meta Description 1']]#[:50]
#df = df[df['Address'].str.contains("/blog/")]
#df['Title 1']= df['Title 1'].str.replace('importSEM','',regex=True)

Now we create our label or label list (below is my test list, be sure to create your own). The Transformer will intake this list, score each against the title and meta description and we’ll grab the highest score. This is how we categorize each page. Note, the longer the list, the longer the processing. There is a direct relationship between processing time and the length of the label list and your titles/meta. It also helps accuracy if you have a highly distinct label list. The most similar they are, the most often you get split results.

labels = ['Google','Images','Keyword','API','Automation','Text Analysis','Scraping','Audit','NLP','Regex','Twitter','GSC']

Configure Machine Learning Transformer

This is the muscle behind it all. This simple piece of code loads the Hugging Face transformer pipeline. “zero-shot-classification” is the machine learning method in which “the already trained model can classify any text information given without having any specific information about data.” This has the amazing advantage of being able to be deployed easily and quickly on any text and label list. The downside is depending on your lists and text, it may not be as accurate as if you used a model trained on a more relevant dataset.

Speaking of datasets, the model attribute loads, you guessed it, the model. Here we have chosen valhalla/distilbart-mnli-12-3. This is a distilled version of BART trained on the MNLI dataset. 12-3 (~1GB) is the measure of distillation. In fact, you can choose the non-distilled version here: facebook/bart-large-mnli (~1.7GB). The upside to the distilled version is that it’s much smaller and faster with only about a percent loss of accuracy. There are other distilled versions like 12-1, 12-6, and 12-9 you can try out that are different sizes. The size of the model matters when you are processing hundreds or thousands of pages. If any version of BART is too slow, you can try this uncased BERT model, which is super fast, light, but has an accuracy rating of 80% compared to 88+ with BART.

machine learning accuracy

classifier = pipeline("zero-shot-classification",
                      model="valhalla/distilbart-mnli-12-3")

Next, we simply create a couple of empty lists that will store all the label results.

label_list = []
label_list_score = []

Process Crawl Data

Now it’s time to iterate through the titles in our dataframe from earlier. At this point, you can decide to classify solely on the title or if you feel you need more context, you can append the meta description which I’m doing below. If you want to use only the title, simply remove the code: + ". " + row[2] from below. At this stage, you may also choose to clean the data a bit. In some cases, you may want to lowercase everything or make some replacements. Furthermore, sometimes, the meta can inject too much context and it changes the labeling if not laser-focused. If you find that is the case, uncomment #[:50] and the number inside is how many characters of the meta you want to include.

for index, row in df.iterrows():

  string_to_classify = row[1] + ". " + row[2]#[:50]
  result = classifier(string_to_classify, labels)

The processing time is determined by the weight of the model you are using, the number of labels in your list, and the length of the string (title + meta) you are classifying. In some cases, this can be as high as 30 seconds per evaluation. You can reduce this time by adjusting the factors listed above.

Once the result is returned it will be in JSON format for us to parse easily. We’ll grab the highest-scoring label and the corresponding score (1 is a perfect match). Note that all labels will be evaluated and given a score. This is why processing can take a long time.

label = result["labels"][0]
label_score = str(round(result["scores"][0],3))

Next, we simply append the chosen label and the score to each corresponding list.

label_list.append(label)
label_list_score.append(label_score)

All that is left is to add the lists to our initial dataframe and export the CSV.

df['Label'] = label_list
df['Label Score'] = label_list_score
df.to_csv("title-with-meta3.csv")

Optional: Get Label Counts

If you’re interested to know right away how many times each label was used you can add this code below. Depending on your results, you may want to tweak your labels if what is returned is not what you expect. Otherwise, you can import the CSV into Google Sheets and use the Query formula with Count() to get the same output.

counts = Counter(output).most_common(len(output))

for key, value in counts:
  print(key + ": " + value)

Example Output

machine learning classify text

Conclusion

The ability to access complex machine learning tools for SEO has never been easier. You now have the ability to categorize content to your heart’s content. Keep in mind, you’ll still want to manually do some quality assurance on the results. The models at best are 83-90% accurate and still depend heavily on the text you are labeling and the label list you provide. With that said, it’s not magic and not always perfect. Many adjustments are often needed and in some cases, this method just doesn’t work and you still need to label by hand.

Thanks for ML guidance over the past few weeks by Charly Wargnier, Andrea Volpini, Kane Jameson, Elias Dabbas, and JC Chouinard

Quick app teaser! Charly Wargnier is currently working hard on an app that extends a similar script in many ways. I can’t wait for it to be released, it will be a game-changer! Stay tuned on his Twitter feed.

Now get out there and try it out! Follow me on Twitter and let me know your applications and ideas!

Python Text Classification FAQ

How can Python and machine learning be utilized for easy text classification?

Python offers machine learning libraries such as scikit-learn and NLTK that can be leveraged to build models for text classification, making the process straightforward.

Which Python libraries are commonly used for text classification using machine learning?

Commonly used Python libraries for text classification include scikit-learn for machine learning algorithms, NLTK for natural language processing, and pandas for data manipulation.

What specific steps are involved in using Python and machine learning for easy text classification?

The process includes data preprocessing, feature extraction, model training, and evaluation. Python scripts can automate these steps to create a text classification pipeline.

Are there any considerations or limitations when using Python for text classification with machine learning?

Consider the choice of algorithms, the quality and size of the training dataset, and potential biases in the training data. Regular model evaluation and updates may be necessary for optimal performance.

Where can I find examples and documentation for text classification using machine learning with Python?

Explore online tutorials, documentation for relevant Python libraries, and resources specific to machine learning for text analysis for practical examples and detailed guides on using Python for easy text classification.

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)