Analyze Crawled PDF Text Using Python for SEO

Estimated Read Time: 5 minute(s)

Common Topics: text, pdf, flex, data, python

Google has been indexing PDFs for many years and ranks them among web pages so it’s only logical to analyze your PDFs for optimization opportunities just like a web page. Obviously in general that is a much harder task given the file’s constraints, but we can start this process with the help of Python. I’m going to show you how easy it is to convert the PDF’s text into actual string text that we can then do any number of analyses on. We’re going to approach this using a crawl CSV from Screaming Frog (export PDF file data) because in many circumstances you may not even know if or how many PDFs are on your site.

Table of Contents

Requirements and Assumptions

Python 3 is installed and basic Python syntax understood
Access to a Linux installation (I recommend Ubuntu) or Google Colab.
Crawl data with PDF files, or a CSV list of PDF URLs

Let’s start by installing PDFminer, a Python module that has the functions to convert the PDF text to string text we can use. If you are using Google Colab, put an exclamation mark before pip3 below.

pip3 install pdfminer

Now we import a number of modules.

io: to handle writing text from a file to a variable
urllib.request: downloading the PDF file
pandas: adding PDF text data to a dataframe
time: a small script delay
pdfminer: will process the PDF
%load_ext google.colab.data_table is a Google Colab extension to make our dataframes prettier.

from io import StringIO
import urllib.request
import pandas as pd
import time

from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.pdfpage import PDFPage
from pdfminer.pdfparser import PDFParser
%load_ext google.colab.data_table

Now we import the PDF crawl CSV file, address column only. Then we convert the address column to a list so we can iterate through it and process the PDFs.

df_pdf = pd.read_csv('internal_pdf.csv')[['Address']]
list_pdf = df_pdf['Address'].tolist()
print(list_pdf)

We’ll now set a counter for the loop later when we store the PDF files locally. We will next create an empty dataframe that will eventually store the PDF text for each PDF file.

counter = 0
df_pdf = pd.DataFrame([], columns=['URL','Text'])

It’s time to process the PDF list! What need to do is download each of the PDFs locally and then process them. So the first variable set is filepath and we use the counter to help make each PDF name unique. You could also write some regex to strip everything but the file name in the URL but this was easier. Then we make the request using urllib.request.urlretrieve() as the common requests module can’t handle downloading files. The pdf variable contains the URL for the PDF and filepath is where the file will be saved and named locally. Lastly, we’ll sleep for 5 seconds to make sure the file is downloaded before we continue.

for pdf in list_pdf:
    filepath = "pdf" + str(counter) + ".pdf"
    counter += 1

    urllib.request.urlretrieve(pdf,filepath)
    time.sleep(5)

Now that we have downloaded a PDF file it’s time to process it which involves a bunch of functions from the PDFminer module we installed. In general, what is happening here is the file is opened and put into an object. Then the PDFminer parser/converter/interpreter process each page in the PDF and puts the text in the output_string variable.

    output_string = StringIO()
    with open(filepath, 'rb') as in_file:
        parser = PDFParser(in_file)
        doc = PDFDocument(parser)
        rsrcmgr = PDFResourceManager()
        device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.create_pages(doc):
            interpreter.process_page(page)

Now that we have text in the output_string file object we can get the contents by using getvalue(). Returns are retained and converted to \n so we’ll search and replace those with a space.

    text = output_string.getvalue()
    text = text.replace('\n\n',' ')

It’s time to load the PDF URL and its text into a dictionary object and finally add it to the dataframe. At this point, if you have more PDFs in your list the process starts again with the new PDF URL. When all is done, the dataframe will print out for you to see.

    keyword_dict = {"URL":pdf,"Text":text}
    df_pdf = df_pdf.append(keyword_dict, ignore_index=True)

df_pdf.head()

From here it’s up to you how you analyze the text! Here are a couple of ideas using other tutorials here at importsem!

Conclusion

Now you have the framework to begin parsing your PDFs either to analyze or to data-mine and scrape! Lots more potential on this one!

Now get out there and try it out! Follow me on Twitter and let me know your Python SEO applications and ideas!

PDFMinder and Python FAQ

How can PDFminer and Python be used to analyze text from crawled PDFs for SEO purposes?

Leverage Python scripts and the PDFminer library to extract and analyze text content from crawled PDFs, providing valuable insights for SEO analysis.

What Python libraries are commonly used for working with PDFminer to extract text from PDF documents?

The primary Python library for working with PDFminer is, unsurprisingly, pdfminer. Utilize this library to extract text data from PDF documents for SEO analysis.

What specific information can be obtained by analyzing crawled PDF text using Python and PDFminer for SEO?

Python scripts can extract textual content from PDFs, allowing for analysis of keywords, phrases, and other relevant SEO information contained within the PDF documents.

Are there any considerations or limitations when using PDFminer and Python to analyze crawled PDF text for SEO purposes?

Considerations include the structure of PDF documents, potential variations in text extraction accuracy, and the need for pre-processing to handle different PDF formats.

Where can I find examples and documentation for using Python with PDFminer to analyze crawled PDF text for SEO?

Refer to the official documentation for PDFminer for detailed guides and examples. Explore online tutorials and Python resources for practical demonstrations and implementation details in analyzing crawled PDF text for SEO using Python and PDFminer.

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!