Analyze Crawled PDF Text Using Python for SEO
Google has indexed PDFs for many years and treats them like web pages, so it’s logical to analyze your PDFs for optimization opportunities just as you would a web page. This can be more challenging because of file constraints, but we can begin the process using Python. I’ll show how easy it is to convert a PDF’s contents into plain string text that you can analyze. We’ll use a crawl CSV from Screaming Frog (export PDF file data), since you may not know if — or how many — PDFs are on your site.
Table of Contents
Requirements and Assumptions
- Python 3 is installed and you understand basic Python syntax
- Access to a Linux installation (I recommend Ubuntu) or Google Colab.
- Crawl data with PDF files, or a CSV list of PDF URLs
Let’s start by installing PDFminer, a Python module that converts PDF content into usable string text. If you are using Google Colab, put an exclamation mark before pip3 below.
pip3 install pdfminer
Now we import a number of modules.
- io: to handle writing text from a file to a variable
- urllib.request: downloading the PDF file
- pandas: adding PDF text data to a dataframe
- time: a small script delay
- pdfminer: will process the PDF
- %load_ext google.colab.data_table is a Google Colab extension to make our dataframes prettier.
from io import StringIO import urllib.request import pandas as pd import time from pdfminer.converter import TextConverter from pdfminer.layout import LAParams from pdfminer.pdfdocument import PDFDocument from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter from pdfminer.pdfpage import PDFPage from pdfminer.pdfparser import PDFParser %load_ext google.colab.data_table
Import the PDF crawl CSV file, selecting the Address column only. Then convert that column to a list so you can iterate over it and process the PDFs.
df_pdf = pd.read_csv('internal_pdf.csv')[['Address']]
list_pdf = df_pdf['Address'].tolist()
print(list_pdf)
Set a counter used when saving PDF files locally, and create an empty dataframe to hold each PDF’s URL and text.
counter = 0 df_pdf = pd.DataFrame([], columns=['URL','Text'])
Now we’ll process the PDF list. What we need to do is download each PDF locally and then process it. The first variable set is filepath, and we use the counter to make each saved PDF name unique. You could also use regex to extract only the file name from the URL, but this approach is simpler. We use urllib.request.urlretrieve() to download files, since the common requests library doesn’t handle this type of file download. The pdf variable holds the PDF URL and filepath is where the file will be saved locally. Finally, we pause for 5 seconds to ensure the file finishes downloading before continuing.
for pdf in list_pdf:
filepath = "pdf" + str(counter) + ".pdf"
counter += 1
urllib.request.urlretrieve(pdf,filepath)
time.sleep(5)
Now that we’ve downloaded a PDF, we process it using several functions from the PDFminer module we installed. In short, the file is opened and wrapped in objects; the PDFminer parser, converter, and interpreter process each page and write the extracted text to the output_string variable.
output_string = StringIO()
with open(filepath, 'rb') as in_file:
parser = PDFParser(in_file)
doc = PDFDocument(parser)
rsrcmgr = PDFResourceManager()
device = TextConverter(rsrcmgr, output_string, laparams=LAParams())
interpreter = PDFPageInterpreter(rsrcmgr, device)
for page in PDFPage.create_pages(doc):
interpreter.process_page(page)
With text in the output_string object, retrieve the contents with getvalue(). Line breaks are retained as ‘\n’, so we replace double line breaks ‘\n\n’ with a space.
text = output_string.getvalue()
text = text.replace('\n\n',' ')
Load the PDF URL and its text into a dictionary, append it to the dataframe, and repeat for remaining PDFs. When finished, display the dataframe.
keyword_dict = {"URL":pdf,"Text":text}
df_pdf = df_pdf.append(keyword_dict, ignore_index=True)
df_pdf.head()
From here, you can choose how to analyze the text. A couple of related tutorials on importsem:
Conclusion
You now have a framework to parse PDFs for analysis, data mining, or scraping. There’s a lot of potential to extend this approach.
Give it a try! Follow me on Twitter and share your Python SEO applications and ideas!
PDFMinder and Python FAQ
- Evaluate Subreddit Posts in Bulk Using GPT4 Prompting - December 12, 2024
- Calculate Similarity Between Article Elements Using spaCy - November 13, 2024
- Audit URLs for SEO Using ahrefs Backlink API Data - November 11, 2024












