Extracting Data from PDFs Using PDFMiner

Estimated Read Time: 4 minute(s)

Common Topics: data, text, script, enlighter, pdf

PDF files are ubiquitous in various industries, but programmatically extracting data from them can be complex. PDFMiner, a powerful Python library, helps parse and extract content from PDFs in formats like plain text, HTML, XML, or tagged text.

This tutorial explains how to use a comprehensive PDF extraction script. We’ll explore its structure and functionality step-by-step and how to apply it to SEO and similar use cases.

Table of Contents

Requirements

Before diving into the script, ensure the following are installed:

Python 3.x
PDFMiner (pdfminer.six): Install via pip:

pip install pdfminer.six

You’ll also need sample PDFs to test the script.

Imports and Setup

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import sys
import getopt

Core PDFMiner modules:

- PDFParser: Parses the raw PDF file.
- PDFDocument: Represents the parsed PDF document.
- PDFResourceManager: Manages shared resources like fonts and images.
- PDFPageInterpreter: Processes PDF pages.
- Converters: Convert PDF data into different formats (text, HTML, XML).
- LAParams: Controls layout analysis, such as margins and spacing.

Command-line utilities:

- - sys: Handles command-line arguments.
  - getopt: Parses options passed to the script.

The Main Function

The main() function drives the script. It:

Parses command-line arguments.
Sets up resources, converters, and layout parameters.
Processes PDF pages.
Outputs extracted data.

def main(argv):
    import getopt
    def usage():
        print(f'usage: {argv[0]} [options] input.pdf ...')
        return 100

The usage() function displays instructions when the script is run without valid arguments or with the -h option.

Parsing Command-Line Arguments

try:
    (opts, args) = getopt.getopt(argv[1:], 'dP:o:t:O:c:s:R:Y:p:m:SCnAVM:W:L:F:')
except getopt.GetoptError:
    return usage()
if not args: return usage()

The script accepts various options:

-P: Password for encrypted PDFs.
-o: Output file name.
-t: Output format (text, html, xml, tag).
-p: Page numbers to extract (comma-separated).
-R: Rotate pages by a specified angle.
-m: Maximum number of pages to process.
-S: Strip control characters.
-C: Disable caching.
-n: Disable layout analysis.
-M, -W, -L: Adjust character, word, and line margins.

Setting Default Parameters

debug = 0
password = b''
pagenos = set()
maxpages = 0
outfile = None
outtype = None
imagewriter = None
rotation = 0
stripcontrol = False
layoutmode = 'normal'
encoding = 'utf-8'
laparams = LAParams()

Debugging: Controls the verbosity of logging.
Password: For password-protected PDFs.
Page Numbers: Specifies which pages to extract.
Output Type: Sets the format for extracted data.
Layout Parameters (LAParams): Customizes text layout analysis.

Handling Layout Analysis

if '-n' in opts: laparams = None
if '-A' in opts: laparams.all_texts = True
if '-M' in opts: laparams.char_margin = float(v)
if '-W' in opts: laparams.word_margin = float(v)
if '-L' in opts: laparams.line_margin = float(v)

all_texts: Forces extraction of all text, including hidden elements.
char_margin: Defines the spacing between characters for grouping into words.
word_margin: Sets spacing between words.
line_margin: Controls the spacing between lines of text.

Resource Manager and Device Setup

rsrcmgr = PDFResourceManager(caching=caching)
if outtype == 'text':
    device = TextConverter(rsrcmgr, outfp, laparams=laparams)
elif outtype == 'html':
    device = HTMLConverter(rsrcmgr, outfp, laparams=laparams, layoutmode=layoutmode)
elif outtype == 'xml':
    device = XMLConverter(rsrcmgr, outfp, laparams=laparams)

PDFResourceManager: Manages shared resources (e.g., fonts).
Converters: Initialize a converter for the desired output format.

Processing Pages

for fname in args:
    with open(fname, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, pagenos, password=password, caching=caching):
            page.rotate = (page.rotate + rotation) % 360
            interpreter.process_page(page)

PDFPage.get_pages: Iterates through selected pages in the input PDF.
Rotation: Applies a rotation to the pages if specified.
Interpreter: Processes each page and sends the content to the converter.

Output Management

if outfile:
    outfp = open(outfile, 'w', encoding=encoding)
else:
    outfp = sys.stdout

The script writes the output to a specified file or prints it to the console.

Application to SEO

Using the Script for SEO Research

PDFs often contain valuable SEO data:

Reports from tools like SEMrush or Ahrefs.
Whitepapers or case studies with keyword analysis.
Research papers on search engine trends.

Here’s how to extract SEO-related data using the script.

Run the Script: Extract the plain text from an SEO report:

with open('report_text.txt', 'r') as file:
    content = file.read()
keywords = re.findall(r'\b(?:keyword1|keyword2|keyword3)\b', content, re.IGNORECASE)
print(keywords)

Conclusion

This script demonstrates the flexibility of Python and PDFMiner for extracting data from PDFs. By breaking down the script:

We understand how to customize layout parameters, page selection, and output formats.
Applications like SEO research highlight how the extracted data can be used for real-world insights.

With further customization, this script can automate data extraction for SEO, research, or other domains, unlocking the valuable information stored in PDFs.

Follow me at: https://www.linkedin.com/in/gregbernhardt/

Author
Recent Posts

Follow me

Greg Bernhardt

Sr. SEO Specialist for Shopify. Nearly 20 years of experience in web design, web development, and web marketing. Education in Information Sciences from UW-Milwaukee. Managing the largest online US physics community. Enjoy learning about search engines, SEO, chrome tricks, Python, knowledge graphs, data science, and more!

Follow me

Latest posts by Greg Bernhardt (see all)