pdf information extraction python

Extracting Data from PDFs Using PDFMiner

Estimated Read Time: 4 minute(s)
Common Topics: data, text, script, enlighter, pdf

PDF files are ubiquitous in various industries, but programmatically extracting data from them can be complex. PDFMiner, a powerful Python library, helps parse and extract content from PDFs in formats like plain text, HTML, XML, or tagged text.

This tutorial explains how to use a comprehensive PDF extraction script. We’ll explore its structure and functionality step-by-step and how to apply it to SEO and similar use cases.

Requirements

Before diving into the script, ensure the following are installed:

  • Python 3.x
  • PDFMiner (pdfminer.six): Install via pip:

    pip install pdfminer.six

You’ll also need sample PDFs to test the script.

Imports and Setup

from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdfparser import PDFParser
from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import XMLConverter, HTMLConverter, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
import sys
import getopt

Core PDFMiner modules:

    • PDFParser: Parses the raw PDF file.
    • PDFDocument: Represents the parsed PDF document.
    • PDFResourceManager: Manages shared resources like fonts and images.
    • PDFPageInterpreter: Processes PDF pages.
    • Converters: Convert PDF data into different formats (text, HTML, XML).
    • LAParams: Controls layout analysis, such as margins and spacing.

Command-line utilities:

      • sys: Handles command-line arguments.
      • getopt: Parses options passed to the script.

The Main Function

The main() function drives the script. It:

  • Parses command-line arguments.
  • Sets up resources, converters, and layout parameters.
  • Processes PDF pages.
  • Outputs extracted data.
def main(argv):
    import getopt
    def usage():
        print(f'usage: {argv[0]} [options] input.pdf ...')
        return 100

The usage() function displays instructions when the script is run without valid arguments or with the -h option.

Parsing Command-Line Arguments

try:
    (opts, args) = getopt.getopt(argv[1:], 'dP:o:t:O:c:s:R:Y:p:m:SCnAVM:W:L:F:')
except getopt.GetoptError:
    return usage()
if not args: return usage()

The script accepts various options:

  • -P: Password for encrypted PDFs.
  • -o: Output file name.
  • -t: Output format (text, html, xml, tag).
  • -p: Page numbers to extract (comma-separated).
  • -R: Rotate pages by a specified angle.
  • -m: Maximum number of pages to process.
  • -S: Strip control characters.
  • -C: Disable caching.
  • -n: Disable layout analysis.
  • -M, -W, -L: Adjust character, word, and line margins.

Setting Default Parameters

debug = 0
password = b''
pagenos = set()
maxpages = 0
outfile = None
outtype = None
imagewriter = None
rotation = 0
stripcontrol = False
layoutmode = 'normal'
encoding = 'utf-8'
laparams = LAParams()
  • Debugging: Controls the verbosity of logging.
  • Password: For password-protected PDFs.
  • Page Numbers: Specifies which pages to extract.
  • Output Type: Sets the format for extracted data.
  • Layout Parameters (LAParams): Customizes text layout analysis.

Handling Layout Analysis

if '-n' in opts: laparams = None
if '-A' in opts: laparams.all_texts = True
if '-M' in opts: laparams.char_margin = float(v)
if '-W' in opts: laparams.word_margin = float(v)
if '-L' in opts: laparams.line_margin = float(v)
  • all_texts: Forces extraction of all text, including hidden elements.
  • char_margin: Defines the spacing between characters for grouping into words.
  • word_margin: Sets spacing between words.
  • line_margin: Controls the spacing between lines of text.

Resource Manager and Device Setup

rsrcmgr = PDFResourceManager(caching=caching)
if outtype == 'text':
    device = TextConverter(rsrcmgr, outfp, laparams=laparams)
elif outtype == 'html':
    device = HTMLConverter(rsrcmgr, outfp, laparams=laparams, layoutmode=layoutmode)
elif outtype == 'xml':
    device = XMLConverter(rsrcmgr, outfp, laparams=laparams)
  • PDFResourceManager: Manages shared resources (e.g., fonts).
  • Converters: Initialize a converter for the desired output format.

Processing Pages

for fname in args:
    with open(fname, 'rb') as fp:
        interpreter = PDFPageInterpreter(rsrcmgr, device)
        for page in PDFPage.get_pages(fp, pagenos, password=password, caching=caching):
            page.rotate = (page.rotate + rotation) % 360
            interpreter.process_page(page)
  • PDFPage.get_pages: Iterates through selected pages in the input PDF.
  • Rotation: Applies a rotation to the pages if specified.
  • Interpreter: Processes each page and sends the content to the converter.

Output Management

if outfile:
    outfp = open(outfile, 'w', encoding=encoding)
else:
    outfp = sys.stdout

The script writes the output to a specified file or prints it to the console.

Application to SEO

Using the Script for SEO Research

PDFs often contain valuable SEO data:

  • Reports from tools like SEMrush or Ahrefs.
  • Whitepapers or case studies with keyword analysis.
  • Research papers on search engine trends.

Here’s how to extract SEO-related data using the script.

Run the Script: Extract the plain text from an SEO report:

python script.py -o report_text.txt -t text seo_report.pdf

Analyze Extracted Content: Open the report_text.txt file and look for keywords, metrics, or trends.

Automate Keyword Extraction: Use Python’s regex module to extract keywords:

 

with open('report_text.txt', 'r') as file:
    content = file.read()
keywords = re.findall(r'\b(?:keyword1|keyword2|keyword3)\b', content, re.IGNORECASE)
print(keywords)

Conclusion

This script demonstrates the flexibility of Python and PDFMiner for extracting data from PDFs. By breaking down the script:

  • We understand how to customize layout parameters, page selection, and output formats.
  • Applications like SEO research highlight how the extracted data can be used for real-world insights.

With further customization, this script can automate data extraction for SEO, research, or other domains, unlocking the valuable information stored in PDFs.

Follow me at: https://www.linkedin.com/in/gregbernhardt/

Greg Bernhardt
Follow me