Published on February 19, 2024

•

19min

Extract Tables from PDFs with Tesseract OCR

Sean Lawton

@snlwtn

Extract Tables from PDFs with Tesseract OCR

Extracting Tables from PDFs

Converting PDF documents into actionable data is a common challenge for many businesses. PDFs often contain valuable tabular data like financial reports, product specs, research findings, and more. Manually extracting this tabular data is tedious and error-prone.

Automating the extraction of tables from PDFs unlocks several key benefits:

It saves enormous time and effort compared to manual data entry. Teams can focus on analyzing data instead of manually reformatting it.
It eliminates human error that inevitably occurs with manual extraction. Automated approaches extract tables with higher accuracy.
Data can be exported into spreadsheets or databases for further analysis. Instead of static tables in a PDF, you have structured, machine-readable data.
It enables large-scale processing of thousands of PDF documents to extract tabular data. No longer limited by manual efforts.
Tables extracted from PDFs can be used to train AI systems. The extracted data helps create better document understanding models.

As more business-critical information gets captured in PDF documents, extracting tables provides immense value. It takes an unstructured document format and produces structured, usable data.

Step by Step with Python

Extracting Table Data from PDFs using Tesseract OCR

Prerequisites

Python installed
Tesseract OCR installed
pytesseract, pdf2image, pandas Python libraries installed

Step 1: Convert PDF to Images

First, convert each page of the PDF into separate images.

from pdf2image import convert_from_path
 
def convert_pdf_to_images(pdf_file):
    return convert_from_path(pdf_file)
 
images = convert_pdf_to_images('your_pdf_file.pdf')

from pdf2image import convert_from_path
 
def convert_pdf_to_images(pdf_file):
    return convert_from_path(pdf_file)
 
images = convert_pdf_to_images('your_pdf_file.pdf')

Step 2: Apply OCR on Images

import pytesseract
 
def extract_text_from_images(images):
    text_list = []
    for image in images:
        text = pytesseract.image_to_string(image)
        text_list.append(text)
    return text_list
 
extracted_text = extract_text_from_images(images)

import pytesseract
 
def extract_text_from_images(images):
    text_list = []
    for image in images:
        text = pytesseract.image_to_string(image)
        text_list.append(text)
    return text_list
 
extracted_text = extract_text_from_images(images)

Step 3: Extract Table data

Extracting table data from OCR text can be tricky and may require custom processing based on the table format. Here, we demonstrate a basic approach using regular expressions (regex). More complex scenarios will require a much more robust method to locate and extract characters in a structured way.

import pandas as pd
import re
 
def extract_table_data(text_list):
    tables = []
    for text in text_list:
        # Assuming each line of the table starts with a digit
        # (customize this regex based on your table format)
        lines = re.findall(r'^\d+.*', text, re.MULTILINE)
        # Split each line into columns
        table = [line.split() for line in lines]
        df = pd.DataFrame(table)
        tables.append(df)
    return tables
 
tables = extract_table_data(extracted_text)

import pandas as pd
import re
 
def extract_table_data(text_list):
    tables = []
    for text in text_list:
        # Assuming each line of the table starts with a digit
        # (customize this regex based on your table format)
        lines = re.findall(r'^\d+.*', text, re.MULTILINE)
        # Split each line into columns
        table = [line.split() for line in lines]
        df = pd.DataFrame(table)
        tables.append(df)
    return tables
 
tables = extract_table_data(extracted_text)

Step 4: Output to CSV

Once you have extracted the table data, you can output each table to a separate CSV file.

def output_tables_to_csv(tables):
    for i, table in enumerate(tables):
        table.to_csv(f'table_{i}.csv', index=False)
 
output_tables_to_csv(tables)

def output_tables_to_csv(tables):
    for i, table in enumerate(tables):
        table.to_csv(f'table_{i}.csv', index=False)
 
output_tables_to_csv(tables)

A More Advanced Approach

import cv2
import numpy as np
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
from typing import List, Tuple
 
def preprocess_image(image: np.ndarray) -> np.ndarray:
    """
    Preprocess the image to improve OCR accuracy.
    """
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 
    # Apply thresholding to preprocess the image
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY
        | cv2.THRESH_OTSU)[1]
 
    # Apply dilation to merge letters
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,2))
    dilate = cv2.dilate(thresh, kernel, iterations=1)
 
    return dilate
 
def detect_table_borders(
    preprocessed_image: np.ndarray) -> Tuple[List[int], List[int]]:
    """
    Detect table borders using image processing techniques.
    """
    # Find all contours
    contours, _ = cv2.findContours(preprocessed_image,
        cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
 
    # Sort contours by area, descending
    contours = sorted(contours, key=cv2.contourArea, reverse=True)
 
    # Find the contour with 4 corners (assuming it's the table)
    for c in contours:
        peri = cv2.arcLength(c, True)
        approx = cv2.approxPolyDP(c, 0.02 * peri, True)
 
        if len(approx) == 4:
            # These are our table boundaries
            return approx.reshape(4, 2)
 
    return None
 
def extract_table_structure(image: np.ndarray,
    table_coords: List[Tuple[int, int]]) -> pd.DataFrame:
    """
    Extract table structure using the detected table coordinates.
    """
    # Crop the table from the image
    x, y, w, h = cv2.boundingRect(np.array(table_coords))
    cropped = image[y:y+h, x:x+w]
 
    # Use Tesseract to do OCR on the cropped image
    ocr_result = pytesseract.image_to_data(cropped,
        output_type=pytesseract.Output.DATAFRAME)
 
    # Filter out empty text
    ocr_result = ocr_result[ocr_result.text.notna()]
 
    # Group by lines
    lines = ocr_result.groupby('block_num')
 
    # Extract text and positions
    table_data = []
    for _, line in lines:
        line_text = ' '.join(line['text'].tolist())
        left = line['left'].min()
        top = line['top'].min()
        table_data.append((line_text, left, top))
 
    # Sort by vertical position (top)
    table_data.sort(key=lambda x: x[2])
 
    # Create DataFrame
    df = pd.DataFrame(table_data, columns=['text', 'left', 'top'])
 
    # Identify columns based on 'left' position
    df['column'] = pd.cut(df['left'], bins=5, labels=False)
 
    # Pivot to create final table structure
    final_table = df.pivot(columns='column',
        values='text').reset_index(drop=True)
 
    return final_table
 
def extract_tables_from_pdf(pdf_path: str) -> List[pd.DataFrame]:
    """
    Extract tables from a PDF file.
    """
    # Convert PDF to images
    images = convert_from_path(pdf_path)
 
    extracted_tables = []
 
    for image in images:
        # Convert PIL Image to numpy array
        np_image = np.array(image)
 
        # Preprocess the image
        preprocessed = preprocess_image(np_image)
 
        # Detect table borders
        table_coords = detect_table_borders(preprocessed)
 
        if table_coords is not None:
            # Extract table structure
            table = extract_table_structure(np_image, table_coords)
            extracted_tables.append(table)
 
    return extracted_tables
 
# Usage
pdf_path = 'path/to/your/pdf/file.pdf'
tables = extract_tables_from_pdf(pdf_path)
 
# Save extracted tables to CSV
for i, table in enumerate(tables):
    table.to_csv(f'extracted_table_{i}.csv', index=False)

import cv2
import numpy as np
import pytesseract
from pdf2image import convert_from_path
import pandas as pd
from typing import List, Tuple
 
def preprocess_image(image: np.ndarray) -> np.ndarray:
    """
    Preprocess the image to improve OCR accuracy.
    """
    # Convert to grayscale
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
 
    # Apply thresholding to preprocess the image
    thresh = cv2.threshold(gray, 0, 255, cv2.THRESH_BINARY
        | cv2.THRESH_OTSU)[1]
 
    # Apply dilation to merge letters
    kernel = cv2.getStructuringElement(cv2.MORPH_RECT, (2,2))
    dilate = cv2.dilate(thresh, kernel, iterations=1)
 
    return dilate
 
def detect_table_borders(
    preprocessed_image: np.ndarray) -> Tuple[List[int], List[int]]:
    """
    Detect table borders using image processing techniques.
    """
    # Find all contours
    contours, _ = cv2.findContours(preprocessed_image,
        cv2.RETR_TREE, cv2.CHAIN_APPROX_SIMPLE)
 
    # Sort contours by area, descending
    contours = sorted(contours, key=cv2.contourArea, reverse=True)
 
    # Find the contour with 4 corners (assuming it's the table)
    for c in contours:
        peri = cv2.arcLength(c, True)
        approx = cv2.approxPolyDP(c, 0.02 * peri, True)
 
        if len(approx) == 4:
            # These are our table boundaries
            return approx.reshape(4, 2)
 
    return None
 
def extract_table_structure(image: np.ndarray,
    table_coords: List[Tuple[int, int]]) -> pd.DataFrame:
    """
    Extract table structure using the detected table coordinates.
    """
    # Crop the table from the image
    x, y, w, h = cv2.boundingRect(np.array(table_coords))
    cropped = image[y:y+h, x:x+w]
 
    # Use Tesseract to do OCR on the cropped image
    ocr_result = pytesseract.image_to_data(cropped,
        output_type=pytesseract.Output.DATAFRAME)
 
    # Filter out empty text
    ocr_result = ocr_result[ocr_result.text.notna()]
 
    # Group by lines
    lines = ocr_result.groupby('block_num')
 
    # Extract text and positions
    table_data = []
    for _, line in lines:
        line_text = ' '.join(line['text'].tolist())
        left = line['left'].min()
        top = line['top'].min()
        table_data.append((line_text, left, top))
 
    # Sort by vertical position (top)
    table_data.sort(key=lambda x: x[2])
 
    # Create DataFrame
    df = pd.DataFrame(table_data, columns=['text', 'left', 'top'])
 
    # Identify columns based on 'left' position
    df['column'] = pd.cut(df['left'], bins=5, labels=False)
 
    # Pivot to create final table structure
    final_table = df.pivot(columns='column',
        values='text').reset_index(drop=True)
 
    return final_table
 
def extract_tables_from_pdf(pdf_path: str) -> List[pd.DataFrame]:
    """
    Extract tables from a PDF file.
    """
    # Convert PDF to images
    images = convert_from_path(pdf_path)
 
    extracted_tables = []
 
    for image in images:
        # Convert PIL Image to numpy array
        np_image = np.array(image)
 
        # Preprocess the image
        preprocessed = preprocess_image(np_image)
 
        # Detect table borders
        table_coords = detect_table_borders(preprocessed)
 
        if table_coords is not None:
            # Extract table structure
            table = extract_table_structure(np_image, table_coords)
            extracted_tables.append(table)
 
    return extracted_tables
 
# Usage
pdf_path = 'path/to/your/pdf/file.pdf'
tables = extract_tables_from_pdf(pdf_path)
 
# Save extracted tables to CSV
for i, table in enumerate(tables):
    table.to_csv(f'extracted_table_{i}.csv', index=False)

Challenges of Extracting Tables from PDFs

PDF documents present unique challenges for extracting tabular data compared to HTML webpages. This is because PDFs store data very differently than HTML.

PDFs contain a mix of text, images, and drawing commands that are rendered to create the visual layout. The content is not stored as structured tables with defined rows, columns, and cells like HTML.
Tables in PDF documents are created using combinations of text, spacing, and lines to organize information visually in a grid. There is no semantic data that identifies rows and columns in the raw PDF file.
The position and size of cells in PDF tables are defined by absolute coordinates instead of table structures. There are no clean tags identifying the start and end of tables.
OCR text extraction from PDFs struggles with extracting text from tables cleanly because words get broken across cells and rows incorrectly. The lack of semantic structure causes issues.
PDF table data extraction requires reconstructing the table structure from the visual representation back into structured data. This is challenging because it requires mimicking the human understanding of tables.

The lack of semantic structure and table-specific metadata within raw PDF files makes extracting clean and structured tabular data from PDF documents a complex problem. The solution relies heavily on being able to accurately reconstruct table structures from the visual layout and presentations.

Approaches to PDF Table Extraction

There are two main approaches to extracting tables from PDF documents - rule-based extraction and machine learning extraction.

Rule-based Extraction

Rule-based extraction relies on manually defined rules and patterns to identify tables within a PDF document. It looks at visual characteristics like horizontal and vertical lines, cell spacing patterns, and text formatting to detect tables.

Some key aspects of rule-based table extraction include:

Analyzing horizontal and vertical lines to detect table boundaries and cell divisions. Long contiguous lines likely indicate borders.
Looking at white space and gaps between text to identify column and row divisions. Regular patterns in spacing often signify structured tables.
Checking for text formatting consistency within content clusters to detect header rows and data cells.
Using the presence and position of text relative to lines to differentiate data cells from other content.
Applying heuristics around expected table structures and layouts. Common patterns can help validate table detection.
Customizing and tuning rules over time based on the specific PDF document corpus.

Rule-based extraction can be effective when tables follow fairly standard layouts and formats within a given document domain. The main downside is that hand-crafted rules are brittle and often fail on complex or unusual table structures.

Key Steps for PDF Table Extraction

Extracting tables from PDFs requires several key steps:

Detect Tables on Page

The first step is to detect all the tables present on each page of the PDF document. This involves scanning the layout of each page and identifying table-like structures based on visual cues like gridlines, spacing patterns, etc. Advanced algorithms can detect tables even without clear gridlines.

Identify Table Structure

Once table locations are detected, the next step is to identify the structure of each table. This means detecting the number of rows and columns, identifying column spans and row spans, and determining table headers and data cells. The spatial relationships between cells can reveal the table structure.

Extract Cell Data

After the table structure is determined, the actual cell data needs to be extracted. Text, numbers, and images within each cell are identified and extracted. Advanced OCR capability is needed to extract text accurately.

Output Structured Table Data

The final step is to output the extracted data in a structured format like CSV, JSON or XML rather than unstructured text. This tabular structure preserves the table relationships and allows easy importing into other apps.

Properly implementing these key steps results in accurate extraction of tables from PDF to structured data formats. Robust algorithms and OCR capabilities are needed to handle varied table layouts and data types. The resulting structured table data unlocks the ability to analyze, manipulate and share the tables from PDF documents.

Detecting Table Locations

Extracting tables from PDFs starts with identifying where the tables are located on the page. This involves using computer vision techniques to scan the PDF and detect areas that have the characteristics of tables.

Some key approaches for detecting table locations include:

Identifying regions with lines/boxes - One of the hallmarks of tables is that they contain lines and boxes outlining rows and columns. By scanning for continuous line segments and enclosed boxes, we can get a good sense of regions that potentially contain tables. This involves analyzing the vector information in the PDF to look for patterns of lines.
Finding areas with tabular data patterns - In addition to lines and boxes, tables exhibit strong horizontal/vertical alignment of text into rows and columns. By analyzing the formatting and spatial arrangement of text, we can detect areas with tabular patterns—even without visible lines. This relies on identifying proximity and alignment of text to infer columns and rows.
Distinguishing from figures or images - Some figures and images in a PDF can also contain grid-like shapes that may resemble tables. To avoid misidentifying these non-table elements, the table detection algorithm needs to look at other signals like text density and differentiation between graphics and text. Figures and images can then be filtered out as potential table locations.

By combining these techniques, the table detection stage provides candidate table regions that can then be passed along for more detailed analysis and structure identification in the next steps. This detection process removes the vast majority of non-relevant content, making the full extraction task much more efficient and targeted.

Identifying Table Structure

Once the location of tables is detected, the next step is identifying the structure of each table. This involves determining the rows and columns as well as handling complexities like row/column spans and merged cells.

Determining the rows and columns provides the overall grid structure of the table. Look for horizontal dividing lines to detect rows and vertical lines to detect columns. The intersections of these lines form the cells of the table.

However, tables can have spans where a cell extends over multiple rows or columns. This is done to merge the cell space. When identifying the table structure, check for cells that cross the dividing lines and span over multiple rows or columns.

Merged cells are another structural complexity to handle. These occur when cell borders are missing, causing cells to improperly merge together. To detect merged cells, look for missing borders and identify the distinct logical cells.

Properly identifying the table structure by detecting rows, columns, spans, and merged cells ensures the accuracy of the extracted table data. The table structure provides the template for mapping the extracted cell text into a structured tabular format.

Handling these complexities takes advanced logic and algorithms. But with the right approach, identifying the underlying table structure is achievable even for complex PDF table layouts.

Extracting Cell Data

Once the location and structure of tables is identified, the next step is extracting the actual cell contents. This involves isolating each cell and identifying whether it contains text, images, or a combination.

To isolate cells, the space between cell boundaries must be detected. This can be challenging if lines are faint or obscured by other graphics or content. Advanced algorithms look at white space and change in font to delineate cell edges accurately.

When extracting cell contents, the system must recognize text characters through optical character recognition (OCR). Well-trained OCR models can identify text even when distorted, tilted, or against a colorful background. Images within cells also need to be detected and extracted separately from text.

Some cells contain a mix of text and images, numbers, symbols, and other multimedia. The cell extraction process must handle these complex data types and preserve the formatting and arrangement. This may involve recombining text, images, and other elements after extraction.

Robust cell extraction maximizes the accuracy when isolating and identifying all types of cell contents. This provides the cleanest data to feed into the structured table output.

Outputting Structured Table Data

Once the table data has been extracted from the PDF, the next step is outputting it in a structured format that can be easily consumed by other applications. Two common options for output formats are CSV (comma separated values) and Excel spreadsheets.

Converting the extracted data into CSV provides a simple tabular format that can be imported into databases and analyzed in spreadsheet programs. The CSV contains the rows and columns of data, separated by commas or other delimiters. This allows the table contents to be accessed programmatically by other systems.

For usage in Excel, the output can be formatted into .xlsx or .xls files. The tabular structure and cell contents are preserved from the PDF table. Formatting like colors, borders and text styles can also be applied to more closely resemble the original table styling. Excel files are useful for business analysis and data processing.

Additionally, PDF table extraction solutions may provide developer APIs that allow integrating the extraction capabilities into other applications. The API endpoints can accept a PDF file input, run the extraction process, and return structured data in JSON, XML or other formats. This enables developers to build table extraction into their own workflows and systems.

Overall, converting the unstructured PDF table data into formats like CSV, Excel and APIs creates opportunities for further analysis, visualization and automation. The output options make the table data accessible to platforms beyond just PDF readers.

Real-World Applications

Extracting tables from PDF documents has many useful real-world applications for businesses and organizations:

Data Mining and Analysis from Reports

Financial reports, market research reports, and other business documents often contain valuable tabular data. By automatically extracting tables, companies can pull data from these reports into spreadsheets or databases for further quantitative analysis. This helps uncover insights and trends.

For example, a business intelligence analyst at a finance firm could extract earnings tables from thousands of quarterly reports to aggregate sales and profitability metrics by industry sector. The automated extraction saves huge amounts of manual work.

Database Population

Organizations like banks, insurance firms, utilities, and retailers have vast databases of customer records, transaction histories, product catalogs, and other structured information. Extracting tables from PDF documents provides a fast way to populate these databases.

The extracted spreadsheet-style data can be easily imported and merged into databases with minimal cleaning and transformations required. This database population accelerates analytics and operations.

Automating Workflows

In many document-driven business processes, key information lives in tables within PDF files. Extracting the tables allows automating workflows by feeding the extracted data into other systems.

For instance, an insurance company could extract policy details tables from application forms to auto-fill their policy admin system. Or a bank could extract transaction descriptions from account statements to auto-categorize transactions.

By extracting tables from PDFs, businesses can feed data seamlessly between documents and systems - saving vast amounts of manual work.

Conclusion

Extracting tables from PDFs has many uses across different industries and applications. As we've seen, there are some key challenges with PDF table extraction - PDFs lack semantic structure and tables can have complex layouts. However, by using advanced techniques like deep learning and natural language processing, it's possible to accurately detect table locations, identify table structure, and extract cell data.

In summary, the key steps for extracting tables from PDFs are:

Detecting table locations and boundaries in the PDF
Identifying rows and columns to determine table structure
Extracting text and data from table cells
Outputting the extracted data as a structured table

While existing solutions can achieve high accuracy on simple tables, there are still challenges with handling more complex tables, especially when they lack visual cues like borders. Performance on scanned documents with skewed tables is also lacking.

In the future, we can expect continued improvements to table extraction algorithms. More advanced deep learning and computer vision techniques will help better handle complex table layouts and scanned documents. Natural language processing will also play a bigger role in understanding table context and meaning.

Overall, extracting key information from PDF tables in an automated way has many benefits for data analysis and information extraction. As the technology continues to evolve, PDF table extraction will become even more robust and accurate across diverse document types. The ability to quickly convert PDF tables into accessible, editable, and analyzable data unlocks new possibilities for how businesses, researchers, and individuals can work with data stored in PDFs.

Convert Wells Fargo ...How to Fix a Reconci...

See all posts