Convert PDFs to Excel with AI and Computer Vision
Why is reading and extracting characters from PDFs so damn difficult? In reality, it’s not. We can write a super simple Python Flask app to accept a PDF via an HTTP POST request and output a text file:
import io
import os
import PyPDF2
from flask import Flask, request, send_file
app = Flask(__name__)
@app.route('/upload-pdf', methods=['POST'])
def upload_pdf():
if 'pdf' not in request.files:
return 'No file part', 400
file = request.files['pdf']
if file.filename == '':
return 'No selected file', 400
if file and allowed_file(file.filename):
text = extract_text_from_pdf(file)
output_filename = 'output.txt'
with open(output_filename, 'w') as text_file:
text_file.write(text)
return send_file(output_filename, as_attachment=True)
def allowed_file(filename):
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ['pdf']
def extract_text_from_pdf(file):
pdfReader = PyPDF2.PdfReader(io.BytesIO(file.read()))
text = ''
for pageNum in range(len(pdfReader.pages)):
pageObj = pdfReader.pages[pageNum]
text += pageObj.extract_text()
return text
if __name__ == '__main__':
app.run(debug=True)
import io
import os
import PyPDF2
from flask import Flask, request, send_file
app = Flask(__name__)
@app.route('/upload-pdf', methods=['POST'])
def upload_pdf():
if 'pdf' not in request.files:
return 'No file part', 400
file = request.files['pdf']
if file.filename == '':
return 'No selected file', 400
if file and allowed_file(file.filename):
text = extract_text_from_pdf(file)
output_filename = 'output.txt'
with open(output_filename, 'w') as text_file:
text_file.write(text)
return send_file(output_filename, as_attachment=True)
def allowed_file(filename):
return '.' in filename and filename.rsplit('.', 1)[1].lower() in ['pdf']
def extract_text_from_pdf(file):
pdfReader = PyPDF2.PdfReader(io.BytesIO(file.read()))
text = ''
for pageNum in range(len(pdfReader.pages)):
pageObj = pdfReader.pages[pageNum]
text += pageObj.extract_text()
return text
if __name__ == '__main__':
app.run(debug=True)
There we go! Job is done, we have successfully converted the text from the PDF to a text file which we can hook up to a frontend and download.
Well, it’s not that simple. Most people don’t want the entire block of text from a PDF. They often want the tabular data or the semi-structured, unstructured data that’s sometimes structured or not at all structured in any way. This means we need to tweak our extraction algorithm a bit. Or, maybe a lot. This task, seemingly simple, becomes complex when dealing with large volumes of data or non-standardized PDF formats. Sure, we could use a library like Tabula-py to locate and extract tabular data from the PDFs, but it’s not accurate enough. Especially for the really wonky “tabular” data. Plus, it requires Java to be installed on the machine you deploy your code to: gross. Maybe we can use other existing tools at our disposal to get the job done?
Understanding the Challenge
PDFs, designed for consistent viewing across platforms, are not inherently structured for data extraction. They often contain a mix of text, images, and tables, making manual data entry a time-consuming and error-prone process. AI and Computer Vision offer a compelling solution by automating this process, but how exactly do they work?
AI and Computer Vision
-
Optical Character Recognition (OCR): The first step in converting PDF to Excel involves OCR technology. OCR algorithms, powered by AI, scan the PDF file, recognizing and extracting text. Modern OCR is advanced enough to handle various fonts and even handwritten text with remarkable accuracy.
-
Data Interpretation and Structuring: Once the text is extracted, AI algorithms interpret the context. They identify patterns and structures, like tables or lists, differentiating between mere text and data meant for tabulation.
-
Computer Vision's Edge: Computer Vision comes into play particularly with non-text elements. For instance, if a PDF contains graphical representations of data, Computer Vision algorithms can interpret these images, turning bar graphs or pie charts into tabular data suitable for Excel.
The following example uses Pytesseract, pypdfium2, and LangChain libraries to process the PDFs and feed the extracted text to a large language model (LLM).
First, we need a function to convert the PDFs to images:
def convert_pdf_to_images(file_path, scale=300/72):
pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices = page_indices,
scale = scale,
)
final_images = []
for i, image in zip(page_indices, renderer):
image_byte_array = BytesIO()
image.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i:image_byte_array}))
return final_images
def convert_pdf_to_images(file_path, scale=300/72):
pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices = page_indices,
scale = scale,
)
final_images = []
for i, image in zip(page_indices, renderer):
image_byte_array = BytesIO()
image.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i:image_byte_array}))
return final_images
Next, we need the PyTesseract library to extract the text from the images:
from pytesseract import image_to_string
def extract_text_with_pytesseract(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
for index, image_bytes in enumerate(image_list):
image = Image.open(BytesIO(image_bytes))
raw_text = str(image_to_string(image))
image_content.append(raw_text)
return "\n".join(image_content)
from pytesseract import image_to_string
def extract_text_with_pytesseract(list_dict_final_images):
image_list = [list(data.values())[0] for data in list_dict_final_images]
image_content = []
for index, image_bytes in enumerate(image_list):
image = Image.open(BytesIO(image_bytes))
raw_text = str(image_to_string(image))
image_content.append(raw_text)
return "\n".join(image_content)
Another method we can leverage is using LangChain. A great library for extracting text and feeding it to an LLM of our choice to handle the language processing from the text that we extracted:
from langchain.document_loaders.image import UnstructuredImageLoader
def convert_pdf_to_images(file_path, scale=300/72):
pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices = page_indices,
scale = scale,
)
final_images = []
for i, image in zip(page_indices, renderer):
image_byte_array = BytesIO()
image.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i:image_byte_array}))
return final_images
from langchain.document_loaders.image import UnstructuredImageLoader
def convert_pdf_to_images(file_path, scale=300/72):
pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices = page_indices,
scale = scale,
)
final_images = []
for i, image in zip(page_indices, renderer):
image_byte_array = BytesIO()
image.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i:image_byte_array}))
return final_images
These are just a couple examples of how we can extract text from an unstructured document with existing tools that perform some of the heavy lifting.
From here is where it gets interesting. There are so many different options for parsing the text into meaningful data.
Methods for Extracting Meaningful Data from Text
Once you have extracted text from a PDF using OCR or similar methods, the next challenge is to extract meaningful data from this unstructured text. This involves parsing, processing, and organizing the text into a structured format. Below are some methods and techniques for achieving this.
1. Regular Expressions (Regex)
Regular expressions are tools for pattern matching and text manipulation, useful for identifying specific patterns in text like dates, email addresses, phone numbers, etc.
- Example Use Case: Extracting dates. Use a regex pattern like
\d{4}-\d{2}-\d{2}
to find dates inYYYY-MM-DD
format.
2. Natural Language Processing (NLP)
NLP techniques can understand and extract meaning from human language data. Libraries like LangChain, NLTK or SpaCy are helpful for tokenization, part-of-speech tagging, named entity recognition, etc.
- Example Use Case: Extracting names or locations. Named Entity Recognition (NER) can identify proper nouns and categorize them.
A simple example using NLTK:
import nltk
from nltk.tokenize import LineTokenizer, WordPunctTokenizer
# Sample block of text with an embedded table
text_block = """
This is some introductory text.
Header1, Header2, Header3
Row1Cell1, Row1Cell2, Row1Cell3
Row2Cell1, Row2Cell2, Row2Cell3
This is some concluding text.
"""
# Tokenize the text by lines
line_tokenizer = LineTokenizer()
lines = line_tokenizer.tokenize(text_block)
# Assuming the table starts after a line with all headers
# and ends before a line without any commas
table_lines = []
inside_table = False
for line in lines:
if ',' in line:
inside_table = True
table_lines.append(line)
elif inside_table:
break
# Further tokenize each line of the table into words/punctuations
word_tokenizer = WordPunctTokenizer()
table_data = [word_tokenizer.tokenize(line) for line in table_lines]
# Writing the extracted table data to a text file
with open('extracted_table_nltk.txt', 'w') as file:
for row in table_data:
file.write(", ".join(row) + "\n")
print("Table extracted and written to extracted_table_nltk.txt")
import nltk
from nltk.tokenize import LineTokenizer, WordPunctTokenizer
# Sample block of text with an embedded table
text_block = """
This is some introductory text.
Header1, Header2, Header3
Row1Cell1, Row1Cell2, Row1Cell3
Row2Cell1, Row2Cell2, Row2Cell3
This is some concluding text.
"""
# Tokenize the text by lines
line_tokenizer = LineTokenizer()
lines = line_tokenizer.tokenize(text_block)
# Assuming the table starts after a line with all headers
# and ends before a line without any commas
table_lines = []
inside_table = False
for line in lines:
if ',' in line:
inside_table = True
table_lines.append(line)
elif inside_table:
break
# Further tokenize each line of the table into words/punctuations
word_tokenizer = WordPunctTokenizer()
table_data = [word_tokenizer.tokenize(line) for line in table_lines]
# Writing the extracted table data to a text file
with open('extracted_table_nltk.txt', 'w') as file:
for row in table_data:
file.write(", ".join(row) + "\n")
print("Table extracted and written to extracted_table_nltk.txt")
3. Parsing Structured Text Formats
If your text has semi-structured formats like CSV, JSON, or XML, use appropriate parsers to convert these sections into structured data.
- Example Use Case: Extracting data from CSV formatted strings.
4. Custom Parsing Algorithms
For texts with a predictable but non-standard structure, write custom parsing algorithms.
- Example Use Case: Extracting tabular data where rows and columns have specific delimiters.
5. Machine Learning Models
For complex tasks where the structure is less predictable, machine learning models can identify and extract information.
- Example Use Case: Classifying text sections (e.g., identifying invoice details).
6. Text Preprocessing Techniques
Preprocessing like removing stop words, stemming, and lemmatization helps in cleaning and standardizing text.
- Example Use Case: Standardizing text for effective pattern matching.
7. Using APIs for Specific Data Types
Specialized APIs can extract certain types of data from text, like financial or bibliographic data.
- Example Use Case: Extracting financial data using a dedicated API.
8. Rule-Based Extraction
Creating a set of rules for extraction is effective for domain-specific documents with consistent formats.
- Example Use Case: Extracting product information from catalog pages.
The method you choose depends on the nature of your text data and the type of information you need to extract. Often, combining these methods yields the best results for complex documents.
Overcoming Format Variability
One of the biggest challenges in this conversion process is the variability of PDF formats. AI models are trained on robust datasets to recognize and adapt to a wide range of document layouts and styles. This training enables the AI to handle different types of PDFs, from simple text documents to complex reports with mixed content. Why re-invent the wheel? We should leverage the existing tools we have at our disposal to accelerate our productivity.
Ensuring Accuracy and Efficiency
Accuracy is critical in data conversion. AI algorithms continually learn and improve, reducing errors that are common in manual data entry. Accuracy can be a bottleneck with AI, namely hallucination issues. It is always critical to analyze the output of each iteration to ensure precision when it comes to data extraction.
Endless Complexity
Converting PDFs to Excel is a complex task that blends various technologies and methodologies. From simple text scraping, to OCR and AI, each plays a crucial role in transforming unstructured, often chaotic, PDF data into orderly, structured tables. This process is technically challenging, but pays off in a significant way by increasing productivity and automation. Check out LedgerBox to start automating your workflow!