Extracting text from PDFs is a common and repeatable task for many applications.
Repeatable in business means that the quicker you can do it, the better. Automation and AI are the future of this kind of work.
Data entry clerks are threatened by the prospect of losing their job.
If you are a data entry clerk or you know somebody, do prepare for the future of this job.
Many jobs are in the decline now because of automation and AI. That’s because time is the most valuable resource. Thus, every company is looking for quick and efficient ways to get the job done.
I know you probably say humans can extract text from PDFs more accurately than machines. Let’s suppose so. But let’s say you have a big PDF file, and there are tricky parts in the document that the AI can’t recognize. In this case, you’re still winning because the time you spend is now less than the time you would have spent as a human.
So what can you do to solve this issue as a data entry clerk?
Start with the basics. Learn a programming language to automate your work. Learn AI if you’re interested. Or learn to use a tool that will make your job easier.
One of the tools that you can use to extract text from PDF files is Amazon Textract.
In this tutorial, I’ll show you how to use Amazon Textract to get text blocks from a PDF file using Python. The focus in mind is to be able to extract text from PDFs quickly and in a minimal code.
What is Amazon Textract?
Amazon Textract is not designed for extracting text from PDFs. It is more general than that.
Amazon Textract is a service that automatically detects and extracts data from scanned documents. It goes beyond simple optical character recognition (OCR) to also identify the contents of fields in forms and information stored in tables. ~ AWS Docs
But in this tutorial, we’ll focus on extracting text from PDFs.
How Textract works
Textract is a machine learning tool on the cloud. It uses a combination of natural language processing (NLP) and computer vision to extract text from documents. It can also extract tables and forms from scanned documents. It can also extract text from images.
How Detecting Text Works
Let’s see how Textract can detect text from a document. First, let’s learn what a Block means. A Block represents items that are recognized in a document within a group of pixels close to each other.
Amazon Textract provides synchronous and asynchronous operations that return only the text detected in a document. For both sets of operations, different types of Block objects are returned:
- The lines and words of the detected text.
- The relationships between the lines and words of detected text.
- The page that the detected text appears on.
- The location of lines and words of text on the document page.
Textract Amazon Python Code
Let’s start with a function that will start the Textract detection process:
import boto3
import time
def start_job(client, s3_bucket_name, object_name):
""""Starts a text detection job.
:param client: The Textract boto3 client.
:param s3_bucket_name: The name of the S3 bucket.
:param object_name: The name of the object in the S3 bucket.
:return: The job ID.
"""
response = None
response = client.start_document_text_detection(
DocumentLocation={
'S3Object': {
'Bucket': s3_bucket_name,
'Name': object_name
}})
return response["JobId"]
As you can see, the function starts the text detection process by
calling the start_document_text_detection
method on the Textract
client. The DocumentLocation
parameter is a dictionary that contains
the S3 bucket and the object name of the document.
To follow up with this tutorial, you only need to have an account on AWS
and add two permissions to your IAM user: one for the S3 bucket ( AmazonS3FullAccess
) and one for the Textract service ( AmazonTextractFullAccess
).
Finally, this function should return the job id which will be used in the next function.
def is_job_complete(client, job_id):
"""Checks if a job is complete.
:param client: The Textract boto3 client.
:param job_id: The job ID.
:return: True if the job is complete, otherwise False.
"""
time.sleep(5)
response = client.get_document_text_detection(JobId=job_id)
status = response["JobStatus"]
print("Job status: {}".format(status))
while status == "IN_PROGRESS":
time.sleep(5)
response = client.get_document_text_detection(JobId=job_id)
status = response["JobStatus"]
print("Job status: {}".format(status))
return status == "SUCCEEDED"
The is_job_complete()
function returns a boolean value indicating
whether the job is complete or not. So it calls the get_document_text_detection()
method on the job id returned from the start_job()
function.
While the status of that job is IN_PROGRESS
, the function will call
the get_document_text_detection()
again until the status becomes SUCCEEDED
which means the detection is done. What is left is to get
the text blocks from the response.
We get the blocks from the following function:
def get_job_results(client, job_id):
"""Gets the results of a detection job.
:param client: The Textract boto3 client.
:param job_id: The job ID.
:return: The job results.
"""
time.sleep(5)
response = client.get_document_text_detection(JobId=job_id)
return response
Now, the get_document_text_detection()
method returns the response
of the text detection job. The response contains the blocks of text,
block types, and other information.
Let’s do a quick test to see how these three functions work together:
if __name__ == "__main__":
# Document
s3_bucket_name = "ki-textract-demo-docs"
document_name = "Amazon-Textract-Pdf.pdf"
region = "us-east-1"
client = boto3.client('textract', region_name=region)
job_id = start_job(client, s3_bucket_name, document_name)
print("Started job with id: {}".format(job_id))
if is_job_complete(client, job_id):
response = get_job_results(client, job_id)
# Print detected text
for item in response["Blocks"]:
if item["BlockType"] == "LINE":
print('\033[94m' + item["Text"] + '\033[0m')
We defined the s3 bucket name (which is a unique name globally). You can
replace it with your own bucket name, but for now, let’s keep it as is
and use the same bucket ( ki-textract-demo-docs
) that AWS uses for
the demo.
The document name is the name of the object in the S3 bucket (in our
case, it’s Amazon-Textract-Pdf.pdf
). In this case, it’s this PDF
file
.
We also defined the region name which is passed to the boto3 client to create the client object.
We then start the text detection job. We then wait until the job is complete. If so, we get the job results which are the JSON response of the text detection job.
Finally, we loop over the blocks and make sure the block type is LINE
and then print the text (in blue color).
Final Thoughts
We’ve seen how to quickly get started with Textract to detect text from a PDF document. This is a quick way to do it, yet it’s also simple.