Extracting Text from PDF files.
All of you must be familiar with what PDFs are. In fact, they are one of the most important and widely used digital media. PDF stands for Portable Document Format. It uses .pdf extension. It is used to present and exchange documents reliably, independent of software, hardware, or operating system.
Python package PyPDF can be used to achieve what we want (text extraction), although it can do more than what we need. This package can also be used to generate, decrypting and merging PDF files.
To install this package type the below command in the terminal
pip install PyPDF2
# importing required modules
import PyPDF2
# creating a pdf file object
pdf_file = open("Manager Human Resources Job Description.pdf","rb")
# creating a pdf reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# printing number of pages in pdf file
print(pdf_reader.numPages)
# extracting text from page
print(pdf_reader.getPage(0).extractText())