[ad_1]
This text focuses on the Pytesseract, easyOCR, PyPDF2, and LangChain libraries. The experimentation knowledge is a one-page PDF file and is freely obtainable on my GitHub.
Each Pytesseract and easyOCR work with photos therefore requiring changing the PDF recordsdata into photos earlier than performing the content material extraction.
The conversion might be finished utilizing the pypdfium2
which is a robust library for PDF file processing, and it’s implementation is given beneath:
pip set up pypdfium2
This perform takes a PDF as enter and returns an inventory of every web page of the PDF as an inventory of photos.
def convert_pdf_to_images(file_path, scale=300/72):pdf_file = pdfium.PdfDocument(file_path)
page_indices = [i for i in range(len(pdf_file))]
renderer = pdf_file.render(
pdfium.PdfBitmap.to_pil,
page_indices = page_indices,
scale = scale,
)
final_images = []
for i, picture in zip(page_indices, renderer):
image_byte_array = BytesIO()
picture.save(image_byte_array, format='jpeg', optimize=True)
image_byte_array = image_byte_array.getvalue()
final_images.append(dict({i:image_byte_array}))
return final_images
[ad_2]
Source link