October 22, 2024
Chicago 12, Melborne City, USA
pdf

How can i extract clean japanese text from the pdf folder in python


This is my code

import os

import PyPDF2

# set the directory where the PDF files are located
pdf_directory = '/Users/humnerohit/Desktop/test_pdf_files'

# loop through each file in the directory
for filename in os.listdir(pdf_directory):
    if filename.endswith('.pdf'):
        # create a PDF file object
        pdf_file = open(os.path.join(pdf_directory, filename), 'rb')
        
        # create a PDF reader object
        pdf_reader = PyPDF2.PdfFileReader(pdf_file)
        
        # loop through each page in the PDF file
        text=""
        for page_num in range(pdf_reader.numPages):
            # extract the text from the page
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
        
        # close the PDF file object
        pdf_file.close()
        
        # create a text file object
        text_file = open(os.path.join(pdf_directory, filename[:-4] + '.txt'), 'w')
        
        # write the extracted text to the text file
        text_file.write(text)
        
        # close the text file object
        text_file.close()

output

gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003
಩খ຀ࢢදொͷ࠶։ൃϏϧɾαϯϓϥβớຊࣾ಩খ຀Ờͷখࢁ౻ࢢ࿠ձ௕ͱ੪౻ਗ਼ࣾ௕͕Ұ೔ỏब೚͍͋ͭ͞ͷͨΊ಩খ຀ຽใࣾΛ๚Εỏ த୔ࣾ࣍ܒ௕ͱ࠙ஊỐủத৺֗ͷ֩ళฮͱͯ͠ỏ
ޙࠓ΋ؤுΓ·͢Ứͱ๊ෛΛड़΂ͨỐ
Ӻલ࠶։ൃϏϧͱͯ͠Ұࣣࣣ۝೥ʹΦồϓϯͨ͠αϯϓϥβ͸ỏࠓ೥૑ۀೋेೋ೥໨Ố֩ςφϯτͷμΠΤồͱͷे೥ؒͷܖ໿Λऴ
͑ỏࡢ೥શؗϦχỿồΞϧͨ͠ỐҰ෦ỏςφϯτ༠க͕஗Ε͕ͨỏ͜ͷ΄ͲΊͲ͕͍ͭͨͨΊỏখࢁલࣾ௕͸ࡢ೥ेೋ݄ͷגओ૯ձͰୀ೚Λਃ͠ೖΕỐࡾ݄ࡾे೔ͷऔక໾ձͰঝೝ͞ΕͨỐ৽ࣾ௕ʹ͸ࡾ੕ͷ੪౻ਗ਼ࠪ؂໾Λબ೚Ốখࢯࢁ͸ձ௕ʹब೚ͨ͠Ố
খࢁձ௕͸ủࢥ͍ग़Λ࿩ͤ͹͖Γ͕ͳ͍Ứͱ໨ΛࡉΊủαϯϓϥβ͸మೆͷ֩Ͱͳ͚Ε͹ͳΒͳ͍Ứͱޙࠓͷళͮ͘ΓʹҙཉỐ੪౻ࣾ௕΋ủখࢁձ௕ͷԿ෼ͷҰ΋Ͱ͖ͳ͍ͱࢥ͏͕ỏैۀһʹڠྗͯ͠΋Βỳͯؤுỳ͍͖͍ͯͨỨͱܾҙΛड़΂ͨỐ଍ݩ͔Β஍ٿͷ۱
ʑ
·Ͱỏڥ؀ѱԽ͕ਂࠁͷ౓߹͍Λ૿͍ͯ͠ΔỐμΠΦΩγϯỏ ԹஆԽỏࢎੑӍỏΦκϯ૚ഁյỏੜ෺छݮগỏީؾมಈ
/gai007ỐڥࠃΛ௒͑ỏͦ͢໺Λ޿͛ͯ࣍ʑʹಥ͖෇͚ΒΕΔҟมʹỏ஍Ҭ͸Ͳ͏ཱͪ޲͔͏͔Ố͔ͭͯ͸ओʹ֐ެۀ࢈ͷࢹ؂ʹ஫ҙΛ෷͏͚ͩͩ

The text is getting extracted from the pdf and stored in the same folder as same name with .txt extension.

but it seems that the data is not getting converted into japanese text.

Expecting to get clean japanese text into text file.



You need to sign in to view this answers

Leave feedback about this

  • Quality
  • Price
  • Service

PROS

+
Add Field

CONS

+
Add Field
Choose Image
Choose Video