How can i extract clean japanese text from the pdf folder in python

This is my code

import os

import PyPDF2

# set the directory where the PDF files are located
pdf_directory = '/Users/humnerohit/Desktop/test_pdf_files'

# loop through each file in the directory
for filename in os.listdir(pdf_directory):
    if filename.endswith('.pdf'):
        # create a PDF file object
        pdf_file = open(os.path.join(pdf_directory, filename), 'rb')
        
        # create a PDF reader object
        pdf_reader = PyPDF2.PdfFileReader(pdf_file)
        
        # loop through each page in the PDF file
        text=""
        for page_num in range(pdf_reader.numPages):
            # extract the text from the page
            page = pdf_reader.getPage(page_num)
            text += page.extractText()
        
        # close the PDF file object
        pdf_file.close()
        
        # create a text file object
        text_file = open(os.path.join(pdf_directory, filename[:-4] + '.txt'), 'w')
        
        # write the extracted text to the text file
        text_file.write(text)
        
        # close the text file object
        text_file.close()

output

gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003
಩খ຀ࢢදொͷ࠶։ൃϏϧɾαϯϓϥβớຊࣾ಩খ຀Ờͷখࢁ౻ࢢ࿠ձ௕ͱ੪౻ਗ਼ࣾ௕͕Ұ೔ỏब೚͍͋ͭ͞ͷͨΊ಩খ຀ຽใࣾΛ๚Εỏ த୔ࣾ࣍ܒ௕ͱ࠙ஊỐủத৺֗ͷ֩ళฮͱͯ͠ỏ
ޙࠓ΋ؤுΓ·͢Ứͱ๊ෛΛड़΂ͨỐ
Ӻલ࠶։ൃϏϧͱͯ͠Ұࣣࣣ۝೥ʹΦồϓϯͨ͠αϯϓϥβ͸ỏࠓ೥૑ۀೋेೋ೥໨Ố֩ςφϯτͷμΠΤồͱͷे೥ؒͷܖ໿Λऴ
͑ỏࡢ೥શؗϦχỿồΞϧͨ͠ỐҰ෦ỏςφϯτ༠க͕஗Ε͕ͨỏ͜ͷ΄ͲΊͲ͕͍ͭͨͨΊỏখࢁલࣾ௕͸ࡢ೥ेೋ݄ͷגओ૯ձͰୀ೚Λਃ͠ೖΕỐࡾ݄ࡾे೔ͷऔక໾ձͰঝೝ͞ΕͨỐ৽ࣾ௕ʹ͸ࡾ੕ͷ੪౻ਗ਼ࠪ؂໾Λબ೚Ốখࢯࢁ͸ձ௕ʹब೚ͨ͠Ố
খࢁձ௕͸ủࢥ͍ग़Λ࿩ͤ͹͖Γ͕ͳ͍Ứͱ໨ΛࡉΊủαϯϓϥβ͸మೆͷ֩Ͱͳ͚Ε͹ͳΒͳ͍Ứͱޙࠓͷళͮ͘ΓʹҙཉỐ੪౻ࣾ௕΋ủখࢁձ௕ͷԿ෼ͷҰ΋Ͱ͖ͳ͍ͱࢥ͏͕ỏैۀһʹڠྗͯ͠΋Βỳͯؤுỳ͍͖͍ͯͨỨͱܾҙΛड़΂ͨỐ଍ݩ͔Β஍ٿͷ۱
ʑ
·Ͱỏڥ؀ѱԽ͕ਂࠁͷ౓߹͍Λ૿͍ͯ͠ΔỐμΠΦΩγϯỏ ԹஆԽỏࢎੑӍỏΦκϯ૚ഁյỏੜ෺छݮগỏީؾมಈ
/gai007ỐڥࠃΛ௒͑ỏͦ͢໺Λ޿͛ͯ࣍ʑʹಥ͖෇͚ΒΕΔҟมʹỏ஍Ҭ͸Ͳ͏ཱͪ޲͔͏͔Ố͔ͭͯ͸ओʹ֐ެۀ࢈ͷࢹ؂ʹ஫ҙΛ෷͏͚ͩͩ

The text is getting extracted from the pdf and stored in the same folder as same name with .txt extension.

but it seems that the data is not getting converted into japanese text.

Expecting to get clean japanese text into text file.

You need to sign in to view this answers

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

How can i extract clean japanese text from the pdf folder in python

Leave feedback about this Cancel Reply

PROS

CONS

Categories

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP

Recent Posts

Postgres drop type XX000 “cache lookup failed for type”

PostgreSQL how to merge rows where some fields match and others are null

About Us

Categories

Android

C#

C++

CSS

GPL

HTML

Contact Info

Follow Us

How can i extract clean japanese text from the pdf folder in python

Share This Post:

Leave feedback about this Cancel Reply

PROS

CONS

Related Post

Android

C#

C++

CSS

GPL

HTML

java

javascript

jQuery

Node.js

pdf

PHP