This is my code
import os
import PyPDF2
# set the directory where the PDF files are located
pdf_directory = '/Users/humnerohit/Desktop/test_pdf_files'
# loop through each file in the directory
for filename in os.listdir(pdf_directory):
if filename.endswith('.pdf'):
# create a PDF file object
pdf_file = open(os.path.join(pdf_directory, filename), 'rb')
# create a PDF reader object
pdf_reader = PyPDF2.PdfFileReader(pdf_file)
# loop through each page in the PDF file
text=""
for page_num in range(pdf_reader.numPages):
# extract the text from the page
page = pdf_reader.getPage(page_num)
text += page.extractText()
# close the PDF file object
pdf_file.close()
# create a text file object
text_file = open(os.path.join(pdf_directory, filename[:-4] + '.txt'), 'w')
# write the extracted text to the text file
text_file.write(text)
# close the text file object
text_file.close()
output
gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003/gai003/gai004/gai003
খࢢදொͷ࠶։ൃϏϧɾαϯϓϥβớຊࣾখỜͷখࢁ౻ࢢձͱ੪౻ਗ਼͕ࣾҰỏब͍͋ͭ͞ͷͨΊখຽใࣾΛ๚Εỏ தࣾ࣍ܒͱ࠙ஊỐủத৺֗ͷ֩ళฮͱͯ͠ỏ
ޙࠓؤுΓ·͢Ứͱ๊ෛΛड़ͨỐ
Ӻલ࠶։ൃϏϧͱͯ͠ҰࣣࣣʹΦồϓϯͨ͠αϯϓϥβỏࠓۀೋेೋỐ֩ςφϯτͷμΠΤồͱͷेؒͷܖΛऴ
͑ỏࡢશؗϦχỿồΞϧͨ͠ỐҰ෦ỏςφϯτ༠க͕Ε͕ͨỏ͜ͷ΄ͲΊͲ͕͍ͭͨͨΊỏখࢁલࣾࡢेೋ݄ͷגओ૯ձͰୀΛਃ͠ೖΕỐࡾ݄ࡾेͷऔకձͰঝೝ͞ΕͨỐ৽ࣾʹࡾͷ੪౻ਗ਼ࠪΛબỐখࢯࢁձʹबͨ͠Ố
খࢁձủࢥ͍ग़Λ͖ͤΓ͕ͳ͍ỨͱΛࡉΊủαϯϓϥβమೆͷ֩Ͱͳ͚ΕͳΒͳ͍Ứͱޙࠓͷళͮ͘ΓʹҙཉỐ੪౻ࣾủখࢁձͷԿͷҰͰ͖ͳ͍ͱࢥ͏͕ỏैۀһʹڠྗͯ͠Βỳͯؤுỳ͍͖͍ͯͨỨͱܾҙΛड़ͨỐݩ͔Βٿͷ۱
ʑ
·ͰỏڥѱԽ͕ਂࠁͷ߹͍Λ૿͍ͯ͠ΔỐμΠΦΩγϯỏ ԹஆԽỏࢎੑӍỏΦκϯഁյỏੜछݮগỏީؾมಈ
/gai007ỐڥࠃΛ͑ỏͦ͢Λ͛ͯ࣍ʑʹಥ͖͚ΒΕΔҟมʹỏҬͲ͏ཱ͔ͪ͏͔Ố͔ͭͯओʹެۀ࢈ͷࢹʹҙΛ͏͚ͩͩ
The text is getting extracted from the pdf and stored in the same folder as same name with .txt extension.
but it seems that the data is not getting converted into japanese text.
Expecting to get clean japanese text into text file.
You need to sign in to view this answers
Leave feedback about this