OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Python function to read doc file

  • Thread starter Thread starter DP92_A_Yellow
  • Start date Start date
D

DP92_A_Yellow

Guest
I am working on a program in Python to read a doc/docx file. I found a module named docx, so I tried this function:

Code:
import docx
#funct to get text from doc
def getText(file_name):
    try:
        doc = Document(file_name)
        fullText = []
        for para in doc.paragraphs:
            fullText.append(para.text)
        return '\n'.join(fullText)
    except:
        print(doc.paragraphs)
        print("Exception")
        return

I keep getting the same Exception output, and if I try to print the output

Code:
text = getText(file)
 print(str(text))

it just prints None. The exception I got during debugging is PackageNotFoundError, so in the code I encapsulated the getText function in a try-except block.

I tried saving the files locally, since they are located in a remote dropbox folder, and if I save ithe files locally and open it with word I found out they are empty.

Code:
pth = os.path.abspath('files/' + fname.split('/')[-1]) #fname.split... is the dropbox .docx file name
                if not os.path.isfile(pth): #check I have not already saved the file in older debugging
                    doc = Document()
                    doc.save(pth)
                text = getText(pth)

I guess the formatting is not recognized by docx. Is there another library to solve this specific problem, or maybe a particular function from the same docx package library?

I considered textract, but I use Windows, and on another laptop with macOS I could not successfully install it.
<p>I am working on a program in Python to read a doc/docx file. I found a module named docx, so I tried this function:</p>
<pre><code>import docx
#funct to get text from doc
def getText(file_name):
try:
doc = Document(file_name)
fullText = []
for para in doc.paragraphs:
fullText.append(para.text)
return '\n'.join(fullText)
except:
print(doc.paragraphs)
print("Exception")
return
</code></pre>
<p>I keep getting the same Exception output, and if I try to print the output</p>
<pre><code>text = getText(file)
print(str(text))
</code></pre>
<p>it just prints None.
The exception I got during debugging is <code>PackageNotFoundError</code>, so in the code I encapsulated the <code>getText</code> function in a try-except block.</p>
<p>I tried saving the files locally, since they are located in a remote dropbox folder, and if I save ithe files locally and open it with word I found out they are empty.</p>
<pre><code>pth = os.path.abspath('files/' + fname.split('/')[-1]) #fname.split... is the dropbox .docx file name
if not os.path.isfile(pth): #check I have not already saved the file in older debugging
doc = Document()
doc.save(pth)
text = getText(pth)
</code></pre>
<p>I guess the formatting is not recognized by docx. Is there another library to solve this specific problem, or maybe a particular function from the same docx package library?</p>
<p>I considered textract, but I use Windows, and on another laptop with macOS I could not successfully install it.</p>
 

Latest posts

I
Replies
0
Views
1
impact christian
I
Top