OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Problems identifying an invisible character

  • Thread starter Thread starter Dominik
  • Start date Start date
D

Dominik

Guest
I'm analyzing a corpus of documents and I noticed that there are 4 instances of tokens that look identical but are recognized as different. Today I imported the dataset to another software I it highlighted what looked like an empty space before the word:

enter image description here

I then tried to copy/paste this from Gephi to Excel and a weird dot was displayed before the term: enter image description here

I tried to copy/paste the text to text-compare.com in an attempt to identify the character and to my jupyter lab notebook in an attempt to fix the words, but whenever I do that the odd character disappears and it is not picked up so I'm unable to select the terms that I'm trying to correct. Any idea about how to handle this?

Edit: I tried to copy/paste the string here, but it is displayed normally as "science". I also used strip() in Python during pre-processing, but that didn't help. The original documents were in pdf. I OCRed them and extracted the text in UTF8.
<p>I'm analyzing a corpus of documents and I noticed that there are 4 instances of tokens that look identical but are recognized as different. Today I imported the dataset to another software I it highlighted what looked like an empty space before the word:</p>
<p><a href="https://i.sstatic.net/HqcwrUOy.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/HqcwrUOy.png" alt="enter image description here" /></a></p>
<p>I then tried to copy/paste this from Gephi to Excel and a weird dot was displayed before the term:
<a href="https://i.sstatic.net/IYV6TEHW.png" rel="nofollow noreferrer"><img src="https://i.sstatic.net/IYV6TEHW.png" alt="enter image description here" /></a></p>
<p>I tried to copy/paste the text to text-compare.com in an attempt to identify the character and to my jupyter lab notebook in an attempt to fix the words, but whenever I do that the odd character disappears and it is not picked up so I'm unable to select the terms that I'm trying to correct. Any idea about how to handle this?</p>
<p>Edit:
I tried to copy/paste the string here, but it is displayed normally as "science". I also used strip() in Python during pre-processing, but that didn't help. The original documents were in pdf. I OCRed them and extracted the text in UTF8.</p>
 
Top