OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Match punctuation sign or end of a line

  • Thread starter Thread starter zest16
  • Start date Start date
Z

zest16

Guest
I want to improve the NLTK sentence tokenizer. Unfortunately, it doesn't work too well when the text doesn't leave any whitespace between the period and the next sentence.

Code:
from nltk.tokenize import sent_tokenize

text = "I love you.i hate you.I understand. i comprehend. i have 3.5 lines.I am bored"

sentences = sent_tokenize(text)
sentences

Output:

Code:
['I love you.i hate you.I understand.',
 'i comprehend.',
 'i have 3.5 lines.I am bored']

So with regex I can split the first line into 3 separate sentences. However, I don't know how can I get the last sentence too, which doesn't end in a punctuation sign.

Code:
import re

new_sentences = []
for i in sentences:
    sents = re.findall(r'\w+.*?[.?!$](?!\d)', i, flags=re.S)
    new_sentences.extend(sents)
new_sentences

Output:

Code:
['I love you.',
 'i hate you.',
 'I understand.',
 'i comprehend.',
 'i have 3.5 lines.']

I put the $ there indicating end of line, but it doesn't seem to care.
<p>I want to improve the NLTK sentence tokenizer. Unfortunately, it doesn't work too well when the text doesn't leave any whitespace between the period and the next sentence.</p>
<pre><code>from nltk.tokenize import sent_tokenize

text = "I love you.i hate you.I understand. i comprehend. i have 3.5 lines.I am bored"

sentences = sent_tokenize(text)
sentences
</code></pre>
<p>Output:</p>
<pre><code>['I love you.i hate you.I understand.',
'i comprehend.',
'i have 3.5 lines.I am bored']
</code></pre>
<p>So with regex I can split the first line into 3 separate sentences. However, I don't know how can I get the last sentence too, which doesn't end in a punctuation sign.</p>
<pre><code>import re

new_sentences = []
for i in sentences:
sents = re.findall(r'\w+.*?[.?!$](?!\d)', i, flags=re.S)
new_sentences.extend(sents)
new_sentences
</code></pre>
<p>Output:</p>
<pre><code>['I love you.',
'i hate you.',
'I understand.',
'i comprehend.',
'i have 3.5 lines.']
</code></pre>
<p>I put the <code>$</code> there indicating end of line, but it doesn't seem to care.</p>
 
Top