OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

BeautifulSoup output not properly formatted

  • Thread starter Thread starter bsteo
  • Start date Start date
B

bsteo

Guest
I'm traying to webscrap some text from a website, the problem is its html formatting.

Code:
        <div class="coptic-text html">
            <div class="htmlvis"><t class="translation" title="The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."><div class="verse" verse="1"><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲱⲱⲙⲉ' target='_new'>ϫⲱⲱⲙⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲙ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲡⲟ' target='_new'>ϫⲡⲟ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲓⲏⲥⲟⲩⲥ' target='_new'>ⲓⲏⲥⲟⲩⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲭⲣⲓⲥⲧⲟⲥ' target='_new'>ⲭⲣⲓⲥⲧⲟⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲇⲁⲩⲉⲓⲇ' target='_new'>ⲇⲁⲩⲉⲓⲇ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲁⲃⲣⲁϩⲁⲙ' target='_new'>ⲁⲃⲣⲁϩⲁⲙ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=.' target='_new'>.</a></span></span></div></t><!--
--></span></div></t></div>

My desired output:

Code:
1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ.

My output:

Code:
ⲡϫⲱⲱⲙⲉⲙⲡⲉϫⲡⲟⲛⲓ ⲏⲥⲟⲩⲥⲡⲉⲭⲣⲓ ⲥⲧⲟⲥⲡϣⲏⲣⲉⲛⲇⲁⲩⲉⲓ ⲇⲡϣⲏⲣⲉⲛⲁⲃⲣⲁϩⲁⲙ.

My code so far:

Code:
#coding: utf-8

import requests
from bs4 import BeautifulSoup
import signal
import sys
import os.path

signal.signal(signal.SIGINT, lambda x, y: sys.exit(0))

if len(sys.argv) != 4:
    print("Usage: %s <book name> <first chapter> <last chapter>" % os.path.basename(__file__))
    quit()

book_name = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])

while start <= stop:
    out_file = open(f"./{book_name}_{str(start)}.txt", "a")

    try:
        response = requests.get(f'https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica')
        soup = BeautifulSoup(response.text, "lxml")
        content_list = soup.find_all("span", class_="norm")

        text = []
        print(f"[{str(start)}/{str(stop)}] https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica")
        for element in content_list:
            text.append(element.get_text())

        text = ''.join(text).strip()
        out_file.write("%s\n" % text)

    except:
        print("Error")
    start += 1

P.S. Language is old Coptic.
<p>I'm traying to webscrap some text from a website, the problem is its html formatting.</p>
<pre><code> <div class="coptic-text html">
<div class="htmlvis"><t class="translation" title="The book of the genealogy of Jesus Christ, the son of David, the son of Abraham."><div class="verse" verse="1"><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲱⲱⲙⲉ' target='_new'>ϫⲱⲱⲙⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲙ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϫⲡⲟ' target='_new'>ϫⲡⲟ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲓⲏⲥⲟⲩⲥ' target='_new'>ⲓⲏⲥⲟⲩⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡⲉ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲭⲣⲓⲥⲧⲟⲥ' target='_new'>ⲭⲣⲓⲥⲧⲟⲥ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲇⲁⲩⲉⲓⲇ' target='_new'>ⲇⲁⲩⲉⲓⲇ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲡ' target='_new'>ⲡ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ϣⲏⲣⲉ' target='_new'>ϣⲏⲣⲉ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲛ' target='_new'>ⲛ</a></span><!--
--><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=ⲁⲃⲣⲁϩⲁⲙ' target='_new'>ⲁⲃⲣⲁϩⲁⲙ</a></span></span><!--
--><span class="word"><span class="norm"><a href='https://coptic-dictionary.org/results.cgi?quick_search=.' target='_new'>.</a></span></span></div></t><!--
--></span></div></t></div>
</code></pre>
<p>My desired output:</p>
<pre><code>1: ⲡϫⲱⲱⲙⲉ ⲙⲡⲉϫⲡⲟ ⲛⲓⲏⲥⲟⲩⲥ ⲡⲉⲭⲣⲓⲥⲧⲟⲥ ⲡϣⲏⲣⲉ ⲛⲇⲁⲩⲉⲓⲇ ⲡϣⲏⲣⲉ ⲛⲁⲃⲣⲁϩⲁⲙ.
</code></pre>
<p>My output:</p>
<pre><code>ⲡϫⲱⲱⲙⲉⲙⲡⲉϫⲡⲟⲛⲓ ⲏⲥⲟⲩⲥⲡⲉⲭⲣⲓ ⲥⲧⲟⲥⲡϣⲏⲣⲉⲛⲇⲁⲩⲉⲓ ⲇⲡϣⲏⲣⲉⲛⲁⲃⲣⲁϩⲁⲙ.
</code></pre>
<p>My code so far:</p>
<pre><code>#coding: utf-8

import requests
from bs4 import BeautifulSoup
import signal
import sys
import os.path

signal.signal(signal.SIGINT, lambda x, y: sys.exit(0))

if len(sys.argv) != 4:
print("Usage: %s <book name> <first chapter> <last chapter>" % os.path.basename(__file__))
quit()

book_name = sys.argv[1]
start = int(sys.argv[2])
stop = int(sys.argv[3])

while start <= stop:
out_file = open(f"./{book_name}_{str(start)}.txt", "a")

try:
response = requests.get(f'https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica')
soup = BeautifulSoup(response.text, "lxml")
content_list = soup.find_all("span", class_="norm")

text = []
print(f"[{str(start)}/{str(stop)}] https://data.copticscriptorium.org/texts/new-testament/{book_name}_{str(start)}/sahidica")
for element in content_list:
text.append(element.get_text())

text = ''.join(text).strip()
out_file.write("%s\n" % text)

except:
print("Error")
start += 1
</code></pre>
<p>P.S. Language is old Coptic.</p>
 
Top