OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

detect and change website encoding in python

  • Thread starter Thread starter kl4us
  • Start date Start date
K

kl4us

Guest
I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:

Code:
import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com" 
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding

I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change. Have anyone some suggestion?

Thanks
<p>I have a problem with website encoding. I maked a program to scrape a website but i didn't have successfully with changing encoding of readed content. My code is:</p>

<pre><code>import sys,os,glob,re,datetime,optparse
import urllib2

from BSXPath import BSXPathEvaluator,XPathResult
#import BeautifulSoup

#from utility import *

sTargetEncoding = "utf-8"

page_to_process = "http://www.xxxx.com"
req = urllib2.urlopen(page_to_process)
content = req.read()
encoding=req.headers['content-type'].split('charset=')[-1]
print encoding

ucontent = unicode(content, encoding).encode(sTargetEncoding)
#ucontent = content.decode(encoding).encode(sTargetEncoding)
#ucontent = content

document = BSXPathEvaluator(ucontent)

print "ORIGINAL ENCODING: " + document.originalEncoding
</code></pre>

<p>I used external library (BSXPath an extension of BeautifulSoap) and the document.originalEncoding print the encoding of website and not the utf-8 encoding that I tried to change.
Have anyone some suggestion?</p>

<p>Thanks</p>
 
Top