OiO.lk Community platform!

Oio.lk is an excellent forum for developers, providing a wide range of resources, discussions, and support for those in the developer community. Join oio.lk today to connect with like-minded professionals, share insights, and stay updated on the latest trends and technologies in the development field.
  You need to log in or register to access the solved answers to this problem.
  • You have reached the maximum number of guest views allowed
  • Please register below to remove this limitation

Reading xls file into pandas DataFrame with xlcd

  • Thread starter Thread starter pawel_kw
  • Start date Start date
P

pawel_kw

Guest
I'm trying to read an xls file into a DataFrame in Python. The file is made available online by a Spanish electrical distribution grid operator on their website: https://www.ufd.es/wp-content/uploads/2024/05/publicacion-capacidad-Junio.xls There must be something fishy about this file, because it's not only Python that has problems reading it: Numbers on Mac says the file format is invalid, however Excel under Windows has no problems opening the file (no errors or warnings reported). Google Sheets also opens the file without issues.

I tried opening the file using pandas.read_excel function, trying all available engines. I suppose xlrd should be the method of choice here, since it's an *.xls file. In the minimum code reproducing the issue I use requests.get() to download the file providing a valid header, otherwise the website blocks the download:

Code:
import requests
import pandas as pd

url = "https://www.ufd.es/wp-content/uploads/2024/05/publicacion-capacidad-Junio.xls"
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
}
response = requests.get(url, headers=headers)
file_name = "data.xls"
df = pd.read_excel(response.content, header=[0, 1], engine="xlrd")

Versions of the libraries I use: xlrd: 2.0.1, pandas: 2.2.2, with Python 3.10.12.

This results in the following error:

Code:
./xls_test.py:10: FutureWarning: Passing bytes to 'read_excel' is deprecated and will be removed in a future version. To read from a byte string, wrap it in a `BytesIO` object.
  df = pd.read_excel(response.content, header=[0, 1], engine="xlrd")
Traceback (most recent call last):
  File "./xls_test.py", line 10, in <module>
    df = pd.read_excel(response.content, header=[0, 1], engine="xlrd")
  File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel
    io = ExcelFile(
  File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__
    self._reader = self._engines[engine](
  File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__
    super().__init__(
  File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__
    self.book = self.load_workbook(self.handles.handle, engine_kwargs)
  File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook
    return open_workbook(file_contents=data, **engine_kwargs)
  File "/opt/venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook
    bk = open_workbook_xls(
  File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 104, in open_workbook_xls
    bk.parse_globals()
  File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 1211, in parse_globals
    self.handle_sst(data)
  File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 1178, in handle_sst
    self._sharedstrings, rt_runlist = unpack_SST_table(strlist, uniquestrings)
  File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 1474, in unpack_SST_table
    assert _unused_i == nstrings - 1
AssertionError

Ignoring the FutureWarning for now, the problem seems to be located in the unpack_SST_table() function of xlrd module. I tried digging a bit in there, trying to understand what is going on. I enabled some debug messages which were already present in the code and added some more. It looks like the function cannot handle the file at some specific point. The unpack_SST_table() function is defined at the end of the file book.py, located in my venv under /opt/venv/lib/python3.10/site-packages/xlrd/book.py. I added some print statements to show variable values, when the problem occurs, just before the asser in line 1474:

Code:
print("SST CMPRSD: nchars=%d pos=%d rawstrg=%r" % (nchars, pos, rawstrg))
print(f"pos={pos} datainx={datainx} datalen={datalen} ndatas={ndatas} _unused_i={_unused_i} nstrings={nstrings}")

Running the download and read script with these I see:

Code:
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=2 datalen=7930 ndatas=2 _unused_i=1657 nstrings=23799

If I wrapt the assertion in try...except the data frame loads, but I see that the problem occurs multiple times, always with the same rawstrg:

Code:
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=2 datalen=7930 ndatas=2 _unused_i=1657 nstrings=23799
_unused_i=1657 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=3 datalen=7930 ndatas=2 _unused_i=2424 nstrings=23799
_unused_i=2424 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=4 datalen=7930 ndatas=2 _unused_i=3191 nstrings=23799
_unused_i=3191 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=5 datalen=7930 ndatas=2 _unused_i=3958 nstrings=23799
_unused_i=3958 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=6 datalen=7930 ndatas=2 _unused_i=4725 nstrings=23799
_unused_i=4725 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=7 datalen=7930 ndatas=2 _unused_i=5492 nstrings=23799
_unused_i=5492 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=8 datalen=7930 ndatas=2 _unused_i=6259 nstrings=23799
_unused_i=6259 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=9 datalen=7930 ndatas=2 _unused_i=7026 nstrings=23799
_unused_i=7026 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=10 datalen=7930 ndatas=2 _unused_i=7793 nstrings=23799
_unused_i=7793 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=11 datalen=7930 ndatas=2 _unused_i=8560 nstrings=23799
_unused_i=8560 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=12 datalen=7930 ndatas=2 _unused_i=9327 nstrings=23799
_unused_i=9327 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=13 datalen=7930 ndatas=2 _unused_i=10094 nstrings=23799
_unused_i=10094 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=14 datalen=7930 ndatas=2 _unused_i=10861 nstrings=23799
_unused_i=10861 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=15 datalen=7930 ndatas=2 _unused_i=11628 nstrings=23799
_unused_i=11628 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=16 datalen=7930 ndatas=2 _unused_i=12395 nstrings=23799
_unused_i=12395 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=17 datalen=7930 ndatas=2 _unused_i=13162 nstrings=23799
_unused_i=13162 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=18 datalen=7930 ndatas=2 _unused_i=13929 nstrings=23799
_unused_i=13929 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=19 datalen=7930 ndatas=2 _unused_i=14696 nstrings=23799
_unused_i=14696 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=20 datalen=7930 ndatas=2 _unused_i=15463 nstrings=23799
_unused_i=15463 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=21 datalen=7930 ndatas=2 _unused_i=16230 nstrings=23799
_unused_i=16230 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=22 datalen=7930 ndatas=2 _unused_i=16997 nstrings=23799
_unused_i=16997 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=23 datalen=7930 ndatas=2 _unused_i=17764 nstrings=23799
_unused_i=17764 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=24 datalen=7930 ndatas=2 _unused_i=18531 nstrings=23799
_unused_i=18531 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=25 datalen=7930 ndatas=2 _unused_i=19298 nstrings=23799
_unused_i=19298 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=26 datalen=7930 ndatas=2 _unused_i=20065 nstrings=23799
_unused_i=20065 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=27 datalen=7930 ndatas=2 _unused_i=20832 nstrings=23799
_unused_i=20832 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=28 datalen=7930 ndatas=2 _unused_i=21599 nstrings=23799
_unused_i=21599 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=29 datalen=7930 ndatas=2 _unused_i=22366 nstrings=23799
_unused_i=22366 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=30 datalen=7930 ndatas=2 _unused_i=23133 nstrings=23799
_unused_i=23133 nstrings=23799
!!! get_externsheet_local_range: refx=65535, not in range(1)

I'm not sure if I'm reading in all the data properly this way. I'd appreciate any hints on what to do about it. Is it a problem with the file itself or is this some special case xlrd is not handling properly?
<p>I'm trying to read an xls file into a DataFrame in Python. The file is made available online by a Spanish electrical distribution grid operator on their website: <a href="https://www.ufd.es/wp-content/uploads/2024/05/publicacion-capacidad-Junio.xls" rel="nofollow noreferrer">https://www.ufd.es/wp-content/uploads/2024/05/publicacion-capacidad-Junio.xls</a> There must be something fishy about this file, because it's not only Python that has problems reading it: Numbers on Mac says the file format is invalid, however Excel under Windows has no problems opening the file (no errors or warnings reported). Google Sheets also opens the file without issues.</p>
<p>I tried opening the file using <code>pandas.read_excel</code> function, trying all available engines. I suppose <code>xlrd</code> should be the method of choice here, since it's an *.xls file. In the minimum code reproducing the issue I use <code>requests.get()</code> to download the file providing a valid header, otherwise the website blocks the download:</p>
<pre><code>import requests
import pandas as pd

url = "https://www.ufd.es/wp-content/uploads/2024/05/publicacion-capacidad-Junio.xls"
headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/109.0.0.0 Safari/537.36",
}
response = requests.get(url, headers=headers)
file_name = "data.xls"
df = pd.read_excel(response.content, header=[0, 1], engine="xlrd")
</code></pre>
<p>Versions of the libraries I use:
<code>xlrd: 2.0.1</code>, <code>pandas: 2.2.2</code>, with Python 3.10.12.</p>
<p>This results in the following error:</p>
<pre><code>./xls_test.py:10: FutureWarning: Passing bytes to 'read_excel' is deprecated and will be removed in a future version. To read from a byte string, wrap it in a `BytesIO` object.
df = pd.read_excel(response.content, header=[0, 1], engine="xlrd")
Traceback (most recent call last):
File "./xls_test.py", line 10, in <module>
df = pd.read_excel(response.content, header=[0, 1], engine="xlrd")
File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 495, in read_excel
io = ExcelFile(
File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 1567, in __init__
self._reader = self._engines[engine](
File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 46, in __init__
super().__init__(
File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_base.py", line 573, in __init__
self.book = self.load_workbook(self.handles.handle, engine_kwargs)
File "/opt/venv/lib/python3.10/site-packages/pandas/io/excel/_xlrd.py", line 63, in load_workbook
return open_workbook(file_contents=data, **engine_kwargs)
File "/opt/venv/lib/python3.10/site-packages/xlrd/__init__.py", line 172, in open_workbook
bk = open_workbook_xls(
File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 104, in open_workbook_xls
bk.parse_globals()
File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 1211, in parse_globals
self.handle_sst(data)
File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 1178, in handle_sst
self._sharedstrings, rt_runlist = unpack_SST_table(strlist, uniquestrings)
File "/opt/venv/lib/python3.10/site-packages/xlrd/book.py", line 1474, in unpack_SST_table
assert _unused_i == nstrings - 1
AssertionError
</code></pre>
<p>Ignoring the <code>FutureWarning</code> for now, the problem seems to be located in the <code>unpack_SST_table()</code> function of <code>xlrd</code> module. I tried digging a bit in there, trying to understand what is going on. I enabled some debug messages which were already present in the code and added some more. It looks like the function cannot handle the file at some specific point. The <code>unpack_SST_table()</code> function is defined at the end of the file <code>book.py</code>, located in my venv under <code>/opt/venv/lib/python3.10/site-packages/xlrd/book.py</code>. I added some print statements to show variable values, when the problem occurs, just before the <code>asser</code> in line 1474:</p>
<pre><code>print("SST CMPRSD: nchars=%d pos=%d rawstrg=%r" % (nchars, pos, rawstrg))
print(f"pos={pos} datainx={datainx} datalen={datalen} ndatas={ndatas} _unused_i={_unused_i} nstrings={nstrings}")
</code></pre>
<p>Running the download and read script with these I see:</p>
<pre><code>SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=2 datalen=7930 ndatas=2 _unused_i=1657 nstrings=23799
</code></pre>
<p>If I wrapt the assertion in <code>try...except</code> the data frame loads, but I see that the problem occurs multiple times, always with the same <code>rawstrg</code>:</p>
<pre><code>SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=2 datalen=7930 ndatas=2 _unused_i=1657 nstrings=23799
_unused_i=1657 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=3 datalen=7930 ndatas=2 _unused_i=2424 nstrings=23799
_unused_i=2424 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=4 datalen=7930 ndatas=2 _unused_i=3191 nstrings=23799
_unused_i=3191 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=5 datalen=7930 ndatas=2 _unused_i=3958 nstrings=23799
_unused_i=3958 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=6 datalen=7930 ndatas=2 _unused_i=4725 nstrings=23799
_unused_i=4725 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=7 datalen=7930 ndatas=2 _unused_i=5492 nstrings=23799
_unused_i=5492 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=8 datalen=7930 ndatas=2 _unused_i=6259 nstrings=23799
_unused_i=6259 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=9 datalen=7930 ndatas=2 _unused_i=7026 nstrings=23799
_unused_i=7026 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=10 datalen=7930 ndatas=2 _unused_i=7793 nstrings=23799
_unused_i=7793 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=11 datalen=7930 ndatas=2 _unused_i=8560 nstrings=23799
_unused_i=8560 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=12 datalen=7930 ndatas=2 _unused_i=9327 nstrings=23799
_unused_i=9327 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=13 datalen=7930 ndatas=2 _unused_i=10094 nstrings=23799
_unused_i=10094 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=14 datalen=7930 ndatas=2 _unused_i=10861 nstrings=23799
_unused_i=10861 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=15 datalen=7930 ndatas=2 _unused_i=11628 nstrings=23799
_unused_i=11628 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=16 datalen=7930 ndatas=2 _unused_i=12395 nstrings=23799
_unused_i=12395 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=17 datalen=7930 ndatas=2 _unused_i=13162 nstrings=23799
_unused_i=13162 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=18 datalen=7930 ndatas=2 _unused_i=13929 nstrings=23799
_unused_i=13929 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=19 datalen=7930 ndatas=2 _unused_i=14696 nstrings=23799
_unused_i=14696 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=20 datalen=7930 ndatas=2 _unused_i=15463 nstrings=23799
_unused_i=15463 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=21 datalen=7930 ndatas=2 _unused_i=16230 nstrings=23799
_unused_i=16230 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=22 datalen=7930 ndatas=2 _unused_i=16997 nstrings=23799
_unused_i=16997 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=23 datalen=7930 ndatas=2 _unused_i=17764 nstrings=23799
_unused_i=17764 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=24 datalen=7930 ndatas=2 _unused_i=18531 nstrings=23799
_unused_i=18531 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=25 datalen=7930 ndatas=2 _unused_i=19298 nstrings=23799
_unused_i=19298 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=26 datalen=7930 ndatas=2 _unused_i=20065 nstrings=23799
_unused_i=20065 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=27 datalen=7930 ndatas=2 _unused_i=20832 nstrings=23799
_unused_i=20832 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=28 datalen=7930 ndatas=2 _unused_i=21599 nstrings=23799
_unused_i=21599 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=29 datalen=7930 ndatas=2 _unused_i=22366 nstrings=23799
_unused_i=22366 nstrings=23799
SST CMPRSD: nchars=14 pos=0 rawstrg=b'UD638423080002'
pos=0 datainx=30 datalen=7930 ndatas=2 _unused_i=23133 nstrings=23799
_unused_i=23133 nstrings=23799
!!! get_externsheet_local_range: refx=65535, not in range(1)
</code></pre>
<p>I'm not sure if I'm reading in all the data properly this way. I'd appreciate any hints on what to do about it. Is it a problem with the file itself or is this some special case <code>xlrd</code> is not handling properly?</p>
 

Latest posts

Top