XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'

Wed Sep 29 09:22:27 EDT 2021

On Wednesday, September 29, 2021 at 8:12:08 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 13.10, hongy... at gmail.com wrote: 
> > On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote: 
> >> On 29/09/2021 10.22, hongy... at gmail.com wrote: 
> >>> I tried to convert a xls file into csv with the following command, but failed: 
> >>> 
> >>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls 
> >>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n' 
> >>> 
> >>> The above testing file is located at here [1]. 
> >>> 
> >>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls 
> >>> 
> >>> Any hints for fixing this problem? 
> >> You need to delete the 13 first lines in the file 
> > 
> > Yes. After deleting the top 3 lines, the problem has been fixed. 
> > 
> >> or you see to that your code does first trim the data before start xml parse it. 
> > 
> > Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?
> You could do something like loading the XML into a string (myxmlstr)

How to do this operation? As you have seen, the file refused to be loaded at all.

> and then find the fist < in that string 
> 
> xmlstart = myxmlstr.find('<') 
> 
> xmlstr = myxmlstr[xmlstart:] 
> 
> then use the xmlstr in the xml parser, sure not as convenient as loading 
> the file directly to the xml parser. 
> 
> I don't say this is the best way of doing it, I'm sure some python wiz 
> here would have a smarter solution. 

Another very strange thing: I trimmed the first 3 lines in the original file and saved it into a new one named as  2021-2022-1-trimmed-top-3-lines.xls. [1]

Then I read the file with the following python script named as pandas-excel.py:

------
import pandas as pd

excel_file='2021-2022-1-trimmed-top-3-lines.xls'

#print(pd.ExcelFile(excel_file).sheet_names)

newpd=pd.read_excel(excel_file, sheet_name='Sheet1')

for i in newpd.index:
     if i >1:
         for j in newpd.columns:
             if int(j.split()[1]) > 2:
                 if not pd.isnull(newpd.loc[i][j]):
                     print(newpd.loc[i][j])
------

$ python pandas-excel.py | sort -u
汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932 
汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室 汽修1932 

OTOH, I also tried to read the file with in2csv as follows:

$ in2csv --sheet Sheet1 2021-2022-1-trimmed-top-3-lines.xls 2>/dev/null |tr ',' '\n' | \
  sed -re '/^$/d' | sort -u  | awk '{print length($0),$0}' | sort -k1n | tail -3 | cut -d ' '  -f2-
汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932 
智能网联汽车概论 [1-8]周 6-9节 45 汽车楼511汽车营销策划实训室 汽销1931 
汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室 汽修1932 

As you can see, the above two methods give different results. I'm very puzzled by this phenomenon. Any hints/tips/comments will be greatly appreciated.

[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1-trimmed-top-3-lines.xls

Regards,
HZ