XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
hongy...@gmail.com
hongyi.zhao at gmail.com
Wed Sep 29 09:22:27 EDT 2021
On Wednesday, September 29, 2021 at 8:12:08 PM UTC+8, J.O. Aho wrote:
> On 29/09/2021 13.10, hongy... at gmail.com wrote:
> > On Wednesday, September 29, 2021 at 5:40:58 PM UTC+8, J.O. Aho wrote:
> >> On 29/09/2021 10.22, hongy... at gmail.com wrote:
> >>> I tried to convert a xls file into csv with the following command, but failed:
> >>>
> >>> $ in2csv --sheet 'Sheet1' 2021-2022-1.xls
> >>> XLRDError: Unsupported format, or corrupt file: Expected BOF record; found b'\r\n\r\n\r\n\r\n'
> >>>
> >>> The above testing file is located at here [1].
> >>>
> >>> [1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1.xls
> >>>
> >>> Any hints for fixing this problem?
> >> You need to delete the 13 first lines in the file
> >
> > Yes. After deleting the top 3 lines, the problem has been fixed.
> >
> >> or you see to that your code does first trim the data before start xml parse it.
> >
> > Yes. I really want to do this trick programmatically, but how do I do it without manually editing the file?
> You could do something like loading the XML into a string (myxmlstr)
How to do this operation? As you have seen, the file refused to be loaded at all.
> and then find the fist < in that string
>
> xmlstart = myxmlstr.find('<')
>
> xmlstr = myxmlstr[xmlstart:]
>
> then use the xmlstr in the xml parser, sure not as convenient as loading
> the file directly to the xml parser.
>
> I don't say this is the best way of doing it, I'm sure some python wiz
> here would have a smarter solution.
Another very strange thing: I trimmed the first 3 lines in the original file and saved it into a new one named as 2021-2022-1-trimmed-top-3-lines.xls. [1]
Then I read the file with the following python script named as pandas-excel.py:
------
import pandas as pd
excel_file='2021-2022-1-trimmed-top-3-lines.xls'
#print(pd.ExcelFile(excel_file).sheet_names)
newpd=pd.read_excel(excel_file, sheet_name='Sheet1')
for i in newpd.index:
if i >1:
for j in newpd.columns:
if int(j.split()[1]) > 2:
if not pd.isnull(newpd.loc[i][j]):
print(newpd.loc[i][j])
------
$ python pandas-excel.py | sort -u
汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932
汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室 汽修1932
OTOH, I also tried to read the file with in2csv as follows:
$ in2csv --sheet Sheet1 2021-2022-1-trimmed-top-3-lines.xls 2>/dev/null |tr ',' '\n' | \
sed -re '/^$/d' | sort -u | awk '{print length($0),$0}' | sort -k1n | tail -3 | cut -d ' ' -f2-
汽车实用英语 [1-8]周 1-4节 38 汽车楼413基础电气实训室II 汽修1932
智能网联汽车概论 [1-8]周 6-9节 45 汽车楼511汽车营销策划实训室 汽销1931
汽车车载网络系统的检测与修复 [1-12]周 1-4节 38 汽车楼416安全、舒适系统实训室 汽修1932
As you can see, the above two methods give different results. I'm very puzzled by this phenomenon. Any hints/tips/comments will be greatly appreciated.
[1] https://github.com/hongyi-zhao/temp/blob/master/2021-2022-1-trimmed-top-3-lines.xls
Regards,
HZ
More information about the Python-list
mailing list