UTF-16 or something else?

Skip Montanaro skip.montanaro at gmail.com
Tue Feb 9 09:53:50 EST 2021


I downloaded US hospital ICU capacity data this morning from this page:

https://healthdata.gov/dataset/covid-19-reported-patient-impact-and-hospital-capacity-facility

(The download link is about halfway down the page.)

Trying to read it using my personal CSV tools without specifying an
encoding, it failed to understand the first column, hospital_pk. That is
apparently because the file isn't simply ASCII or UTF-8. There are a few
bytes ahead of the "h". However, if I open the file using "utf-16" as the
encoding, Python complains there is no BOM. od(1) suggests there is
*something* ahead of the first column name, but it's three bytes, not two:

% od -A x -t x1z -v <
reported_hospital_capacity_admissions_facility_level_weekly_average_timeseries_20210207.csv
| head
000000 *ef bb bf* 68 6f 73 70 69 74 61 6c 5f 70 6b 2c 63  >...hospital_pk,c<
000010 6f 6c 6c 65 63 74 69 6f 6e 5f 77 65 65 6b 2c 73  >ollection_week,s<
000020 74 61 74 65 2c 63 63 6e 2c 68 6f 73 70 69 74 61  >tate,ccn,hospita<
...

I'm opening the file like so:

inf = open(args[0], "r", encoding=encoding)

where encoding is passed on the command line. I know I can simply edit out
those bytes and probably be good-to-go, but I'd prefer not to. What should
I be passing for the encoding?

Skip, who thought everybody had effectively settled on utf-8 at this point,
but apparently not...


More information about the Python-list mailing list