[Tutor] Unknown encoded file types.
Sarfraaz Ahmed
sarfraaz at gmail.com
Sun Feb 7 06:55:36 EST 2021
On Sun, Feb 7, 2021 at 5:11 PM Cameron Simpson <cs at cskk.id.au> wrote:
> On 07Feb2021 20:55, Sean Murphy <mhysnm1964 at gmail.com> wrote:
> >This is what I was suspecting. Thanks for confirming. I have even tried
> >to decode the binary variable into UTF and it failed.
>
I had a similar encoding issue on my Mac OS machine yesterday and this
stackoverflow answer helped me. I updated my experience as a comment there
to help others who might face a similar issue.
https://stackoverflow.com/questions/18171739/unicodedecodeerror-when-reading-csv-file-in-pandas-with-python#comment116826345_18172249
You basically need to find out the encoding for each file and provide that
encoding format at the time of opening the file. This major work is in
identifying the encoding format of all your files. Hopefully, if they are
same, then you would have less work.
Hope that helps.
>
> That just says it isn't UTF-8. If you were trying UTF-8.
>
> >I am thinking of trying
> >to work out how to clean the file to remove any text that don't fall
> within
> >the western language.
>
> Is that necessary?
>
> >Far as I am aware, only European / English should be
> >present. More English than anything else.
>
> That leaves plenty of scope for nonASCII bytes. What kind of criterion
> do you think would help you?
>
> >When using binary mode to load a text file. Does all the encoding bytes
> >stay present in the file after the content of the file has been loaded?
> Thus when
> >you join the content from two files together. You are getting the encoding
> >information half way through the join text?
>
> The often aren't any "encoding bytes" to save/preserve. The text will
> simply have been transcribed in whatever encoding was in use. There
> aren't standard "markers" for this stuff, which is why an unknown file
> is guesswork.
>
> If the text commences with a BOM (FFFE or FEFF) it is probably UTF-16BE
> or UTF-16LE respectively. But otherwise you're on your own, falling back
> to libraries which guess from the elading data and the byte value
> distributions.
>
> Cheers,
> Cameron Simpson <cs at cskk.id.au>
> _______________________________________________
> Tutor maillist - Tutor at python.org
> To unsubscribe or change subscription options:
> https://mail.python.org/mailman/listinfo/tutor
>
--
Jazaak Allahu Khair
-- Sarfraaz Ahmed
More information about the Tutor
mailing list