catch UnicodeDecodeError

Jaroslav Dobrek jaroslav.dobrek at gmail.com
Thu Jul 26 12:51:34 CEST 2012


> And the cool thing is: you can! :)
>
> In Python 2.6 and later, the new Py3 open() function is a bit more hidden,
> but it's still available:
>
>     from io import open
>
>     filename = "somefile.txt"
>     try:
>         with open(filename, encoding="utf-8") as f:
>             for line in f:
>                 process_line(line)  # actually, I'd use "process_file(f)"
>     except IOError, e:
>         print("Reading file %s failed: %s" % (filename, e))
>     except UnicodeDecodeError, e:
>         print("Some error occurred decoding file %s: %s" % (filename, e))

Thanks. I might use this in the future.

> > try:
> >     for line in f: # here text is decoded implicitly
> >        do_something()
> > except UnicodeDecodeError():
> >     do_something_different()
>
> > This isn't possible for syntactic reasons.
>
> Well, you'd normally want to leave out the parentheses after the exception
> type, but otherwise, that's perfectly valid Python code. That's how these
> things work.

You are right. Of course this is syntactically possible. I was too
rash, sorry. In confused
it with some other construction I once tried. I can't remember it
right now.

But the code above (without the brackets) is semantically bad: The
exception is not caught.


> > The problem is that vast majority of the thousands of files that I
> > process are correctly encoded. But then, suddenly, there is a bad
> > character in a new file. (This is so because most files today are
> > generated by people who don't know that there is such a thing as
> > encodings.) And then I need to rewrite my very complex program just
> > because of one single character in one single file.
>
> Why would that be the case? The places to change should be very local in
> your code.

This is the case in a program that has many different functions which
open and parse different
types of files. When I read and parse a directory with such different
types of files, a program that
uses

for line in f:

will not exit with any hint as to where the error occurred. I just
exits with a UnicodeDecodeError. That
means I have to look at all functions that have some variant of

for line in f:

in them. And it is not sufficient to replace the "for line in f" part.
I would have to transform many functions that
work in terms of lines into functions that work in terms of decoded
bytes.

That is why I usually solve the problem by moving fles around until I
find the bad file. Then I recode or repair
the bad file manually.



More information about the Python-list mailing list