[Tutor] Logical error?

Danny Yoo dyoo at hashcollision.org
Sun May 4 06:03:46 CEST 2014


> ########################################################################
> for encoding in ('utf-8', 'utf-16', 'utf-32'):
>   for i in range(0x110000):
>     aChar = unichr(i)
>     try:
>       someBytes = aChar.encode(encoding)
>       if '\n' in someBytes:
>         print("%r contains a newline in its bytes encoded with %s" %
> (aChar, encoding))
>     except:
>       ## Normally, try/catches with an empty except is a bad idea.
>       ## Here, this is toy code, and we're just exploring.
>       pass
> ########################################################################


Gaa...  Sorry about the bad indenting.  Let me try that again.


####################################
for encoding in ('utf-8', 'utf-16', 'utf-32'):
    for i in range(0x110000):
        aChar = unichr(i)
        try:
            someBytes = aChar.encode(encoding)
            if '\n' in someBytes:
                print("%r contains a newline in its bytes encoded with
%s" % (aChar, encoding))
        except:
            ## Normally, try/catches with an empty except is a bad idea.
            ## Here, this is toy code, and we're just exploring.
            pass
####################################



> Hopefully, this makes the point clearer: we must not try to decode
> individual lines.  By that time, the damage has been done: the act of
> trying to break the file into lines by looking naively at newline byte
> characters is invalid when certain characters can themselves have
> newline characters.

Confusing last sentence.  Let me try that again.  The act of trying to
break the file into lines by looking naively at newline byte
characters is invalid because certain characters, under encoding,
themselves consist of newline characters.  We've got to open the file
with the right encoding in play.


Joel Spolsky's article on "The Absolute minimum Every Software
Developer Absolutely, Positively Must Know About Unicode and Character
Sets (No Excuses!)" needs to be referenced.   :P

    http://www.joelonsoftware.com/articles/Unicode.html


More information about the Tutor mailing list