[Python-Dev] PEP 383: Non-decodable Bytes in System Character Interfaces
norseman at hughes.net
Fri May 1 01:02:51 CEST 2009
Martin v. Löwis wrote:
>>>> How do get a printable unicode version of these path strings if they
>>>> contain none unicode data?
>>> Define "printable". One way would be to use a regular expression,
>>> replacing all codes in a certain range with a question mark.
>> What I mean by printable is that the string must be valid unicode
>> that I can print to a UTF-8 console or place as text in a UTF-8
>> web page.
>> I think your PEP gives me a string that will not encode to
>> valid UTF-8 that the outside of python world likes. Did I get this
>> point wrong?
> You are right. However, if your *only* requirement is that it should
> be printable, then this is fairly underspecified. One way to get
> a printable string would be this function
> def printable_string(unprintable):
> return ""
> This will always return a printable version of the input string...
No it will not.
It will return either nothing at all or a '\x00' depending on how a NULL
is treated. Neither prints on paper, screen or any where else. If you
get the cases where all bytes are not translating or printable locally
then you get nothing out. Duplicate file names usually abound too.
>> In our application we are running fedora with the assumption that the
>> filenames are UTF-8. When Windows systems FTP files to our system
>> the files are in CP-1251(?) and not valid UTF-8.
> That would be a bug in your FTP server, no? If you want all file names
> to be UTF-8, then your FTP server should arrange for that.
Which seems to be exactly what he's trying to do.
>> Having an algorithm that says if its a string no problem, if its
>> a byte deal with the exceptions seems simple.
>> How do I do this detection with the PEP proposal?
If no one has an 'elegant' solution, toss PEP and do what has to be
done. I find the classroom is seldom related to reality.
>> Do I end up using the byte interface and doing the utf-8 decode
> No, you should encode using the "strict" error handler, with the
> locale encoding. If the file name encodes successfully, it's correct,
> otherwise, it's broken.
Exactly his problem to solve. How does he fix the broken????
First: See if the sender(s) will use a different "font". :)
I would suggest you read raw bytes and handle the problem in
the usual logical way. (Translate what you can, if it looks readable
keep it otherwise send it back if possible.) If you have to keep a
junked up name, try using a thesaurus or soundex (I know I spelled that
wrong) to help keep the meaning/sound of the file name. If the name is
one of those computer generated gobbeldigoops - build a translation
table to use for incoming and for getting back to original bit patterns
later. Your name won't be the same but ... Plug it into that handy
utility you just wrote and you can talk much more effectively with sender.
If you can get the page-thingy (CP-1251 or whatever) specs you
can be well ahead of the game. There are programs out there that will
convert (better or lessor) between page specs. Some work in-line.
Watch out for Python's print function not being completely compatible
with reality. The high bit bytes in ASCII have been in use for quite
some time and are (or at least supposed to be) part of the page to page
spec translations. You probably will need to know (or make a close
guess) of the 'from' language to get plausible results. If the files
are coming across the Pacific it might be a good time to form a
collaboration. (a case of: we agree that 'that' bit pattern in your
filename will become 'this' in ours. Reversal required, as in A becomes
C incoming and C becomes A outgoing.)
Note: Different machines store things differently. Intel stores High
byte last, Sun stores it first. It can be handy to know the machinery.
Net transport programs are supposed to send Sun order, not all do.
More information about the Python-list