[Python-Dev] Filename as byte string in python 2.6 or 3.0?
Ulrich Eckhardt
eckhardt at satorlaser.com
Mon Sep 29 13:59:06 CEST 2008
On Monday 29 September 2008, M.-A. Lemburg wrote:
> On 2008-09-29 12:50, Ulrich Eckhardt wrote:
> > 1. For POSIX platforms (using a byte string for the path):
> > Here, the first approach is to convert the path to Unicode, according to
> > the locale's CTYPE category. Hopefully, it will be UTF-8, but also
> > codepages should work. If there is a segment (a byte sequence between two
> > path separators) where it doesn't work, it uses an ASCII mapping where
> > possible and codepoints from the "Private Use Area" (PUA) of Unicode for
> > the non-decodable bytes.
> > In order to pass this path to fopen(), each segment would be converted to
> > a byte string again, using the locale's CTYPE category except for
> > segments which use the PUA where it simply encodes the original bytes.
>
> I'm not sure how this would work. How would you map the private use
> code points back to bytes ? Using a special codec that knows about
> these code points ? How would the fopen() know to use that special
> codec instead of e.g. the UTF-8 codec ?
Sorry, I wasn't clear enough. I'll try to explain further...
Let's assume we have a filename like this:
0xc2 0xa9 0x2f 0x7f
The first two bytes are the copyright sign encoded in UTF-8, followed by a
slash (0x2f, path separator) and a character encoded in an unknown codepage
(0x7f is not ASCII!). The first thing when receiving that path from the
system would be to split it into segments, here we would get two of them, one
with 0xc2 0xa9 and the other with 0x7f. This uses the fact that the separator
(slash/0x2f) is rather universal (Note: I'm not sure about encodings like
BIG5, i.e. ones that are neither UTF-8 nor derived from ASCII).
For each segment, we would apply the locale's CTYPE facet and get the Unicode
codepoint 0xa9 for the first segment, while the second one fails to convert.
So, for the second one, we simply check for each byte if it is valid and
printable ASCII (0x7f isn't). If it is, we emit the byte as Unicode
codepoint. Otherwise, we map to the PUA.
The PUA reserves 0xe000 to 0xf8ff for private uses. I would simply encode the
byte 0x7f as 0xe07f, i.e. map it to the beginning of that range. Eventually,
we would end up with the following Unicode codepoints:
0xa9, 0x2f, 0xe07f
When converting to a byte string for use with fopen(), we simply inspect the
supplied string again. If a segment contains elements of the PUA, we simply
reverse the mapping for those and leave the others in that segment as-is. For
all other segments, we apply the CTYPE conversion.
Notes:
* This effectively converts the current path representation (a string) into a
sequence of segments where each segment can either be a fully Unicode-capable
string or a raw byte string without any known interpretation. However,
instead of using an array for that, it uses a string, which is what most
people's code expects anyway.
* You could also work on a byte-base instead of splitting the path in segments
first. I just assumed that a single segment will not contain valid UTF-8
sequences mixed with invalid ones. A path however can contain both correctly
and incorrectly encoded segments.
> BTW: Private use areas in Unicode are meant for e.g. company specific
> code points. Using them for escaping purposes is likely to cause problems
> due to assignment clashes.
I'm not sure if the use I proposed is correct according to the intended use of
the PUA. I know that ideally no such string would escape from Python, i.e. it
should only be visible internally. I would guess that that is something the
PUA was intended for.
Uli
--
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932
**************************************************************************************
Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.
**************************************************************************************
More information about the Python-Dev
mailing list