[Python-Dev] Filename as byte string in python 2.6 or 3.0?

Mon Sep 29 13:59:06 CEST 2008

On Monday 29 September 2008, M.-A. Lemburg wrote:
> On 2008-09-29 12:50, Ulrich Eckhardt wrote:
> > 1. For POSIX platforms (using a byte string for the path):
> > Here, the first approach is to convert the path to Unicode, according to
> > the locale's CTYPE category. Hopefully, it will be UTF-8, but also
> > codepages should work. If there is a segment (a byte sequence between two
> > path separators) where it doesn't work, it uses an ASCII mapping where
> > possible and codepoints from the "Private Use Area" (PUA) of Unicode for
> > the non-decodable bytes.
> > In order to pass this path to fopen(), each segment would be converted to
> > a byte string again, using the locale's CTYPE category except for
> > segments which use the PUA where it simply encodes the original bytes.
>
> I'm not sure how this would work. How would you map the private use
> code points back to bytes ? Using a special codec that knows about
> these code points ? How would the fopen() know to use that special
> codec instead of e.g. the UTF-8 codec ?

Sorry, I wasn't clear enough. I'll try to explain further...

Let's assume we have a filename like this:

  0xc2 0xa9 0x2f 0x7f

The first two bytes are the copyright sign encoded in UTF-8, followed by a 
slash (0x2f, path separator) and a character encoded in an unknown codepage 
(0x7f is not ASCII!). The first thing when receiving that path from the 
system would be to split it into segments, here we would get two of them, one 
with 0xc2 0xa9 and the other with 0x7f. This uses the fact that the separator 
(slash/0x2f) is rather universal (Note: I'm not sure about encodings like 
BIG5, i.e. ones that are neither UTF-8 nor derived from ASCII).

For each segment, we would apply the locale's CTYPE facet and get the Unicode 
codepoint 0xa9 for the first segment, while the second one fails to convert. 
So, for the second one, we simply check for each byte if it is valid and 
printable ASCII (0x7f isn't). If it is, we emit the byte as Unicode 
codepoint. Otherwise, we map to the PUA.

The PUA reserves 0xe000 to 0xf8ff for private uses. I would simply encode the 
byte 0x7f as 0xe07f, i.e. map it to the beginning of that range. Eventually, 
we would end up with the following Unicode codepoints:

  0xa9, 0x2f, 0xe07f

When converting to a byte string for use with fopen(), we simply inspect the 
supplied string again. If a segment contains elements of the PUA, we simply 
reverse the mapping for those and leave the others in that segment as-is. For 
all other segments, we apply the CTYPE conversion.

Notes:
* This effectively converts the current path representation (a string) into a 
sequence of segments where each segment can either be a fully Unicode-capable 
string or a raw byte string without any known interpretation. However, 
instead of using an array for that, it uses a string, which is what most 
people's code expects anyway.
* You could also work on a byte-base instead of splitting the path in segments 
first. I just assumed that a single segment will not contain valid UTF-8 
sequences mixed with invalid ones. A path however can contain both correctly 
and incorrectly encoded segments.

> BTW: Private use areas in Unicode are meant for e.g. company specific
> code points. Using them for escaping purposes is likely to cause problems
> due to assignment clashes.

I'm not sure if the use I proposed is correct according to the intended use of 
the PUA. I know that ideally no such string would escape from Python, i.e. it 
should only be visible internally. I would guess that that is something the 
PUA was intended for.

Uli

-- 
Sator Laser GmbH
Geschäftsführer: Thorsten Föcking, Amtsgericht Hamburg HR B62 932

**************************************************************************************
           Visit our website at <http://www.satorlaser.de/>
**************************************************************************************
Diese E-Mail einschließlich sämtlicher Anhänge ist nur für den Adressaten bestimmt und kann vertrauliche Informationen enthalten. Bitte benachrichtigen Sie den Absender umgehend, falls Sie nicht der beabsichtigte Empfänger sein sollten. Die E-Mail ist in diesem Fall zu löschen und darf weder gelesen, weitergeleitet, veröffentlicht oder anderweitig benutzt werden.
E-Mails können durch Dritte gelesen werden und Viren sowie nichtautorisierte Änderungen enthalten. Sator Laser GmbH ist für diese Folgen nicht verantwortlich.

**************************************************************************************