[Python-Dev] [GSoC] Porting on RPM3

Nick Coghlan ncoghlan at gmail.com
Tue Mar 22 21:46:17 CET 2011


On Tue, Mar 22, 2011 at 7:29 PM, Panu Matilainen
<pmatilai at laiskiainen.org> wrote:
> The bindings cannot go changing header contents to their liking, so any
> canonicalization would have to go into rpm proper, the build-side of things
> to be exact so the runtime doesn't have to care. Requiring rpm to fiddle
> with encodings + canonicalization for every single string it processes at
> runtime would require enormous changes throughout rpm, and presumably at a
> massive performance cost too.

Just a hint from our experience with APIs like os/email/urllib.parse:
you pretty much end up *needing* to have parallel bytes and str APIs
(including higher level data structures that know how to encode and
decode themselves) to get this to work properly. The str APIs will
work 90% of the time, but you still need access to the raw bytes to
recover when the simple approach fails. One key choice to be made is
whether to go the brittle option (i.e. ASCII) for the implicit
decoding, or the permissive one (i.e. UTF-8 with surrogateescape). The
former punts on the complicated encoding issues (e.g. urllib.parse
does this, since correctly formed URLs are meant to be encoded into
pure ASCII), while the latter works by default in more situations, but
can allow malformed data to escape the IO layer and cause problems in
other parts of the program (e.g. many of the os APIs do this, since
real world applications often care more about round tripping correctly
between different OS interfaces).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia


More information about the Python-Dev mailing list