[Python-Dev] [GSoC] Porting on RPM3

Panu Matilainen pmatilai at laiskiainen.org
Tue Mar 22 10:29:25 CET 2011


On 03/22/2011 03:06 AM, David Malcolm wrote:
> [CCing Panu Matilainen, the maintainer of rpm, or, at least rpm 4.*,
> which is what all major distributions are using AIUI]
>
> On Mon, 2011-03-21 at 10:50 +0100, "Martin v. Löwis" wrote:
>> Am 21.03.2011 07:37, schrieb Prashant Kumar:
>>> Hello,
>>>      My name is  Prashant Kumar and I've worked on porting few python
>>> libraries(distutils2, configobj) and I've been looking at the ideas
>>> list for GSoC for a project related to porting.
>>>
>>>      I came across [1]  and found it interesting. It mentions that some
>
> Hi Prashant!  Thanks for the interest.
>
> Panu: [1] is http://wiki.python.org/moin/RPMOnPython3 , a Google Summer
> of Code proposal to work on the Python 3 bindings to RPM.
>
>>> of the work has already been done; I would like to look at the code
>>> repository for the same, could someone provide me the link for the
>>> same?
>
>> Not so much the code but the person who did the porting. This was Dave
>> Malcolm (CC'ed); please get in touch with him. Please familiarize
>> yourself with the existing Python bindings (in the latest RPM 4 release
>> from rpm.org). You'll notice that this already has Python 3 support;
>> not sure whether that's the most recent code, though.
>
> Panu Matilainen also worked on the python 3 port of the librpm python
> bindings.
>
> For the rpm source code, see: http://rpm.org/wiki/GetSource  (the python
> bindings are in a subdirectory of the main source tree).
>
> My initial patchbomb landed on the mailing list here:
>    http://lists.rpm.org/pipermail/rpm-maint/2009-October/002528.html
> and Panu committed and fixed up the patches around then.
>
> My understanding is that the current status is that the bindings work,
> but all values that were formerly exposed to Python 2 as "str" are now
> exposed to Python 3 as "bytes", which would require changing all
> consumers of the code.

That's more or less where it stands.

> I believe Panu has also been working on a rewrite of the Python
> bindings, since the existing code is a little messy.
> Panu, am I remembering this correctly?

The python binding rewrite was abandoned (it just didn't work out for 
various reasons) and usable bits merged into the existing bindings.
So yes you're correct - there /was/ such a thing but any new work should 
go to the bindings that exist in the main rpm source tree.

> The idea is that these types are fundamentally string-like, but
> unfortunately rpm has always been a bit loose in its interpretation of
> the encoding of byte values in package files and package databases.
> There are millions of rpm files out there, and millions of rpm
> databases, and all of these are in _some_ encoding.  I have seen
> specfiles in which parts of the file were encoded in UTF-8 and other
> parts were encoded in Latin-1 (this broke one of my python scripts
> horribly).

More precisely, it's not being "a bit loose" about encoding, rpm simply 
doesn't know diddly about encodings and does not make any assumptions or 
interpretations about them. A string in rpm is just a sequence of 
arbitrary non-zero byte values terminated with \0.

> Martin and I discussed this last week at PyCon.  I believe the proposal
> that we came up with was:
>    - try to interpret bytes as UTF-8, using the "surrogateescape"
> mechanism, so that if it fails, we can at least preserve the exact bytes
> and round-trip

Right, based on a quick skim of the surrogateescape PEP, that seems like 
a reasonable approach (rpm is much like the traditional POSIX interfaces 
which simply do not deal with encoding at all)

> Ultimately, this does mean trying to impose some kind of encoding
> standard on rpm files and rpm databases, which I think would be a Good
> Thing, but is perhaps something of scope creep compared to what the
> proposal at [1] says.  See e.g. http://rpm.org/ticket/30

Note that any frpm forced encoding standard could only affect new 
packages, but rpm and the bindings need to be able deal with all the 
junk out in the wild pretty much forever.

>
> Other ideas that occur:
>    - does rpmlint check for encoding yet?

IIRC rpmlint can (depending on config probably) check for encoding of 
the paths and the spec itself. However this still doesn't guarantee all 
the string-data in header to be utf, as practically any part(s) of the 
data can come from macros, which are not encoding-aware either.

>    - what to do e.g. about canonicalization?  What happens if one rpm
> provide a feature named "café" (where the "é" is U+00E9) and another rpm
> requires a feature named "café" (where the "é" is U+0065 LATIN SMALL
> LETTER E + U+0301 COMBINING ACUTE ACCENT)?  IIRC we ruled that rpms in
> Fedora had to have ASCII names, and I'm guessing this applies to
> metadata, but we do allow UTF-8 filenames within package payloads
> (again, IIRC)

Ouch. Did I already mention that UTF and the encoding business makes my 
head hurt? I guess I didn't, can't think straight because by now I have 
that headache...

Anyway, pretty much all rules in this area are distro specific, as rpm 
doesn't enforce anything wrt encoding.

The bindings cannot go changing header contents to their liking, so any 
canonicalization would have to go into rpm proper, the build-side of 
things to be exact so the runtime doesn't have to care. Requiring rpm to 
fiddle with encodings + canonicalization for every single string it 
processes at runtime would require enormous changes throughout rpm, and 
presumably at a massive performance cost too.

	- Panu -



More information about the Python-Dev mailing list