[Python-Dev] [GSoC] Porting on RPM3
Panu Matilainen
pmatilai at laiskiainen.org
Tue Mar 22 10:29:25 CET 2011
On 03/22/2011 03:06 AM, David Malcolm wrote:
> [CCing Panu Matilainen, the maintainer of rpm, or, at least rpm 4.*,
> which is what all major distributions are using AIUI]
>
> On Mon, 2011-03-21 at 10:50 +0100, "Martin v. Löwis" wrote:
>> Am 21.03.2011 07:37, schrieb Prashant Kumar:
>>> Hello,
>>> My name is Prashant Kumar and I've worked on porting few python
>>> libraries(distutils2, configobj) and I've been looking at the ideas
>>> list for GSoC for a project related to porting.
>>>
>>> I came across [1] and found it interesting. It mentions that some
>
> Hi Prashant! Thanks for the interest.
>
> Panu: [1] is http://wiki.python.org/moin/RPMOnPython3 , a Google Summer
> of Code proposal to work on the Python 3 bindings to RPM.
>
>>> of the work has already been done; I would like to look at the code
>>> repository for the same, could someone provide me the link for the
>>> same?
>
>> Not so much the code but the person who did the porting. This was Dave
>> Malcolm (CC'ed); please get in touch with him. Please familiarize
>> yourself with the existing Python bindings (in the latest RPM 4 release
>> from rpm.org). You'll notice that this already has Python 3 support;
>> not sure whether that's the most recent code, though.
>
> Panu Matilainen also worked on the python 3 port of the librpm python
> bindings.
>
> For the rpm source code, see: http://rpm.org/wiki/GetSource (the python
> bindings are in a subdirectory of the main source tree).
>
> My initial patchbomb landed on the mailing list here:
> http://lists.rpm.org/pipermail/rpm-maint/2009-October/002528.html
> and Panu committed and fixed up the patches around then.
>
> My understanding is that the current status is that the bindings work,
> but all values that were formerly exposed to Python 2 as "str" are now
> exposed to Python 3 as "bytes", which would require changing all
> consumers of the code.
That's more or less where it stands.
> I believe Panu has also been working on a rewrite of the Python
> bindings, since the existing code is a little messy.
> Panu, am I remembering this correctly?
The python binding rewrite was abandoned (it just didn't work out for
various reasons) and usable bits merged into the existing bindings.
So yes you're correct - there /was/ such a thing but any new work should
go to the bindings that exist in the main rpm source tree.
> The idea is that these types are fundamentally string-like, but
> unfortunately rpm has always been a bit loose in its interpretation of
> the encoding of byte values in package files and package databases.
> There are millions of rpm files out there, and millions of rpm
> databases, and all of these are in _some_ encoding. I have seen
> specfiles in which parts of the file were encoded in UTF-8 and other
> parts were encoded in Latin-1 (this broke one of my python scripts
> horribly).
More precisely, it's not being "a bit loose" about encoding, rpm simply
doesn't know diddly about encodings and does not make any assumptions or
interpretations about them. A string in rpm is just a sequence of
arbitrary non-zero byte values terminated with \0.
> Martin and I discussed this last week at PyCon. I believe the proposal
> that we came up with was:
> - try to interpret bytes as UTF-8, using the "surrogateescape"
> mechanism, so that if it fails, we can at least preserve the exact bytes
> and round-trip
Right, based on a quick skim of the surrogateescape PEP, that seems like
a reasonable approach (rpm is much like the traditional POSIX interfaces
which simply do not deal with encoding at all)
> Ultimately, this does mean trying to impose some kind of encoding
> standard on rpm files and rpm databases, which I think would be a Good
> Thing, but is perhaps something of scope creep compared to what the
> proposal at [1] says. See e.g. http://rpm.org/ticket/30
Note that any frpm forced encoding standard could only affect new
packages, but rpm and the bindings need to be able deal with all the
junk out in the wild pretty much forever.
>
> Other ideas that occur:
> - does rpmlint check for encoding yet?
IIRC rpmlint can (depending on config probably) check for encoding of
the paths and the spec itself. However this still doesn't guarantee all
the string-data in header to be utf, as practically any part(s) of the
data can come from macros, which are not encoding-aware either.
> - what to do e.g. about canonicalization? What happens if one rpm
> provide a feature named "café" (where the "é" is U+00E9) and another rpm
> requires a feature named "café" (where the "é" is U+0065 LATIN SMALL
> LETTER E + U+0301 COMBINING ACUTE ACCENT)? IIRC we ruled that rpms in
> Fedora had to have ASCII names, and I'm guessing this applies to
> metadata, but we do allow UTF-8 filenames within package payloads
> (again, IIRC)
Ouch. Did I already mention that UTF and the encoding business makes my
head hurt? I guess I didn't, can't think straight because by now I have
that headache...
Anyway, pretty much all rules in this area are distro specific, as rpm
doesn't enforce anything wrt encoding.
The bindings cannot go changing header contents to their liking, so any
canonicalization would have to go into rpm proper, the build-side of
things to be exact so the runtime doesn't have to care. Requiring rpm to
fiddle with encodings + canonicalization for every single string it
processes at runtime would require enormous changes throughout rpm, and
presumably at a massive performance cost too.
- Panu -
More information about the Python-Dev
mailing list