[Distutils] Proposal: Restrict the characters in a project name

Daniel Holth dholth at gmail.com
Wed May 15 16:31:56 CEST 2013


How to avoid confusables.

These scripts are recommended for use in identifiers:
http://www.unicode.org/reports/tr31/#Table_Recommended_Scripts

This report details a confusables detection algorithm:
http://www.unicode.org/reports/tr39/#Confusable_Detection

And ICU implements it:
http://www.icu-project.org/apiref/icu4c/uspoof_8h.html (see also
PyICU).

The package index would enforce uniqueness of the "skeleton" of each
registered package which is just an internal normalization based on
confusability. if skeleton(identifier1) == skeleton(identifier2) then
id1 and id2 are confusable.

The tooling could get away with a simpler rule like
re.sub("[^\w\d.]+", "_", distribution, re.UNICODE)

As a bonus to including the world, this should be able to prevent
people from exchanging zeroes for capital O.

On Wed, May 15, 2013 at 7:17 AM, Eric V. Smith <eric at trueblade.com> wrote:
> On 05/15/2013 07:10 AM, Donald Stufft wrote:
>>>>> Anyone want to run a scan over the PyPI package set to see
>>>>> how many packages would cause problems for a "[a-zA-Z0-9_.-]"
>>>>> only filter?
>>>>
>>>> See my previous email where I did queries against my local DB.
>>>> It's 225 total projects that wouldn't be allowed.
>>>
>>> Can you send the list of those projects?
>>>
>>> Eric.
>>>
>>
>> Here you go https://gist.github.com/dstufft/5583225 used a Python
>> oneliner and the PyPI API so others can reproduce easily if they
>> wish.
>
> Perfect. Thanks.
>
> It looks like space causes most of the issues. I'm not sure how
> "Twisted Flow >= 1.0" would be expected to parse.
>
> Eric.
>
>
> _______________________________________________
> Distutils-SIG maillist  -  Distutils-SIG at python.org
> http://mail.python.org/mailman/listinfo/distutils-sig


More information about the Distutils-SIG mailing list