[Python-ideas] PEP 426, YAML in the stdlib and implementation discovery

Andrew Barnert abarnert at yahoo.com
Fri May 31 20:43:54 CEST 2013


From: Philipp A. <flying-sheep at web.de>

Sent: Friday, May 31, 2013 10:18 AM


> also YAML is standardized, so having a parser in the stdlib doesn’t 
> mean it’s a bad thing when development stalls, because a parsing 
> library doesn’t need to evolve (was it kenneth reitz who said the 
> stdlib is where projects go to die?)

But YAML is up to 1.2, and I believe most libraries (including PyYAML) only handle 1.1 so far. There are also known bugs in the 1.1 specification (e.g., "." is a valid float literal, but doesn't specify 0.0 or any other valid value), that each library has to work around. There are features of the standard, such as YAML<->XML bindings, that are still in early stages of design. Maybe a YAML-1.1-as-interpreted-by-the-majority-of-the-quasi-reference-implementations library doesn't need to evolve, but a YAML library does.

From: Philipp A. <flying-sheep at web.de>
Sent: Friday, May 31, 2013 9:35 AM


>    1. YAML in the stdlib
>The stdlib shouldn’t get more C code; that’s what I’ve gathered.
>So let’s put a pure-python implementation of YAML into the stdlib.

Are you suggesting importing PyYAML (in modified form, and without the libyaml-binding "fast" implementation) into the stdlib, or building a new one? If the former, have you talked to Kirill Simonov? If the latter, are you proposing to build it, or just suggesting that it would be nice if somebody did?

> Let’s base the 
> parser on generators, since generators are cool, easy to debug, and 
> allow us to emit and test the token stream (other than e.g. the HTML 
> parser we have)

Do you mean adding a load_iter() akin to load_all() except that it yields one document at a time, or a SAX-like API instead of a simple load()? 

Or do you just mean we should write a new implementation from scratch that's, e.g., built around a character stream generator feeding a token generator feeding an event generator feeding a document generator? Since YAML is parseable by simple recursive-descent, that isn't impossible, but most implementations make extensive use of peeking, which means you'd need to wrap each generator or add a lookahead stash at each level, which might destroy most of the benefits you're looking for. Also, writing a new implementation from scratch isn't exactly trivial. Look at https://bitbucket.org/xi/pyyaml/src/804623320ab2/lib3/yaml?at=default compared to http://hg.python.org/cpython/file/16fea8b0f8c4/Lib/json.

>    2. Implementation discovery
>People want fast parsing. That’s incompatible with a pure python implementation.
>So let’s define (or use, if there is one I’m not aware of) a discovery 
>mechanism that allows implementations of certain APIs to register 
>themselves as such.


This is an interesting idea, but I think it might be better served by first applying it to something that's already finished, instead of to vaporware. 

For example, the third-party lxml library provides an implementation of the ElementTree API. For some use cases, it's better than the stdlib one. So, a lot of programs start off with this:

    try:
        from lxml import etree as ET
    except ImportError:
        from xml.etree import ElementTree as ET

Your registration mechanism would mean they don't have to do this; they just import from the stdlib, and if lxml is present and registered, it would be loaded instead.

>Let “import yaml” use this mechanism to import a compatible 3rd party implementation 
>in preference to the stdlib one

>Let’s define a property of the implementation that tells the user which 
>implementation he’s using, and a way to select a specific implementation
>(Although that’s probably easily done by just not doing “import yaml”, 
>but “import std_yaml” or “import pyyaml2”)


There are a few examples of something similar, both in and out of the stdlib. For example:

The dbm module basically works like this: you can import dbm.ndbm, or you can just import dbm to get the best available implementation. That isn't done by hooking the import, but rather by providing a handful of wrapper functions that forward to the chosen implementation. Is that reasonable for YAML, or are there too many top-level functions or too much module-level global state or something?

BeautifulSoup uses a different mechanism: you import bs4, but when you construct a BeautifulSoup object, you can optionally specify a parser by name, or leave it unspecified to have it pick the best. I don't think that applies here, as you'll probably be mostly calling top-level functions, not constructing and using parser objects.

There's also been some discussion around how tulip/PEP3156 could allow a "default event loop", which could be either tulip's or provided by some external library like Twisted.

What all of these are missing is a way for an unknown third-party implementation to plug themselves in as the new best. Of course you can always monkeypatch it at runtime (dbm._names.insert(0, __name__)), but you want to do it at _install_ time, which is a different story.

One further issue is that sometimes the system administrator (or end user) might want to affect the default choice for programs running on his machine. For example, lxml is built around libxml2. Mac OS X 10.6, some linux distros, etc. come with old or buggy versions of libxml2. You might want to install lxml anyway and make it the default for BeautifulSoup, but not for ElementTree, or vice-versa.


Finally, what happens if you install two modules on your system which both register as implementations of the same module?

>    3. Allow YAML to be used besides JSON as metadata like in PEP 426. (so including either 

> pymeta.yaml or pymeta.json makes a valid package)
>I don’t propose 
>that we exclusively use YAML, but only because I think that PEP 426 
>shouldn’t be hindered from being implemented ASAP by waiting for a new 
>std-library to be ready.

Note that JSON is a strict subset of YAML 1.2, and not too far from a subset of 1.1. So, you could propose exclusive YAML, and make sticking within the JSON schema and syntax required for packages compatible with Python 3.3 and earlier, but optional for 3.4+ packages.

>What do you think?
>Is there a reason for not including a YAML lib that i didn’t cover?Is there a reason JSON is used other than YAML not being in the stdlib?
>_______________________________________________
>Python-ideas mailing list
>Python-ideas at python.org
>http://mail.python.org/mailman/listinfo/python-ideas
>
>
>


More information about the Python-ideas mailing list