PEP 426, YAML in the stdlib and implementation discovery
Hi, reading PEP 426<http://www.python.org/dev/peps/pep-0426/#switching-to-a-json-compatible-format>, I made a connection to a (IMHO) longstanding issue: YAML not being in the stdlib. I’m no big fan of JSON, because it’s so strict and comparatively verbose compared with YAML. I just think YAML is more pythonic, and a better choice for any kind of human-written data format. So i devised 3 ideas: 1. *YAML in the stdlib* The stdlib shouldn’t get more C code; that’s what I’ve gathered. So let’s put a pure-python implementation of YAML into the stdlib. Let’s also strictly define the API and make it secure-by-naming™. What i mean is let’s use the safe load function that doesn’t instantiate user-defined classes (in PyYAML called “safe_load”) as default load function “load”, and call the unsafe one by a longer, explicit name (e.g. “unsafe_load” or “extended_load” or something) Let’s base the parser on generators, since generators are cool, easy to debug, and allow us to emit and test the token stream (other than e.g. the HTML parser we have) 2. *Implementation discovery* People want fast parsing. That’s incompatible with a pure python implementation. So let’s define (or use, if there is one I’m not aware of) a discovery mechanism that allows implementations of certain APIs to register themselves as such. Let “import yaml” use this mechanism to import a compatible 3rd party implementation in preference to the stdlib one Let’s define a property of the implementation that tells the user which implementation he’s using, and a way to select a specific implementation (Although that’s probably easily done by just not doing “import yaml”, but “import std_yaml” or “import pyyaml2”) 3. Allow YAML to be used besides JSON as metadata like in PEP 426. (so including either pymeta.yaml or pymeta.json makes a valid package) I don’t propose that we exclusively use YAML, but only because I think that PEP 426 shouldn’t be hindered from being implemented ASAP by waiting for a new std-library to be ready. What do you think? Is there a reason for not including a YAML lib that i didn’t cover? Is there a reason JSON is used other than YAML not being in the stdlib?
On May 31, 2013 9:46 AM, "Philipp A." <flying-sheep@web.de> wrote:
I’m no big fan of JSON, because it’s so strict and comparatively verbose
compared with YAML. I just think YAML is more pythonic, and a better choice for any kind of human-written data format.
Considering json values are Python literals and yaml isn't I'd say you have the first part backwards. And as far as human-written data goes strictness helps prevent errors. But it doesn't have to be a competition. If there's value in having a standard yaml parser or value in accepting yaml in specific cases that value should stand by itself. --- Bruce (from my phone)
json is both subset of python literals (pyon, if you want) and yaml, but stricter than both: it doesn’t allow typewriter apostrophes (') for strings like python, and doesn’t allow unicode, raw strings or byte literals. (yaml afaik knows all of those, albeit with different syntax than python. but it has them, allowing to easily represent more of python’s capabilities) yaml reports errors and treats ambiguities as errors. with “strictness”, i mean the ways something can be written: both python and yaml allow for synonymous ways to write the same object, whereas JSON only accepts variable whitespace: hardly a source of error! and of course there’s value in a stdlib yaml parser: as i said it’s much more human friendly, and even some python projects already use it for configuration because of this (i say “even”, because of course JSON being in the stdlib is a string aargument to use it in dependecy-free projects). also YAML is standardized, so having a parser in the stdlib doesn’t mean it’s a bad thing when development stalls, because a parsing library doesn’t need to evolve (was it kenneth reitz who said the stdlib is where projects go to die?) 2013/5/31 Bruce Leban <bruce@leapyear.org>
On May 31, 2013 9:46 AM, "Philipp A." <flying-sheep@web.de> wrote:
I’m no big fan of JSON, because it’s so strict and comparatively verbose
compared with YAML. I just think YAML is more pythonic, and a better choice for any kind of human-written data format.
Considering json values are Python literals and yaml isn't I'd say you have the first part backwards. And as far as human-written data goes strictness helps prevent errors.
But it doesn't have to be a competition. If there's value in having a standard yaml parser or value in accepting yaml in specific cases that value should stand by itself.
--- Bruce (from my phone)
On 2013-05-31, at 19:18 , Philipp A. wrote:
json is both subset of python literals (pyon, if you want) and yaml, but stricter than both: it doesn’t allow typewriter apostrophes (') for strings like python, and doesn’t allow unicode, raw strings or byte literals.
All json strings are unicode literals: JSON strings can embed any unicode character literally aside from the double quote and backslash, and they support \u escapes (identical to Python's). The only major difference is that JSON does not support \U escapes (to escapes can only be used for BMP characters).
(yaml afaik knows all of those, albeit with different syntax than python. but it has them, allowing to easily represent more of python’s capabilities)
YAML has no rawstrings that I know of. It also has no byte literals, there is a working draft for a binary tag encoded in base64[0]. Its failsafe schema only contains strings (unicode), same as JSON.
yaml reports errors and treats ambiguities as errors.
That is not correct, the spec notes:
A YAML processor may recover from syntax errors, possibly by ignoring certain parts of the input, but it must provide a mechanism for reporting such errors.
YAML implementations are absolutely free to resolve ambiguities on their own and not report any error by default, and the spec's "loading failure points" graph clearly indicates parsing a YAML document may yield a partial representation.
and of course there’s value in a stdlib yaml parser: as i said it’s much more human friendly, and even some python projects already use it for configuration because of this (i say “even”, because of course JSON being in the stdlib is a string aargument to use it in dependecy-free projects). also YAML is standardized, so having a parser in the stdlib doesn’t mean it’s a bad thing when development stalls, because a parsing library doesn’t need to evolve (was it kenneth reitz who said the stdlib is where projects go to die?)
From: Masklinn <masklinn@masklinn.net> Sent: Friday, May 31, 2013 11:18 AM
On 2013-05-31, at 19:18 , Philipp A. wrote:
(yaml afaik knows all of those, albeit with different syntax than python. but it has them, allowing to easily represent more of python’s capabilities)
YAML has no rawstrings that I know of.
Single-quoted strings are basically raw strings. They're different from raw strings in the same way all YAML strings are different from Python strings—newlines and doubling the quotes to escape them—but they ignore escape sequences, which is the fundamental property of raw strings. See http://www.yaml.org/spec/1.2/spec.html#id2760844 for an example, and section 7.3.2 for specifics.
It also has no byte literals, there is a working draft for a binary tag encoded in base64[0].
Section 10.4 explicitly says that "It is strongly recommended that [interoperable] schemas make as much use as possible of the the YAML tag repository at http://yaml.org/type/. This repository provides recommended global tags for increasing the portability of YAML documents between different applications." Even though most of the tags in the repository are officially working drafts, they're effectively standards.
Its failsafe
schema only contains strings (unicode), same as JSON.
But the spec doesn't recommend using the failsafe schema for most purposes. Section 10.3 says "[The core schema] is the recommended default schema that YAML processor should use unless instructed otherwise. It is also strongly recommended that other schemas should be based on it." Section 10.4 then implies that what most applications really should be using is something the spec doesn't quite name or define—the core schema plus all the tags from the repository. Note that all of this means you can't just say "use YAML" to specify anything; you have to say "use this YAML schema". So, if we were to follow the OP's suggestion of using YAML for metadata, it would have to be more specific than that.
2013/5/31 Guido van Rossum <guido@python.org>
On Fri, May 31, 2013 at 11:35 AM, Brett Cannon <brett@python.org> wrote:
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
Hm. What makes you think it's not being used widely enough? I suppose JSON is more popular, but it's not like YAML is dying. AFAIK there's a 3rd party yaml parser that we could incorporate with its authors' permission -- this would save people from a pretty common dependency (just like we did for JSON).
i think ruby created its own reality here: yaml wasn’t popular because it wasn’t in the stdlib, and became popular as soon as it was. its advantages as arguably most human-writable serialization format helped here. at least that’s my interpretation.
On Fri, May 31, 2013 at 3:27 PM, Philipp A. <flying-sheep@web.de> wrote:
2013/5/31 Guido van Rossum <guido@python.org>
On Fri, May 31, 2013 at 11:35 AM, Brett Cannon <brett@python.org> wrote:
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
Hm. What makes you think it's not being used widely enough? I suppose JSON is more popular, but it's not like YAML is dying. AFAIK there's a 3rd party yaml parser that we could incorporate with its authors' permission -- this would save people from a pretty common dependency (just like we did for JSON).
i think ruby created its own reality here: yaml wasn’t popular because it wasn’t in the stdlib, and became popular as soon as it was. its advantages as arguably most human-writable serialization format helped here. at least that’s my interpretation.
I kindof like the idea of a stdlib YAML. IIUC the format competes with XML and pickle in interesting ways, avoiding in-band signaling like "keys starting with a $" or whatever that sometimes happens with JSON. It doesn't make sense as the serialization format for packaging metadata. *.dist-info/pymeta.json, to be parsed at install time to resolve dependency graphs, has no human readability/writability requirement, but must be fast, simple to implement, and available on Python 2.6+. At least for the 100k+ existing pypy-hosted sdists the metadata input format is a Python program called "setup.py". YAML could be a good alternative. Perhaps you could try doing a YAML version of Bento's configuration language? Right now the thinking in packaging is that there will not be *a* standard setup.py replacement. Instead, we will define hooks so that your build tool can generate the static metadata and build your package, and the install tools will be able to interoperate from there.
Philipp A. <flying-sheep@...> writes:
Hi, reading PEP 426, I made a connection to a (IMHO) longstanding issue: YAML not being in the stdlib.
There have been security issues with YAML (which bit the Rails community not so long ago) because it allows the construction of arbitrary objects. So it may be that YAML is not the best format for scenarios where tools read YAML from untrusted sources. The PEP defines the metadata format as a Python dictionary - the serialising of metadata to a specific file format seems a secondary consideration. It's quite possible that some of the packaging tools that use the new metadata will support different serialisation mechanisms, perhaps including YAML, but ISTM that having YAML in the stdlib is orthogonal to the PEP. Do you have a specific YAML implementation in mind? I thought that the front-runner was PyYAML, but in my initial experiments with PyYAML and packaging metadata, I found bugs in the implementation (which I have reported on the PyYAML tracker) which made me switch to JSON. Regards, Vinay Sajip
2013/5/31 Vinay Sajip <vinay_sajip@yahoo.co.uk>
There have been security issues with YAML (which bit the Rails community not so long ago) because it allows the construction of arbitrary objects. So it may be that YAML is not the best format for scenarios where tools read YAML from untrusted sources.
please read my post again: i specifically mention that issue and a possible solution. i’m just a little annoyed that you skipped that paragraph and attack a strawman now. but not too annoyed :) The PEP defines the metadata format as a Python dictionary - the serialising
of metadata to a specific file format seems a secondary consideration. It's quite possible that some of the packaging tools that use the new metadata will support different serialisation mechanisms, perhaps including YAML, but ISTM that having YAML in the stdlib is orthogonal to the PEP.
but in the future, package metadata won’t be specified in the setup.py anymore, so we need a metadata file (like setup.cfg would have been for distutils2). and we write those per hand. the involved metadata corresponds exactly to the one mentioned here, so what do you think that the format of that metadata file will be? Do you have a specific YAML implementation in mind? I thought that the
front-runner was PyYAML, but in my initial experiments with PyYAML and packaging metadata, I found bugs in the implementation (which I have reported on the PyYAML tracker) which made me switch to JSON.
i didn’t think of any, but i don’t think any available one would meet the proposed goals of a secure API (like i said in the paragraph you skipped) and a generator-based implementation/API. Regards,
Vinay Sajip
regards, phil
On 31.05.2013 20:13, Philipp A. wrote:
2013/5/31 Vinay Sajip <vinay_sajip@yahoo.co.uk>
The PEP defines the metadata format as a Python dictionary - the serialising of metadata to a specific file format seems a secondary consideration. It's quite possible that some of the packaging tools that use the new metadata will support different serialisation mechanisms, perhaps including YAML, but ISTM that having YAML in the stdlib is orthogonal to the PEP.
but in the future, package metadata won’t be specified in the setup.py anymore, so we need a metadata file (like setup.cfg would have been for distutils2). and we write those per hand. the involved metadata corresponds exactly to the one mentioned here, so what do you think that the format of that metadata file will be?
Just as data point: PEP 426 explicitly says "It is expected that these metadata files will be generated by build tools based on other input formats (such as setup.py) rather than being edited by hand." Not sure where you got the idea from that anyone would write the JSON files by hand. The data will be extracted from the things you specify in setup.py at sdist or wheel build time and put into the JSON files. So that particular use case is not very likely to happen. That's not to say there aren't any use cases, it's just not going to be this one :-). -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 31 2013)
Python Projects, Consulting and Support ... http://www.egenix.com/ mxODBC.Zope/Plone.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ... http://python.egenix.com/
2013-07-01: EuroPython 2013, Florence, Italy ... 31 days to go ::::: Try our mxODBC.Connect Python Database Interface for free ! :::::: eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/
Philipp A. <flying-sheep@...> writes:
please read my post again: i specifically mention that issue and a possible solution. i’m just a little annoyed that you skipped that paragraph and attack a strawman now. but not too annoyed :)
I did read it, perhaps I should have been more clear. I didn't say the security issue was a show-stopper, just tagged it as a possible problem area. There are already yaml libraries out in the wild whose load() is the unsafe version, and a user may not necessarily be able to control (or even know) which yaml library is installed (e.g. distro package managers are conservative about adopting recent versions of libs).
i didn’t think of any, but i don’t think any available one would meet the proposed goals of a secure API (like i said in the paragraph you skipped)
It's chicken and egg. IMO it doesn't make sense to even think about YAML in the stdlib until there is a version outside the stdlib which has a reasonable level of adoption and battle-tested status. This is how JSON support came into the stdlib, for example. At the moment PyYAML seems to be the most mature, but from what I can see on its Trac, the most recent version (3.10 AFAIK) is still not ready. Regards, Vinay Sajip
Hello, On Fri, 31 May 2013 18:35:43 +0200 "Philipp A." <flying-sheep@web.de> wrote:
What do you think?
Is there a reason for not including a YAML lib that i didn’t cover?
As for many topics, the #1 reason is that nobody proposed such an inclusion. By *proposing*, we do not mean just emitting the idea as you did (which is of course fine), but drafting a concrete proposal (under the form of a Python Enhancement Proposal), and promising - or finding someone else willing to promise - to handle maintenance and bugfixing of the new stdlib module within our development community (not eternally of course, but a couple of years would be nice so as to iron out most issues). Regards Antoine.
On Fri, May 31, 2013 at 12:35 PM, Philipp A. <flying-sheep@web.de> wrote:
Hi, reading PEP 426, I made a connection to a (IMHO) longstanding issue: YAML not being in the stdlib.
I’m no big fan of JSON, because it’s so strict and comparatively verbose compared with YAML. I just think YAML is more pythonic, and a better choice for any kind of human-written data format.
So i devised 3 ideas:
YAML in the stdlib The stdlib shouldn’t get more C code; that’s what I’ve gathered. So let’s put a pure-python implementation of YAML into the stdlib. Let’s also strictly define the API and make it secure-by-naming™. What i mean is let’s use the safe load function that doesn’t instantiate user-defined classes (in PyYAML called “safe_load”) as default load function “load”, and call the unsafe one by a longer, explicit name (e.g. “unsafe_load” or “extended_load” or something) Let’s base the parser on generators, since generators are cool, easy to debug, and allow us to emit and test the token stream (other than e.g. the HTML parser we have)
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
Implementation discovery People want fast parsing. That’s incompatible with a pure python implementation. So let’s define (or use, if there is one I’m not aware of) a discovery mechanism that allows implementations of certain APIs to register themselves as such. Let “import yaml” use this mechanism to import a compatible 3rd party implementation in preference to the stdlib one Let’s define a property of the implementation that tells the user which implementation he’s using, and a way to select a specific implementation (Although that’s probably easily done by just not doing “import yaml”, but “import std_yaml” or “import pyyaml2”)
The standard practice to to place any accelerated code in something like _yaml and then in yaml.py do a ``from _yaml import *``.
Allow YAML to be used besides JSON as metadata like in PEP 426. (so including either pymeta.yaml or pymeta.json makes a valid package) I don’t propose that we exclusively use YAML, but only because I think that PEP 426 shouldn’t be hindered from being implemented ASAP by waiting for a new std-library to be ready.
But that then creates a possible position where just to read metadata you must have a 3rd-party library installed, and I view that as non-starter.
What do you think?
While I appreciate what you are suggesting, I don't see it happening.
Is there a reason for not including a YAML lib that i didn’t cover?
Yes, see above.
Is there a reason JSON is used other than YAML not being in the stdlib?
It's simpler, it's Python syntax, it's faster to parse. If you don't like json and would rather specify metadata using YAML, I would write a tool that read YAML and then emitted the metadata.json file. That way you get to write your metadata in the format you want but without requiring YAML support in the stdlib. But making YAML a first-class citizen in all of this won't happen as long as YAML is not in the stdlib and that is not a viable option.
On Fri, May 31, 2013 at 11:35 AM, Brett Cannon <brett@python.org> wrote:
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
Hm. What makes you think it's not being used widely enough? I suppose JSON is more popular, but it's not like YAML is dying. AFAIK there's a 3rd party yaml parser that we could incorporate with its authors' permission -- this would save people from a pretty common dependency (just like we did for JSON).
Is there a reason JSON is used other than YAML not being in the stdlib?
It's simpler, it's Python syntax, it's faster to parse.
I would warn strongly against the "JSON is Python syntax" meme. While you can usually read JSON with Python's eval(), *writing* it with repr() is a disaster because of JSON's requirement to use double string quotes. And as we know, eval() is unsafe, so the conclusion is that one should always use the json module, and never rely on the fact that it looks like Python (except that it makes the format easy to understand to humans familiar with Python). (I have no opinion on the use of YAML for metadata.) -- --Guido van Rossum (python.org/~guido)
On Fri, May 31, 2013 at 2:43 PM, Guido van Rossum <guido@python.org> wrote:
On Fri, May 31, 2013 at 11:35 AM, Brett Cannon <brett@python.org> wrote:
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
Hm. What makes you think it's not being used widely enough?
In my purview it isn't. I mean I know App Engine uses it but I just don't come across it often enough to think that we should incorporate it. Heck, I think I may have suggested it years back when I came across a need, but at this moment in time I just don't have a feeling it's wide enough to want to maintain the code.
I suppose JSON is more popular, but it's not like YAML is dying. AFAIK there's a 3rd party yaml parser that we could incorporate with its authors' permission -- this would save people from a pretty common dependency (just like we did for JSON).
Sure, but the pure Python version is naively 5,500 lines, just 500 lines less than decimal.py. The json package is 1,100. Sure it's difficult to get right since the format is hard to parse which is an argument for it's addition, but this would not be a small import of code. And the popularity:code size/complexity just isn't an easy sell to me. Obviously it's just a gut feeling but I just am not feeling it as worth the hassle. But we really won't know w/o asking python-dev about this.
Is there a reason JSON is used other than YAML not being in the stdlib?
It's simpler, it's Python syntax, it's faster to parse.
I would warn strongly against the "JSON is Python syntax" meme. While you can usually read JSON with Python's eval(), *writing* it with repr() is a disaster because of JSON's requirement to use double string quotes. And as we know, eval() is unsafe, so the conclusion is that one should always use the json module, and never rely on the fact that it looks like Python (except that it makes the format easy to understand to humans familiar with Python).
I'm talking purely from the perspective of writing it by hand which is what sparked this conversation. There is no new format to really learn like with YAML: write a Python dict using double-quotes for strings and lowercase your singletons from Python and you basically there. -Brett
(I have no opinion on the use of YAML for metadata.)
-- --Guido van Rossum (python.org/~guido)
On Fri, 31 May 2013 16:11:10 -0400 Brett Cannon <brett@python.org> wrote:
On Fri, May 31, 2013 at 2:43 PM, Guido van Rossum <guido@python.org> wrote:
On Fri, May 31, 2013 at 11:35 AM, Brett Cannon <brett@python.org> wrote:
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
Hm. What makes you think it's not being used widely enough?
In my purview it isn't. I mean I know App Engine uses it but I just don't come across it often enough to think that we should incorporate it.
YAML is used by both Salt and Ansible, two configuration management engines (*) written in Python with growing popularity. http://docs.saltstack.com/topics/tutorials/starting_states.html#default-data... http://ansible.cc/docs/playbooks.html#playbook-language-example (*) in other words, Chef / Puppet contenders https://www.ohloh.net/p/compare?project_0=Chef&project_1=salt&project_2=Ansible
I suppose JSON is more popular, but it's not like YAML is dying. AFAIK there's a 3rd party yaml parser that we could incorporate with its authors' permission -- this would save people from a pretty common dependency (just like we did for JSON).
Sure, but the pure Python version is naively 5,500 lines, just 500 lines less than decimal.py. The json package is 1,100. Sure it's difficult to get right since the format is hard to parse which is an argument for it's addition, but this would not be a small import of code.
I agree that YAML being on the complex side is a bit of a warning sign for stdlib inclusion.
I'm talking purely from the perspective of writing it by hand which is what sparked this conversation. There is no new format to really learn like with YAML: write a Python dict using double-quotes for strings and lowercase your singletons from Python and you basically there.
But writing JSON by hand isn't really pleasant or convenient. It's ok for small lumps of data. Salt and Ansible don't (seem to) use YAML for anything complicated, just the much friendlier user experience. Regards Antoine.
On Fri, May 31, 2013 at 9:43 PM, Guido van Rossum <guido@python.org> wrote:
On Fri, May 31, 2013 at 11:35 AM, Brett Cannon <brett@python.org> wrote:
Is there a reason JSON is used other than YAML not being in the stdlib?
It's simpler, it's Python syntax, it's faster to parse.
I would warn strongly against the "JSON is Python syntax" meme.
Another warning - this javascript tidbit broke my heart more times than I care to admit:
var k = 'foo' var obj = {k: 'wonderful'} obj[k] undefined obj['k'] "wonderful"
Yuval
Yuval Greenfield wrote:
It's simpler, it's Python syntax, it's faster to parse.
I would warn strongly against the "JSON is Python syntax" meme.
Another warning - this javascript tidbit broke my heart more times than I care to admit:
var k = 'foo' var obj = {k: 'wonderful'} obj[k] undefined obj['k'] "wonderful"
*sigh* brother! worse if instead of a generic k you have: var MY_CONFIGURATION_CONSTANT = 'foo'; fortunately JSON does not allow unquoted keys (and variables...) so at least his behaviour does not make surprises. -- ZeD
2013/5/31 Brett Cannon <brett@python.org>
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
[citation needed] it’s omnipresent in the ruby community *because* it’s nicer than JSON and XML, and *because* the ruby stdlib has a parser (my interpretation, of course, but not a unlikely one, imho). and again, to intercept the “unsafe” argument: naming the unsafe load function “load” creates human error. naming the safe one “load” prevents it. i’m sure of that, too: nobody may honestly say he didn’t know that “unsafe_load” is unsafe.
(Although that’s probably easily done by just not doing “import yaml”, but
“import std_yaml” or “import pyyaml2”)
The standard practice to to place any accelerated code in something like _yaml and then in yaml.py do a ``from _yaml import *``.
that’s what i said. just that _name implies internal, implementation -specific, rapidly changing code, which doesn’t fit my vision of a strict API that “_yaml” and compatible implementations should implement. but maybe an infobox in the stdlib yaml documentation telling the user about it is sufficient. But that then creates a possible position where just to read metadata
you must have a 3rd-party library installed, and I view that as non-starter.
that’s exactly why i presented those 3 ideas as one: they work together best (although the implementation discovery isn’t mandatory) It's simpler, it's Python syntax, it's faster to parse.
wrong, wrong and irrelevant. it’s only “simpler” for certail definitions of “simple”. those definitions target compilers, not humans. python targets humans, not compilers. (that’s e.g. why it favors readability over speed) also JSON is NOT python syntax, not even a subset: it has true, false and null instead of True, False and None, and also there’s a little hack involving escaped newlines which breaks code based on this assumption in awesome ways ;) But making YAML a first-class citizen in all of this won't happen
as long as YAML is not in the stdlib and that is not a viable option.
says you.
On 05/31/2013 11:57 AM, Philipp A. wrote:
2013/5/31 Brett Cannon wrote:
The standard practice to to place any accelerated code in something like _yaml and then in yaml.py do a ``from _yaml import *``.
that’s what i said. just that _name implies internal, implementation -specific, rapidly changing code, [...]
_name implies implementation detail and private. Very little in the stdlib is rapidly changing. -- ~Ethan~
On Fri, May 31, 2013 at 2:57 PM, Philipp A. <flying-sheep@web.de> wrote:
2013/5/31 Brett Cannon <brett@python.org>
So yaml is not going to end up in the stdlib. The format is not used widely enough to warrant being added nor have to maintain a parser for such a complicated format.
[citation needed]
OK, I claim it isn't as widely used as I think would warrant inclusion, you disagree. Asking for a citation could be thrown in either direction and on any point in this discussion and it comes off as aggressive.
it’s omnipresent in the ruby community *because* it’s nicer than JSON and XML, and *because* the ruby stdlib has a parser (my interpretation, of course, but not a unlikely one, imho).
That's fine, but that's not a reason to add it to Python's stdlib. Adding anything to the stdlib takes careful thought because the burden of maintenance is high for any code that lands there. From my POV a YAML module just isn't there.
and again, to intercept the “unsafe” argument: naming the unsafe load function “load” creates human error. naming the safe one “load” prevents it. i’m sure of that, too: nobody may honestly say he didn’t know that “unsafe_load” is unsafe.
(Although that’s probably easily done by just not doing “import yaml”, but “import std_yaml” or “import pyyaml2”)
The standard practice to to place any accelerated code in something like _yaml and then in yaml.py do a ``from _yaml import *``.
that’s what i said. just that _name implies internal, implementation -specific, rapidly changing code, which doesn’t fit my vision of a strict API that “_yaml” and compatible implementations should implement. but maybe an infobox in the stdlib yaml documentation telling the user about it is sufficient.
But that then creates a possible position where just to read metadata you must have a 3rd-party library installed, and I view that as non-starter.
that’s exactly why i presented those 3 ideas as one: they work together best (although the implementation discovery isn’t mandatory)
It's simpler, it's Python syntax, it's faster to parse.
wrong, wrong and irrelevant.
It might be irrelevant to you, but it isn't irrelevant to everyone. Remember, this is for the stdlib which means its use needs to beyond just what packaging wants.
it’s only “simpler” for certail definitions of “simple”. those definitions target compilers, not humans. python targets humans, not compilers. (that’s e.g. why it favors readability over speed) also JSON is NOT python syntax, not even a subset: it has true, false and null instead of True, False and None,
For the purposes of what is being discussed here it is close enough (the PEP mentions the use of none once).
and also there’s a little hack involving escaped newlines which breaks code based on this assumption in awesome ways ;)
But making YAML a first-class citizen in all of this won't happen as long as YAML is not in the stdlib and that is not a viable option.
says you.
Yes, says me. It's my opinion and I am allowed to express it here. You are beginning to take this personally and become a bit hostile. Please take a moment to step back and realize this is just a discussion and just because I disagree with it doesn't mean I think that's bad or I think negatively of you, I just disagree with you.
2013/5/31 Brett Cannon <brett@python.org>
[citation needed]
OK, I claim it isn't as widely used as I think would warrant inclusion, you disagree. Asking for a citation could be thrown in either direction and on any point in this discussion and it comes off as aggressive.
it was a joke. you say “it’s not widely used” and i used wikipedias citation tag in order to say: “i’m not so sure about this”
wrong, wrong and irrelevant.
It might be irrelevant to you, but it isn't irrelevant to everyone. Remember, this is for the stdlib which means its use needs to beyond just what packaging wants.
you said JSON is faster to parse in context of metadata files. that’s really irrelevant, because those metadata files are tiny. speed only matters if you have to do an operation very often in some way. that isn’t at all the case if your task is parsing one <100 lines metadata file. for every other use, the user would be able to decide if it was worth the very slightly slower parsing. the area where parsing speed is of essence is a use case for something like protocol buffers anyway, not JSON or YAML. Yes, says me. It's my opinion and I am allowed to express it here.
You are beginning to take this personally and become a bit hostile. Please take a moment to step back and realize this is just a discussion and just because I disagree with it doesn't mean I think that's bad or I think negatively of you, I just disagree with you.
i’m sorry if i came over like this. let me tell you that this was not at all my intention! best regards, philipp
From: Philipp A. <flying-sheep@web.de> Sent: Friday, May 31, 2013 10:18 AM
also YAML is standardized, so having a parser in the stdlib doesn’t mean it’s a bad thing when development stalls, because a parsing library doesn’t need to evolve (was it kenneth reitz who said the stdlib is where projects go to die?)
But YAML is up to 1.2, and I believe most libraries (including PyYAML) only handle 1.1 so far. There are also known bugs in the 1.1 specification (e.g., "." is a valid float literal, but doesn't specify 0.0 or any other valid value), that each library has to work around. There are features of the standard, such as YAML<->XML bindings, that are still in early stages of design. Maybe a YAML-1.1-as-interpreted-by-the-majority-of-the-quasi-reference-implementations library doesn't need to evolve, but a YAML library does. From: Philipp A. <flying-sheep@web.de> Sent: Friday, May 31, 2013 9:35 AM
1. YAML in the stdlib The stdlib shouldn’t get more C code; that’s what I’ve gathered. So let’s put a pure-python implementation of YAML into the stdlib.
Are you suggesting importing PyYAML (in modified form, and without the libyaml-binding "fast" implementation) into the stdlib, or building a new one? If the former, have you talked to Kirill Simonov? If the latter, are you proposing to build it, or just suggesting that it would be nice if somebody did?
Let’s base the parser on generators, since generators are cool, easy to debug, and allow us to emit and test the token stream (other than e.g. the HTML parser we have)
Do you mean adding a load_iter() akin to load_all() except that it yields one document at a time, or a SAX-like API instead of a simple load()? Or do you just mean we should write a new implementation from scratch that's, e.g., built around a character stream generator feeding a token generator feeding an event generator feeding a document generator? Since YAML is parseable by simple recursive-descent, that isn't impossible, but most implementations make extensive use of peeking, which means you'd need to wrap each generator or add a lookahead stash at each level, which might destroy most of the benefits you're looking for. Also, writing a new implementation from scratch isn't exactly trivial. Look at https://bitbucket.org/xi/pyyaml/src/804623320ab2/lib3/yaml?at=default compared to http://hg.python.org/cpython/file/16fea8b0f8c4/Lib/json.
2. Implementation discovery People want fast parsing. That’s incompatible with a pure python implementation. So let’s define (or use, if there is one I’m not aware of) a discovery mechanism that allows implementations of certain APIs to register themselves as such.
This is an interesting idea, but I think it might be better served by first applying it to something that's already finished, instead of to vaporware. For example, the third-party lxml library provides an implementation of the ElementTree API. For some use cases, it's better than the stdlib one. So, a lot of programs start off with this: try: from lxml import etree as ET except ImportError: from xml.etree import ElementTree as ET Your registration mechanism would mean they don't have to do this; they just import from the stdlib, and if lxml is present and registered, it would be loaded instead.
Let “import yaml” use this mechanism to import a compatible 3rd party implementation in preference to the stdlib one
Let’s define a property of the implementation that tells the user which implementation he’s using, and a way to select a specific implementation (Although that’s probably easily done by just not doing “import yaml”, but “import std_yaml” or “import pyyaml2”)
There are a few examples of something similar, both in and out of the stdlib. For example: The dbm module basically works like this: you can import dbm.ndbm, or you can just import dbm to get the best available implementation. That isn't done by hooking the import, but rather by providing a handful of wrapper functions that forward to the chosen implementation. Is that reasonable for YAML, or are there too many top-level functions or too much module-level global state or something? BeautifulSoup uses a different mechanism: you import bs4, but when you construct a BeautifulSoup object, you can optionally specify a parser by name, or leave it unspecified to have it pick the best. I don't think that applies here, as you'll probably be mostly calling top-level functions, not constructing and using parser objects. There's also been some discussion around how tulip/PEP3156 could allow a "default event loop", which could be either tulip's or provided by some external library like Twisted. What all of these are missing is a way for an unknown third-party implementation to plug themselves in as the new best. Of course you can always monkeypatch it at runtime (dbm._names.insert(0, __name__)), but you want to do it at _install_ time, which is a different story. One further issue is that sometimes the system administrator (or end user) might want to affect the default choice for programs running on his machine. For example, lxml is built around libxml2. Mac OS X 10.6, some linux distros, etc. come with old or buggy versions of libxml2. You might want to install lxml anyway and make it the default for BeautifulSoup, but not for ElementTree, or vice-versa. Finally, what happens if you install two modules on your system which both register as implementations of the same module?
3. Allow YAML to be used besides JSON as metadata like in PEP 426. (so including either
pymeta.yaml or pymeta.json makes a valid package) I don’t propose that we exclusively use YAML, but only because I think that PEP 426 shouldn’t be hindered from being implemented ASAP by waiting for a new std-library to be ready.
Note that JSON is a strict subset of YAML 1.2, and not too far from a subset of 1.1. So, you could propose exclusive YAML, and make sticking within the JSON schema and syntax required for packages compatible with Python 3.3 and earlier, but optional for 3.4+ packages.
What do you think? Is there a reason for not including a YAML lib that i didn’t cover?Is there a reason JSON is used other than YAML not being in the stdlib? _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On 2013-05-31, at 20:43 , Andrew Barnert wrote:
For example, the third-party lxml library provides an implementation of the ElementTree API. For some use cases, it's better than the stdlib one. So, a lot of programs start off with this:
try: from lxml import etree as ET except ImportError: from xml.etree import ElementTree as ET
Your registration mechanism would mean they don't have to do this; they just import from the stdlib, and if lxml is present and registered, it would be loaded instead.
That seems rife with potential issues and unintended side-effects e.g. while lxml does for the most part provide ET's API, it also extends it and I do not know if it can run ET's testsuite. It also doesn't handle ET's implementation details for obvious reasons[0]. So while a developer who will try lxml and fallback on ET will keep in mind to stay compatible and not use ET implementation details, one who expects ET and gets lxml on client machine will likely be slightly disappointed in the system. [0] and one of those implementation details can turn out deadly in unskilled hands: by default ElementTree will *not* remember namespace aliases and will rename most things from ns0 onwards: >>> ET.tostring(ET.fromstring('<foo xmlns:bar="noop"><bar:baz/></foo>')) '<foo><ns0:baz xmlns:ns0="noop" /></foo>' whereas lxml *will* save parsed namespace aliases in its internal namespace map(s?) and reuse them: >>> lxml.tostring(lxml.fromstring('<foo xmlns:bar="noop"><bar:baz/></foo>')) '<foo xmlns:bar="noop"><bar:baz/></foo>' if a developer expects a re-prefixed output and get lxml's, things are going to blow up. Yeah it's bad to do that, but I've seen enough supposedly-XML-based software which cared less for the namespace itself than for the alias to know that it's way too common.
2013/5/31 Masklinn <masklinn@masklinn.net>
On 2013-05-31, at 20:43 , Andrew Barnert wrote:
try: from lxml import etree as ET except ImportError: from xml.etree import ElementTree as ET
Your registration mechanism would mean they don't have to do this; they
just import from the stdlib, and if lxml is present and registered, it would be loaded instead.
That seems rife with potential issues and unintended side-effects e.g. while lxml does for the most part provide ET's API, it also extends it and I do not know if it can run ET's testsuite. It also doesn't handle ET's implementation details for obvious reasons.
and that’s where my idea’s “strict API” comes into play: compatible implementations would *have to* pass a test suite and implement a certain API and comply with the standard. unsure if and how to test the latter (surely running a testsuite when something wants to register isn’t practical – or is it?)
On 2013-05-31, at 21:39 , Philipp A. wrote:
2013/5/31 Masklinn <masklinn@masklinn.net>
On 2013-05-31, at 20:43 , Andrew Barnert wrote:
try: from lxml import etree as ET except ImportError: from xml.etree import ElementTree as ET
Your registration mechanism would mean they don't have to do this; they just import from the stdlib, and if lxml is present and registered, it would be loaded instead.
That seems rife with potential issues and unintended side-effects e.g. while lxml does for the most part provide ET's API, it also extends it and I do not know if it can run ET's testsuite. It also doesn't handle ET's implementation details for obvious reasons.
and that’s where my idea’s “strict API” comes into play: compatible implementations would *have to* pass a test suite and implement a certain API and comply with the standard.
But that's not sufficient is the issue here, as I tried to point out when somebody uses ElementTree they may be using more than just the API, they may well be relying on implementation details (e.g. namespace behavior or the _namespace_map). It might be bad, but it still happens. A lot. Hell, I've used _namespace_map in the past because I had to (wanted to transform a maven script and some script down the line, maybe maven itself, wanted exactly the `mvn` namespace alias). This will usually be safe, especially with old packages with low to no evolution. But if you start swapping things with "API-compatible" libraries all bets are off.
unsure if and how to test the latter (surely running a testsuite when something wants to register isn’t practical – or is it?)
From: Masklinn <masklinn@masklinn.net> Sent: Friday, May 31, 2013 1:45 PM
On 2013-05-31, at 21:39 , Philipp A. wrote:
2013/5/31 Masklinn <masklinn@masklinn.net>
try: from lxml import etree as ET except ImportError: from xml.etree import ElementTree as ET
Your registration mechanism would mean they don't have to do
On 2013-05-31, at 20:43 , Andrew Barnert wrote: this; they just import from the stdlib, and if lxml is present and registered, it would be loaded instead.
That seems rife with potential issues and unintended side-effects e.g. while lxml does for the most part provide ET's API, it also extends it and I do not know if it can run ET's testsuite. It also doesn't handle ET's implementation details for obvious reasons.
and that’s where my idea’s “strict API” comes into play: compatible implementations would *have to* pass a test suite and implement a certain API and comply with the standard.
But that's not sufficient is the issue here, as I tried to point out when somebody uses ElementTree they may be using more than just the API, they may well be relying on implementation details (e.g. namespace behavior or the _namespace_map).
In Philipp A.'s original suggestion, he made it explicitly clear that it should be easily to pick the best-installed, but also easy to pick a specific implementation. And I brought up ElementTree specifically because the same is true there: some people just want the best-installed (the ones who are currently doing the try:/except ImportError:, or would be if they knew about it), others specifically want one or the other. Maybe I didn't make things clear enough with my examples, so let's look at one of them, dbm, in more depth. Plenty of code just wants any implementation of the core API. For that case, you import dbm (or anydbm, in 2.x). But there's also code that specifically needs the traditional 100%-ndbm-compatible implementation. For that case, you import dbm.ndbm (or dbm, in 2.x). And there's code that specifically needs the extended functionality of gdbm, and is willing to deal with the possibility that it isn't available. For that case, you import dbm.gdbm (or gdbm in 2.x). So, what would be the equivalent for ElementTree? Code that just wants the best implementation or the core API can import xml.etree (or xml.etree.AnyElementTree, if you like the 2.x naming better). Code that specifically needs the traditional implementation, e.g., to use _namespace_map, would import xml.etree.ElementTree. Code that specifically needs the extended functionality of lxml, and is willing to deal with the possibility that it isn't available, can import lxml.etree. Notice that this is no change from the present day at all for the latter two cases; the only thing it changes is the first case, which goes from a try/import/except ImportError/import to a simple import. And all existing code would continue to work exactly the same way it always has. So, unless your worry is that it will be an attractive nuisance causing people to change their code to import xml.etree and try to use _namespace_map anyway and not know why they're getting NameErrors, I don't understand what you're arguing about.
From: Philipp A. <flying-sheep@web.de> Sent: Friday, May 31, 2013 12:39 PM
and that’s where my idea’s “strict API” comes into play: compatible implementations would *have to* pass a test suite and implement a certain API and comply with the standard.
unsure if and how to test the latter (surely running a testsuite when something wants to register isn’t practical – or is it?)
After sleeping on this question, I'm not sure the best-implementation wrapper really needs to be that dynamic. There are only so many YAML libraries out there, and the set doesn't change that rapidly. (And the same is true for dbm, ElementTree, etc.) So, maybe we just put a static list in yaml, like the one in dbm (http://hg.python.org/cpython/file/3.3/Lib/dbm/__init__.py#l41): _names = ['pyyaml', 'yaml.yaml'] If a new implementation reaches maturity, it goes through some process TBD, and in 3.5.2, we change that one line to _names = ['pyyaml', 'yayaml', 'yaml.yaml']. And that's how you "register" a new implementation. The only real downside I can see is that some people might stick to 3.5.1 for a long time after 3.5.2 is released (maybe because it comes pre-installed with OS X 10.9 or RHEL 7.2 ESR or something), but still want to accept yayaml. If that's really an issue, someone could put an "anyyaml" backport project on PyPI that offered the latest registered name list to older versions of Python (maybe even including 2.7). Is that good enough?
On 1 Jun, 2013, at 3:18, Andrew Barnert <abarnert@yahoo.com> wrote:
From: Philipp A. <flying-sheep@web.de> Sent: Friday, May 31, 2013 12:39 PM
and that’s where my idea’s “strict API” comes into play: compatible implementations would *have to* pass a test suite and implement a certain API and comply with the standard.
unsure if and how to test the latter (surely running a testsuite when something wants to register isn’t practical – or is it?)
After sleeping on this question, I'm not sure the best-implementation wrapper really needs to be that dynamic. There are only so many YAML libraries out there, and the set doesn't change that rapidly. (And the same is true for dbm, ElementTree, etc.)
So, maybe we just put a static list in yaml, like the one in dbm (http://hg.python.org/cpython/file/3.3/Lib/dbm/__init__.py#l41): _names = ['pyyaml', 'yaml.yaml'] If a new implementation reaches maturity, it goes through some process TBD, and in 3.5.2, we change that one line to _names = ['pyyaml', 'yayaml', 'yaml.yaml']. And that's how you "register" a new implementation.
The only real downside I can see is that some people might stick to 3.5.1 for a long time after 3.5.2 is released (maybe because it comes pre-installed with OS X 10.9 or RHEL 7.2 ESR or something), but still want to accept yayaml. If that's really an issue, someone could put an "anyyaml" backport project on PyPI that offered the latest registered name list to older versions of Python (maybe even including 2.7).
Is that good enough?
Please don't. I have not particular opionion on adding YAML support to the stdlib, but if support is added it should be usuable on its own and users shouldn't have to rely on 3th-party libraries for serious use. That is, the stdlib version should be complete and fast enough (for some definition of fast enough). Having the stdlib on "random" 3th-party libraries is IMHO code smell and makes debugging issues harder (why does my script that only uses the stdlib work on machine 1 but not on machine 2... oops, one of the machines has some 3ht party yaml implementation that hides a bug in the stdlib even though I don't use it explicitly). BTW. That doesn't mean the stdlib version should contain as much features as possible. Compare this with XML parsing: the xml.etree implementation is quite usable on its own, but sometimes you need more advanced XML features and then you can use lxml which has a simular API but a lot more features. Ronald
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On 06/04/2013 06:00 AM, Ronald Oussoren wrote:
On 1 Jun, 2013, at 3:18, Andrew Barnert <abarnert@yahoo.com> wrote:
From: Philipp A. <flying-sheep@web.de> Sent: Friday, May 31, 2013 12:39 PM
and that’s where my idea’s “strict API” comes into play: compatible implementations would *have to* pass a test suite and implement a certain API and comply with the standard.
unsure if and how to test the latter (surely running a testsuite when something wants to register isn’t practical – or is it?)
After sleeping on this question, I'm not sure the best-implementation wrapper really needs to be that dynamic. There are only so many YAML libraries out there, and the set doesn't change that rapidly. (And the same is true for dbm, ElementTree, etc.)
So, maybe we just put a static list in yaml, like the one in dbm (http://hg.python.org/cpython/file/3.3/Lib/dbm/__init__.py#l41): _names = ['pyyaml', 'yaml.yaml'] If a new implementation reaches maturity, it goes through some process TBD, and in 3.5.2, we change that one line to _names = ['pyyaml', 'yayaml', 'yaml.yaml']. And that's how you "register" a new implementation.
The only real downside I can see is that some people might stick to 3.5.1 for a long time after 3.5.2 is released (maybe because it comes pre-installed with OS X 10.9 or RHEL 7.2 ESR or something), but still want to accept yayaml. If that's really an issue, someone could put an "anyyaml" backport project on PyPI that offered the latest registered name list to older versions of Python (maybe even including 2.7).
Is that good enough?
Please don't. I have not particular opionion on adding YAML support to the stdlib, but if support is added it should be usuable on its own and users shouldn't have to rely on 3th-party libraries for serious use. That is, the stdlib version should be complete and fast enough (for some definition of fast enough).
Having the stdlib on "random" 3th-party libraries is IMHO code smell and makes debugging issues harder (why does my script that only uses the stdlib work on machine 1 but not on machine 2... oops, one of the machines has some 3ht party yaml implementation that hides a bug in the stdlib even though I don't use it explicitly).
BTW. That doesn't mean the stdlib version should contain as much features as possible. Compare this with XML parsing: the xml.etree implementation is quite usable on its own, but sometimes you need more advanced XML features and then you can use lxml which has a simular API but a lot more features.
I completely agree with Ronald here. I don't see the need for a complicated "yaml parser registry". If you want the one in the stdlib (if it ever exists), then use it. Otherwise, use a different one. As with XML parsing, this doesn't mean they all can't share an API. Eric.
2013/5/31 Andrew Barnert <abarnert@yahoo.com>
But YAML is up to 1.2, and I believe most libraries (including PyYAML) only handle 1.1 so far. There are also known bugs in the 1.1 specification (e.g., "." is a valid float literal, but doesn't specify 0.0 or any other valid value), that each library has to work around. There are features of the standard, such as YAML<->XML bindings, that are still in early stages of design. Maybe a YAML-1.1-as-interpreted-by-the-majority-of-the-quasi-reference-implementations library doesn't need to evolve, but a YAML library does.
afaik YAML 1.2 exists to clarify those mentioned bugs, since they have all been found and people needed a bugfree standard. also could you mention a bug with a non-obvious solution? i don’t think any yaml implementation is going to interpret “.” as something other than 0.0 Are you suggesting importing PyYAML (in modified form, and without the
libyaml-binding "fast" implementation) into the stdlib, or building a new one? If the former, have you talked to Kirill Simonov? If the latter, are you proposing to build it, or just suggesting that it would be nice if somebody did?
i don’t know: would it aid my argument if i had asked him or written my own? (i’ve done nothing of the two because unless Guido says “I can see YAML in the stdlib” it would be pointless imho) Do you mean adding a load_iter() akin to load_all() except that it yields
one document at a time, or a SAX-like API instead of a simple load()?
no i meant that the lexer should be a generator (e.g. “[int(token) for token in YAMLLexer(open('myfile.yml')).lex()]” and/or an API accepting incomplete yaml chunks and emitting tokens, like “for token in lexer.feed(stream.read())”) but what you said is also necessary for the format: lexing from a long stream of documents coming in through the network doesn’t make sense in another way) Your registration mechanism would mean they don't have to do this; they
just import from the stdlib, and if lxml is present and registered, it would be loaded instead.
exactly There are a few examples of something similar, both in and out of the
stdlib. For example:
The dbm module basically works like this: you can import dbm.ndbm, or you can just import dbm to get the best available implementation. That isn't done by hooking the import, but rather by providing a handful of wrapper functions that forward to the chosen implementation. Is that reasonable for YAML, or are there too many top-level functions or too much module-level global state or something?
i think so: as i said, we’d need to define an API. since it’s “just” a serialization language, i think we could go with not much more than - load(fileobj_or_filename, safe=True) #maybe better than a unsafe_blah for each loadlike function - load_iter(fileobj_or_filename, safe=True) - loads(fileobj_or_filename, safe=True) - loads_iter(fileobj_or_filename, safe=True) - dump() - dumps - YAMLLexer #with some methods and defined constructors - YAMLParser #accepting tokens from the lexer - YAMLTokens #one of the new, shiny enums BeautifulSoup
[…]
tulip
also nice ideas
What all of these are missing is a way for an unknown third-party implementation to plug themselves in as the new best. Of course you can always monkeypatch it at runtime (dbm._names.insert(0, __name__)), but you want to do it at _install_ time, which is a different story.
One further issue is that sometimes the system administrator (or end user) might want to affect the default choice for programs running on his machine. For example, lxml is built around libxml2. Mac OS X 10.6, some linux distros, etc. come with old or buggy versions of libxml2. You might want to install lxml anyway and make it the default for BeautifulSoup, but not for ElementTree, or vice-versa.
Finally, what happens if you install two modules on your system which both register as implementations of the same module?
i think we can’t allows them to modify some syste-global list, since everything would install itself as #1, so it would be pointless. i don’t know how to select one, but we should expose a systemwide way to configure the used one (like .pythonrc?), as well as a way to directly use one from python (as said above). then it wouldn’t matter much, since the admin is required to only install on, or configure the system to use the preferred one. the important things are imho to make the system discoverable and transparent, exposing the found implementations and the used one as well as we can. Note that JSON is a strict subset of YAML 1.2, and not too far from a
subset of 1.1. So, you could propose exclusive YAML, and make sticking within the JSON schema and syntax required for packages compatible with Python 3.3 and earlier, but optional for 3.4+ packages.
yeah. pretty nice. but i don’t think a stdlib yaml can land before 3.5.
On Sat, Jun 1, 2013 at 2:35 AM, Philipp A. <flying-sheep@web.de> wrote:
Hi, reading PEP 426, I made a connection to a (IMHO) longstanding issue: YAML not being in the stdlib.
As MAL already noted, PEP 426 defines a data interchange format for automated tools. Build tools should not (and likely will not) use it as an input format (JSON is a human *readable* serialisation format, but it's a bit of a stretch to call it human *editable* - editing JSON files directly is in some respects even more painful than writing XML by hand). While having a YAML parser in the standard library is a defensible idea, it still wouldn't make any difference to the packaging metadata standards. Those are going to be strictly JSON, as it is a simpler format and there is no technical reason to allow YAML for files that should *never* be edited by hand.
From my perspective, at this point in time, you have 3 reasonable choices for storing application configuration data on disk:
* .ini syntax * easy for humans to read and write * stdlib support * can't handle structured data without awkward hacks * relatively simple * JSON * easy for humans to read, hard for humans to write * stdlib support * handles structured data * relatively simple * YAML * easy for humans to read and write * no stdlib support * handles structured data * relatively complex (Historically, XML may also have been on that list, but in most new code, JSON or YAML will be a better choice wherever XML would have previously been suitable) YAML's complexity is the reason I prefer JSON as a data interchange format, but I still believe YAML fills a useful niche for more complex configuration files where .ini syntax is too limited and JSON is too hard to edit by hand. Thus, for me, I answer "Is it worth supporting YAML in the standard library?" with a definite "Yes" (as a more complex configuration file format for the cases .ini syntax can't readily handle). That answer leaves the key open questions as: * whether or not there are any YAML implementations for Python that are sufficiently mature for one to be locked into the standard library's 18-24 month upgrade cycle and 6 month bugfix cycle * whether including such a library would carry an acceptable increase in our risk of security vulnerabilities * whether the authors/maintainers of any such library are prepared to accept the implications of standard library inclusion. Given the mess that was the partial inclusion of PyXML, the explicit decision to disallow any future externally maintained libraries (see PEP 360) and the existing proposal to include a pip bootstraping mechanism in Python 3.4 (see PEP 439), I have my doubts that Python 3.4 is the right time to be including a potentially volatile library, even if providing a YAML parser as an included battery is a good idea in the long run. Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
On Jun 1, 2013, at 7:04, Nick Coghlan <ncoghlan@gmail.com> wrote:
Given the mess that was the partial inclusion of PyXML, the explicit decision to disallow any future externally maintained libraries (see PEP 360) and the existing proposal to include a pip bootstraping mechanism in Python 3.4 (see PEP 439), I have my doubts that Python 3.4 is the right time to be including a potentially volatile library, even if providing a YAML parser as an included battery is a good idea in the long run.
For the record, the OP (Phillip) was thinking 3.5 or later. I'm the one who made the assumption he wanted this in 3.4, and he immediately corrected me. Also, I get the impression that he wants to define a new API which doesn't match any of the existing libraries, and not add anything to the stdlib until one of the existing (fast) libraries has an adaptation to the new API, which means it's clearly not an immediate-term goal. He just wants to get some consensus that this is a good idea before starting the work of defining that API, finding or building a reference implementation in pure Python, and convincing the existing library authors to adapt to the standard API. If I've interpreted him wrong, I apologize.
On 01.06.13 18:42, Andrew Barnert wrote:
On Jun 1, 2013, at 7:04, Nick Coghlan ... wrote:
... I have my doubts that Python 3.4 is the right time to be including a potentially volatile library, even if providing a YAML parser as an included battery is a good idea in the long run.
For the record, the OP (Phillip) was thinking 3.5 or later. I'm the one who made the assumption he wanted this in 3.4, and he immediately corrected me.
At least I spotted no '3.4' in the first mail of this thread.
Also, I get the impression that he wants to define a new API which doesn't match any of the existing libraries, and not add anything to the stdlib until one of the existing (fast) libraries has an adaptation to the new API, which means it's clearly not an immediate-term goal. He just wants to get some consensus that this is a good idea before starting the work of defining that API, finding or building a reference implementation in pure Python, and convincing the existing library authors to adapt to the standard API. ...
that is what I also distilled (after removing some add-on topics). I am +1 on both designing a new API (willing to support) and to nudge the existing libraries to adapt it **before** lobbying towards stdlib inclusion. Personally I also perceive YAML as Nick put it between .ini and JSON. All the best, Stefan
On 2 Jun 2013 02:43, "Andrew Barnert" <abarnert@yahoo.com> wrote:
On Jun 1, 2013, at 7:04, Nick Coghlan <ncoghlan@gmail.com> wrote:
Given the mess that was the partial inclusion of PyXML, the explicit decision to disallow any future externally maintained libraries (see PEP 360) and the existing proposal to include a pip bootstraping mechanism in Python 3.4 (see PEP 439), I have my doubts that Python 3.4 is the right time to be including a potentially volatile library, even if providing a YAML parser as an included battery is a good idea in the long run.
For the record, the OP (Phillip) was thinking 3.5 or later. I'm the one
who made the assumption he wanted this in 3.4, and he immediately corrected me.
Also, I get the impression that he wants to define a new API which
doesn't match any of the existing libraries, and not add anything to the stdlib until one of the existing (fast) libraries has an adaptation to the new API, which means it's clearly not an immediate-term goal. He just wants to get some consensus that this is a good idea before starting the work of defining that API, finding or building a reference implementation in pure Python, and convincing the existing library authors to adapt to the standard API. Ah, I missed that. If the target time frame is 3.5 and the API design goals include "secure by default, full power of YAML when requested" then it sounds like a fine idea to try. Cheers, Nick.
If I've interpreted him wrong, I apologize.
On Sat, Jun 1, 2013 at 10:04 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
YAML's complexity is the reason I prefer JSON as a data interchange format, but I still believe YAML fills a useful niche for more complex configuration files where .ini syntax is too limited and JSON is too hard to edit by hand.
YAML, unlike JSON, is able to represent non-tree data (it can store any reference graph). At this point the only stdlib module that can do that is pickle, which would ordinarily precluded by security concerns. So adding YAML is excellent even as a data interchange format in the stdlib -- there is no stdlib module that occupies this particular intersection of functionality (nominal security and object graphs). I wrote a Python-Ideas post in the past about including something in-between json and pickle on the power spectrum, it didn't go over well, but I figured I'd mention this point again anyway. Sorry. -- Devin
On Mon, Jun 3, 2013 at 10:16 AM, Devin Jeanpierre <jeanpierreda@gmail.com> wrote:
On Sat, Jun 1, 2013 at 10:04 AM, Nick Coghlan <ncoghlan@gmail.com> wrote:
YAML's complexity is the reason I prefer JSON as a data interchange format, but I still believe YAML fills a useful niche for more complex configuration files where .ini syntax is too limited and JSON is too hard to edit by hand.
YAML, unlike JSON, is able to represent non-tree data (it can store any reference graph). At this point the only stdlib module that can do that is pickle, which would ordinarily precluded by security concerns. So adding YAML is excellent even as a data interchange format in the stdlib -- there is no stdlib module that occupies this particular intersection of functionality (nominal security and object graphs).
Note that while JSON itself doesn't handle arbitrary reference graphs, people actually *store* complex graph structures in JSON all the time. It's just that there are no standard syntax/semantics for doing so (logging.dictConfig, for example, uses context dependent named references, while jsonschema uses $ref fields). Cheers, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia
Nick Coghlan <ncoghlan@...> writes:
Note that while JSON itself doesn't handle arbitrary reference graphs, people actually *store* complex graph structures in JSON all the time. It's just that there are no standard syntax/semantics for doing so (logging.dictConfig, for example, uses context dependent named references, while jsonschema uses $ref fields).
A more generalised version of the dictConfig approach, using JSON-serialisable dictionaries to describe more complex object graphs (including cross-references), is described here: http://pymolurus.blogspot.co.uk/2013/04/using-dictionaries-to-configure-obje... Regards, Vinay Sajip
At this point I would argue against any new modules by default. How does the cost and liability of yaml in the standard library make up for such a ring benefit as excluding one little line from my requirements.txt? On May 31, 2013 12:46 PM, "Philipp A." <flying-sheep@web.de> wrote:
Hi, reading PEP 426<http://www.python.org/dev/peps/pep-0426/#switching-to-a-json-compatible-format>, I made a connection to a (IMHO) longstanding issue: YAML not being in the stdlib.
I’m no big fan of JSON, because it’s so strict and comparatively verbose compared with YAML. I just think YAML is more pythonic, and a better choice for any kind of human-written data format.
So i devised 3 ideas:
1. *YAML in the stdlib* The stdlib shouldn’t get more C code; that’s what I’ve gathered. So let’s put a pure-python implementation of YAML into the stdlib. Let’s also strictly define the API and make it secure-by-naming™. What i mean is let’s use the safe load function that doesn’t instantiate user-defined classes (in PyYAML called “safe_load”) as default load function “load”, and call the unsafe one by a longer, explicit name (e.g. “unsafe_load” or “extended_load” or something) Let’s base the parser on generators, since generators are cool, easy to debug, and allow us to emit and test the token stream (other than e.g. the HTML parser we have) 2. *Implementation discovery* People want fast parsing. That’s incompatible with a pure python implementation. So let’s define (or use, if there is one I’m not aware of) a discovery mechanism that allows implementations of certain APIs to register themselves as such. Let “import yaml” use this mechanism to import a compatible 3rd party implementation in preference to the stdlib one Let’s define a property of the implementation that tells the user which implementation he’s using, and a way to select a specific implementation (Although that’s probably easily done by just not doing “import yaml”, but “import std_yaml” or “import pyyaml2”) 3. Allow YAML to be used besides JSON as metadata like in PEP 426. (so including either pymeta.yaml or pymeta.json makes a valid package) I don’t propose that we exclusively use YAML, but only because I think that PEP 426 shouldn’t be hindered from being implemented ASAP by waiting for a new std-library to be ready.
What do you think?
Is there a reason for not including a YAML lib that i didn’t cover? Is there a reason JSON is used other than YAML not being in the stdlib?
_______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
On Tue, Jun 4, 2013 at 4:26 AM, Calvin Spealman <ironfroggy@gmail.com> wrote:
At this point I would argue against any new modules by default. How does the cost and liability of yaml in the standard library make up for such a ring benefit as excluding one little line from my requirements.txt?
Disagree. Lots of users don't have the luxury of running in an environment where internet dependencies can be automatically installed. For many (probably most non-developer) users, there is a *huge* gap between "is in the stdlib" and "requires internet access to install". -- --Guido van Rossum (python.org/~guido)
participants (19)
-
Andrew Barnert
-
Antoine Pitrou
-
Brett Cannon
-
Bruce Leban
-
Calvin Spealman
-
Daniel Holth
-
Devin Jeanpierre
-
Eric V. Smith
-
Ethan Furman
-
Guido van Rossum
-
M.-A. Lemburg
-
Masklinn
-
Nick Coghlan
-
Philipp A.
-
Ronald Oussoren
-
Stefan Drees
-
Vinay Sajip
-
Vito De Tullio
-
Yuval Greenfield