[Python-ideas] Stdlib YAML evolution (Was: PEP 426, YAML in the stdlib and implementation discovery)

Mon Jun 3 01:53:55 CEST 2013

From: anatoly techtonik <techtonik at gmail.com>
Sent: Sunday, June 2, 2013 11:23 AM

>FWIW, I am +1 on for the ability to read YAML based configs Python
>without dependencies, but waiting for several years is hard.

With all due respect, I don't think you've read even a one-sentence description of YAML, so your entire post is nonsense.

The first sentence of the abstract says, "YAML… is a…data serialization language designed around the common native data types of agile programming languages." So, your idea that we shouldn't use it for serialization, and shouldn't map it to native Python data types, is ridiculous.

You specifically suggest mapping YAML to XML so we can treat it as a structured document. From the "Relation to XML" section: "YAML is primarily a data serialization language. XML was… designed to support structured documentation."

You suggest that we shouldn't build all of YAML, just some bare-minimum subset that's good enough to get started. JSON is already _more_ than a bare-minimum subset of YAML, so we're already done.

But you'd also like some data-driven way to extend this. YAML has already designed exactly that. Once you have the core schema, you can add new types, and the syntax for those types is data-driven (although the semantics are really only defined in hand-wavy English and probably require code to implement, but I'm not sure how you expect your proposal to be any different, unless you're proposing something like XML Schema). So, either the necessary subset of YAML you want is the entire spec, or you want to do an equal amount of work building something just as complex but not actually YAML.

The idea of building a useful subset of YAML isn't a bad one. But the way to do that is to go through the features of YAML that JSON doesn't have, and decide which ones you want. For example, YAML with the core schema, but no aliases, no plain strings, and no explicit tags is basically JSON with indented block structure, raw strings, and useful synonyms for key constants (so you can write True instead of true). You could even carefully import a few useful definitions from the type library as long as they're unambiguous (e.g., timestamp). That gives you most of the advantages of YAML that don't bring any safety risks, and its output would be interpretable as full YAML, and it might be a little easier to implement than the full spec. But that has very little to do with your proposal. In particular, leaving out the data-driven features of YAML is what makes it safe and simple.

Meanwhile, I think what you actually want is XSLT processors to convert YAML to and from XML. Fortunately, the YAML community is already working on that at http://www.yaml.org/xml. Then you don't need any new Python code at all; just convert your YAML to XML and use whichever XML library (in the stdlib or not), and you're done.

>Maybe try an alternative data driven development process as opposed

>to traditional PEP based all-or-nothing style to speed up the process?
>It is possible to make users happy incrementally and keep development
>fun without sacrificing too much on the Zen side. If open source is 
>about scratching your own itches, then the most effective way to
>implement a spec would be to allow people add support for their own
>flavor without disrupting works of the others.
>
>
>For some reason I think most people don't need full YAML
>
>speccy, especially if final implementation will be slow and heavy.
>
>
>So instead of:
>    import yaml
>I'd start with something more generic and primitive:
>    from datatrans import yamlish
>
>
>Where `datatrans` is data transformation framework taking care
>of usual text parsing (data partitioning), partition mapping (structure
>transformation) and conversion (binary to string etc.) trying to be as fast
>and lightweight as possible, bringing vast field for future optimizations on
>algorithmic level. `yamlish` is an implementation which is not vastly
>optimized (datatrans to yamlish is like RPython to PyPy) and can be
>easily extended to cover more YAML (hopefully). Therefore the name -
>it doesn't pretend to parse YAML - it parses some supported subset,
>which is improved over time by different parties (if datatrans is done right
>to provide readable (maintainable + extensible) code for implementation).
>
>
>
>
>There is an exisiting package called `yamlish` on PyPI - I am not talking
>about it - it is PyYAML based, which is not an option for now as I see it.
>So I stole its name. Sorry. This PyPI package was used to parse TAP
>format, which is again, a subset. Subset..
>
>
>It appears that YAML is good for humans for its subsets. It leaves an
>impression (maybe it's just an illusion) that development work for subset
>support can also be partitioned. If `datatrans` "done right" is possible, it
>will allow incremental addition of new YAML features as the need for
>them arises (new data examples are added). Or it can help to build
>parsers for YAML subsets that are intentionally limited to make them
>performance efficient.
>
>
>Because `datatrans` is a package isolating parsing, mapping and
>conversion parts of the process to make it modular and extensible, it can
>serve as a reference point for various kinds of (scientific) papers including
>the ones that prove that such data transformation framework is impossible.
>As for `yamlish` submodule, the first important paper covering it will be a
>reference table matrix of supported features.
>
>
>
>
>While it all sounds too complicated, driving development by data and real
>short term user needs (as opposed to designing everything upfront) will
>make the process more attractive. In data driven development, there are not 
>many things that can break - you either continue parsing previous data or
>not. The output from the parsing process may change over time, but it may
>be controlled by configuring the last step of data transformation phase.
>
>
>`Parsing AppEngine config file` or `reading package meta data`
>are good starting points. Once package meta data subset is parsed, it is
>done and won't break. The implementation for meta data parsing may mature
>in distutils package, for AppEngine in its SDK, and merged in either of those,
>sending patches for `datastrans` to stdlib. The question is only to design
>output format for the parse stage. I am not sure everything should be
>convertible into Python objects using the "best fit" technique.
>
>
>I will be pretty comfortable if target format will not be native Python objects
>
>at all. More than that - I will even insist to avoid converting to native Python
>object from the start. The ideal output for the first version should be generic
>tree structure with defined names for YAML elements. The tree that can be
>represented as XML where these names are tags. The tree can be
>therefore traversed and selected using XPath/JQuery syntax.
>
>
>It will take several years for implementation to mature and in the end there
>will be a plenty of backward compatibility matters with the API, formatting
>and serializing. So the first thing I'd do is [abandon serialization]. From the
>point of view of my self-proclaimed data transformation theory, the input and
>output formats are data. If output format is not human readable - as some
>linked Python data structures in memory - it wastes time and hinders
>development. Serializing Python is a problem of different level, which is an
>example of binary, abstract, memory-only output format - a lot of properties
>that you don't want to deal with while working with data.
>
>
>
>
>To summarize:
>
>1. full spec support is no goal
>2. data driven (using real world examples/stories)
>3. incremental (one example/story at a time)
>4. research based (beautiful ideas vs ugly real world limitations)
>
>
>5. maintainable (code is easy to read and understand the structure)
>6. extensible (easy to find out the place to be modified)
>7. core "generic tree" data structure as an intermediate format and
>    "yaml tree" data structure as a final format from parsing process
>
>
>
>
>P.S. I am willing to wok on this "data transformation theory" stuff and
>       prototype implementation, because it is generally useful in many
>       areas. But I need support.
>-- 
>anatoly t.
>
>