[Python-ideas] Stdlib YAML evolution (Was: PEP 426, YAML in the stdlib and implementation discovery)

June 2, 2013

      FWIW, I am +1 on for the ability to read YAML based configs Python
without dependencies, but waiting for several years is hard.

Maybe try an alternative data driven development process as opposed
to traditional PEP based all-or-nothing style to speed up the process?
It is possible to make users happy incrementally and keep development
fun without sacrificing too much on the Zen side. If open source is
about scratching your own itches, then the most effective way to
implement a spec would be to allow people add support for their own
flavor without disrupting works of the others.

For some reason I think most people don't need full YAML
speccy, especially if final implementation will be slow and heavy.

So instead of:
    import yaml
I'd start with something more generic and primitive:
    from datatrans import yamlish

Where `datatrans` is data transformation framework taking care
of usual text parsing (data partitioning), partition mapping (structure
transformation) and conversion (binary to string etc.) trying to be as fast
and lightweight as possible, bringing vast field for future optimizations on
algorithmic level. `yamlish` is an implementation which is not vastly
optimized (datatrans to yamlish is like RPython to PyPy) and can be
easily extended to cover more YAML (hopefully). Therefore the name -
it doesn't pretend to parse YAML - it parses some supported subset,
which is improved over time by different parties (if datatrans is done right
to provide readable (maintainable + extensible) code for implementation).

There is an exisiting package called `yamlish` on PyPI - I am not talking
about it - it is PyYAML based, which is not an option for now as I see it.
So I stole its name. Sorry. This PyPI package was used to parse TAP
format, which is again, a subset. Subset..

It appears that YAML is good for humans for its subsets. It leaves an
impression (maybe it's just an illusion) that development work for subset
support can also be partitioned. If `datatrans` "done right" is possible, it
will allow incremental addition of new YAML features as the need for
them arises (new data examples are added). Or it can help to build
parsers for YAML subsets that are intentionally limited to make them
performance efficient.

Because `datatrans` is a package isolating parsing, mapping and
conversion parts of the process to make it modular and extensible, it can
serve as a reference point for various kinds of (scientific) papers
including
the ones that prove that such data transformation framework is impossible.
As for `yamlish` submodule, the first important paper covering it will be a
reference table matrix of supported features.

While it all sounds too complicated, driving development by data and real
short term user needs (as opposed to designing everything upfront) will
make the process more attractive. In data driven development, there are not
many things that can break - you either continue parsing previous data or
not. The output from the parsing process may change over time, but it may
be controlled by configuring the last step of data transformation phase.

`Parsing AppEngine config file` or `reading package meta data`
are good starting points. Once package meta data subset is parsed, it is
done and won't break. The implementation for meta data parsing may mature
in distutils package, for AppEngine in its SDK, and merged in either of
those,
sending patches for `datastrans` to stdlib. The question is only to design
output format for the parse stage. I am not sure everything should be
convertible into Python objects using the "best fit" technique.

I will be pretty comfortable if target format will not be native Python
objects
at all. More than that - I will even insist to avoid converting to native
Python
object from the start. The ideal output for the first version should be
generic
tree structure with defined names for YAML elements. The tree that can be
represented as XML where these names are tags. The tree can be
therefore traversed and selected using XPath/JQuery syntax.

It will take several years for implementation to mature and in the end there
will be a plenty of backward compatibility matters with the API, formatting
and serializing. So the first thing I'd do is [abandon serialization]. From
the
point of view of my self-proclaimed data transformation theory, the input
and
output formats are data. If output format is not human readable - as some
linked Python data structures in memory - it wastes time and hinders
development. Serializing Python is a problem of different level, which is an
example of binary, abstract, memory-only output format - a lot of properties
that you don't want to deal with while working with data.

To summarize:
1. full spec support is no goal
2. data driven (using real world examples/stories)
3. incremental (one example/story at a time)
4. research based (beautiful ideas vs ugly real world limitations)

5. maintainable (code is easy to read and understand the structure)
6. extensible (easy to find out the place to be modified)
7. core "generic tree" data structure as an intermediate format and
    "yaml tree" data structure as a final format from parsing process

P.S. I am willing to wok on this "data transformation theory" stuff and
       prototype implementation, because it is generally useful in many
       areas. But I need support.
-- 
anatoly t.

[Python-ideas] Stdlib YAML evolution (Was: PEP 426, YAML in the stdlib and implementation discovery)

anatoly techtonik