Stdlib YAML evolution (Was: PEP 426, YAML in the stdlib and implementation discovery)
FWIW, I am +1 on for the ability to read YAML based configs Python without dependencies, but waiting for several years is hard. Maybe try an alternative data driven development process as opposed to traditional PEP based all-or-nothing style to speed up the process? It is possible to make users happy incrementally and keep development fun without sacrificing too much on the Zen side. If open source is about scratching your own itches, then the most effective way to implement a spec would be to allow people add support for their own flavor without disrupting works of the others. For some reason I think most people don't need full YAML speccy, especially if final implementation will be slow and heavy. So instead of: import yaml I'd start with something more generic and primitive: from datatrans import yamlish Where `datatrans` is data transformation framework taking care of usual text parsing (data partitioning), partition mapping (structure transformation) and conversion (binary to string etc.) trying to be as fast and lightweight as possible, bringing vast field for future optimizations on algorithmic level. `yamlish` is an implementation which is not vastly optimized (datatrans to yamlish is like RPython to PyPy) and can be easily extended to cover more YAML (hopefully). Therefore the name - it doesn't pretend to parse YAML - it parses some supported subset, which is improved over time by different parties (if datatrans is done right to provide readable (maintainable + extensible) code for implementation). There is an exisiting package called `yamlish` on PyPI - I am not talking about it - it is PyYAML based, which is not an option for now as I see it. So I stole its name. Sorry. This PyPI package was used to parse TAP format, which is again, a subset. Subset.. It appears that YAML is good for humans for its subsets. It leaves an impression (maybe it's just an illusion) that development work for subset support can also be partitioned. If `datatrans` "done right" is possible, it will allow incremental addition of new YAML features as the need for them arises (new data examples are added). Or it can help to build parsers for YAML subsets that are intentionally limited to make them performance efficient. Because `datatrans` is a package isolating parsing, mapping and conversion parts of the process to make it modular and extensible, it can serve as a reference point for various kinds of (scientific) papers including the ones that prove that such data transformation framework is impossible. As for `yamlish` submodule, the first important paper covering it will be a reference table matrix of supported features. While it all sounds too complicated, driving development by data and real short term user needs (as opposed to designing everything upfront) will make the process more attractive. In data driven development, there are not many things that can break - you either continue parsing previous data or not. The output from the parsing process may change over time, but it may be controlled by configuring the last step of data transformation phase. `Parsing AppEngine config file` or `reading package meta data` are good starting points. Once package meta data subset is parsed, it is done and won't break. The implementation for meta data parsing may mature in distutils package, for AppEngine in its SDK, and merged in either of those, sending patches for `datastrans` to stdlib. The question is only to design output format for the parse stage. I am not sure everything should be convertible into Python objects using the "best fit" technique. I will be pretty comfortable if target format will not be native Python objects at all. More than that - I will even insist to avoid converting to native Python object from the start. The ideal output for the first version should be generic tree structure with defined names for YAML elements. The tree that can be represented as XML where these names are tags. The tree can be therefore traversed and selected using XPath/JQuery syntax. It will take several years for implementation to mature and in the end there will be a plenty of backward compatibility matters with the API, formatting and serializing. So the first thing I'd do is [abandon serialization]. From the point of view of my self-proclaimed data transformation theory, the input and output formats are data. If output format is not human readable - as some linked Python data structures in memory - it wastes time and hinders development. Serializing Python is a problem of different level, which is an example of binary, abstract, memory-only output format - a lot of properties that you don't want to deal with while working with data. To summarize: 1. full spec support is no goal 2. data driven (using real world examples/stories) 3. incremental (one example/story at a time) 4. research based (beautiful ideas vs ugly real world limitations) 5. maintainable (code is easy to read and understand the structure) 6. extensible (easy to find out the place to be modified) 7. core "generic tree" data structure as an intermediate format and "yaml tree" data structure as a final format from parsing process P.S. I am willing to wok on this "data transformation theory" stuff and prototype implementation, because it is generally useful in many areas. But I need support. -- anatoly t.
I would definitely like to have a YAML library--even one with restricted function--in the standard library. Anatoly's suggestion of 'yamlish' is fine. I have myself had to write a 'miniyaml.py' as part of one work project. We have some configuration files in bit of software I wrote/maintain, and these files are very much human edited, and their structure very well fits a subset of YAML. There are many things you can do in YAML that we do not need to do. Using JSON, or INI format, or XML, or other options supported in the Python standard library would have been possible, but in all cases the configuration files would have been far less readable than in the way I did it. Within the particular environment I'm supporting, it is possible but slightly cumbersome to install a 3rd party library like PyYaml. So I wrote <40 lines of code to support the subset we need, but keeping the API compatible. There's nothing special about my implementation, and it's certainly not ready for inclusion in STDLIB, but the fact I needed to do it is suggestive to me. Below is the docstring for my small module. Basically, ANY reasonable subset of YAML that was supported in the STDLIB would also be a drop-in replacement for my trivial code. Well, I suppose that if its API were different from that of PyYaml, some small change might be needed, but it would certainly support my simple use case.
This module provides an implementation for a small subset of YAML
Only the constructs needed for parsing an "invariants" file are supported here. The only supported API function of the PyYaml library is 'load_all()'. However, within that restriction, the result returned by 'miniyaml.load_all()'--if loading a string that this module can parse--is intended to be identical to that returned by 'yaml.load_all()'.
The intended use of this module is with an import like:
import miniyaml as yaml
In the presence of an actual PyYaml installation, this can simply be instead:
import yaml
The parsed subset used in invariants files looks like the below. If multiline comments (with line breaks) are need, the usual YAML construct is used. Each invariant block is a new YAML "document":
Invariant: some_python_construct(anton.leaf.parameter) is something_else Comment: This describes the invariant in more human readable terms --- Invariant: isMultiple(DT, some.type.of.interval) Comment: | The interval should really be a multiple of timestep because of equation: sigma = epsilon^2 + (3*foo)^5 And that's how it works. ---
On Jun 2, 2013, at 11:23 AM, anatoly techtonik wrote:
FWIW, I am +1 on for the ability to read YAML based configs Python without dependencies, but waiting for several years is hard.
Maybe try an alternative data driven development process as opposed to traditional PEP based all-or-nothing style to speed up the process? It is possible to make users happy incrementally and keep development fun without sacrificing too much on the Zen side. If open source is about scratching your own itches, then the most effective way to implement a spec would be to allow people add support for their own flavor without disrupting works of the others.
For some reason I think most people don't need full YAML speccy, especially if final implementation will be slow and heavy.
So instead of: import yaml I'd start with something more generic and primitive: from datatrans import yamlish
Where `datatrans` is data transformation framework taking care of usual text parsing (data partitioning), partition mapping (structure transformation) and conversion (binary to string etc.) trying to be as fast and lightweight as possible, bringing vast field for future optimizations on algorithmic level. `yamlish` is an implementation which is not vastly optimized (datatrans to yamlish is like RPython to PyPy) and can be easily extended to cover more YAML (hopefully). Therefore the name - it doesn't pretend to parse YAML - it parses some supported subset, which is improved over time by different parties (if datatrans is done right to provide readable (maintainable + extensible) code for implementation).
There is an exisiting package called `yamlish` on PyPI - I am not talking about it - it is PyYAML based, which is not an option for now as I see it. So I stole its name. Sorry. This PyPI package was used to parse TAP format, which is again, a subset. Subset..
It appears that YAML is good for humans for its subsets. It leaves an impression (maybe it's just an illusion) that development work for subset support can also be partitioned. If `datatrans` "done right" is possible, it will allow incremental addition of new YAML features as the need for them arises (new data examples are added). Or it can help to build parsers for YAML subsets that are intentionally limited to make them performance efficient.
Because `datatrans` is a package isolating parsing, mapping and conversion parts of the process to make it modular and extensible, it can serve as a reference point for various kinds of (scientific) papers including the ones that prove that such data transformation framework is impossible. As for `yamlish` submodule, the first important paper covering it will be a reference table matrix of supported features.
While it all sounds too complicated, driving development by data and real short term user needs (as opposed to designing everything upfront) will make the process more attractive. In data driven development, there are not many things that can break - you either continue parsing previous data or not. The output from the parsing process may change over time, but it may be controlled by configuring the last step of data transformation phase.
`Parsing AppEngine config file` or `reading package meta data` are good starting points. Once package meta data subset is parsed, it is done and won't break. The implementation for meta data parsing may mature in distutils package, for AppEngine in its SDK, and merged in either of those, sending patches for `datastrans` to stdlib. The question is only to design output format for the parse stage. I am not sure everything should be convertible into Python objects using the "best fit" technique.
I will be pretty comfortable if target format will not be native Python objects at all. More than that - I will even insist to avoid converting to native Python object from the start. The ideal output for the first version should be generic tree structure with defined names for YAML elements. The tree that can be represented as XML where these names are tags. The tree can be therefore traversed and selected using XPath/JQuery syntax.
It will take several years for implementation to mature and in the end there will be a plenty of backward compatibility matters with the API, formatting and serializing. So the first thing I'd do is [abandon serialization]. From the point of view of my self-proclaimed data transformation theory, the input and output formats are data. If output format is not human readable - as some linked Python data structures in memory - it wastes time and hinders development. Serializing Python is a problem of different level, which is an example of binary, abstract, memory-only output format - a lot of properties that you don't want to deal with while working with data.
To summarize: 1. full spec support is no goal 2. data driven (using real world examples/stories) 3. incremental (one example/story at a time) 4. research based (beautiful ideas vs ugly real world limitations)
5. maintainable (code is easy to read and understand the structure) 6. extensible (easy to find out the place to be modified) 7. core "generic tree" data structure as an intermediate format and "yaml tree" data structure as a final format from parsing process
P.S. I am willing to wok on this "data transformation theory" stuff and prototype implementation, because it is generally useful in many areas. But I need support. -- anatoly t. _______________________________________________ Python-ideas mailing list Python-ideas@python.org http://mail.python.org/mailman/listinfo/python-ideas
-- mertz@ THIS MESSAGE WAS BROUGHT TO YOU BY: n o gnosis Postmodern Enterprises .cx IN A WORLD W/O WALLS, THERE WOULD BE NO GATES d o z e
David Mertz writes:
I would definitely like to have a YAML library--even one with restricted function--in the standard library.
Different use cases (and users) will stick at different restrictions. This would be endlessly debatable. I think the only restriction that really makes sense is the load vs. load_unsafe restriction (and that should be a user decision; the "unsafe" features should be available to users who want them).
On Mon, Jun 3, 2013 at 5:56 PM, Philipp A. <flying-sheep@web.de> wrote:
it is PyYAML based, which is not an option for now as I see it.
can you please elaborate on why this is the case? did Kirill Simonov say “no”?
We don't need full YAML spec implementation for package metadata format in distutils (and for other configs too). Therefore.. Full YAML implementation by PyYAML may be good in general, but for this simple case it is huge, unsafe, slow, C-based, which means the parsing logic is not translatable (i.e. with PythonJS) and can not be optimized with future platform-dependent GPU and CPU cache level 2 tweakers of PyPy JIT. -- anatoly t.
On Mon, Jun 3, 2013 at 3:59 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
David Mertz writes:
I would definitely like to have a YAML library--even one with restricted function--in the standard library.
Different use cases (and users) will stick at different restrictions. This would be endlessly debatable. I think the only restriction that really makes sense is the load vs. load_unsafe restriction (and that should be a user decision; the "unsafe" features should be available to users who want them).
Short version of previous letter. "yamlish" is only for simple nested human editable data, such as config files. Format is based on widely popular "organic" examples found on internet and provided in previous letter: http://tmuxp.readthedocs.org/en/latest/examples.html http://code.google.com/p/rietveld/source/browse/app.yaml https://github.com/agschwender/pilbox/blob/master/provisioning/playbook.yml
On 13 November 2013 17:15, anatoly techtonik <techtonik@gmail.com> wrote:
On Mon, Jun 3, 2013 at 3:59 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
David Mertz writes:
I would definitely like to have a YAML library--even one with restricted function--in the standard library.
Different use cases (and users) will stick at different restrictions. This would be endlessly debatable. I think the only restriction that really makes sense is the load vs. load_unsafe restriction (and that should be a user decision; the "unsafe" features should be available to users who want them).
Short version of previous letter. "yamlish" is only for simple nested human editable data, such as config files. Format is based on widely popular "organic" examples found on internet and provided in previous letter:
1. Inventing a new data format (your "yamlish" format) is probably a bad idea. There are enough already. 2. Putting support for a newly designed format directly into the stdlib is *definitely* a bad idea. Write a module, put it on PyPI, If it's useful, people will use it. They will help you to iron out the design of the new format - it may evolve into "full" YAML or into JSON, in which case you've learned something about why those formats made the compromises they did, or it will evolve into a popular new format, at which point it might be worth proposing that the module is ready to be included in the stdlib. Or it will not be sufficiently popular, in which case you have at least solved your personal problem. If you are expecting someone else to do this, I think the general message from this thread is that nobody else is interested enough to take this on, so it isn't going to happen, sorry. Paul
On Wed, Nov 13, 2013 at 9:47 PM, Paul Moore <p.f.moore@gmail.com> wrote:
On 13 November 2013 17:15, anatoly techtonik <techtonik@gmail.com> wrote:
On Mon, Jun 3, 2013 at 3:59 AM, Stephen J. Turnbull <stephen@xemacs.org> wrote:
David Mertz writes:
I would definitely like to have a YAML library--even one with restricted function--in the standard library.
Different use cases (and users) will stick at different restrictions. This would be endlessly debatable. I think the only restriction that really makes sense is the load vs. load_unsafe restriction (and that should be a user decision; the "unsafe" features should be available to users who want them).
Short version of previous letter. "yamlish" is only for simple nested human editable data, such as config files. Format is based on widely popular "organic" examples found on internet and provided in previous letter:
1. Inventing a new data format (your "yamlish" format) is probably a bad idea. There are enough already. 2. Putting support for a newly designed format directly into the stdlib is *definitely* a bad idea.
It is not a new format. It is YAML subset, limited but fully readable and parseable YAML. If you read it with YAML then save back immediately, you will get the same result.
Write a module, put it on PyPI, If it's useful, people will use it. They will help you to iron out the design of the new format - it may evolve into "full" YAML or into JSON, in which case you've learned something about why those formats made the compromises they did, or it will evolve into a popular new format, at which point it might be worth proposing that the module is ready to be included in the stdlib. Or it will not be sufficiently popular, in which case you have at least solved your personal problem.
Can you more thoroughly criticize the idea, instead of sending me somewhere where I will definitely fail. I've written a few parsers in my life, all manual - I don't have much experience with flex/yacc kind of things, that's why I asked if anybody known a good framework for such stuff. That's why I asked if there is a comparison site similar like http://todomvc.com/ for MV* frameworks, but for Python parser frameworks. Yes/No/Don't Care/Ignore. I don't believe that that among python-ideas subscribers there are no people with experience in different parser frameworks. It is also a good Google Code-In, GSoC project.
If you are expecting someone else to do this, I think the general message from this thread is that nobody else is interested enough to take this on, so it isn't going to happen, sorry.
The python ideas list is useless without people looking for an exercise and that's worthy implementing regardless of who gave the idea. Good idea without implementation is the same zero as bad idea with. Communicating and defending ideas alone is time-consuming, hard and thankless process already. Even if you come up with implementation for something like hexdump, it will likely be rewritten without any credits if you won't accept the hecking CLA, which nobody cares to explain even on python-legal.
On Wed, Nov 13, 2013 at 12:34 PM, anatoly techtonik <techtonik@gmail.com> wrote:
I've written a few parsers in my life, all manual - I don't have much experience with flex/yacc kind of things, that's why I asked if anybody known a good framework for such stuff.
The wiki has a decent listing of parsing libraries, though I don't know how up to date it is. https://wiki.python.org/moin/LanguageParsing FWIW, I like the idea of a library for a safe subset of YAML. Paul is right that it should live on the cheeseshop a while. You might see if anyone on python-list has interest in collaborating with you on such a project. -eric p.s. I appreciate your follow-up email. Your original proposal was long and hard to follow. The idea I like here was lost in there, as evidenced by some of the responses you got. It was much clearer today. Also, those links you gave are nice concrete examples. I'd recommend sticking to that formula of a brief, focused proposal supported by examples. The examples help to mitigate the communication breakdowns.
On 13/11/2013 20:45, Eric Snow wrote:
On Wed, Nov 13, 2013 at 12:34 PM, anatoly techtonik <techtonik@gmail.com> wrote:
I've written a few parsers in my life, all manual - I don't have much experience with flex/yacc kind of things, that's why I asked if anybody known a good framework for such stuff.
The wiki has a decent listing of parsing libraries, though I don't know how up to date it is.
There's also Ned Batchelder's site http://nedbatchelder.com/text/python-parsers.html last updated 29th December 2012. -- Python is the second best programming language in the world. But the best has yet to be invented. Christian Tismer Mark Lawrence
On Wed, Nov 13, 2013 at 10:34:34PM +0300, anatoly techtonik wrote:
I've written a few parsers in my life, all manual - I don't have much experience with flex/yacc kind of things, that's why I asked if anybody known a good framework for such stuff.
Try PyParsing. -- Steven
From: anatoly techtonik <techtonik@gmail.com> Sent: Sunday, June 2, 2013 11:23 AM
FWIW, I am +1 on for the ability to read YAML based configs Python without dependencies, but waiting for several years is hard.
With all due respect, I don't think you've read even a one-sentence description of YAML, so your entire post is nonsense. The first sentence of the abstract says, "YAML… is a…data serialization language designed around the common native data types of agile programming languages." So, your idea that we shouldn't use it for serialization, and shouldn't map it to native Python data types, is ridiculous. You specifically suggest mapping YAML to XML so we can treat it as a structured document. From the "Relation to XML" section: "YAML is primarily a data serialization language. XML was… designed to support structured documentation." You suggest that we shouldn't build all of YAML, just some bare-minimum subset that's good enough to get started. JSON is already _more_ than a bare-minimum subset of YAML, so we're already done. But you'd also like some data-driven way to extend this. YAML has already designed exactly that. Once you have the core schema, you can add new types, and the syntax for those types is data-driven (although the semantics are really only defined in hand-wavy English and probably require code to implement, but I'm not sure how you expect your proposal to be any different, unless you're proposing something like XML Schema). So, either the necessary subset of YAML you want is the entire spec, or you want to do an equal amount of work building something just as complex but not actually YAML. The idea of building a useful subset of YAML isn't a bad one. But the way to do that is to go through the features of YAML that JSON doesn't have, and decide which ones you want. For example, YAML with the core schema, but no aliases, no plain strings, and no explicit tags is basically JSON with indented block structure, raw strings, and useful synonyms for key constants (so you can write True instead of true). You could even carefully import a few useful definitions from the type library as long as they're unambiguous (e.g., timestamp). That gives you most of the advantages of YAML that don't bring any safety risks, and its output would be interpretable as full YAML, and it might be a little easier to implement than the full spec. But that has very little to do with your proposal. In particular, leaving out the data-driven features of YAML is what makes it safe and simple. Meanwhile, I think what you actually want is XSLT processors to convert YAML to and from XML. Fortunately, the YAML community is already working on that at http://www.yaml.org/xml. Then you don't need any new Python code at all; just convert your YAML to XML and use whichever XML library (in the stdlib or not), and you're done.
Maybe try an alternative data driven development process as opposed
to traditional PEP based all-or-nothing style to speed up the process? It is possible to make users happy incrementally and keep development fun without sacrificing too much on the Zen side. If open source is about scratching your own itches, then the most effective way to implement a spec would be to allow people add support for their own flavor without disrupting works of the others.
For some reason I think most people don't need full YAML
speccy, especially if final implementation will be slow and heavy.
So instead of: import yaml I'd start with something more generic and primitive: from datatrans import yamlish
Where `datatrans` is data transformation framework taking care of usual text parsing (data partitioning), partition mapping (structure transformation) and conversion (binary to string etc.) trying to be as fast and lightweight as possible, bringing vast field for future optimizations on algorithmic level. `yamlish` is an implementation which is not vastly optimized (datatrans to yamlish is like RPython to PyPy) and can be easily extended to cover more YAML (hopefully). Therefore the name - it doesn't pretend to parse YAML - it parses some supported subset, which is improved over time by different parties (if datatrans is done right to provide readable (maintainable + extensible) code for implementation).
There is an exisiting package called `yamlish` on PyPI - I am not talking about it - it is PyYAML based, which is not an option for now as I see it. So I stole its name. Sorry. This PyPI package was used to parse TAP format, which is again, a subset. Subset..
It appears that YAML is good for humans for its subsets. It leaves an impression (maybe it's just an illusion) that development work for subset support can also be partitioned. If `datatrans` "done right" is possible, it will allow incremental addition of new YAML features as the need for them arises (new data examples are added). Or it can help to build parsers for YAML subsets that are intentionally limited to make them performance efficient.
Because `datatrans` is a package isolating parsing, mapping and conversion parts of the process to make it modular and extensible, it can serve as a reference point for various kinds of (scientific) papers including the ones that prove that such data transformation framework is impossible. As for `yamlish` submodule, the first important paper covering it will be a reference table matrix of supported features.
While it all sounds too complicated, driving development by data and real short term user needs (as opposed to designing everything upfront) will make the process more attractive. In data driven development, there are not many things that can break - you either continue parsing previous data or not. The output from the parsing process may change over time, but it may be controlled by configuring the last step of data transformation phase.
`Parsing AppEngine config file` or `reading package meta data` are good starting points. Once package meta data subset is parsed, it is done and won't break. The implementation for meta data parsing may mature in distutils package, for AppEngine in its SDK, and merged in either of those, sending patches for `datastrans` to stdlib. The question is only to design output format for the parse stage. I am not sure everything should be convertible into Python objects using the "best fit" technique.
I will be pretty comfortable if target format will not be native Python objects
at all. More than that - I will even insist to avoid converting to native Python object from the start. The ideal output for the first version should be generic tree structure with defined names for YAML elements. The tree that can be represented as XML where these names are tags. The tree can be therefore traversed and selected using XPath/JQuery syntax.
It will take several years for implementation to mature and in the end there will be a plenty of backward compatibility matters with the API, formatting and serializing. So the first thing I'd do is [abandon serialization]. From the point of view of my self-proclaimed data transformation theory, the input and output formats are data. If output format is not human readable - as some linked Python data structures in memory - it wastes time and hinders development. Serializing Python is a problem of different level, which is an example of binary, abstract, memory-only output format - a lot of properties that you don't want to deal with while working with data.
To summarize:
1. full spec support is no goal 2. data driven (using real world examples/stories) 3. incremental (one example/story at a time) 4. research based (beautiful ideas vs ugly real world limitations)
5. maintainable (code is easy to read and understand the structure) 6. extensible (easy to find out the place to be modified) 7. core "generic tree" data structure as an intermediate format and "yaml tree" data structure as a final format from parsing process
P.S. I am willing to wok on this "data transformation theory" stuff and prototype implementation, because it is generally useful in many areas. But I need support. -- anatoly t.
On Mon, Jun 3, 2013 at 2:53 AM, Andrew Barnert <abarnert@yahoo.com> wrote:
From: anatoly techtonik <techtonik@gmail.com> Sent: Sunday, June 2, 2013 11:23 AM
FWIW, I am +1 on for the ability to read YAML based configs Python without dependencies, but waiting for several years is hard.
With all due respect, I don't think you've read even a one-sentence description of YAML, so your entire post is nonsense.
I'll try to clarify my post, so that it will be clear for you. Please, ask if something is unclear. You're right. I am not reading specifications prior to using things. What do I personally need from YAML? These are examples of files I use daily: http://tmuxp.readthedocs.org/en/latest/examples.html http://code.google.com/p/rietveld/source/browse/app.yaml https://github.com/agschwender/pilbox/blob/master/provisioning/playbook.yml http://pastebin.com/RG7g260k (OpenXcom save format)
The first sentence of the abstract says, "YAML… is a…data serialization language designed around the common native data types of agile programming languages." So, your idea that we shouldn't use it for serialization, and shouldn't map it to native Python data types, is ridiculous.
I don't care really about the abstract. I am a complaining user - not a smart guy, who wrote the spec. So my thinking is the following: 1. Neither of examples above is a persistence data format of serialized native computer language data types. These are just nested mappings and lists. Strictly two dimensional tree data structure, even for openXcom one. It is YAML, or as I said - subset of YAML, and that's why I deliberately called this format "yamlish". 2. Regardless of any desire to use this proposal as an opportunity to see the full YAML 1.2 spec implemented in Python stdlib, I am going to resist. I need work with *safe data format*, which is "human friendly". And I put *safe format* over *serialization format*.
You specifically suggest mapping YAML to XML so we can treat it as a structured document. From the "Relation to XML" section: "YAML is primarily a data serialization language. XML was… designed to support structured documentation."
Where? Oh, do you mean this one: "The ideal output for the first version should be generic tree structure with defined names for YAML elements. The tree that can be represented as XML where these names are tags. " It is not about "structured document", it is about "structured data format". "tree that can be represented as XML" is not "XML tree". XML here is just an example of structured nested format that everybody is aware of. I want to say that this "tree structure" should be plain, and 1:1 mapping to XML is necessary and sufficient requirement.
You suggest that we shouldn't build all of YAML, just some bare-minimum subset that's good enough to get started. JSON is already _more_ than a bare-minimum subset of YAML, so we're already done.
I didn't know that JSON is not compatible with YAML. Still I am not sure I understand how your argument of "JSON in not YAML" makes it "done" with minimal implementation of YAML. Module name - "yamlish" - defines its purpose as something that my poor language skills can verbalize as "provide support for parsing and writing files in formats, that are subsets of YAML used to store generic user editable, not Python specific declarative data, such as configurations, save files, settings etc.". Because I am not a CS major, I can't describe exactly how to define common things between examples I provided, how these examples are different from usual programming language objects serialized into YAML. I feel that these examples are "yamlish" and I am pretty much appreciate if somebody can come up with proper *definition* of characteristics of the simple data formats (which are still YAML) that give this feeling. Such definition will greatly help to keep it moving in the right direction.
But you'd also like some data-driven way to extend this. YAML has already designed exactly that. Once you have the core schema, you can add new types, and the syntax for those types is data-driven (although the semantics are really only defined in hand-wavy English and probably require code to implement, but I'm not sure how you expect your proposal to be any different, unless you're proposing something like XML Schema). So, either the necessary subset of YAML you want is the entire spec, or you want to do an equal amount of work building something just as complex but not actually YAML.
No, it is not data-driven support for extension in "yamlish" format. It is data-driven process of writing parser for "yamlish" - you get one example, parse it, get output, write test, get another, parse it, get output, run previous test. "yamlish" format is only for common, human understandable data files. Perhaps expanding on the idea of "yamlish" format with development process and with details of my "own data transformation theory" was not a good idea, but it was the only chance to find a motivation to write down the stuff. =) Sorry for the overload, and let me clarify things a little. I proposed process for extending support of "yamlish" parser to parse more backward-compatible "yamlish" data formats. There is no mechanism to support conflicting formats, or formats that change the output for existing stuff. That's it. There is no additional API for full YAML, so no complexity involved with maintenance and support of extra features or full YAML speccy. `datatrans` framework I was speaking about is possible implementation of the lib to transform 2D structures between different formats. You know, data transformation process is all the same at some level. On the level above I even can say that everything we do in CS is just data transformation. It is not related to "yamlish" format definition. The only thing that is important that "datatrans" enables many input and many outputs of formats that can be represented in 2D annotated (or generic) tree. It is not related to "yamlish".
The idea of building a useful subset of YAML isn't a bad one. But the way to do that is to go through the features of YAML that JSON doesn't have, and decide which ones you want. For example, YAML with the core schema, but no aliases, no plain strings, and no explicit tags is basically JSON with indented block structure, raw strings, and useful synonyms for key constants (so you can write True instead of true). You could even carefully import a few useful definitions from the type library as long as they're unambiguous (e.g., timestamp). That gives you most of the advantages of YAML that don't bring any safety risks, and its output would be interpretable as full YAML, and it might be a little easier to implement than the full spec. But that has very little to do with your proposal. In particular, leaving out the data-driven features of YAML is what makes it safe and simple.
Now I feel that we basically thinking about the same things - simplicity and safety. I didn't read the spec, so I don't know what things are in core YAML schema, so here you know much better than I what needs to be filtered out. My thought was using examples to see what should be filtered out, because iterating over spec will bring many more "useful" features that people with forward thinking might want, but which may be harmful for keeping this small and simple. I really like YAML brevity compared to JSON and other structured data formats (tmuxp example page is a good one). Support for indented data format is also natural for indented language. But it is hard to make format right and not to spoil it with overengineering. About safety. I believe that this "data-driven features of YAML" is the point of confusion. I recall that YAML spec provided some declarative mechanism for extensions. It is not it. My data-driven approach is just "don't design anything upfront, use existing widely used data examples as a spec of data that needs to be parsed". And yes - I don't need this YAML extensibility feature, which I too believe makes YAML unsafe. I need YAML as a format of indented data in a text file. Nothing more. YAML without "extra processing" that leads to potential hacks and execution of unwanted code. I just want to make sure that data format is safe. Currently, Python stdlib lacks a safe serialization format - docs are bleeding red of warnings without specifying any alternatives. I like to call it "yamlish", because if it is named YAML, people will demand dynamism, OOPy "constructor/destructor" tricks, and sooner or later the module users will be pwnd, like it happened with other serialization modules before. Therefore I don't want "serialization as a feature", but I don't mind against "serialization as a side effect" if it is compatible with good intuitive API AND improves the speed without sacrificing _clarity_ and safety. _clarity_ here is the understanding that "there is no way that 'yamlsih' format can be unsafe" at all times.
Meanwhile, I think what you actually want is XSLT processors to convert YAML to and from XML. Fortunately, the YAML community is already working on that at http://www.yaml.org/xml. Then you don't need any new Python code at all; just convert your YAML to XML and use whichever XML library (in the stdlib or not), and you're done.
XSLT, that declarative turing complete language. I fed up it. Complexity and performance ruin the beautiful theory. I think that turing-completeness is a trap - solving its gestalts gives a good feeling when you learn it, but it has nothing to do with the real world problems. XSLT processors hog memory AND slow at the same time. XSLT debug is impossible, because process is obscure. I guess that it is also easily exploitable to DoS. XSLT? Not anymore, thanks. XML has only one advantage over all other formats - auto-discoverable validation schemas. That's why it is still so popular. FWIW. Right now Python doesn't have any safe native data for structured data - only linked objects and references. Some time ago I tried to introduce solution to handling structured data by proposing 2D (two dimensional) terminology with a generic tree as base type. But the post became too complicated, lacking pictures, and I was unable to support the communication. I don't want this idea to find a rest in mailing list archives, so if you know how to write such minimal (and safe) parser (and fast) in Python (and maintainable), please tell me. If additional parser language is inevitable, maybe somebody knows of a comparison site similar that http://todomvc.com/ does for MV* frameworks.
participants (9)
-
anatoly techtonik
-
Andrew Barnert
-
David Mertz
-
Eric Snow
-
Mark Lawrence
-
Paul Moore
-
Philipp A.
-
Stephen J. Turnbull
-
Steven D'Aprano