On Sun, Feb 6, 2022 at 9:42 PM Chris Angelico <rosuav@gmail.com> wrote:
> As for dataclasses, this is what i mean by "code" vs "data" -- if you know when you are writing the code exactly what key (fields, etc) you expect , and you want to be able to work with that data model as code (e.g. attribute access, maybe some methods, then you do:
>
> In [10]: @dataclass
>     ...: class Stream:
>     ...:     codec_type : str
>     ...:     width: int
>     ...:     height: int
>
> And if you have that data in a dict (say, from JSON, then you can extract it like this:
>
> In [11]: stream_info = {'codec_type': 'video',
>     ...:                'width': 1024,
>     ...:                'height': 768,
>     ...:                }
>
> In [12]: stream = Stream(**stream_info)
>
> In [13]: stream
> Out[13]: Stream(codec_type='video', width=1024, height=768)
>
> That only works if you dict is in exactly the right form, but that would be the case anyway.

One very *very* important aspect of a huge number of JSON-based
protocols is that they absolutely will not break if new elements are
added. In other words, I look at the things I'm interested in, but
those streams also have a ton of other information (frame rate,
metadata, pixel format), which could get augmented at any time, and I
should just happily ignore the parts I'm not looking for. Making that
work with dataclasses (a) is even more boilerplate, and (b) would
obscure the relationship between the dataclass and the JSON schema.

I believe some folks have asked for the ability for  **kwargs to be tacked on to the dataclass generated __init__ -- I don't know if it will happen, but that would address this use case.

Not sure what you mean by "obscure the relationship between the dataclass and the JSON schema."

I guess you mean that the dataclass will then accept non-schema conforming JSON, but if you don't want it to do that, then do allow that. For my part, in an application that I'm doing all of of JSON -- data classes, I explicitly add and "extra_data" field, so I can capture anything in the JSON that doesn't have a "proper" place.

After I posted, I realized that dataclasses are probably not the simplest solution -- but SimpleNamespace could be:

In [9]: stream_info = {'codec_type': 'video',
   ...:                'width': 1024,
   ...:                'height': 768,
   ...:                }

In [10]: stream = types.SimpleNamespace(**stream_info)

In [11]: stream.codec_type
Out[11]: 'video'

In [12]: stream.height
Out[12]: 768


In [13]: stream.width
Out[13]: 1024


In any case, if you don't like how dataclasses or SimpleNamespace does it, then write you own custom class / converter -- I don't see the need for it to be a language feature.

I'm not sure what you mean here about code vs data. What is the
difference that you're drawing? Ultimately, I need to read a
particular data structure and find the interesting parts of it. It's
not about code. The only code is "iterate over info->streams, look at
the codec_type, width, height, perform arithmetic on videos".

The distinction I'm trying to draw (and I did say it was a fuzzy one in Python) is that data are things you can store in variables -- e.g. the keys of a dict can be hard coded (known at code-writing time) or stored in a variable.

Code is things like variable and attribute names that have to known at code-writing time (baring metaprogramming techniques, get/setattr, etc).

In this case, we are looking to auto-extract variable from a dict -- you can't even start to write that code unless you know what the keys in the dict are -- if that's the case, then you know (at least part of) the schema, and you can use dataclasses, etc, and get your code.

I"ve worked with systems (the netcdf4 library for example, if you want an obscure one :-) ) that auto translate essentially keys in a dict to object attributes. it seems pretty nifty at first:

ds = Dataset("the_file.nc")
ds.sea_surface_temp.units

But it ends up just making things harder -- you need to poke into the file to see what names will be there, it's actually harder to introspect (can't just look at .keys() ) -- and things really go to heck if the keys in your data don't follow Python variable naming rules:

In [14]: stream_info = {'codec-type': 'video',
    ...:                'width': 1024,
    ...:                'height': 768,
    ...:                }

In [15]: stream = types.SimpleNamespace(**stream_info)

In [16]: stream.codec-type
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-16-26291ce709a5> in <module>
----> 1 stream.codec-type

AttributeError: 'types.SimpleNamespace' object has no attribute 'codec'

In [17]: stream.codec_type
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-17-6225e2eacec1> in <module>
----> 1 stream.codec_type

AttributeError: 'types.SimpleNamespace' object has no attribute 'codec_type'

In [18]: getattr(stream, 'codec-type')
Out[18]: 'video'


So: I say, keep your data in dicts, and if you want to load a code object with that data, do it in a clearly defined way.

Again, in a quick script, maybe it'd be helpful occasionally, mostly saving some typing (all those darn square brackets and quotes)[*] -- but I don't think that's worth a language feature.

-CHB 

[*] I'm not being facetious here -- I write a lot of quick scripts, and DO find typing:
this.that
a lot easier than
this['that']

But I don't think it's worth a language feature

And now that I've thought about it -- maybe I'll start using the SimpleNamespace trick in some of those quick scripts.

-CHB

--
Christopher Barker, PhD (Chris)

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython