On Tue, Sep 15, 2020 at 7:30 PM Christopher Barker <pythonchb@gmail.com> wrote:
On Tue, Sep 15, 2020 at 9:09 AM Wes Turner <wes.turner@gmail.com> wrote:
json.load and json.dump already default to UTF8 and already have parameters for json loading and dumping.

yes, of course.

json.loads and json.dumps exist only because there was no way to distinguish between a string containing JSON and a file path string.
(They probably should've been .loadstr and .dumpstr, but it's too late for that now)

I think they exist because that was the pickle API from years ago -- though maybe that's why the pickle API had them. Though I think you have it a bit backwards -- you can't pass a path into loads/dumps for that reason. If they were created because that distinction couldn't be made, then load/sump would have accepted a string path back in the day.

TBH, I think it would be great to just have .load and .dump read the file with standard params when a path-like ( hasattr(obj, '__path__') ) is passed, but the suggested disadvantages of this are:


  > The default encoding is platform dependent (whatever locale.getpreferredencoding() returns), but any text encoding supported by Python can be used. See the codecs module for the list of supported encodings.

that's not a reason at all -- the reason is that some folks think overloading a function like this is bad API design. And it's been the way it's been for a long time, so probably better to add a new function(s), rather than extend the API of an existing one.

.load - reads a file object
.loadf - reads a file object that it opens for you from a str path or an object with an obj.__path__
.loads - reads from a string-like object

or

.load - reads a file object or creates a file object from a path or an obj.__path__ and closes it after reading
.loads - reads from a

For backwards-compatibility (without a check for `sys.version_info[:2]` or `hasattr(json, 'loadf')`, handling the file (e.g. using a context manager) will still be the way it's done.
 
 
- .load and .dump don't default to UTF8?
  AFAIU, they do default to UTF-8. Do they instead currently default to locale.getpreferredencoding() instead of the JSON spec(s) *
  encoding= was removed from .loads and was never accepted by json.load or json.dump

I think dump defaults to UTF-8. But load is a bit odd (and not that well documented).

it appears to accept a file_like object that returns either a string or a byte object from its read() method. If strings, then the decoding is done. if bytes, then I assume that it's using utf-8.

This, by the way, should be better documented.

I agree: https://github.com/python/cpython/blob/master/Lib/json/__init__.py
 
 
- .load and .dump would also need to accept an encoding= parameter for non-spec data that don't want to continue handling the file themselves
  - pickle.load has an encoding= parameter

.loads doesn't now, so I don't see why they would need to with the proposed change. You can always encode/decode ahead of time however you want, either in the file-like object or by passing decoded str to .loads/dumps.

pickle.loads does accept an encoding= parameter; and that's the API  we were matching.

Handling the file object will continue to be the backwards-compatible way to do it .
 
 
- Should we be using open(pth, 'rb') and open(pth, 'wb')? (Binary mode)

no, I think that's clear. in fact, you can't currently dump to a binary file:

In [26]: json.dump(obj, open('tiny-enc.json', 'wb'))                                              
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-26-02e9bcd47a3e> in <module>
----> 1 json.dump(obj, open('tiny-enc.json', 'wb'))

~/miniconda3/envs/py3/lib/python3.8/json/__init__.py in dump(obj, fp, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
    178     # a debuggability cost
    179     for chunk in iterable:
--> 180         fp.write(chunk)
    181
    182

TypeError: a bytes-like object is required, not 'str'
 
That's the beauty of Python 3's text model :-)

JSON Specs:

  > JSON text SHALL be encoded in UTF-8, UTF-16, or UTF-32.  The default
   encoding is UTF-8,

So THAT is interesting. But the current implementation does not directly support anything but UTF-8, and I think it's fine that that still be the case. If anyone is using the other two, it's an esoteric case, and they can encode/decode by hand.

The Python JSON implementation should support the full JSON spec (including UTF-8, UTF-16, and UTF-32) and should default to UTF-8.
 

> So, could we just have .load and .dump accept a path-like and an encoding= parameter (because they need to be able to specify UTF-8 / UTF-16 / UTF-32 anyway)?

These are separate questions, but I'll say:

Yes, it could take a path-like. But I think there was not much support for that in this discussion.

A path str or a path-like. Is there any reason not to also support a path-like object with this API, too?
 

No -- there is no need for encoding parameter -- the other two options are rare and can be done by hand.

There is a need for an encoding parameter in order to support the full JSON spec. Whether creating a new .loadf or just extending .load is the solution, the method should accept an encoding parameter.
 

BTW: .dumps() dumps to, well, a string, so it's not assuming any encoding. A user can encode it any way they want when passing it along.

This, in fact, is all very Python3 text model compatible -- the encoding/decoding should happen as close to IO as possible. 

Is there precedent for handling the file for the user in any other stdlib functions?

Extending the pickle and marshal APIs should also occur with this PR if accepted.
 

If there were no backward compatibility options, and it were me, I would only use strings in/out of the json module, but I think that ship has sailed.

The obj.__json__ protocol discussions discussed various ways to implement customizable serialization of object graphs containing complex types to JSON/JSON5 and/or JSON-LD (which BTW supports complex types like complex fractions)
 

Anyway -- if anyone wants to push for overloading .load()/dump(), rather than making two new loadf() and dumpf() functions, then speak now -- that will take more discussion, and maybe a PEP.

I don't see why one or the other would need a PEP so long as the new functionality is backward-compatible?
 

-CHB



--
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython