[Web-SIG] Time a for JSON parser in the standard library?

Fri Mar 21 01:43:30 CET 2008

On Thu, Mar 20, 2008 at 6:48 PM, John Millikin <jmillikin at gmail.com> wrote:
>  This is fantastic. My knowledge of other JSON modules was based mainly
>  on the comparison page from json.org, and yours is much more complete
>  and informative.

Well that's understandable.  What's on json.org is great to give somebody
an easy to understand overview (ever looked at YAML?); but it wasn't until
Douglas wrote the RFC that there was anything detailed enough to base
an implementation upon without a lot of guesswork.  And I still had to
poke around in the ECMAScript, IEEE 754, and Unicode standards to
fill in some gaps.  And JSON is the *easy* format!

And even then, we're not just talking about a JSON parser.  We're all
doing more than that; we're mapping Python to JSON.  And there is
no definitive spec for that.  Just look at my numbers tests; there are
a lot of differences in how numeric mappings are done, but yet many
of them can be arguably "correct" while still doing things differently.

>  You could try adding a section to the numbers area about
>  Arabic/Chinese/whatever numbers, such as U+0661. These are not allowed
>  in JSON, but are accepted by parsers that use \d regex patterns with
>  the re.UNICODE flag set.

Good call.  I forgot all about that possibility, probably because
I'm not using regexes.

>  For strings, I would like to suggest that escaping "/" to "\\/" be
>  considered the norm, with deviations from this marked on the table.
>  This is to protect against foolish website authors including JSON
>  directly using a <script> tag.

Well, for that I'm not too willing to state any preference, even though
I understand the <script> reasoning.  The JSON spec basically says
that any of these three representation can be used
  /
  \/
  \u002f
without favoring any one of them.  So in my mind they are all
equally "good".  And I've shown which one each of the modules
produce (although I did fail to document which ones let you
set/change that behavior).

But point noted.

>  I think the RFC allows inclusion of U+000A (newline) in strings
>  without escaping -- at least, it is not in the range of characters
>  requiring escaping.

>From RFC 4627, section 2.5:

   "... All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F)."

So the way I read that is that raw newlines are not allowed.
But then according to that U+2028 (Line Separator) would be
allowed; and frankly I don't think I even got that right.

I should double-check on the JavaScript spec.

>  You remark in the page on Unicode that "[encoding] is not a concern
>  for any of the UTF-[8,16,32] encodings, but it could be if you wanted
>  ISO-8859-4 for example". However, the RFC specifies that all JSON text
>  is to be encoded using a Unicode encoding.

That's true:
   "JSON text SHALL be encoded in Unicode.  The default encoding
    is UTF-8."
For whatever "encoded in Unicode" actually means.

I'll note more clearly that doing ISO-8859-* is probably stepping outside
of the JSON spec.  But then, what about trying to put a U+0001D140
character into a UCS-2 encoded output?

>  I hope you don't mind if I "borrow" your test cases to use in the
>  jsonlib unit test suite.

Absolutely do so.  I was hoping in fact to be able to make a
generic test suite that everybody could use but I can only
work so fast.

I'm actually interested in seeing people doing JSON tests for
other languages getting on board too; because I think the Python
camp is way ahead in RFC conformance (expect perhaps for
the JavaScript guys).  And the whole point of JSON is to be
language neutral; so everybody needs to do it right.

>  And finally, I apologize for slandering demjson. I was not aware that
>  it had a "strict" mode, and constructed my opinion of it based on its
>  behavior in "loose" mode.

No problem.  The default mode is "loose" anyway, so you'd
have to pay attention to realize it could be strict.

>  >  I do think though that if this is targeted for Python 3, that
>  >  none of the modules really works well.  We should really
>  >  design an interface that uses the bytes type rather
>  >  than str for pushing around encoded JSON data.
>  >
>  This seems like a module that would be easy enough to have in both 2.6
>  and 3. The py3k version could have additional enhancements, such as
>  detecting appropriate serialization formats based on ABCs, but even a
>  limited version in 2.6 would be more helpful than none at all.

Not my call.  Certainly though what goes in 3.0 (if any) I'd like
to be the best that we as a community can put together.
And I'm not sure that any of the existing implementations,
mine included, are good enough yet for me.  But neither are
we that far away.  I was actually pleasantly surprised that
several of the modules I tested did as well as they did.
-- 
Deron Meranda