python 2.7 and unicode (one more time)

Rustom Mody rustompmody at gmail.com
Sat Nov 22 16:57:22 CET 2014


On Saturday, November 22, 2014 8:14:15 PM UTC+5:30, Roy Smith wrote:
>  Marko Rauhamaa wrote:
> 
> > Steven D'Aprano:
> > 
> > > You haven't given any good reason for objecting to calling Unicode
> > > strings by what they are. Maybe you think that it is an implementation
> > > detail, and that some version of Python might suddenly and without
> > > warning change to only supporting KOI8-R strings or GB2312 strings? If
> > > so, you are badly mistaken. The fact that Python strings are Unicode
> > > is not an implementation detail, it is part of the language semantics.
> > 
> > To me, repeating the word Unicode everywhere is giving the (in and of
> > itself impressive) standard too primary a status. While understanding
> > how Unicode, IEEE-754, 2's complement, mark-and-sweep etc work is very
> > useful and occasionally can be taken explicit advantage of, those really
> > are mundane techniques to implement abstractions.
> > 
> > Python's strings exist (primarily) so you can express utterances in a
> > human language, aka plain text. They don't exist to express Unicode code
> > points. That would be putting the cart before the horse.
> > 
> > > "Rectangular door" makes perfect sense, and in a world where there are
> > > dozens of legacy non-rectangular doors, it would be very sensible to
> > > specify the kind of door.
> > 
> > It makes sense, and yet, I've never heard anyone talk about rectangular
> > doors even though I use numerous doors every day. Why is it, then, that
> > people feel the constant need to add the "Unicode" epithet to Python's
> > strings, which -- according to its own specification -- are just
> > strings?
> > 
> > 
> > Marko
> 
> There's a old joke to the effect that the fields of study which are 
> confident that they're really doing science (i.e. chemistry, biology, 
> physics, astronomy, etc) don't put the word "science" in their names.  
> It's only the fields of study that are less confident about their status 
> as sciences (computer science, behavioral science, political science, 
> etc) that feel the need to explicitly say "science".  As if repeating it 
> enough times makes it true.  I wonder if something of the same thing 
> applies here?  <ducking and running>
> 
> Somewhat more seriously, the IEEE-754 point is quite apropos.  Back when 
> 754 first came out, there were lots of different floating point 
> implementations.  Machines that used 754 touted it in their sales 
> literature and mentioned it all over their documentation.  These days, 
> 754 is so ubiquitous, nobody even thinks to mention it, in the same way 
> nobody bothers to mention 2's complement integers.  I suspect that some 
> day, the same thing will happen with Unicode.  For that matter, we will 
> eventually get to the point where when people say, "just plain text", 
> they will mean Unicode, in the same way that "just plain text" today 
> really means ASCII (and the text/plain MIME type will become a 
> historical curiosity).

Yes this was my point also -- encodings in general and unicode in
particular is a mess (as of 2014).  Maybe in a few years the dust 
will settle.  Then saying 'unicode' will become redundant.
But until then when we have a rather leaky abstraction having
sealing liquid on the hands is preferable to sewage in the house.



More information about the Python-list mailing list