Re: a little parsing challenge ☺

Tue Jul 19 04:58:15 EDT 2011

On Jul 18, 7:07 pm, Billy Mays <no... at nohow.com> wrote:
> On 7/18/2011 7:56 PM, Steven D'Aprano wrote:
>
>
>
>
>
>
>
>
>
> > Billy Mays wrote:
>
> >> On 07/17/2011 03:47 AM, Xah Lee wrote:
> >>> 2011-07-16
>
> >> I gave it a shot.  It doesn't do any of the Unicode delims, because
> >> let's face it, Unicode is for goobers.
>
> > Goobers... that would be one of those new-fangled slang terms that the young
> > kids today use to mean its opposite, like "bad", "wicked" and "sick",
> > correct?
>
> > I mention it only because some people might mistakenly interpret your words
> > as a childish and feeble insult against the 98% of the world who want or
> > need more than the 127 characters of ASCII, rather than understand you
> > meant it as a sign of the utmost respect for the richness and diversity of
> > human beings and their languages, cultures, maths and sciences.
>
> TL;DR version: international character sets are a problem, and Unicode
> is not the answer to that problem).
>
> As long as I have used python (which I admit has only been 3 years)
> Unicode has never appeared to be implemented correctly.  I'm probably
> repeating old arguments here, but whatever.
>
> Unicode is a mess.  When someone says ASCII, you know that they can only
> mean characters 0-127.  When someone says Unicode, do the mean real
> Unicode (and is it 2 byte or 4 byte?) or UTF-32 or UTF-16 or UTF-8?
> When using the 'u' datatype with the array module, the docs don't even
> tell you if its 2 bytes wide or 4 bytes.  Which is it?  I'm sure that
> all the of these can be figured out, but the problem is now I have to
> ask every one of these questions whenever I want to use strings.
>
> Secondly, Python doesn't do Unicode exception handling correctly. (but I
> suspect that its a broader problem with languages) A good example of
> this is with UTF-8 where there are invalid code points ( such as 0xC0,
> 0xC1, 0xF5, 0xF6, 0xF7, 0xF8, ..., 0xFF, but you already knew that, as
> well as everyone else who wants to use strings for some reason).
>
> When embedding Python in a long running application where user input is
> received, it is very easy to make mistake which bring down the whole
> program.  If any user string isn't properly try/excepted, a user could
> craft a malformed string which a UTF-8 decoder would choke on.  Using
> ASCII (or whatever 8 bit encoding) doesn't have these problems since all
> codepoints are valid.
>
> Another (this must have been a good laugh amongst the UniDevs) 'feature'
> of unicode is the zero width space (UTF-8 code point 0xE2 0x80 0x8B).
> Any string can masquerade as any other string by placing  few of these
> in a string.  Any word filters you might have are now defeated by some
> cheesy Unicode nonsense character.  Can you just just check for these
> characters and strip them out?  Yes.  Should you have to?  I would say no.
>
> Does it get better?  Of course! international character sets used for
> domain name encoding use yet a different scheme (Punycode).  Are the
> following two domain names the same: tést.com , xn--tst-bma.com ?  Who
> knows!
>
> I suppose I can gloss over the pains of using Unicode in C with every
> string needing to be an LPS since 0x00 is now a valid code point in
> UTF-8 (0x0000 for 2 byte Unicode) or suffer the O(n) look up time to do
> strlen or concatenation operations.
>
> Can it get even better?  Yep.  We also now need to have a Byte order
> Mark (BOM) to determine the endianness of our characters.  Are they
> little endian or big endian?  (or perhaps one of the two possible middle
> endian encodings?)  Who knows?  String processing with unicode is
> unpleasant to say the least.  I suppose that's what we get when we
> things are designed by committee.
>
> But Hey!  The great thing about standards is that there are so many to
> choose from.
>
> --
> Bill

might check out my take

〈Xah's Unicode Tutorial〉
http://xahlee.org/Periodic_dosage_dir/unicode.html

especially good for emacs users.

if you grew up with english, unicode might seem complex or difficult
due to unfamiliarity.

but for asian people, when you dont have alphabets, it's kinda strange
to think that a byte is char. The notion simply don't exist and
impossible to establish. There are many encodings for chinese before
unicode. Even today, unicode isn't used in taiwan or china. Taiwan
uses big5, china uses GB18030, which contains all chars of unicode.

~8 years ago i thought that it'd be great if china adopted unicode
sometimes in the future... so that we all just have one charset to
deal with. But that's never gonna happen. On the contrary, am thinking
now there's the possibility that the world adopts GB18030 someday. lol
if you go to alexa.com for traffic ranking, a good percentage of the
top few are chinese these days. more and more as i observed since mid
2000s.

by the way, here's what these matching pairs are used for.

‹french quote›
«french quote»

the 〈〉 《》 are chinese brackets used for book titles etc. (CD, TV
program, show title, etc.)
the 「」 『』 are traditional chinese quotes, like english's ‘sinle
curly’, “double curly”
the 【】 〖〗 〔〕 and few others are variant brakets, similar to english's
() {} [].

 Xah