[I18n-sig] Unicode surrogates: just say no!
Tom Emerson
tree@basistech.com
Tue, 26 Jun 2001 10:39:51 -0400
Martin v. Loewis writes:
> > Martin has hinted at a solution requiring even less memory per string
> > object, but I don't know for sure what he is thinking of. All I can
> > imagine is a single flag saying "this string contains no surrogates".
>
> That was my original idea. I later thought have a count of surrogate
> pairs would be better, since it allows to compute len() in constant
> time. Indexing would be linear time only for strings containing
> surrogates, otherwise constant time also.
Just so I understand: the codec will set this flag/length when it
transcodes to the internal representation?
> [on sre]
> > There are two parts to this: the internal
> > engine needs to realize that e.g. "." and certain "[...]" sets may
> > match a surrogate pair, and the indices returned by e.g. the span()
> > method of match objects should be translated to character indices as
> > expected by the applications.
>
> For character classes, it may be acceptable they must only contain BMP
> characters; span would use the conversion macros, and . would need
> special casing. I agree this is terrible, but it could work.
UTR #18 describes the impact of surrogates on regular expressions.
http://www.unicode.org/unicode/reports/tr18/#Surrogates
> Still, exploiting the platform's wchar_t might avoid copies in some
> cases (I'm thinking of my iconv codec in particular), so that would
> give a speed-up.
Excellent point.
-tree
--
Tom Emerson Basis Technology Corp.
Sr. Sinostringologist http://www.basistech.com
"Beware the lollipop of mediocrity: lick it once and you suck forever"