RE Module Performance

wxjmfauth at gmail.com wxjmfauth at gmail.com
Thu Jul 25 11:27:42 CEST 2013


Le mercredi 24 juillet 2013 16:47:36 UTC+2, Michael Torrie a écrit :
> On 07/24/2013 07:40 AM, wxjmfauth at gmail.com wrote:
> 
> > Sorry, you are not understanding Unicode. What is a Unicode
> 
> > Transformation Format (UTF), what is the goal of a UTF and
> 
> > why it is important for an implementation to work with a UTF.
> 
> 
> 
> Really?  Enlighten me.
> 
> 
> 
> Personally, I would never use UTF as a representation *in memory* for a
> 
> unicode string if it were up to me.  Why?  Because UTF characters are
> 
> not uniform in byte width so accessing positions within the string is
> 
> terribly slow and has to always be done by starting at the beginning of
> 
> the string.  That's at minimum O(n) compared to FSR's O(1).  Surely you
> 
> understand this.  Do you dispute this fact?
> 
> 
> 
> UTF is a great choice for interchange, though, and indeed that's what it
> 
> was designed for.
> 
> 
> 
> Are you calling for UTF to be adopted as the internal, in-memory
> 
> representation of unicode?  Or would you simply settle for UCS-4?
> 
> Please be clear here.  What are you saying?
> 
> 
> 
> > Short example. Writing an editor with something like the
> 
> > FSR is simply impossible (properly).
> 
> 
> 
> How? FSR is just an implementation detail.  It could be UCS-4 and it
> 
> would also work.

---------

A coding scheme works with a unique set of characters (the repertoire),
and the implementation (the programming) works with a unique set
of encoded code points. The critical step is the path
{unique set of characters} <--> {unique set of encoded code points}


Fact: there is no other way to do it properly (This is explaining
why we have to live today with all these coding schemes or also
explaining why so many coding schemes hadto be created).

How to understand it? With a sheet of paper and a pencil.

In the byte string world, this step is a no-op.

In Unicode, it is exactly the purpose of a "utf" to achieve this
step. "utf": a confusing name covering at the same time the
process and the result of the process.
A "utf chunk", a series of bits (not bytes), hold intrisically
the information about the character it is representing.

Other "exotic" coding schemes like iso6937 of "CID-fonts" are woking
in the same way.

"Unicode" with the help of "utf(s)" does not differ from the basic
rule.

-----

ucs-2: ucs-2 is a perfecly and correctly working coding scheme.
ucs-2 is not different from the other coding schemes and does
not behave differently (cp... or iso-... or ...). It only
covers a smaller repertoire.

-----

utf32: as a pointed many times. You are already using it (maybe
without knowing it). Where? in fonts (OpenType technology),
rendering engines, pdf files. Why? Because there is not other
way to do it better.

------

The Unicode table (its constuction) is a problem per se.
It is not a technical problem, a very important "linguistic
aspect" of Unicode.
See https://groups.google.com/forum/#!topic/comp.lang.python/XkTKE7U8CS0

------

If you are not understanding my "editor" analogy. One other
proposed exercise. Build/create a flexible iso-8859-X coding
scheme. You will quickly understand where the bottleneck
is.
Two working ways:
- stupidly with an editor and your fingers.
- lazily with a sheet of paper and you head.


----

About my benchmarks: No offense. You are not understanding them,
because you do not understand what this FSR does and the coding
of characters. It's a little bit a devil's circle.

Conceptually, this FSR is spending its time in solving the
problem it creates itsself, with plenty of side effects.

-----

There is a clear difference between FSR and ucs-4/utf32.

-----

See also:
http://www.unicode.org/reports/tr17/

(In my mind, quite "dry" and not easy to understand at
a first reading).


jmf





More information about the Python-list mailing list