RE Module Performance

Steven D'Aprano steve+comp.lang.python at pearwood.info
Sat Jul 27 02:28:56 EDT 2013


On Fri, 26 Jul 2013 08:46:58 -0700, wxjmfauth wrote:

> BTW, I'm pleased to read "sequence of bits" and not bytes. Again, utf
> transformers are producing sequence of bits, call Unicode Transformation
> Units, with lengths of 8/16/32 *bits*, from there the names utf8/16/32.
> UCS transformers are (were) producing bytes, from there the names
> ucs-2/4.


Not only does your distinction between bits and bytes make no practical 
difference on nearly all hardware in common use today[1], but the Unicode 
Consortium disagrees with you, and defines UTC in terms of bytes:

"A Unicode transformation format (UTF) is an algorithmic mapping from 
every Unicode code point (except surrogate code points) to a unique byte 
sequence."

http://www.unicode.org/faq/utf_bom.html#gen2




[1] There may still be some old supercomputers where a byte is more than 
8 bits in use, but they're unlikely to support Unicode.

-- 
Steven



More information about the Python-list mailing list