[Python-3000] characters data type

Guido van Rossum guido at python.org
Wed May 3 18:40:59 CEST 2006


On 5/2/06, Fredrik Lundh <fredrik at pythonware.com> wrote:
> I'm still thinking that it might be a good idea to (optionally) delay de-
> coding of strings until you're actually doing something that needs access
> to the individual characters, though.  (UTF-8 to UTF-8 shuffling is an
> increasingly common use case).

This seems a reasonable alternative abstraction that could be built on
top of bytes and (unicode) strings. Are you thinking of a situation
where you know that it's UTF-8? Or are you also thinking of doing this
for arbitrary encodings? Without knowing the encoding it's hard to
know where the boundaries between characters are, which means you
can't do anything that involves splitting the input into chunks, if
later you may attempt to decode a chunk.

There is of course nothing to stop you from copying a UTF-8 file in
binary mode -- but you seem to be after something more. Perhaps you
could elaborate an example, and explain some of your assumptions (e.g.
are you only talking UTF-8)?

--
--Guido van Rossum (home page: http://www.python.org/~guido/)


More information about the Python-3000 mailing list