(Simple?) Unicode Question

Thu Aug 27 12:49:36 EDT 2009

On Thu, 2009-08-27 at 22:09 +0530, Shashank Singh wrote:
> Hi All!
> 
> I have a very simple (and probably stupid) question eluding me.
> When exactly is the char-set information needed?
> 
> To make my question clear consider reading a file.
> While reading a file, all I get is basically an array of bytes.
> 
> Now suppose a file has 10 bytes in it (all is data, no metadata,
> forget the BOM and stuff for a little while). I read it into an array
> of 10
> bytes, replace, say, 2nd bytes and write all the bytes back to a new
> file. 
> 
> Do i need the character encoding mumbo jumbo anywhere in this?
> 
> Further, does anything, except a printing device need to know the
> encoding of a piece of "text"? I mean, as long as we are not trying
> to get a symbolic representation of a "text" or get "i"th character
> of it, all we need to do is to carry the intended encoding as
> an auxiliary information to the data stored as byte array.

If you are just reading and writing bytes then you are just reading and
writing bytes.  Where you need to worry about unicode, etc. is when you
start treating a series of bytes as TEXT (e.g. how many *characters* are
in this byte array).* 

This is no different, IMO, than treating a byte stream vs a image file.
You don't, need to worry about resolution, palette, bit-depth, etc. if
you are only treating as a stream of bytes.  The only difference between
the two is that in Python "unicode" is a built-in type and "image"
isn't ;)

* Just make sure that if you are manipulating byte streams independent
of it's textual representation that you open files, e.g., in binary
mode.

-a