[Tutor] converting string to text

Marc Tompkins marc.tompkins at gmail.com
Wed Jul 10 19:19:02 CEST 2013


On Wed, Jul 10, 2013 at 3:45 AM, Dave Angel <davea at davea.name> wrote:

>
> Get rid of the BOM from the data file, and it'll work fine.  You don't
> specify what version of Python you're using, so I have to guess.  But
> there's a utf-8 BOM conversion of a BOM at the  beginning of that file, and
> that's not numeric.  Best would be to change the way you generate that
> file, and don't put in a BOM for utf-8.
>
> BOM's are markers that are put at the beginning of certain encodings of
> files to distinguish between BE and LE encodings.  But since your file is
> utf-8, a BOM is unnecessary and confusing.


Just jumping in to translate a bit of jargon...

BOM stands for Byte Order Mark.  (
http://www.opentag.com/xfaq_enc.htm#enc_bom)<http://www.opentag.com/xfaq_enc.htm#enc_bom>
BE stands for "big-endian", and LE stands for "little-endian".

Since the first digital computers were built, there have been two schools
of thought as to how numbers should be stored:  with the "most significant"
digits first, or the "least significant" digits first.  The two schools are
called "big-endian" and "little-endian", after a famous controversy in
"Gulliver's Travels".  The BOM is a sequence of bytes at the beginning of a
Unicode string that tells the reader whether the rest of the string will be
big-endian or little-endian.  UTF-8 was designed to be endian-agnostic, so
a BOM is not actually needed.



> It may even be illegal, but I'm not sure about that.
>

No, it's not illegal; when utf-8 was first introduced it was actually
required.  It's no longer required - so now even utf-8 comes in two flavors
(with and without BOM)!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20130710/9879c351/attachment.html>


More information about the Tutor mailing list