string to unicode

Mon Aug 15 15:37:18 EDT 2011

Chris Angelico wrote:

> On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.ziff at gmail.com> wrote:
>> if I am using the standard csv library to read contents of a csv file
>> which contains Unicode strings (short example:
>> '\xe8\x9f\x92\xe8\x9b\x87'), how do I use a python Unicode method such as
>> decode or encode to transform this string type into a python unicode
>> type? Must I know the encoding (byte groupings) of the Unicode? Can I get
>> this from the file? Perhaps I need to open the file with particular
>> attributes?
> 
> Start here:
> 
> http://www.joelonsoftware.com/articles/Unicode.html
> 
> The CSV file, being stored on disk, cannot contain Unicode strings; it
> can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
> etc), then you can decode it using that. If you don't, your best bet
> is to ask the origin of the file; failing that, check the first few
> bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
> probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
> encodings of the BOM). There may be other clues, too, but normally
> it's best to get the encoding separately from the data rather than try
> to decode it from the data itself.

As this problem really is not a new one, there are several more – if I may 
say so – pythonic approaches:

<http://stackoverflow.com/questions/436220/python-is-there-a-way-to-
determine-the-encoding-of-text-file>

Improving Billy Mays' "matching brackets" checker, chardet worked for me 
(the test file was UTF-8-encoded).  Watch for word-wrap:

-----------------------------------------------------------------------
# encoding: utf-8
'''
Created on 2011-07-18

@author: Thomas 'PointedEars' Lahn <PointedEars at web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6 at myhashismyemail.com>
in <news:j01ph6$knt$1 at speranza.aioe.org> 
'''
import sys, os, chardet

pairs = {u'}': u'{', u')': u'(', u']': u'[',
         u'”': u'“', u'›': u'‹', u'»': u'«',
         u'】': u'【', u'〉': u'〈', u'》': u'《',
         u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)

if __name__ == '__main__':
    for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
        for name in filenames:
            stack = [' ']

            file_path = os.path.join(dirpath, name)

            with open(file_path, 'rb') as f:
                reported = False
                lines = enumerate(f, 1)

                encoding = chardet.detect(''.join(map(lambda x: x[1], 
lines)))['encoding']

                chars = ((c, line_no, col) for line_no, line in lines for 
col, c in enumerate(line.decode(encoding), 1) if c in valid)
                for c, line_no, col in chars:
                    if c in pairs:
                        if stack[-1] == pairs[c]:
                            stack.pop()
                        else:
                            if not reported:
                                first_bad = (c, line_no, col)
                                reported = True
                    else:
                        stack.append(c)

            print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad 
'%s' at %s:%s" % first_bad))
-----------------------------------------------------------------------

HTH

-- 
PointedEars

Bitte keine Kopien per E-Mail. / Please do not Cc: me.