string to unicode
Thomas 'PointedEars' Lahn
PointedEars at web.de
Mon Aug 15 15:37:18 EDT 2011
Chris Angelico wrote:
> On Mon, Aug 15, 2011 at 4:20 PM, Artie Ziff <artie.ziff at gmail.com> wrote:
>> if I am using the standard csv library to read contents of a csv file
>> which contains Unicode strings (short example:
>> '\xe8\x9f\x92\xe8\x9b\x87'), how do I use a python Unicode method such as
>> decode or encode to transform this string type into a python unicode
>> type? Must I know the encoding (byte groupings) of the Unicode? Can I get
>> this from the file? Perhaps I need to open the file with particular
>> attributes?
>
> Start here:
>
> http://www.joelonsoftware.com/articles/Unicode.html
>
> The CSV file, being stored on disk, cannot contain Unicode strings; it
> can only contain bytes. If you know the encoding (eg UTF-8, UCS-2,
> etc), then you can decode it using that. If you don't, your best bet
> is to ask the origin of the file; failing that, check the first few
> bytes - if it's "\xFF\xFE" or "\xFE\xFF" or "\xEF\xBB\xBF", then it's
> probably UTF-16LE, UTF-16BE, or UTF-8, respectively (those being the
> encodings of the BOM). There may be other clues, too, but normally
> it's best to get the encoding separately from the data rather than try
> to decode it from the data itself.
As this problem really is not a new one, there are several more – if I may
say so – pythonic approaches:
<http://stackoverflow.com/questions/436220/python-is-there-a-way-to-
determine-the-encoding-of-text-file>
Improving Billy Mays' "matching brackets" checker, chardet worked for me
(the test file was UTF-8-encoded). Watch for word-wrap:
-----------------------------------------------------------------------
# encoding: utf-8
'''
Created on 2011-07-18
@author: Thomas 'PointedEars' Lahn <PointedEars at web.de>, based on an idea of
Billy Mays <81282ed9a88799d21e77957df2d84bd6514d9af6 at myhashismyemail.com>
in <news:j01ph6$knt$1 at speranza.aioe.org>
'''
import sys, os, chardet
pairs = {u'}': u'{', u')': u'(', u']': u'[',
u'”': u'“', u'›': u'‹', u'»': u'«',
u'】': u'【', u'〉': u'〈', u'》': u'《',
u'」': u'「', u'』': u'『'}
valid = set(v for pair in pairs.items() for v in pair)
if __name__ == '__main__':
for dirpath, dirnames, filenames in os.walk(sys.argv[1]):
for name in filenames:
stack = [' ']
file_path = os.path.join(dirpath, name)
with open(file_path, 'rb') as f:
reported = False
lines = enumerate(f, 1)
encoding = chardet.detect(''.join(map(lambda x: x[1],
lines)))['encoding']
chars = ((c, line_no, col) for line_no, line in lines for
col, c in enumerate(line.decode(encoding), 1) if c in valid)
for c, line_no, col in chars:
if c in pairs:
if stack[-1] == pairs[c]:
stack.pop()
else:
if not reported:
first_bad = (c, line_no, col)
reported = True
else:
stack.append(c)
print '%s: %s' % (name, ("good" if len(stack) == 1 else "bad
'%s' at %s:%s" % first_bad))
-----------------------------------------------------------------------
HTH
--
PointedEars
Bitte keine Kopien per E-Mail. / Please do not Cc: me.
More information about the Python-list
mailing list