Unicode codepoints
Chris Angelico
rosuav at gmail.com
Wed Jun 22 00:00:22 EDT 2011
On Wed, Jun 22, 2011 at 1:37 PM, Saul Spatz <saul.spatz at gmail.com> wrote:
> Hi,
>
> I'm just starting to learn a bit about Unicode. I want to be able to read a utf-8 encoded file, and print out the codepoints it encodes. After many false starts, here's a script that seems to work, but it strikes me as awfully awkward and unpythonic. Have you a better way?
Once you have your data as a Unicode string (and you seem to be using
Python 3, so 's' will be a Unicode string), wouldn't a list of its
codepoints simply be this?
for c in s:
print('U+'+hex(ord(c))[2:])
But if you do need the codePoints() function, I'd do it as a generator.
> def codePoints(s):
> ''' return a list of the Unicode codepoints in the string s '''
> skip = False
> for k, c in enumerate(s):
> if skip:
> skip = False
> yield ord(s[k-1:k+1])
> continue
> if not 0xd800 <= ord(c) <= 0xdfff:
> yield ord(c)
> else:
> skip = True
Your main function doesn't even have to change - it's iterating over
the list, so it may as well iterate over the generator instead.
But I don't really understand what codePoints() does. Is it expecting
the parameter to be a string of bytes or of Unicode characters?
Chris Angelico
More information about the Python-list
mailing list