converting to and from octal escaped UTF--8
MonkeeSage
MonkeeSage at gmail.com
Tue Dec 4 08:40:29 EST 2007
On Dec 3, 8:10 am, Michael Goerz <answer... at 8439.e4ward.com> wrote:
> MonkeeSage wrote:
> > On Dec 3, 1:31 am, MonkeeSage <MonkeeS... at gmail.com> wrote:
> >> On Dec 2, 11:46 pm, Michael Spencer <m... at telcopartners.com> wrote:
>
> >>> Michael Goerz wrote:
> >>>> Hi,
> >>>> I am writing unicode stings into a special text file that requires to
> >>>> have non-ascii characters as as octal-escaped UTF-8 codes.
> >>>> For example, the letter "Í" (latin capital I with acute, code point 205)
> >>>> would come out as "\303\215".
> >>>> I will also have to read back from the file later on and convert the
> >>>> escaped characters back into a unicode string.
> >>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
> >>>> vice versa?
> >>> Perhaps something along the lines of:
> >>> >>> def encode(source):
> >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8'))
> >>> ...
> >>> >>> def decode(encoded):
> >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
> >>> ... return bytes.decode('utf8')
> >>> ...
> >>> >>> encode(u"Í")
> >>> '\\303\\215'
> >>> >>> print decode(_)
> >>> Í
> >>> HTH
> >>> Michael
> >> Nice one. :) If I might suggest a slight variation to handle cases
> >> where the "encoded" string contains plain text as well as octal
> >> escapes...
>
> >> def decode(encoded):
> >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
> >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
> >> return encoded.decode('utf8')
>
> >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
> >> as well as "adf\\303\\215adf".
>
> >> Regards,
> >> Jordan
>
> > err...
>
> > def decode(encoded):
> > for octc in re.findall(r'\\(\d{3})', encoded):
> > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
> > return encoded.decode('utf8')
>
> Great suggestions from both of you! I came up with my "final" solution
> based on them. It encodes only non-ascii and non-printables, and stays
> in unicode strings for both input and output. Also, low ascii values now
> encode into a 3-digit octal sequence also, so that decode can catch them
> properly.
>
> Thanks a lot,
> Michael
>
> ____________
>
> import re
>
> def encode(source):
> encoded = ""
> for character in source:
> if (ord(character) < 32) or (ord(character) > 128):
> for byte in character.encode('utf8'):
> encoded += ("\%03o" % ord(byte))
> else:
> encoded += character
> return encoded.decode('utf-8')
>
> def decode(encoded):
> decoded = encoded.encode('utf-8')
> for octc in re.findall(r'\\(\d{3})', decoded):
> decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
> return decoded.decode('utf8')
>
> orig = u"blaÍblub" + chr(10)
> enc = encode(orig)
> dec = decode(enc)
> print orig
> print enc
> print dec
An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...
def decode(encoded):
decoded = encoded.encode('utf-8')
matches = {}
for octc in re.findall(r'\\(\d{3})', decoded):
matches[octc] = None
for octc in matches:
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')
Untested...
Regards,
Jordan
More information about the Python-list
mailing list