converting to and from octal escaped UTF--8

MonkeeSage MonkeeSage at gmail.com
Tue Dec 4 08:40:29 EST 2007


On Dec 3, 8:10 am, Michael Goerz <answer... at 8439.e4ward.com> wrote:
> MonkeeSage wrote:
> > On Dec 3, 1:31 am, MonkeeSage <MonkeeS... at gmail.com> wrote:
> >> On Dec 2, 11:46 pm, Michael Spencer <m... at telcopartners.com> wrote:
>
> >>> Michael Goerz wrote:
> >>>> Hi,
> >>>> I am writing unicode stings into a special text file that requires to
> >>>> have non-ascii characters as as octal-escaped UTF-8 codes.
> >>>> For example, the letter "Í" (latin capital I with acute, code point 205)
> >>>> would come out as "\303\215".
> >>>> I will also have to read back from the file later on and convert the
> >>>> escaped characters back into a unicode string.
> >>>> Does anyone have any suggestions on how to go from "Í" to "\303\215" and
> >>>> vice versa?
> >>> Perhaps something along the lines of:
> >>>   >>> def encode(source):
> >>>   ...     return "".join("\%o" % ord(c) for c in source.encode('utf8'))
> >>>   ...
> >>>   >>> def decode(encoded):
> >>>   ...     bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
> >>>   ...     return bytes.decode('utf8')
> >>>   ...
> >>>   >>> encode(u"Í")
> >>>   '\\303\\215'
> >>>   >>> print decode(_)
> >>>   Í
> >>> HTH
> >>> Michael
> >> Nice one. :) If I might suggest a slight variation to handle cases
> >> where the "encoded" string contains plain text as well as octal
> >> escapes...
>
> >> def decode(encoded):
> >>   for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
> >>     encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
> >>   return encoded.decode('utf8')
>
> >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
> >> as well as "adf\\303\\215adf".
>
> >> Regards,
> >> Jordan
>
> > err...
>
> > def decode(encoded):
> >   for octc in re.findall(r'\\(\d{3})', encoded):
> >     encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
> >   return encoded.decode('utf8')
>
> Great suggestions from both of you! I came up with my "final" solution
> based on them. It encodes only non-ascii and non-printables, and stays
> in unicode strings for both input and output. Also, low ascii values now
> encode into a 3-digit octal sequence also, so that decode can catch them
> properly.
>
> Thanks a lot,
> Michael
>
> ____________
>
> import re
>
> def encode(source):
>     encoded = ""
>     for character in source:
>         if (ord(character) < 32) or (ord(character) > 128):
>             for byte in character.encode('utf8'):
>                 encoded += ("\%03o" % ord(byte))
>         else:
>             encoded += character
>     return encoded.decode('utf-8')
>
> def decode(encoded):
>     decoded = encoded.encode('utf-8')
>     for octc in re.findall(r'\\(\d{3})', decoded):
>         decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
>     return decoded.decode('utf8')
>
> orig = u"blaÍblub" + chr(10)
> enc  = encode(orig)
> dec  = decode(enc)
> print orig
> print enc
> print dec

An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...

def decode(encoded):
  decoded = encoded.encode('utf-8')
  matches = {}
  for octc in re.findall(r'\\(\d{3})', decoded):
    matches[octc] = None
  for octc in matches:
    decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
  return decoded.decode('utf8')

Untested...

Regards,
Jordan



More information about the Python-list mailing list