converting to and from octal escaped UTF--8

MonkeeSage MonkeeSage at gmail.com
Mon Dec 3 02:37:57 EST 2007


On Dec 3, 1:31 am, MonkeeSage <MonkeeS... at gmail.com> wrote:
> On Dec 2, 11:46 pm, Michael Spencer <m... at telcopartners.com> wrote:
>
>
>
> > Michael Goerz wrote:
> > > Hi,
>
> > > I am writing unicode stings into a special text file that requires to
> > > have non-ascii characters as as octal-escaped UTF-8 codes.
>
> > > For example, the letter "Í" (latin capital I with acute, code point 205)
> > > would come out as "\303\215".
>
> > > I will also have to read back from the file later on and convert the
> > > escaped characters back into a unicode string.
>
> > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and
> > > vice versa?
>
> > Perhaps something along the lines of:
>
> >   >>> def encode(source):
> >   ...     return "".join("\%o" % ord(c) for c in source.encode('utf8'))
> >   ...
> >   >>> def decode(encoded):
> >   ...     bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
> >   ...     return bytes.decode('utf8')
> >   ...
> >   >>> encode(u"Í")
> >   '\\303\\215'
> >   >>> print decode(_)
> >   Í
>
> > HTH
> > Michael
>
> Nice one. :) If I might suggest a slight variation to handle cases
> where the "encoded" string contains plain text as well as octal
> escapes...
>
> def decode(encoded):
>   for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
>     encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
>   return encoded.decode('utf8')
>
> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146"
> as well as "adf\\303\\215adf".
>
> Regards,
> Jordan

err...

def decode(encoded):
  for octc in re.findall(r'\\(\d{3})', encoded):
    encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
  return encoded.decode('utf8')



More information about the Python-list mailing list