Translating unicode data

Tue Mar 24 12:06:07 EDT 2009

On Mar 23, 4:16 pm, Peter Otten <__pete... at web.de> wrote:
> CaptainMcCrank wrote:
> > I'm struggling with a problem analyzing large amounts of unicode data
> > in an http wireshark capture.
> > I've solved the problem with the interpreter, but I'm not sure how to
> > do this in an automated fashion.
>
> > I'd like to grab a line from a text file & translate the unicode
> > sections of it to ascii.  So, for example
> > I'd like to take
> > "\u003cb\u003eMar 17\u003c/b\u003e"
>
> > and turn it into
>
> > "<b>Mar 17</b>"
>
> > I can handle this from the interpreter as follows:
>
> >>>> import unicodedata
> >>>> mystring = u"\u003cb\u003eMar 17\u003c/b\u003e"
> >>>> print mystring
> > <b>Mar 17</b>
>
> > But I don't know what I need to do to automate this!  The data that is
> > in the quotes from line 2 will have to come from a variable.  I am
> > unable to figure out how to do this using a variable rather than a
> > literal string.
>
> If wireshark uses the same escape codes as python you can use str.decode()
> or open the file with codecs.open():
>
> >>> s = "\u003cb\u003eMar 17\u003c/b\u003e"
> >>> s
>
> '\\u003cb\\u003eMar 17\\u003c/b\\u003e'>>> s.decode("unicode-escape")
>
> u'<b>Mar 17</b>'
>
> >>> open("tmp.txt", "w").write(s)
> >>> import codecs
> >>> f = codecs.open("tmp.txt", "r", encoding="unicode-escape")
> >>> f.read()
>
> u'<b>Mar 17</b>'
>
> Peter

This is a workable solution!  Thank you Peter!