Translating unicode data
Peter Otten
__peter__ at web.de
Mon Mar 23 19:16:04 EDT 2009
CaptainMcCrank wrote:
> I'm struggling with a problem analyzing large amounts of unicode data
> in an http wireshark capture.
> I've solved the problem with the interpreter, but I'm not sure how to
> do this in an automated fashion.
>
> I'd like to grab a line from a text file & translate the unicode
> sections of it to ascii. So, for example
> I'd like to take
> "\u003cb\u003eMar 17\u003c/b\u003e"
>
> and turn it into
>
> "<b>Mar 17</b>"
>
> I can handle this from the interpreter as follows:
>
>>>> import unicodedata
>>>> mystring = u"\u003cb\u003eMar 17\u003c/b\u003e"
>>>> print mystring
> <b>Mar 17</b>
>>>>
>
> But I don't know what I need to do to automate this! The data that is
> in the quotes from line 2 will have to come from a variable. I am
> unable to figure out how to do this using a variable rather than a
> literal string.
If wireshark uses the same escape codes as python you can use str.decode()
or open the file with codecs.open():
>>> s = "\u003cb\u003eMar 17\u003c/b\u003e"
>>> s
'\\u003cb\\u003eMar 17\\u003c/b\\u003e'
>>> s.decode("unicode-escape")
u'<b>Mar 17</b>'
>>> open("tmp.txt", "w").write(s)
>>> import codecs
>>> f = codecs.open("tmp.txt", "r", encoding="unicode-escape")
>>> f.read()
u'<b>Mar 17</b>'
Peter
More information about the Python-list
mailing list