[BangPypers] Fwd: Handling unicode characters in xml.dom

Anand Balachandran Pillai abpillai at gmail.com
Thu Mar 20 07:29:51 CET 2008


There seems to be quite a bit of confusion when it comes to Python
and encodings. The following PEP discusses Python and Unicode
and gives some insights.

http://www.python.org/dev/peps/pep-0100/

With py3k this confusion should reduce very much since it unifies
str and unicode types and reduces the encoding problem and uses
 a different type "bytes" for any encoded (binary) data.

http://docs.python.org/dev/3.0/whatsnew/3.0.html

--Anand

On Thu, Mar 20, 2008 at 10:57 AM, Gurpreet Sachdeva
<gurpreet.sachdeva at gmail.com> wrote:
> Thanks Anand for your help. Forwarding your post to the group.
>
> Regards,
> Gurpreet Singh
>
>
>
> ---------- Forwarded message ----------
> From: Anand Balachandran Pillai <abpillai at gmail.com>
>  Date: Wed, Mar 19, 2008 at 11:48 PM
> Subject: Re: [BangPypers] Handling unicode characters in xml.dom
> To: Gurpreet Sachdeva <gurpreet.sachdeva at gmail.com>
>
>
> Hi Gurpreet,
>
>      The problem is that you have some junk characters in the file
>  (mostly Japanese
>  unicode, since the original file seems to be japanese), which are appearing
>  as Ctrl characters in ascii encoding. When the parser tries to parse the
> file
>  it interprets the first Ctrl character (^S) as a newline, so it thinks
>  there is an
>  extra break in the text and produces a "not well-formed token" error.
>
>    The way to solve this is to decode and encode the file again in a
> different
>  encoding than ascii. I tried iso-8859-1 decoding and unicode-escape
> encoding
>  and it works. For this you need to use the services of the codecs module
> since
>  default file objects in Python can only write ascii text.
>
>  Here is the full code...
>  ---------------------------------------------
>  import codecs
>  import xml.dom.minidom as mdom
>
>  data =open('problem.xml').read()
>  f = open('problem2.xml','w')
>
>  e = codecs.EncodedFile(f, 'iso-8859-1','unicode-escape')
>  e.write(data)
>  e.close()
>  data = open('problem2.xml').read()
>  data = '\n'.join(data.split("\\r\\n"))
>  open('problem2.xml','w').write(data)
>
>  print mdom.parse('problem2.xml')
>  --------------------------------------------------
>
>  The unicode-escape encoding interprets the characters and converts
>  them to their hex equivalent, but it escapes newlines to the "\r\n"
> character.
>  So we replace these chars again with "\n" by splitting data and joining it.
>
>  The modified file is saved in problem2.xml .
>
>  Btw, can you forward this to the list. I am on a slow connection hence
> using
>  html interface to gmail and hence address completion is missing.
>
>  HTH,
>
>  --Anand
>
>
>
>  On 3/19/08, Gurpreet Sachdeva <gurpreet.sachdeva at gmail.com> wrote:
>  > Hi Anand,
>  >
>  > Please find attached the xml file that contains the garbage characters.
> Is
>  > there a way we can handle them?
>  >
>  > Thanks for your help.
>  > Gurpreet
>  >
>  > On Tue, Mar 18, 2008 at 1:22 PM, Anand Balachandran Pillai <
>  > abpillai at gmail.com> wrote:
>  >
>  > > Is the garbage CDATA or attribute data ?
>  > >
>  > > CDATA is like <elem>text</elem> and attribute
>  > > is <elem attr="value" />
>  > >
>  > > Can you pase the relevant part of the XML file here or if it is
>  > > small enough, the complete XML file ? Send it directly to me
>  > > since the list removes attachments.
>  > >
>  > > --Anand
>  > >
>  > > On Tue, Mar 18, 2008 at 11:05 AM, Gurpreet Sachdeva
>  > > <gurpreet.sachdeva at gmail.com> wrote:
>  > > > <?xml version="1.0" encoding="UTF-8"?>
>  > > >
>  > > > Still the problem exists.
>  > > >
>  > > > - Gurpreet
>  > > >
>  > > >
>  > > >
>  > > > On Tue, Mar 18, 2008 at 10:44 AM, Anand Balachandran Pillai
>  > > > <abpillai at gmail.com> wrote:
>  > > >
>  > > > > What is the encoding of your XML file ? i.e in the
>  > > > > string "<?xml version="1.0" encoding="<encoding>"?>,
>  > > > > what is <encoding> ?
>  > > > >
>  > > > > Make sure it is an encoding like utf-8 or iso-8859-1
>  > > > > which can help the parser to understand garbage
>  > > > > chars.
>  > > > >
>  > > > > --Anand
>  > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > > On Tue, Mar 18, 2008 at 10:38 AM, Gurpreet Sachdeva
>  > > > > <gurpreet.sachdeva at gmail.com> wrote:
>  > > > > > Hi,
>  > > > > >
>  > > > > > Any idea how to handle the unicode characters existing in an xml
>  > > file
>  > > > while
>  > > > > > parsing it.
>  > > > > >
>  > > > > > This is what I am doing:
>  > > > > >
>  > > > > > from xml.dom import minidom
>  > > > > >
>  > > > > > xmlObj = minidom.parse(fileobj)
>  > > > > >
>  > > > > > And the script throws an error because of some special characters
>  > > ['f
>  > > > > > (3gpÕ¡¤ë'] present in the xml file. Any suggestion/pointers would
>  > be
>  > > > > > appreciated
>  > > > > >
>  > > > > > Thanks and Regards,
>  > > > > > Gurpreet Singh
>  > > > > > _______________________________________________
>  > > > > > BangPypers mailing list
>  > > > > > BangPypers at python.org
>  > > > > > http://mail.python.org/mailman/listinfo/bangpypers
>  > > > > >
>  > > > > >
>  > > > >
>  > > > >
>  > > > >
>  > > > > --
>  > > > > -Anand
>  > > > > _______________________________________________
>  > > > > BangPypers mailing list
>  > > > > BangPypers at python.org
>  > > > > http://mail.python.org/mailman/listinfo/bangpypers
>  > > > >
>  > > >
>  > > >
>  > > >
>  > > > --
>  > > > Thanks and Regards,
>  > > > Gurpreet Singh
>  > > > _______________________________________________
>  > > > BangPypers mailing list
>  > > > BangPypers at python.org
>  > > > http://mail.python.org/mailman/listinfo/bangpypers
>  > > >
>  > > >
>  > >
>  > >
>  > >
>  > > --
>  > > -Anand
>  > > _______________________________________________
>  > > BangPypers mailing list
>  > > BangPypers at python.org
>  > > http://mail.python.org/mailman/listinfo/bangpypers
>  > >
>  >
>  >
>  >
>  > --
>  > Thanks and Regards,
>  > Gurpreet Singh
>  >
>  --
>  -Anand
>
>
>
> --
>
> Thanks and Regards,
> Gurpreet Singh
> _______________________________________________
>  BangPypers mailing list
>  BangPypers at python.org
>  http://mail.python.org/mailman/listinfo/bangpypers
>
>



-- 
-Anand


More information about the BangPypers mailing list