[BangPypers] Fwd: Handling unicode characters in xml.dom
Anand Balachandran Pillai
abpillai at gmail.com
Thu Mar 20 07:29:51 CET 2008
There seems to be quite a bit of confusion when it comes to Python
and encodings. The following PEP discusses Python and Unicode
and gives some insights.
http://www.python.org/dev/peps/pep-0100/
With py3k this confusion should reduce very much since it unifies
str and unicode types and reduces the encoding problem and uses
a different type "bytes" for any encoded (binary) data.
http://docs.python.org/dev/3.0/whatsnew/3.0.html
--Anand
On Thu, Mar 20, 2008 at 10:57 AM, Gurpreet Sachdeva
<gurpreet.sachdeva at gmail.com> wrote:
> Thanks Anand for your help. Forwarding your post to the group.
>
> Regards,
> Gurpreet Singh
>
>
>
> ---------- Forwarded message ----------
> From: Anand Balachandran Pillai <abpillai at gmail.com>
> Date: Wed, Mar 19, 2008 at 11:48 PM
> Subject: Re: [BangPypers] Handling unicode characters in xml.dom
> To: Gurpreet Sachdeva <gurpreet.sachdeva at gmail.com>
>
>
> Hi Gurpreet,
>
> The problem is that you have some junk characters in the file
> (mostly Japanese
> unicode, since the original file seems to be japanese), which are appearing
> as Ctrl characters in ascii encoding. When the parser tries to parse the
> file
> it interprets the first Ctrl character (^S) as a newline, so it thinks
> there is an
> extra break in the text and produces a "not well-formed token" error.
>
> The way to solve this is to decode and encode the file again in a
> different
> encoding than ascii. I tried iso-8859-1 decoding and unicode-escape
> encoding
> and it works. For this you need to use the services of the codecs module
> since
> default file objects in Python can only write ascii text.
>
> Here is the full code...
> ---------------------------------------------
> import codecs
> import xml.dom.minidom as mdom
>
> data =open('problem.xml').read()
> f = open('problem2.xml','w')
>
> e = codecs.EncodedFile(f, 'iso-8859-1','unicode-escape')
> e.write(data)
> e.close()
> data = open('problem2.xml').read()
> data = '\n'.join(data.split("\\r\\n"))
> open('problem2.xml','w').write(data)
>
> print mdom.parse('problem2.xml')
> --------------------------------------------------
>
> The unicode-escape encoding interprets the characters and converts
> them to their hex equivalent, but it escapes newlines to the "\r\n"
> character.
> So we replace these chars again with "\n" by splitting data and joining it.
>
> The modified file is saved in problem2.xml .
>
> Btw, can you forward this to the list. I am on a slow connection hence
> using
> html interface to gmail and hence address completion is missing.
>
> HTH,
>
> --Anand
>
>
>
> On 3/19/08, Gurpreet Sachdeva <gurpreet.sachdeva at gmail.com> wrote:
> > Hi Anand,
> >
> > Please find attached the xml file that contains the garbage characters.
> Is
> > there a way we can handle them?
> >
> > Thanks for your help.
> > Gurpreet
> >
> > On Tue, Mar 18, 2008 at 1:22 PM, Anand Balachandran Pillai <
> > abpillai at gmail.com> wrote:
> >
> > > Is the garbage CDATA or attribute data ?
> > >
> > > CDATA is like <elem>text</elem> and attribute
> > > is <elem attr="value" />
> > >
> > > Can you pase the relevant part of the XML file here or if it is
> > > small enough, the complete XML file ? Send it directly to me
> > > since the list removes attachments.
> > >
> > > --Anand
> > >
> > > On Tue, Mar 18, 2008 at 11:05 AM, Gurpreet Sachdeva
> > > <gurpreet.sachdeva at gmail.com> wrote:
> > > > <?xml version="1.0" encoding="UTF-8"?>
> > > >
> > > > Still the problem exists.
> > > >
> > > > - Gurpreet
> > > >
> > > >
> > > >
> > > > On Tue, Mar 18, 2008 at 10:44 AM, Anand Balachandran Pillai
> > > > <abpillai at gmail.com> wrote:
> > > >
> > > > > What is the encoding of your XML file ? i.e in the
> > > > > string "<?xml version="1.0" encoding="<encoding>"?>,
> > > > > what is <encoding> ?
> > > > >
> > > > > Make sure it is an encoding like utf-8 or iso-8859-1
> > > > > which can help the parser to understand garbage
> > > > > chars.
> > > > >
> > > > > --Anand
> > > > >
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > On Tue, Mar 18, 2008 at 10:38 AM, Gurpreet Sachdeva
> > > > > <gurpreet.sachdeva at gmail.com> wrote:
> > > > > > Hi,
> > > > > >
> > > > > > Any idea how to handle the unicode characters existing in an xml
> > > file
> > > > while
> > > > > > parsing it.
> > > > > >
> > > > > > This is what I am doing:
> > > > > >
> > > > > > from xml.dom import minidom
> > > > > >
> > > > > > xmlObj = minidom.parse(fileobj)
> > > > > >
> > > > > > And the script throws an error because of some special characters
> > > ['f
> > > > > > (3gpÕ¡¤ë'] present in the xml file. Any suggestion/pointers would
> > be
> > > > > > appreciated
> > > > > >
> > > > > > Thanks and Regards,
> > > > > > Gurpreet Singh
> > > > > > _______________________________________________
> > > > > > BangPypers mailing list
> > > > > > BangPypers at python.org
> > > > > > http://mail.python.org/mailman/listinfo/bangpypers
> > > > > >
> > > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > -Anand
> > > > > _______________________________________________
> > > > > BangPypers mailing list
> > > > > BangPypers at python.org
> > > > > http://mail.python.org/mailman/listinfo/bangpypers
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > Thanks and Regards,
> > > > Gurpreet Singh
> > > > _______________________________________________
> > > > BangPypers mailing list
> > > > BangPypers at python.org
> > > > http://mail.python.org/mailman/listinfo/bangpypers
> > > >
> > > >
> > >
> > >
> > >
> > > --
> > > -Anand
> > > _______________________________________________
> > > BangPypers mailing list
> > > BangPypers at python.org
> > > http://mail.python.org/mailman/listinfo/bangpypers
> > >
> >
> >
> >
> > --
> > Thanks and Regards,
> > Gurpreet Singh
> >
> --
> -Anand
>
>
>
> --
>
> Thanks and Regards,
> Gurpreet Singh
> _______________________________________________
> BangPypers mailing list
> BangPypers at python.org
> http://mail.python.org/mailman/listinfo/bangpypers
>
>
--
-Anand
More information about the BangPypers
mailing list