python & xml question
jnana4 at DELETEhotmailCAPS.com
Sun Aug 4 00:07:16 CEST 2002
"Martin v. Loewis" <martin at v.loewis.de> wrote in message
news:m3u1mbtz9h.fsf at mira.informatik.hu-berlin.de...
> "jano" <jnana4 at DELETEhotmailCAPS.com> writes:
> > > Do you have a DOCTYPE declaration in the documented? That might be the
> > > easiest approach: add a DOCTYPE that declares mdash; the parser should
> > > then replace it automatically.
> > Are you asking if there is an associated DTD? There is, and it does
> > the mdash entity and what it should be replaced with, like so:
> > <!ENTITY mdash "">
> I'm really asking whether this declaration is in the internal or in
> the external DTD subset.
The declaration is in the external DTD subset.
> However, I'm also surprised that you declare mdash as —: This
> character is a control character, END OF GUARDED AREA (EPA), and
> I don't know why you would associate that with the name mdash...
> That your operating system uses byte 151 to represent EM DASH in a
> certain code page is irrelevant for XML, XML is based on Unicode, not
> code page 1252.
I used because the XML is destined to be HTML, and #151, as far as i
know, is the only representation that works in all browsers. I see now
though that I should be using the Unicode representation and translating for
a browser at some later point, if necessary.
> > File "quoteHandler.py", line 17, in characters
> > print characters
> > UnicodeError: ASCII encoding error: ordinal not in range(128)
> > Is this saying that — is outside the UTF-8 range?
> No. 8212 *is* the Unicode number for EM DASH. The error message just
> means that you are trying to convert a Unicode string into ASCII (as a
> side effect of the print statement), and that ASCII does not support
> the EM DASH. Try
> print characters.encode("cp1252")
> instead, if your terminal uses that character set.
Great. This works now. I am just trying to parse an existing XML file that
I have and print it to the console, which I thought would be pretty simple,
but I should have done more preparatory work first. Anyway, it works now.
> > Ah, I'm using PyXML 0.6.5 under Cygwin, because I couldn't get the later
> > versions to work under cygwin. Could this be a source of my problems?
> I'd say there are several problems at work. The traceback you report says
> so I would say that you are *not* using PyXML at all (first problem).
> With that version, you will have problems to process entity references
> in the SAX application, unless they are in the internal subset (second
I will move the DTD into the instance, for now. I thought that I was using
PyXML (i thought the expatreader was called from something in PyXML), but I
am pretty new to Python and XML and as you can see, i am quite confused.
> You seem to have a misunderstanding of how character references work
> in XML, and how they are (not) related to your operating system's
> encoding (third problem).
The — I was using not because of my operating system's encoding, but
because I was foolishly encoding special characters in the way that I
thought they would ultimately end up in a browser.
Anyway, thanks a million. Your help has been invaluable. What I have now
is working, and I see several areas that I need to do some research on (like
character encodings and unicode, etc.).
More information about the Python-list