Bug in htmlentitydefs.py with Python 3.0?

André andre.roberge at gmail.com
Wed Dec 26 20:14:56 EST 2007


On Dec 26, 8:53 pm, "Martin v. Löwis" <mar... at v.loewis.de> wrote:
> > Without an additional parser, I was getting the following error
> > message:
> [...]
> > xml.parsers.expat.ExpatError: undefined entity é: line 401, column 11
>
> To understand that problem better, it would have been helpful to see
> what line 401, column 11 of the input file actually says. AFAICT,
> it must have been something like "&é;" which would be really puzzling
> to have in an XML file (usually, people restrict themselves to ASCII
> for entity names).


No, that one was é   (testing with my own name that appeared in
a file).


>
> >             for entity in ent:
> >                 if entity not in parser.entity:
> >                     parser.entity[entity] = ent[entity]
>
> This looks fine to me.
>
> > The output was "wrong".  For example, one of the test I used was to
> > process a copy of the main dict of htmlentitydefs.py inside an html page.  A
> > few of the characters came ok, but I got things like:
>
> > 'Α':    0x0391, # greek capital letter alpha, U+0391
>
> Why do you think this is wrong?

Sorry, that was just cut-and-pasted from the browser (not the source);
in the source of the processed html page, it is
'&#913;':    0x0391, # greek capital letter alpha, U+0391

i.e.  the "&" was transformed into "&" in a number of places (all
places above ascii 127 I believe).


Here are a few more lines extracted from the html file that was
processed:
=============
    'Â':    0x00c2, # latin capital letter A with circumflex, U+00C2
ISOlat1
    'À':   0x00c0, # latin capital letter A with grave = latin capital
letter A grave, U+00C0 ISOlat1
    '&#913;':    0x0391, # greek capital letter alpha, U+0391
    'Å':    0x00c5, # latin capital letter A with ring above = latin
capital letter A ring, U+00C5 ISOlat1
    'Ã':   0x00c3, # latin capital letter A with tilde, U+00C3 ISOlat1
    'Ä':     0x00c4, # latin capital letter A with diaeresis, U+00C4
ISOlat1
    '&#914;':     0x0392, # greek capital letter beta, U+0392
    'Ç':   0x00c7, # latin capital letter C with cedilla, U+00C7
ISOlat1
    '&#935;':      0x03a7, # greek capital letter chi, U+03A7
    '&#8225;':   0x2021, # double dagger, U+2021 ISOpub
    '&#916;':    0x0394, # greek capital letter delta, U+0394
ISOgrk3
  ============



>
> > When using my modified version, I got the following (which may not be
> > transmitted properly by email...)
> >     'Α':    0x0391, # greek capital letter alpha, U+0391
>
> > It does look like a Greek capital letter alpha here.
>
> Sure, however, your first version ALSO has the Greek capital letter
> alpha there; it is just spelled as Α (which *is* a valid spelling
> for that latter in XML).

Agreed that it would be... However that was not how it was
transformed, see above; sorry if I was not clear about what was
happening  (I should not have cut-and-pasted from the browser window).

>
> > I hope the above is of some help.
>
> Thanks; I now think that htmlentitydefs is just as fine as it always
> was - I don't see any problem here.
>

You may well be right in that the problem may lie elsewhere.  But as
making the change I mentioned "fixed" the problem at my, I figured
this was where the problem was located - and thought I should at least
report it here.

Regards,
André

> Regards,
> Martin




More information about the Python-list mailing list