Unexpected behaviour with HTMLParser...

Just Another Victim of the Ambient Morality ihatespam at hotmail.com
Tue Oct 9 18:10:59 EDT 2007


"Diez B. Roggisch" <deets at nospam.web.de> wrote in message 
news:5n2avjFfh6h8U1 at mid.uni-berlin.de...
> Just Another Victim of the Ambient Morality schrieb:
>>     HTMLParser is behaving in, what I find to be, strange ways and I 
>> would like to better understand what it is doing and why.
>>
>>     First, it doesn't appear to translate HTML escape characters.  I 
>> don't know the actual terminology but things like & don't get 
>> translated into & as one would like.  Furthermore, not only does 
>> HTMLParser not translate it properly, it seems to omit it altogether! 
>> This prevents me from even doing the translation myself, so I can't even 
>> working around the issue.
>>     Why is it doing this?  Is there some mode I need to set?  Can anyone 
>> else duplicate this behaviour?  Is it a bug?
>
> Without code, that's hard to determine. But you are aware of e.g.
>
> handle_entityref(name)
> handle_charref(ref)
>
> ?

    Actually, I am not aware of these methods but I will certainly look into 
them!
    I was hoping that the issue would be known or simple before I commited 
to posting code, something that is, to my chagrin, not easily done with my 
news client...


>>     Secondly, HTMLParser often calls handle_data() consecutively, without 
>> any calls to handle_starttag() in between.  I did not expect this.  In 
>> HTML, you either have text or you have tags.  Why split up my text into 
>> successive handle_data() calls?  This makes no sense to me.  At the very 
>> least, it does this in response to text with & like escape sequences 
>> (or whatever they're called), so that it may successively avoid those 
>> translations.
>
> That's the way XML/HTML is defined - there is no guarantee that you get 
> text as whole. If you must, you can collect the snippets yourself, and on 
> the next end-tag deliver them as whole.

    I think there's some miscommunication, here.
    You can't mean "That's the way XML/HTML is defined" because those format 
specifications say nothing about how the format must be parsed.  As far as I 
can tell, you either meant to say that that's the way HTMLParser is 
specified or you're referring to how text in XML/HTML can be broken up by 
tags, in which case I've already addressed that in my post.  I expected to 
see handle_starttag() calls in between calls to handle_data().
    Unless I'm missing something, it simply makes no sense to break up 
contiguous text into multiple handle_data() calls...


>>     Again, why is it doing this?  Is there some mode I need to set?  Can 
>> anyone else duplicate this behaviour?  Is it a bug?
>
> No. It's the way it is, because it would require buffering with unlimited 
> capacity to ensure this property.

    It depends on what you mean by "unlimited capacity."  Is it so bad to 
buffer with as much memory as you have? ...or, at least, have a setting for 
such operation?  Moreover, you know that you'll never have to buffer more 
than there is HTML, so you hardly need "unlimited capacity..."  For 
instance, I believe Xerces does this translation for you 'cause, really, why 
wouldn't you want it to?


>>     These are serious problems for me and I would greatly appreciate a 
>> deeper understanding of these issues.
>
> HTH, and read the docs.

    This does help, thank you.  I have obviously read the docs, since I can 
use HTMLParser enough to find this behaviour.  I don't find the docs to be 
very explanatory (perhaps I'm reading the wrong docs) and I think they 
assume you already know a lot about HTML and parsing, which may be necessary 
assumptions but are not necessarily true...







More information about the Python-list mailing list