[Tutor] Clarified: Best way to alter sections of a string which match dictionary keys?

SSokolow from_python_tutor at SSokolow.com
Sat Jan 3 23:14:52 EST 2004


Karl Pflästerer wrote:

>On  3 Jan 2004, SSokolow <- from_python_tutor at SSokolow.com wrote:
>
>  
>
>>Your reply is confusing me but as I understand it, there are three
>>problems with this:
>>    
>>
>
>I didn't want to confuse you.
>
>[...]
>  
>
>>2. What do you mean safer? The situation may not apply to this
>>    
>>
>
>The regexp isn't 100% safe against badly (or broken) written HTML.  A
>match starts with a `<a' then are some attributes then somewhere is a
>`href"'.  I'm not absolutley sure at the moment (I had to reread the
>docs of W3C) how much the syntax may differ.  Furthermore you need to
>cope with HTML and XHTML; the last should be the smaller problem as it
>is much stricter but HTML may differ a lot.  That's because a lot of
>people don't read the docs of W3C.  But I think you need to cope with
>spaces between `href=' and the following value of the attribute.  Also
>the quotes can be single or double quotes (should be double).
>
>That's not the biggest problem all this can be handled with a regexp but
>if yoou had the (pathological) case that somebody writes
>   <a ....> <a       </a> ..   </a>
>a regexp will fail. But maybe that never happens or only once in a
>million.  If you can live with it fine.
>
>[...]
>  
>
>> I also forgot to mention that the variable string does not hold the
>> entire file. This is run for each chunk of data as it's received from
>> the server. (I don't know how to build content-layer filtering into
>> the proxy code I'm extending so I hooked it in at the content layer. 
>> testing has shown that some links lie across chunk boundaries like
>> this:
>>    
>>
>
>  
>
>>[continued from previous chunk]is some link text</a>
>>.
>>.
>>.
>><a href="whatever">This is th[continued in next chunk]
>>    
>>
>
>  
>
>>and I don't know if the HTML parser might stumble on an unclosed <a>
>>tag pair.
>>    
>>
>
>With that the parser can cope very well.  You just had to change the
>code a bit but that should be possible.
>
>But if spped matters I think the simple regexp solution might suffice.
>
>[...]
>
>I think the problem is interesting so post here if you know more (but
>please with as much facts as possible).
>
>
>   Karl
>  
>
Of the tens of thousands of links on the site, only maybe ten or twenty 
are not generated by scripts. All of the HTML data is either generated 
on the fly by PHP scripts (comment boards), or saved to HTML files 
whenever a change is made. Therefore I feel safe in using the Regexp.

I'll send notice when I make the completed program available on my 
website and/or send the rest of this code once I've implemented the rest 
of the features if you (or anybody else) are interested.

Stephan Sokolow
-------------- next part --------------
An HTML attachment was scrubbed...
URL: http://mail.python.org/pipermail/tutor/attachments/20040103/a60d1494/attachment.html


More information about the Tutor mailing list