[omaha] Tidy Help

Mike Hostetler mike at hostetlerhome.com
Sun May 18 00:49:16 CEST 2008


The problem is that there are a lot of HTML files that are in  
different degrees of bad.  They are all bad, but some are different  
degrees of bad.  So we are using both Tidy and BeautifulSoup.  Tidy  
to normalize them to something sane and then BeautifuSoup to parse  
what we want out of it.   Using the Soup by itself is slow and gives  
varying results.  Using Tidy first and then the Soup is faster and  
gives us more consist results.


On May 17, 2008, at 12:15 PM, Jeff Hinrichs - DM&T wrote:

> Yes,
>
> http://www.crummy.com/software/BeautifulSoup/ 
> documentation.html#Printing%20a%20Document
>
> On Sat, May 17, 2008 at 9:10 AM, Burch Kealey  
> <bkealey at mail.unomaha.edu> wrote:
>>
>>   I don't know-does BS clean up bad html for parsing?
>>
>>   Burch T. Kealey, PhD.
>>   RH-CBA 408-N
>>   University of Nebraska at Omaha
>>   6000 Dodge Street
>>   Omaha Nebraska  68104
>>   402-554-3571
>>   This message (including any attachments) contains confidential
>>   information
>>   intended for a specific individual and purpose, and is protected by
>>   law.  If
>>   you are not the intended recipient, you should delete this
>>   message.  Any
>>   disclosure, copying, or distribution of this message, or the  
>> taking of
>>   any
>>   action based on it, is strictly prohibited.
>> _______________________________________________
>> Omaha Python Users Group mailing list
>> Omaha at python.org
>> http://mail.python.org/mailman/listinfo/omaha
>> http://www.OmahaPython.org
>>
>
>
>
> -- 
> Jeff Hinrichs
> Dundee Media & Technology, Inc
> jeffh at dundeemt.com
> 402.218.1473
> web: www.dundeemt.com
> blog: inre.dundeemt.com
> _______________________________________________
> Omaha Python Users Group mailing list
> Omaha at python.org
> http://mail.python.org/mailman/listinfo/omaha
> http://www.OmahaPython.org

Mike Hostetler
mike at hostetlerhome.com
http://mike.hostetlerhome.com





More information about the Omaha mailing list