How do I automate the removal of all non-ascii characters from my code?

Stefan Behnel stefan_ml at
Mon Sep 12 04:43:51 EDT 2011

Alec Taylor, 12.09.2011 10:33:
> from creole import html2creole
> from BeautifulSoup import BeautifulSoup
> VALID_TAGS = ['strong', 'em', 'p', 'ul', 'li', 'br', 'b', 'i', 'a', 'h1', 'h2']
> def sanitize_html(value):
>     soup = BeautifulSoup(value)
>     for tag in soup.findAll(True):
>         if not in VALID_TAGS:
>             tag.hidden = True
>     return soup.renderContents()
> html2creole(u(sanitize_html('''<h1
> style="margin-left:76.8px;margin-right:0;text-indent:0;">Abstract</h1>
>     <p class="Standard"
> style="margin-left:76.8px;margin-right:0;text-indent:0;">
> [more stuff here]
> """))


I'm not sure what you are trying to say with the above code, but if it's 
the code that fails for you with the exception you posted, I would guess 
that the problem is in the "[more stuff here]" part, which likely contains 
a non-ASCII character. Note that you didn't declare the source file 
encoding above. Do as Gary told you.


More information about the Python-list mailing list