Unicode -> String problem

Tue Jul 10 21:51:55 EDT 2001

Jay Parlar <jparlar at home.com> wrote in message news:<mailman.994770857.26075.python-list at python.org>...
> The reason that I asked for code to remove the non-convertable unicode characters is because I thought that the '\xa0' 
> that I was seeing was a result of the unicode. In further testing, it seems that it's not. When I download the page 
> 'http://www.ign.com', and parser it using the HTMLParser method above, one of the characters/words returned to me is 
> '\xa0'. Now, I have no idea where this is coming from. I suppose I'll have to start looking at the source for HTMLParser, as 
> well as the formatter and writer I used. It's not just '\xa0' either, one page (which I can't seem to remember now) returned a 
> '\x95' as one of the characters/words. 

Possibly you might attain enlightenment if you compared the original
HTML with the output from HTMLParser --- I know next to nothing about
HTML but I'm willing to bet that such a comparison might just bring up
correspondences like   -> \xa0 and · -> \x95. BTW, the url
you gave has 22 occurences of · ...

Then you can proceed at one of two levels:
(1) follow Paul Prescod's suggestion to use the ASCII codec with the
"ignore" option -- this will strip out all non-ASCII characters
(2) use u.encode("cp1252") to give you native Windows stuff in an
8-bit string. However you have to be prepared to handle things like a
"no-break space" (\xa0) and a "bullet" (\x95) and whatever else may be
in the input; time to buy a book on HTML, maybe.

<:-)>
If you get exceptions with "cp1252", it probably means that you have
strayed too far into the forest and encountered a web page produced by
some pesky furriner varmint -- "Oooh, Grandma, what enormous ordinals
you have!"
</:-)>

Oh and call me silly, but a brief peek at an IE5 cache gives me the
impression that what is there is plain ol' HTML, just like you get
from a direct download. You may like to ask exactly what functionality
the C++ kit is providing ...

> One technique I'm using that might be causing the problem is: Using an re to remove a lot of punctuation, punctuation that 
> would cause the word "Python" not to be matched with "Python." (notice the period). Actually, if anyone has a better 
> solution to that, I'd be more than willing to listen. 

That is not causing your current symptom. It is extremely likely to
cause other possibly unwanted symptoms -- like "alt.foo.bar" being
treated as the word "altfoobar". I'm sure various people will give you
one-line Python solutions to the symptoms. However it looks like you
need to start from the top down -- What is the definition of a "word"?
 What tools/techniques are there to tokenise text into "words" and
"non-words"?