Unicode -> String problem
machin_john_888 at hotmail.com
Wed Jul 11 03:51:55 CEST 2001
Jay Parlar <jparlar at home.com> wrote in message news:<mailman.994770857.26075.python-list at python.org>...
> The reason that I asked for code to remove the non-convertable unicode characters is because I thought that the '\xa0'
> that I was seeing was a result of the unicode. In further testing, it seems that it's not. When I download the page
> 'http://www.ign.com', and parser it using the HTMLParser method above, one of the characters/words returned to me is
> '\xa0'. Now, I have no idea where this is coming from. I suppose I'll have to start looking at the source for HTMLParser, as
> well as the formatter and writer I used. It's not just '\xa0' either, one page (which I can't seem to remember now) returned a
> '\x95' as one of the characters/words.
Possibly you might attain enlightenment if you compared the original
HTML with the output from HTMLParser --- I know next to nothing about
HTML but I'm willing to bet that such a comparison might just bring up
correspondences like -> \xa0 and · -> \x95. BTW, the url
you gave has 22 occurences of · ...
Then you can proceed at one of two levels:
(1) follow Paul Prescod's suggestion to use the ASCII codec with the
"ignore" option -- this will strip out all non-ASCII characters
(2) use u.encode("cp1252") to give you native Windows stuff in an
8-bit string. However you have to be prepared to handle things like a
"no-break space" (\xa0) and a "bullet" (\x95) and whatever else may be
in the input; time to buy a book on HTML, maybe.
If you get exceptions with "cp1252", it probably means that you have
strayed too far into the forest and encountered a web page produced by
some pesky furriner varmint -- "Oooh, Grandma, what enormous ordinals
Oh and call me silly, but a brief peek at an IE5 cache gives me the
impression that what is there is plain ol' HTML, just like you get
from a direct download. You may like to ask exactly what functionality
the C++ kit is providing ...
> One technique I'm using that might be causing the problem is: Using an re to remove a lot of punctuation, punctuation that
> would cause the word "Python" not to be matched with "Python." (notice the period). Actually, if anyone has a better
> solution to that, I'd be more than willing to listen.
That is not causing your current symptom. It is extremely likely to
cause other possibly unwanted symptoms -- like "alt.foo.bar" being
treated as the word "altfoobar". I'm sure various people will give you
one-line Python solutions to the symptoms. However it looks like you
need to start from the top down -- What is the definition of a "word"?
What tools/techniques are there to tokenise text into "words" and
More information about the Python-list