Unicode -> String problem

Wed Jul 11 01:55:46 EDT 2001

> 
> Possibly you might attain enlightenment if you compared the original
> HTML with the output from HTMLParser --- I know next to nothing about
> HTML but I'm willing to bet that such a comparison might just bring up
> correspondences like   -> \xa0 and · -> \x95. BTW, the url
> you gave has 22 occurences of · ...

I'm definitely planning on trying that tomorrow. My initial tests with HTMLParser were (unfortunately, I guess) on relatively simple pages, meaning no complex Jscript, or 
whatever silly things people are putting on their pages today, so I started working with it on the assumption that it worked exactly as it should, with no interference 
from me.

> Then you can proceed at one of two levels:
> (1) follow Paul Prescod's suggestion to use the ASCII codec with the
> "ignore" option -- this will strip out all non-ASCII characters
> (2) use u.encode("cp1252") to give you native Windows stuff in an
> 8-bit string. However you have to be prepared to handle things like a
> "no-break space" (\xa0) and a "bullet" (\x95) and whatever else may be
> in the input; time to buy a book on HTML, maybe.
> 
> <:-)>
> If you get exceptions with "cp1252", it probably means that you have
> strayed too far into the forest and encountered a web page produced by
> some pesky furriner varmint -- "Oooh, Grandma, what enormous ordinals
> you have!"
> </:-)>
Hehe, I know what that's all about. Being an Opera user, I have a better view than most people on how poorly, or against W3C standards, most webpages are 
written. IE is a 20 meg download (at least) because it has to have so much machinery to deal with garbage webpages. Opera is a 2 meg download because it sticks 
to standards.

> Oh and call me silly, but a brief peek at an IE5 cache gives me the
> impression that what is there is plain ol' HTML, just like you get
> from a direct download. You may like to ask exactly what functionality
> the C++ kit is providing ...
I can't give you the reason right now why the cache is being returned in Unicode format, but I believe it had something to do with a buffer size or some such 
nonsense. I know that the author of the C++ code that retrieves that cache is viewing this thread, so Bill, if you want to jump in, I'd welcome it. No one seems to 
believe that Unicode is being returned :)

> That is not causing your current symptom. It is extremely likely to
> cause other possibly unwanted symptoms -- like "alt.foo.bar" being
> treated as the word "altfoobar". I'm sure various people will give you
> one-line Python solutions to the symptoms. However it looks like you
> need to start from the top down -- What is the definition of a "word"?
>  What tools/techniques are there to tokenise text into "words" and
> "non-words"?
I've actually put a lot of thought into this part, so "alt.foo.bar" won't be turned into "altfoobar". The english language is a terrible bitch-goddess, especially when it 
comes to its use on the internet, but for the most part, my code to pull out words is working as well as can be expected. I won't bother you with the RE to properly pull 
out punctuation, but just take my word that it seems to be doing what needs to be done.

The main thing is that we're just working on, at most, a pre-alpha version of the product here. This is an academic project, with the current goal being that of putting 
together a working version to test out some theories that the professor who commissioned the project has. If not everything works perfectly (ie. unicode -> string, 
exactly pulling out words, etc.), then at this stage, it's not a huge problem. However, the more "quirks" I can remove now, the better it will be for all those involved.

I guess I'll just repeat (rephrase) my question from before: If anyone has seen an easier way to do what I'm doing (properly parse a webpage, any webpage, and pull 
each word from it, or, even better, give me the text of the page as a list), I'd much appreciate it. If that doesn't happen though, then I'll take all the suggestions from 
this thread and try them all out.

Jay P.