Unicode -> String problem

Tue Jul 10 07:45:25 EDT 2001

> >  
> > I haven't found anything that will explicitly do what I want, namely,
> > completely remove any uncovertable unicode characters. I 
> > have to be able to parse this text afterwards, 
> > using a lot of Python's string functions,
> [snip]
> 
> Are you completely sure that the offending characters are totally
> meaningless in your application? u'\u00A0' is a no-break space; seems
> stripping that out but leaving normal spaces (u'\u0020') might not be
> a good idea. What other non-ASCII characters do you have?
> 
> Are you sure it's Unicode? Is \xA0 exactly what you are seeing, or are
> you seeing \u00A0 and telling us it's \xA0 ??
> 
> Which "lot of Python's string functions" do you plan to use? Note that
> 8-bit strings and Unicode strings support a large number of same-name
> as-close-to-same-functionality-as-possible methods -- see section
> 2.1.5.1 of the Python Library Reference manual. Also the re module
> supports the same functions and methods on both types.

Frankly, I'm starting to wonder myself about all of this :) I'll give a brief description of what's going on, exactly, maybe that will 
offer more insight.

My task is to create an HTML parser that will pull full text from HTML documents, and count the number of occurences of 
each word, as well as the first position it apeared in. I'm using HTMLParser(AbstractFormatter(DumbWriter(cStringIO))) for 
the HTML parsing, and then a variety of other methods to count the number of occurences of each word. Anyway, there 
are two possible ways that my code will be told what HTML to parse: It will either be given a URL, and I have to download 
the page using urlretrieve (or urlopen, not too important, I don't think), or, the other case is that my code is given the full 
HTML from a page, retrieved from IE's cache. 

Now, whenever I'm given HTML from IE's cache, it is unicode. There is no doubt about that. When the cache retrieval 
code was originally written by my colleague, he had to implement it as unicode in C++. It is because of this code that I 
needed to convert from unicode to string, because otherwise, I was getting "ASCII encoding error: ordinal not in 
range(128)" errors.

The reason that I asked for code to remove the non-convertable unicode characters is because I thought that the '\xa0' 
that I was seeing was a result of the unicode. In further testing, it seems that it's not. When I download the page 
'http://www.ign.com', and parser it using the HTMLParser method above, one of the characters/words returned to me is 
'\xa0'. Now, I have no idea where this is coming from. I suppose I'll have to start looking at the source for HTMLParser, as 
well as the formatter and writer I used. It's not just '\xa0' either, one page (which I can't seem to remember now) returned a 
'\x95' as one of the characters/words. 

One technique I'm using that might be causing the problem is: Using an re to remove a lot of punctuation, punctuation that 
would cause the word "Python" not to be matched with "Python." (notice the period). Actually, if anyone has a better 
solution to that, I'd be more than willing to listen. 

Now, if I'm making some obvious mistake here, I ask your forgiveness, what with my total time with Python thus far 
amounting to no more than 1.5 months. However, if you or anyone else could offer any additional insight into this, it would 
be much appreciated.

Jay P. 

PS. I have to leave to a meeting, right now, so I have no time to double check this. Please excuse any silly 
grammar/spelling errors.