[Tutor] pythonic ascii decoding!

Mon Jul 31 11:56:22 EDT 2017

On 07/31/2017 09:39 AM, bruce wrote:
> Hi guys.
> 
> Testing getting data from a number of different US based/targeted
> websites. So the input data source for the most part, will be "ascii".
> I'm getting a few "weird" chars every now and then asn as fas as I can
> tell, they should be utf-8.
> 
> However, the following hasn;t always worked:
>     s=str(s).decode('utf-8').strip()
> 
> So, is there a quick/dirty approach I can use to simply strip out the
> "non-ascii" chars. I know, this might not be the "best/pythonic" way,
> and that it might result in loss of some data/chars, but I can live
> with it for now.
> 
> thoughts/comments ??

It's easy enough to toss chars if you don't care what's being tossed,
which sounds like your case, something like:

''.join(i for i in s if ord(i) < 128)

but there's actually lots to think about here (I'm sure others will jump in)

- Python2 strings default to ascii, Python3 to unicode, there may be
some excitement with the use of ord() depending on how the string is
passed around
- websites will tell you their encoding, which you could and probably
should make use of
- web scraping with Python is a pretty well developed field, perhaps you
might want to use one of the existing projects? (https://scrapy.org/ is
pretty famous, certainly not the only one)