[Tutor] pythonic ascii decoding!
mats at wichmann.us
Mon Jul 31 11:56:22 EDT 2017
On 07/31/2017 09:39 AM, bruce wrote:
> Hi guys.
> Testing getting data from a number of different US based/targeted
> websites. So the input data source for the most part, will be "ascii".
> I'm getting a few "weird" chars every now and then asn as fas as I can
> tell, they should be utf-8.
> However, the following hasn;t always worked:
> So, is there a quick/dirty approach I can use to simply strip out the
> "non-ascii" chars. I know, this might not be the "best/pythonic" way,
> and that it might result in loss of some data/chars, but I can live
> with it for now.
> thoughts/comments ??
It's easy enough to toss chars if you don't care what's being tossed,
which sounds like your case, something like:
''.join(i for i in s if ord(i) < 128)
but there's actually lots to think about here (I'm sure others will jump in)
- Python2 strings default to ascii, Python3 to unicode, there may be
some excitement with the use of ord() depending on how the string is
- websites will tell you their encoding, which you could and probably
should make use of
- web scraping with Python is a pretty well developed field, perhaps you
might want to use one of the existing projects? (https://scrapy.org/ is
pretty famous, certainly not the only one)
More information about the Tutor