Stripping ASCII codes when parsing
fairwinds at eastlink.ca
Mon Oct 17 19:50:31 CEST 2005
This is very nice :-) Thank you Tony. I think this will be the way to
go. My concern ATM is where it will be best to unicode. The data after
this will go into dict and a few processes and into database. Because
input source if not explicit encoding, I will have to assume ISO-8859-1
I believe but could well be cp1252 for most part ( because it says no
ASCII (0-30) but alright ASCII chars 128-254) and because most are
Windows users. Am thinking to unicode after stripping these characters
and validating text, then unicoding (utf-8) so it is unicode in dict.
Then when I perform these other processes it should be uniform and then
it will go into database as unicode. I think this should be ok.
On Monday, October 17, 2005, at 01:48 PM, Tony Nelson wrote:
> In article <mailman.2153.1129538807.509.python-list at python.org>,
> David Pratt <fairwinds at eastlink.ca> wrote:
>> I am working with a text format that advises to strip any ascii
>> characters (0 - 30) as part of parsing data and also the ascii pipe
>> character (124) from the data. I think many of these characters are
>> from a different time. Since I have never seen most of these
>> in text I am not sure how these first 30 control characters are all
>> represented (other than say tab (\t), newline(\n), line return(\r) )
>> what should I do to remove these characters if they are ever
>> encountered. Many thanks.
> Most of those characters are hard to see.
> Represent arbitrary characters in a string in hex: "\x00\x01\x02" or
> with chr(n).
> If you just want to remove some characters, look into "".translate().
> nullxlate = "".join([chr(n) for n in xrange(256)])
> delchars = nullxlate[:31] + chr(124)
> outputstr = inputstr.translate(nullxlate, delchars)
> *firstname*nlsnews at georgea*lastname*.com
More information about the Python-list