cleaning up an ASCII file?

Nick Matzke matzke at berkeley.edu
Wed Jun 10 22:22:27 EDT 2009


Looks like this was a solution:

1. Use this guy's unescape function to convert from HTML/XML Entities to 
unicode
http://effbot.org/zone/re-sub.htm#unescape-html


2. Take the unicode and convert to approximate plain ASCII matches with 
unicodedata (after import unicodedata)


ascii_content2 = unescape(line)

ascii_content = unicodedata.normalize('NFKD', 
unicode(ascii_content2)).encode('ascii','ignore')


The string "line" would give the error, but ascii_content does not.

Cheers!
Nick

PS: "asciiDammit" is also fun to look at




John Machin wrote:
> On Jun 11, 6:09 am, Nick Matzke <mat... at berkeley.edu> wrote:
>> Hi all,
>>
>> So I'm parsing an XML file returned from a database.  However, the
>> database entries have occasional non-ASCII characters, and this is
>> crashing my parsers.
> 
> So fix your parsers. google("unicode"). Deleting stuff that you don't
> understand is an "interesting" approach to academic research :-(
> 
> Care to divulge what "crash" means? e.g. the full traceback and error
> message, plus what version of python on what platform, what version of
> ElementTree or other XML spftware you are using ...
> 
>> Center for Theoretical Evolutionary Genomics
> 
> If your .sig evolves much more, it will consume all available
> bandwidth in the known universe and then some ;-)

-- 
====================================================
Nicholas J. Matzke
Ph.D. Candidate, Graduate Student Researcher
Huelsenbeck Lab
Center for Theoretical Evolutionary Genomics
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab websites:
http://ib.berkeley.edu/people/lab_detail.php?lab=54
http://fisher.berkeley.edu/cteg/hlab.html
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140

-----------------------------------------------------
"[W]hen people thought the earth was flat, they were wrong. When people 
thought the earth was spherical, they were wrong. But if you think that 
thinking the earth is spherical is just as wrong as thinking the earth 
is flat, then your view is wronger than both of them put together."

Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 
14(1), 35-44. Fall 1989.
http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
====================================================



More information about the Python-list mailing list