cleaning up an ASCII file?

Vlastimil Brom vlastimil.brom at gmail.com
Wed Jun 10 16:39:04 EDT 2009


2009/6/10 Nick Matzke <matzke at berkeley.edu>:
> Hi all,
>
> So I'm parsing an XML file returned from a database.  However, the database
> entries have occasional non-ASCII characters, and this is crashing my
> parsers.
>
> Is there some handy function out there that will schlep through a file like
> this, and do something like fix the characters that it can recognize, and
> delete those that it can't?  Basically, like the BBEdit "convert to ASCII"
> menu option under "Text".
>
> I googled some on this, but nothing obvious came up that wasn't specific to
> fixing one or a few characters.
>
> Thanks!
> Nick
>
>
> --
> ====================================================
> Nicholas J. Matzke
> Ph.D. Candidate, Graduate Student Researcher
> Huelsenbeck Lab
> Center for Theoretical Evolutionary Genomics
> 4151 VLSB (Valley Life Sciences Building)
> Department of Integrative Biology
> University of California, Berkeley
>
> Lab websites:
> http://ib.berkeley.edu/people/lab_detail.php?lab=54
> http://fisher.berkeley.edu/cteg/hlab.html
> Dept. personal page:
> http://ib.berkeley.edu/people/students/person_detail.php?person=370
> Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html
> Lab phone: 510-643-6299
> Dept. fax: 510-643-6264
> Cell phone: 510-301-0179
> Email: matzke at berkeley.edu
>
> Mailing address:
> Department of Integrative Biology
> 3060 VLSB #3140
> Berkeley, CA 94720-3140
>
> -----------------------------------------------------
> "[W]hen people thought the earth was flat, they were wrong. When people
> thought the earth was spherical, they were wrong. But if you think that
> thinking the earth is spherical is just as wrong as thinking the earth is
> flat, then your view is wronger than both of them put together."
>
> Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer,
> 14(1), 35-44. Fall 1989.
> http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm
> ====================================================
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi,
depending on the circumstances, there are probably more sophisticated
ways (what does "fix the characters" mean?), but do you maybe think
something like:
>>> u"aáčbüêcîßd".encode("ascii", "ignore")
'abcd'
? It might be important to ensure, that you won't loose any useful
information; where are the unexpected characters coming from, or
couldn't it possibly be fixed in that source?

hth,
  vbr



More information about the Python-list mailing list