Scientists, I am looking for examples of obscure text file formats. These files are often generated by scientific instruments to be read by their proprietary software. Or they might come from a program not intended to be machine readable. These examples are for some motivation slides for my SciPy conference talk [1] on Parsita. If you have an example, please send it my way. If you don't want to send a whole file, enough text to fill a slide is sufficient. Thanks so much! David Hagen [1] https://www.scipy2021.scipy.org/schedule
Hi David, are you interested in issues with text encoding (especially accented characters) or only in data localisation? Is it only "text" or binary data as well (one of the instruments I use generates binary files with no appropriate tool that would be adapted to contemporaneous usage and I had to reverse-engineer the format myself)? Yours Olivier. On Sat, 26 Jun 2021 08:31:59 -0400 David Hagen <david@drhagen.com> wrote:
Scientists,
I am looking for examples of obscure text file formats. These files are often generated by scientific instruments to be read by their proprietary software. Or they might come from a program not intended to be machine readable. These examples are for some motivation slides for my SciPy conference talk [1] on Parsita.
If you have an example, please send it my way. If you don't want to send a whole file, enough text to fill a slide is sufficient.
Thanks so much!
David Hagen
[1] https://www.scipy2021.scipy.org/schedule _______________________________________________ SciPy-User mailing list SciPy-User@python.org https://mail.python.org/mailman/listinfo/scipy-user
-- Olivier Crouzet, PhD http://olivier.ghostinthemachine.space /Maître de Conférences/ @LLING - Laboratoire de Linguistique de Nantes UMR6310 CNRS / Université de Nantes
are you interested in issues with text encoding
No, just in (presumably ASCII) text that someone might want to parse into a Python object. Like a JCAMP-DX file [1] if there was not already a JCAMP-DX parser on PyPI [2].
Is it only "text" or binary data as well
Text is only useful for my immediate purpose because I want to show it on a slide. However, Parsita can be used to write parsers for byte strings as well. [1] http://www.chm.bris.ac.uk/~paulmay/temp/pcc/jcamp.htm [2] https://pypi.org/project/jcamp/
Hi David, The JCAMP files look pretty simple to me, with a well-formatted header. If a new instrument came with such a file, I would think "great, I don't have to spend hours writing speciality code to handle its output". Is the idea to be able to actually parse files into a usable data structure including extracting data from the header? Some of the "obscure" data formats I deal with come from different custom complex instruments (beamlines) around the world with custom, home-built data collection systems - they aren't proprietary or deliberately obfuscated, it just turns out that there is a lot of variety. There have been efforts to standardize even the ASCII-only data files, with examples of real files are at https://github.com/xraypy/xraylarch/tree/master/examples/xafsdata/beamlines Just to be clear, parsing those headers at least enough to be able to get a sensible guess for the name for each column would be an important part of the goal. Getting "just" the table of numbers is not a problem. I have solutions for that, but I'd be interested to see what you might come up with. If you are looking for a real challenge, the CIF format for Crystallographic Information (see https://www.iucr.org/resources/cif) would almost certainly provide one. It uses a primitive ASCII encoding for multiple tables, basically using a flat-file where yaml, json, or even XML or SQLite3 would (now) make much more sense. For people dealing with atomic structures of crystals, this format is not obscure. There are several existing parsers, including in Python, and many software tools work with this format. A real example would look like http://rruff.geo.arizona.edu/AMS/CIF_text_files/07779_cif.txt with many more examples at http://rruff.geo.arizona.edu/AMS/amcsd.php and https://www.crystallography.net/cod/ --Matt On Sat, Jun 26, 2021 at 9:27 AM David Hagen <david@drhagen.com> wrote:
are you interested in issues with text encoding
No, just in (presumably ASCII) text that someone might want to parse into a Python object. Like a JCAMP-DX file [1] if there was not already a JCAMP-DX parser on PyPI [2].
Is it only "text" or binary data as well
Text is only useful for my immediate purpose because I want to show it on a slide. However, Parsita can be used to write parsers for byte strings as well.
[1] http://www.chm.bris.ac.uk/~paulmay/temp/pcc/jcamp.htm [2] https://pypi.org/project/jcamp/ _______________________________________________ SciPy-User mailing list SciPy-User@python.org https://mail.python.org/mailman/listinfo/scipy-user
-- --Matt Newville <newville at cars.uchicago.edu> 630-327-7411
Hi David, At the top of https://docs.astropy.org/en/stable/io/ascii/index.html you will find a number of formats that are common or at least used in astronomy. Several of these are not quite easy for machine reading. CDS stands out as a format that is both quite difficult to fully machine parse and widely used via a popular astronomical catalog data server ( http://webviz.u-strasbg.fr/viz-bin/VizieR). Another tricky one is QDP (see example in https://docs.astropy.org/en/latest/_modules/astropy/io/ascii/qdp.html#QDP). This is more obscure but is still being output as a data product for a NASA mission. Cheers, Tom On Sat, Jun 26, 2021 at 8:54 AM David Hagen <david@drhagen.com> wrote:
Scientists,
I am looking for examples of obscure text file formats. These files are often generated by scientific instruments to be read by their proprietary software. Or they might come from a program not intended to be machine readable. These examples are for some motivation slides for my SciPy conference talk [1] on Parsita.
If you have an example, please send it my way. If you don't want to send a whole file, enough text to fill a slide is sufficient.
Thanks so much!
David Hagen
[1] https://www.scipy2021.scipy.org/schedule _______________________________________________ SciPy-User mailing list SciPy-User@python.org https://mail.python.org/mailman/listinfo/scipy-user
PDB for protein structures and biomolecules is meant to be machine readable, but it is a mess, with lots of special cases and variations. In particular, some of the common pain points I experienced were variations (for example, a given atom may have different alternate positions) and the hierarchical nature of the molecule that is not well represented in the file format (proteins are composed of chains, that are made of amino acids, which are formed by atoms). http://www.wwpdb.org/documentation/file-format-content/format33/v3.3.html On Sat, 26 Jun 2021, 2:54 pm David Hagen, <david@drhagen.com> wrote:
Scientists,
I am looking for examples of obscure text file formats. These files are often generated by scientific instruments to be read by their proprietary software. Or they might come from a program not intended to be machine readable. These examples are for some motivation slides for my SciPy conference talk [1] on Parsita.
If you have an example, please send it my way. If you don't want to send a whole file, enough text to fill a slide is sufficient.
Thanks so much!
David Hagen
[1] https://www.scipy2021.scipy.org/schedule _______________________________________________ SciPy-User mailing list SciPy-User@python.org https://mail.python.org/mailman/listinfo/scipy-user
SerialEM is used on many TEM microscopes. It has the .idoc and .log files. I'm not sure if it is obscure enough but if you are looking for random data sets it may help. idoc: http://internal.connectomes.utah.edu/RC2/TEM/1445/TEM/1445.idoc log: http://internal.connectomes.utah.edu/RC2/TEM/1445/TEM/1445.log Context, if needed, and many links to other .idoc and .log examples. The stats and plots are all generated from data contained in the .idoc and .log files. Have Fun, James -----Original Message----- From: SciPy-User <scipy-user-bounces+james.r.anderson=utah.edu@python.org> On Behalf Of David Hagen Sent: Saturday, June 26, 2021 5:32 AM To: SciPy Users List <scipy-user@python.org> Subject: [SciPy-User] Request for obscure text formats Scientists, I am looking for examples of obscure text file formats. These files are often generated by scientific instruments to be read by their proprietary software. Or they might come from a program not intended to be machine readable. These examples are for some motivation slides for my SciPy conference talk [1] on Parsita. If you have an example, please send it my way. If you don't want to send a whole file, enough text to fill a slide is sufficient. Thanks so much! David Hagen [1] https://www.scipy2021.scipy.org/schedule _______________________________________________ SciPy-User mailing list SciPy-User@python.org https://mail.python.org/mailman/listinfo/scipy-user
Argh, forgot to link context site... do over! SerialEM is used on many TEM microscopes. It has the .idoc and .log files. I'm not sure if it is obscure enough but if you are looking for random data sets it may help. idoc: http://internal.connectomes.utah.edu/RC2/TEM/1445/TEM/1445.idoc log: http://internal.connectomes.utah.edu/RC2/TEM/1445/TEM/1445.log Context, if needed, and many links to other .idoc and .log examples. The stats and plots are all generated from data contained in the .idoc and .log files. http://storage1.connectomes.utah.edu/RC2/VolumeReport.html Have Fun, James -----Original Message----- From: James Anderson Sent: Saturday, June 26, 2021 12:14 PM To: 'SciPy Users List' <scipy-user@python.org> Subject: RE: [SciPy-User] Request for obscure text formats SerialEM is used on many TEM microscopes. It has the .idoc and .log files. I'm not sure if it is obscure enough but if you are looking for random data sets it may help. idoc: http://internal.connectomes.utah.edu/RC2/TEM/1445/TEM/1445.idoc log: http://internal.connectomes.utah.edu/RC2/TEM/1445/TEM/1445.log Context, if needed, and many links to other .idoc and .log examples. The stats and plots are all generated from data contained in the .idoc and .log files. Have Fun, James -----Original Message----- From: SciPy-User <scipy-user-bounces+james.r.anderson=utah.edu@python.org> On Behalf Of David Hagen Sent: Saturday, June 26, 2021 5:32 AM To: SciPy Users List <scipy-user@python.org> Subject: [SciPy-User] Request for obscure text formats Scientists, I am looking for examples of obscure text file formats. These files are often generated by scientific instruments to be read by their proprietary software. Or they might come from a program not intended to be machine readable. These examples are for some motivation slides for my SciPy conference talk [1] on Parsita. If you have an example, please send it my way. If you don't want to send a whole file, enough text to fill a slide is sufficient. Thanks so much! David Hagen [1] https://www.scipy2021.scipy.org/schedule _______________________________________________ SciPy-User mailing list SciPy-User@python.org https://mail.python.org/mailman/listinfo/scipy-user
meshio [0] reads many (unstructured, particularly finite element) mesh file formats. Some of these are binary but many are ASCII or contain ASCII variants. Gmsh's MSH formats (versions 2.2 & 4.1) [1] have provided various challenges over the years. [0] https://pypi.org/project/meshio [1] https://gmsh.info/doc/texinfo/gmsh.html#File-formats Sent from ProtonMail, Swiss-based encrypted email. ‐‐‐‐‐‐‐ Original Message ‐‐‐‐‐‐‐ Le samedi 26 juin 2021 à 10:31 PM, David Hagen <david@drhagen.com> a écrit :
Scientists,
I am looking for examples of obscure text file formats. These files
are often generated by scientific instruments to be read by their
proprietary software. Or they might come from a program not intended
to be machine readable. These examples are for some motivation slides
for my SciPy conference talk [1] on Parsita.
If you have an example, please send it my way. If you don't want to
send a whole file, enough text to fill a slide is sufficient.
Thanks so much!
David Hagen
[1] https://www.scipy2021.scipy.org/schedule
SciPy-User mailing list
SciPy-User@python.org
participants (7)
-
Aldcroft, Thomas
-
David Hagen
-
David Menéndez Hurtado
-
G. D. McBain
-
James Anderson
-
Matt Newville
-
Olivier Crouzet