data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
Here's something more interesting than my shlex idea. os.path is, pretty much, the Python FS toolbox, along with shutil. But, there's one feature missing: check if a file is binary. It isn't hard, see http://code.activestate.com/recipes/173220/. But, writing 50 lines of code for a more common task isn't really Python-ish. So... What if os.path had a binary checker that works just like isfile: os.path.isbinary('/nothingness/is/eternal') # Returns boolean It's a thought... -- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/22d89/22d89c5ecab2a98313d3033bdfc2cc2777a2e265" alt=""
On Wed, Jul 31, 2013 at 10:40:03AM -0500, Ryan <rymg19@gmail.com> wrote:
What is a binary file? Would Russian text in koi8-r encoding be considered binary? What about utf-16? UTF16-encoded files have many zero characters. UTF32-encoded have even more. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
data:image/s3,"s3://crabby-images/4217a/4217a515224212b2ea36411402cf9d76744a5025" alt=""
On 31 juil. 2013, at 18:02, Oleg Broytman <phd@phdru.name> wrote:
And the recipe linked is worse than that: even with no nul byte, if more than 30% of the files's bytes aren't ASCII it considers the file binary. Files in iso-8859 parts 5 to 8 (Cyrillic, Arabic, Greek and Hebrew) are pretty much guaranteed to be inferred as binary. Part 11 (Thai) as well. UTF-8 for any non-Latin script will also be considered binary as the high bit is always set when encoding codepoints outside the ASCII range.
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On Wed, Jul 31, 2013 at 8:40 AM, Ryan <rymg19@gmail.com> wrote:
Some time ago I put on a gas mask and dove into the Perl source code to figure out how its "is binary" and "is text" operators work: http://eli.thegreenplace.net/2011/10/19/perls-guess-if-file-is-text-or-binar... I would recommend against including such a simplistic heuristic in the Python stdlib. Eli
data:image/s3,"s3://crabby-images/7a5f7/7a5f75552015b74c8f6e42eb350ba66b43c9a474" alt=""
On Jul 31, 2013 12:22 PM, "Eli Bendersky" <eliben@gmail.com> wrote:
there's one feature missing: check if a file is binary. It isn't hard, see http://code.activestate.com/recipes/173220/. But, writing 50 lines of code for a more common task isn't really Python-ish. problem it tries to solve) so so difficult is that binary files may contain what is considered to be large amounts of text, and text files may contain pieces of binary data. For example, consider a windows executable file - Much of the data in such a file is considered binary data, but there are defined sections where strings and text resources are stored. Any heuristic algorithm like the one mentioned will be insufficient in such cases. Although I can't think of a situation off hand where the opposite may be true (binary data embedded in what is considered to be a text file) I'm pretty sure such a situation exists.
Some time ago I put on a gas mask and dove into the Perl source code to
figure out how its "is binary" and "is text" operators work: http://eli.thegreenplace.net/2011/10/19/perls-guess-if-file-is-text-or-binar...
I would recommend against including such a simplistic heuristic in the
Python stdlib.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
1.The link I provided wasn't how I wanted it to be. I was using it as an example to show it wasn't impossible. 2.You yourself stated it doesn't work on UTF-8 files. If you wanted one that worked on all text files, it wouldn't work right. 3.Did no one get the 'nothingness/is/eternal' joke? So...although that is a nice piece of code, an os.path implementation would probably be more complete and foolproof. Eli Bendersky <eliben@gmail.com> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Jul 31, 2013, at 12:03, Ryan <rymg19@gmail.com> wrote:
So...although that is a nice piece of code, an os.path implementation would probably be more complete and foolproof.
And because there is no foolproof, or even remotely close to foolproof, way to do it, there can be no os.path implementation.
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 7/31/2013 3:03 PM, Ryan wrote:
1.The link I provided wasn't how I wanted it to be.
And there is no 'one way' that will satisfy everyone, or every most people, as they will have different use cases for 'istext'.
I was using it as an example to show it wasn't impossible.
It is obviously possible to apply any arbitrary predicate to any object within its input domain. No one has claimed otherwise that I know of.
2.You yourself stated it doesn't work on UTF-8 files. If you wanted one that worked on all text files, it wouldn't work right.
The problem is that the problem is ill-defined. Every file is (or can be viewed as) a sequence of binary bytes. Every file can be interpreted as a text file encoded with any of the encodings (like at least some latin-1 encodings, and the IBM PC Graphics encoding) that give a character meaning to every byte. So, to be strict, every file is both binary and text. Python allows us to open any file as either binary or text (with some encoding, with latin-1 one of the possible choices). The pragmatic question is 'Is this file 'likely' *intended* to be interpreted as text, given that the creator is a member of our *local culture*. For the function you referenced, the 'local culture' is 'closed Western European'. For 'closed American', the threshold of allowed non-ascii text and control chars should be more like 0 or 1%. For many cultures, the referenced function is nonsensical. For an open global context, istext would have to try all standard text encodings and for those that worked, apply the grammar rules of the languages that normally are encoded with that encoding. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I just realized I misexpressed myself...again. I meant ASCII or binary, not text or binary. Kind of like the old FTP programs. The implementation would determine if it was ASCII or binary. And, the '/nothingness/is/eternal' is a quote from Xemnas in Kingdom Hearts. I was hoping someone would pick it up. Terry Reedy <tjreedy@udel.edu> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/7a5f7/7a5f75552015b74c8f6e42eb350ba66b43c9a474" alt=""
On Jul 31, 2013 8:26 PM, "Ryan" <rymg19@gmail.com> wrote:
I just realized I misexpressed myself...again. I meant ASCII or binary,
not text or binary. Kind of like the old FTP programs. The implementation would determine if it was ASCII or binary. Even so, that raises the question, "Why ASCII? why not Unicode, or any of the other hundreds of text formats out there?" If this is something to be included into the standard library, a collection used by people from all around the world, some forethought into the backgrounds of it's users should be taken into consideration.
And, the '/nothingness/is/eternal' is a quote from Xemnas in Kingdom
Hearts. I was hoping someone would pick it up.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On 01/08/13 10:25, Ryan wrote:
I just realized I misexpressed myself...again. I meant ASCII or binary, not text or binary. Kind of like the old FTP programs. The implementation would determine if it was ASCII or binary.
Still can't be done reliably, but even if it could, what's so special about ASCII? Should we have dozens of such functions? isascii isbig5 iskoi8u iskoi8r and so on? The concept of "isbinary" is fundamentally flawed. The concept of "try to guess what sort of data a file might plausibly contain" is not flawed, but is a much, much bigger problem than is suitable for a simple os.path function. -- Steven
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
What about something like this: https://github.com/ahupp/python-magic And, I explained the joke in my last post. Steven D'Aprano <steve@pearwood.info> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
From: Ryan <rymg19@gmail.com> Sent: Wednesday, July 31, 2013 7:45 PM
What about something like this:
That particular wrapper is just a ctypes wrapper libmagic.so/.dylib, and not a very portable one (e.g., it's got the path /usr/local/Cellar/libmagic/5.10/ hardcoded into it… which will only work for Mac Homebrew users who are 4 versions/19 months out of date…). Also, note that libmagic already comes with very similar Python ctypes-based bindings. However, there are a half-dozen other wrappers around libmagic on PyPI, and it's pretty trivial to create a new one. The tricky bit is where to get the libmagic code and data files from. If you want to make it usable on most platforms, you'd need to add the file source distribution to the Python source, build it with Python, and statically link it into a module, ala zlib or sqlite. And, unlike those modules, you'd also need to include a data file (magic.mgc) with the binary distribution. I slapped together a quick&dirty wrapper to see what the costs are. It adds 640KB to the 14MB source distribution, 300KB to the 91MB binary (64-bit Mac framework build), and under 10 seconds to the build process. There'd be a bit of an extra maintenance burden in tracking updates (the most recent two updates were 21 Mar 2013 and 22 Feb 2013). The code and data are BSD-licensed, which shouldn't be a problem. The library is very portable: "./configure --enable-static; make" worked even on Windows. So, is it worth adding to Python? I don't know. But it seems at least feasible.
data:image/s3,"s3://crabby-images/69c89/69c89f17a2d4745383b8cc58f8ceebca52d78bb7" alt=""
On Wed, Jul 31, 2013 at 10:11 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Still can't be done reliably, but even if it could, what's so special about ASCII?
Lots of things are special about ASCII. It is a 7-bit subset of pretty much every modern encoding scheme. Being 7-bit, it can be fairly reliably distinguished from most binary formats. Same is true about UTF-8. It is very unlikely that a binary dump of a double array make a valid UTF-8 text and vice versa - UTF-8 text interpreted as a list of doubles is unlikely to produce numbers that are in a reasonable range. I would not mind seeing an "istext()" function somewhere in the stdlib that would only recognize ASCII and UTF-8 as text.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
From: Alexander Belopolsky <alexander.belopolsky@gmail.com> Sent: Wednesday, July 31, 2013 7:57 PM
On Wed, Jul 31, 2013 at 10:11 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Still can't be done reliably, but even if it could, what's so special about ASCII?
Lots of things are special about ASCII. It is a 7-bit subset of pretty much every modern encoding scheme. Being 7-bit, it can be fairly reliably distinguished from most binary formats. Same is true about UTF-8. It is very unlikely that a binary dump of a double array make a valid UTF-8 text and vice versa - UTF-8 text interpreted as a list of doubles is unlikely to produce numbers that are in a reasonable range.
I would not mind seeing an "istext()" function somewhere in the stdlib that would only recognize ASCII and UTF-8 as text.
Plenty of files in popular charsets are actually perfectly valid UTF-8, but garbage when read that way. This and its converse are probably the most common cause of mojibake problems people have today. (I don't know if you can search Stack Overflow for problems with "Ã" in the description, but if you can, it'll be illuminating.) Do you really want a file that sorts half your Latin-1 files into "UTF-8 text files" that are unreadable garbage and the other half into "binary files"? Also, while ASCII is much simpler and more robust to detect, it's not nearly as useful as it used to be. We don't have to deal with 7-bit data channels very often nowadays… and when you do, do you really want to treat pickle format 0 or base-64 or RTF as "text"? Meanwhile, text-processing code that only handles ASCII is generally considered broken. Anyway, if you want that "istext()" function, it's trivial to write it yourself: def istext(b): try: b.decode('utf-8') except UnicodeDecodeError: return False else: return True (There's no reason to try 'ascii', because any ASCII-decodable text is also UTF-8-decodable.) And really, since you're usually going to do something like this: if istext(b): dotextstuff(b) else: dobinarystuff(b) … you're probably better off following EAFP and just doing this: try: dotextstuff(b) except UnicodeDecodeError: dobinstuff(b)
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Andrew Barnert writes:
Plenty of files in popular charsets are actually perfectly valid UTF-8,
FVO "popular charset" in {ASCII} or "plenty of files" in "len(file) < 1KB", yes. Otherwise, see below.
but garbage when read that way. This and its converse are
The converse *is* a problem, because the ISO 8859 family (and even more so the Windows 125x family) basically use up all the bytes.
probably the most common cause of mojibake problems people have today.
Actually the most common cause in my experience is Apache or MUA configuration of a default charset and/or fallback to Latin-1 for files actually written in UTF-8, combined with conformant browsers and MUAs that respect transport-level defaults or protocol defaults rather than try to detect the charset. Viz:
(I don't know if you can search Stack Overflow for problems with "Ã" in the description, but if you can, it'll be illuminating.)
But:
Yes, indeedy! Just because those algorithms exist doesn't mean it's a good idea to use them (outside of some interactive applications like text editors where the user can look at the mojibake and tell the editor either the right encoding or to try another guess).
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On 01/08/13 05:03, Ryan wrote:
1.The link I provided wasn't how I wanted it to be. I was using it as an example to show it wasn't impossible.
But it *is* impossible even in principle to tell the difference between "text" and "binary", since both text and binary files are made up of the same bytes. Whether something is text or binary depends in part on the intention of the reader. E.g. a text file containing the ASCII string "Greetings and salutations Ryan\r\n" is bit-for-bit identical with a binary file containing four C doubles: 1.6937577544703708e+190 2.6890193974129695e+161 9.083672029092351e+223 2.9908963169274674e-260 So any such "is binary" function cannot determine whether a file actually is binary or not. The best it can do is "might be text". That perhaps leads to a less bad (although maybe not actually good) idea, a function which takes an encoding and tries to determine whether or not the contexts of the file could be text in that encoding. But really, file type guessing is too complex to be a simple function like "isbinary" or even "maybetext".
3.Did no one get the 'nothingness/is/eternal' joke?
Not me. -- Steven
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, Jul 31, 2013 at 4:40 PM, Ryan <rymg19@gmail.com> wrote:
Going right back to the beginning here. Suppose this were deemed useful. Why should it be in os.path? Nothing else there, as far as I know, looks at the *contents* of a file. Everything's looking at directory entries, sometimes not even that (eg os.path.basename is pure string manipulation). I should be able to getctime() on a file even without permission to read it. I can't see whether it's binary or text without read permission. This sounds more like a job for a file-like object, maybe a subclass of file that reads (and buffers) the first 512 bytes, guesses whether it's text or binary, and then watches everything that goes through after that and revises its guess later on. And then the question becomes: How useful would that be? But mainly, I think it's only going to cause problems to have a potentially expensive operation stuck away with the very cheap operations in os.path. ChrisA
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Wed, Jul 31, 2013 at 5:42 PM, Chris Angelico <rosuav@gmail.com> wrote:
Something like: if fh.read(512).isprintable(): do_the_ascii_stuff(fh) else: do_the_bin_stuff(fh) -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
That's a pretty good idea. Or it could be like this: if fh.printable(): It would have an optional argument: the number of bytes to read in. Default is 512. So, if we wanted 1024 bytes instead of 512: if fh.printable(1024): David Mertz <mertz@gnosis.cx> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I actually got something simple working. It's only been tested on Android: import re def isbinary(fpath): with open(fpath, 'r') as f: data = re.sub(r'(^(\'|\")|(\'|\")$)', '', repr(f.read()).replace('\\n', '\n')) binchars = re.findall(r'\\x[0123456789abcdef]{2}', data) per = (float(len(binchars)) / float(len(data))) * 100 if int(per) == 0: return True else: return False Ryan <rymg19@gmail.com> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/c36fe/c36fed0c01a8ca13b4ec9c280190e019109b98eb" alt=""
well, the only remotely valid thing to do is to test if the input data is decodable with any of the encodings python knows. if we do some arbitrary threshold, we only get bugs like fedora’s current release name “Schrödinger's Cat” being considered “not text”. i’d never write code like this<https://github.com/hwoarang/libreport/blob/master/src/lib/problem_data.c#L27...> . PS: why do people still convert stuff to float? we live in python3 world, where 1/2 is 0.5
data:image/s3,"s3://crabby-images/4217a/4217a515224212b2ea36411402cf9d76744a5025" alt=""
On 2013-08-12, at 15:42 , Philipp A. wrote:
well, the only remotely valid thing to do is to test if the input data is decodable with any of the encodings python knows.
Most iso-8859 parts can decode any byte (and thus any byte sequence). Parts 3, 6, 7, 8 and 11 are the only ones not to be defined across all of the [128, 255] range (they're ascii extensions so the [0, 127] range is identical to ascii in all iso-8859 parts)
data:image/s3,"s3://crabby-images/ae99c/ae99c83a5503af3a14f5b60dbc2d4fde946fec97" alt=""
On Mon, Aug 12, 2013, at 10:32, Mathias Panzenböck wrote:
It depends on precisely what is meant by "iso-8859 parts" - and the same with any other character in 0-32 or 127-159 (there is nothing special about the null byte in this regard). But it's typical to think of "iso-8859" encodings as being more like IANA ISO-8859-1, which combines ISO/IEC 8859-1 with the control character definitions from ISO 6429.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Mon, Aug 12, 2013 at 3:32 PM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:
I've often used the presence of a NUL in the data as a simple heuristic for "binary file", though only in places where it won't matter (for instance, showing file size in bytes rather than line count - if a binary file happens to have no \0 and its number of \n gets counted, big deal). Otherwise, not worth the hassle of finding out. ChrisA
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I actually got something simple working. It's only been tested on Android: import re def isbinary(fpath): with open(fpath, 'r') as f: data = re.sub(r'(^(\'|\")|(\'|\")$)', '', repr(f.read()).replace('\\n', '\n')) binchars = re.findall(r'\\x[0123456789abcdef]{2}', data) per = (float(len(binchars)) / float(len(data))) * 100 if int(per) == 0: return True else: return False Ryan <rymg19@gmail.com> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Thu, Aug 1, 2013 at 2:47 PM, MRAB <python@mrabarnett.plus.com> wrote:
Doesn't that seem like a bug: ----- Help on method_descriptor: isprintable(...) S.isprintable() -> bool Return True if all characters in S are considered printable in repr() or S is empty, False otherwise. ----- In what sense is "\n" "not printable in repr()"?! -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I see... This works: import re with open('test.xml', 'r') as f: print re.sub(r'(^(\'|\")|(\'|\")$)', '', repr(f.read()).replace('\\n', '\n')) But, repr can still print binary characters. Opening libexpat.so shows all sorts of crazy characters like \x00. random832@fastmail.us wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/22d89/22d89c5ecab2a98313d3033bdfc2cc2777a2e265" alt=""
On Wed, Jul 31, 2013 at 10:40:03AM -0500, Ryan <rymg19@gmail.com> wrote:
What is a binary file? Would Russian text in koi8-r encoding be considered binary? What about utf-16? UTF16-encoded files have many zero characters. UTF32-encoded have even more. Oleg. -- Oleg Broytman http://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.
data:image/s3,"s3://crabby-images/4217a/4217a515224212b2ea36411402cf9d76744a5025" alt=""
On 31 juil. 2013, at 18:02, Oleg Broytman <phd@phdru.name> wrote:
And the recipe linked is worse than that: even with no nul byte, if more than 30% of the files's bytes aren't ASCII it considers the file binary. Files in iso-8859 parts 5 to 8 (Cyrillic, Arabic, Greek and Hebrew) are pretty much guaranteed to be inferred as binary. Part 11 (Thai) as well. UTF-8 for any non-Latin script will also be considered binary as the high bit is always set when encoding codepoints outside the ASCII range.
data:image/s3,"s3://crabby-images/33866/338662e5c8c36c53d24ab18f081cc3f7f9ce8b18" alt=""
On Wed, Jul 31, 2013 at 8:40 AM, Ryan <rymg19@gmail.com> wrote:
Some time ago I put on a gas mask and dove into the Perl source code to figure out how its "is binary" and "is text" operators work: http://eli.thegreenplace.net/2011/10/19/perls-guess-if-file-is-text-or-binar... I would recommend against including such a simplistic heuristic in the Python stdlib. Eli
data:image/s3,"s3://crabby-images/7a5f7/7a5f75552015b74c8f6e42eb350ba66b43c9a474" alt=""
On Jul 31, 2013 12:22 PM, "Eli Bendersky" <eliben@gmail.com> wrote:
there's one feature missing: check if a file is binary. It isn't hard, see http://code.activestate.com/recipes/173220/. But, writing 50 lines of code for a more common task isn't really Python-ish. problem it tries to solve) so so difficult is that binary files may contain what is considered to be large amounts of text, and text files may contain pieces of binary data. For example, consider a windows executable file - Much of the data in such a file is considered binary data, but there are defined sections where strings and text resources are stored. Any heuristic algorithm like the one mentioned will be insufficient in such cases. Although I can't think of a situation off hand where the opposite may be true (binary data embedded in what is considered to be a text file) I'm pretty sure such a situation exists.
Some time ago I put on a gas mask and dove into the Perl source code to
figure out how its "is binary" and "is text" operators work: http://eli.thegreenplace.net/2011/10/19/perls-guess-if-file-is-text-or-binar...
I would recommend against including such a simplistic heuristic in the
Python stdlib.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
1.The link I provided wasn't how I wanted it to be. I was using it as an example to show it wasn't impossible. 2.You yourself stated it doesn't work on UTF-8 files. If you wanted one that worked on all text files, it wouldn't work right. 3.Did no one get the 'nothingness/is/eternal' joke? So...although that is a nice piece of code, an os.path implementation would probably be more complete and foolproof. Eli Bendersky <eliben@gmail.com> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
On Jul 31, 2013, at 12:03, Ryan <rymg19@gmail.com> wrote:
So...although that is a nice piece of code, an os.path implementation would probably be more complete and foolproof.
And because there is no foolproof, or even remotely close to foolproof, way to do it, there can be no os.path implementation.
data:image/s3,"s3://crabby-images/e2594/e259423d3f20857071589262f2cb6e7688fbc5bf" alt=""
On 7/31/2013 3:03 PM, Ryan wrote:
1.The link I provided wasn't how I wanted it to be.
And there is no 'one way' that will satisfy everyone, or every most people, as they will have different use cases for 'istext'.
I was using it as an example to show it wasn't impossible.
It is obviously possible to apply any arbitrary predicate to any object within its input domain. No one has claimed otherwise that I know of.
2.You yourself stated it doesn't work on UTF-8 files. If you wanted one that worked on all text files, it wouldn't work right.
The problem is that the problem is ill-defined. Every file is (or can be viewed as) a sequence of binary bytes. Every file can be interpreted as a text file encoded with any of the encodings (like at least some latin-1 encodings, and the IBM PC Graphics encoding) that give a character meaning to every byte. So, to be strict, every file is both binary and text. Python allows us to open any file as either binary or text (with some encoding, with latin-1 one of the possible choices). The pragmatic question is 'Is this file 'likely' *intended* to be interpreted as text, given that the creator is a member of our *local culture*. For the function you referenced, the 'local culture' is 'closed Western European'. For 'closed American', the threshold of allowed non-ascii text and control chars should be more like 0 or 1%. For many cultures, the referenced function is nonsensical. For an open global context, istext would have to try all standard text encodings and for those that worked, apply the grammar rules of the languages that normally are encoded with that encoding. -- Terry Jan Reedy
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I just realized I misexpressed myself...again. I meant ASCII or binary, not text or binary. Kind of like the old FTP programs. The implementation would determine if it was ASCII or binary. And, the '/nothingness/is/eternal' is a quote from Xemnas in Kingdom Hearts. I was hoping someone would pick it up. Terry Reedy <tjreedy@udel.edu> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/7a5f7/7a5f75552015b74c8f6e42eb350ba66b43c9a474" alt=""
On Jul 31, 2013 8:26 PM, "Ryan" <rymg19@gmail.com> wrote:
I just realized I misexpressed myself...again. I meant ASCII or binary,
not text or binary. Kind of like the old FTP programs. The implementation would determine if it was ASCII or binary. Even so, that raises the question, "Why ASCII? why not Unicode, or any of the other hundreds of text formats out there?" If this is something to be included into the standard library, a collection used by people from all around the world, some forethought into the backgrounds of it's users should be taken into consideration.
And, the '/nothingness/is/eternal' is a quote from Xemnas in Kingdom
Hearts. I was hoping someone would pick it up.
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On 01/08/13 10:25, Ryan wrote:
I just realized I misexpressed myself...again. I meant ASCII or binary, not text or binary. Kind of like the old FTP programs. The implementation would determine if it was ASCII or binary.
Still can't be done reliably, but even if it could, what's so special about ASCII? Should we have dozens of such functions? isascii isbig5 iskoi8u iskoi8r and so on? The concept of "isbinary" is fundamentally flawed. The concept of "try to guess what sort of data a file might plausibly contain" is not flawed, but is a much, much bigger problem than is suitable for a simple os.path function. -- Steven
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
What about something like this: https://github.com/ahupp/python-magic And, I explained the joke in my last post. Steven D'Aprano <steve@pearwood.info> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
From: Ryan <rymg19@gmail.com> Sent: Wednesday, July 31, 2013 7:45 PM
What about something like this:
That particular wrapper is just a ctypes wrapper libmagic.so/.dylib, and not a very portable one (e.g., it's got the path /usr/local/Cellar/libmagic/5.10/ hardcoded into it… which will only work for Mac Homebrew users who are 4 versions/19 months out of date…). Also, note that libmagic already comes with very similar Python ctypes-based bindings. However, there are a half-dozen other wrappers around libmagic on PyPI, and it's pretty trivial to create a new one. The tricky bit is where to get the libmagic code and data files from. If you want to make it usable on most platforms, you'd need to add the file source distribution to the Python source, build it with Python, and statically link it into a module, ala zlib or sqlite. And, unlike those modules, you'd also need to include a data file (magic.mgc) with the binary distribution. I slapped together a quick&dirty wrapper to see what the costs are. It adds 640KB to the 14MB source distribution, 300KB to the 91MB binary (64-bit Mac framework build), and under 10 seconds to the build process. There'd be a bit of an extra maintenance burden in tracking updates (the most recent two updates were 21 Mar 2013 and 22 Feb 2013). The code and data are BSD-licensed, which shouldn't be a problem. The library is very portable: "./configure --enable-static; make" worked even on Windows. So, is it worth adding to Python? I don't know. But it seems at least feasible.
data:image/s3,"s3://crabby-images/69c89/69c89f17a2d4745383b8cc58f8ceebca52d78bb7" alt=""
On Wed, Jul 31, 2013 at 10:11 PM, Steven D'Aprano <steve@pearwood.info>wrote:
Still can't be done reliably, but even if it could, what's so special about ASCII?
Lots of things are special about ASCII. It is a 7-bit subset of pretty much every modern encoding scheme. Being 7-bit, it can be fairly reliably distinguished from most binary formats. Same is true about UTF-8. It is very unlikely that a binary dump of a double array make a valid UTF-8 text and vice versa - UTF-8 text interpreted as a list of doubles is unlikely to produce numbers that are in a reasonable range. I would not mind seeing an "istext()" function somewhere in the stdlib that would only recognize ASCII and UTF-8 as text.
data:image/s3,"s3://crabby-images/d224a/d224ab3da731972caafa44e7a54f4f72b0b77e81" alt=""
From: Alexander Belopolsky <alexander.belopolsky@gmail.com> Sent: Wednesday, July 31, 2013 7:57 PM
On Wed, Jul 31, 2013 at 10:11 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Still can't be done reliably, but even if it could, what's so special about ASCII?
Lots of things are special about ASCII. It is a 7-bit subset of pretty much every modern encoding scheme. Being 7-bit, it can be fairly reliably distinguished from most binary formats. Same is true about UTF-8. It is very unlikely that a binary dump of a double array make a valid UTF-8 text and vice versa - UTF-8 text interpreted as a list of doubles is unlikely to produce numbers that are in a reasonable range.
I would not mind seeing an "istext()" function somewhere in the stdlib that would only recognize ASCII and UTF-8 as text.
Plenty of files in popular charsets are actually perfectly valid UTF-8, but garbage when read that way. This and its converse are probably the most common cause of mojibake problems people have today. (I don't know if you can search Stack Overflow for problems with "Ã" in the description, but if you can, it'll be illuminating.) Do you really want a file that sorts half your Latin-1 files into "UTF-8 text files" that are unreadable garbage and the other half into "binary files"? Also, while ASCII is much simpler and more robust to detect, it's not nearly as useful as it used to be. We don't have to deal with 7-bit data channels very often nowadays… and when you do, do you really want to treat pickle format 0 or base-64 or RTF as "text"? Meanwhile, text-processing code that only handles ASCII is generally considered broken. Anyway, if you want that "istext()" function, it's trivial to write it yourself: def istext(b): try: b.decode('utf-8') except UnicodeDecodeError: return False else: return True (There's no reason to try 'ascii', because any ASCII-decodable text is also UTF-8-decodable.) And really, since you're usually going to do something like this: if istext(b): dotextstuff(b) else: dobinarystuff(b) … you're probably better off following EAFP and just doing this: try: dotextstuff(b) except UnicodeDecodeError: dobinstuff(b)
data:image/s3,"s3://crabby-images/b96f7/b96f788b988da8930539f76bf56bada135c1ba88" alt=""
Andrew Barnert writes:
Plenty of files in popular charsets are actually perfectly valid UTF-8,
FVO "popular charset" in {ASCII} or "plenty of files" in "len(file) < 1KB", yes. Otherwise, see below.
but garbage when read that way. This and its converse are
The converse *is* a problem, because the ISO 8859 family (and even more so the Windows 125x family) basically use up all the bytes.
probably the most common cause of mojibake problems people have today.
Actually the most common cause in my experience is Apache or MUA configuration of a default charset and/or fallback to Latin-1 for files actually written in UTF-8, combined with conformant browsers and MUAs that respect transport-level defaults or protocol defaults rather than try to detect the charset. Viz:
(I don't know if you can search Stack Overflow for problems with "Ã" in the description, but if you can, it'll be illuminating.)
But:
Yes, indeedy! Just because those algorithms exist doesn't mean it's a good idea to use them (outside of some interactive applications like text editors where the user can look at the mojibake and tell the editor either the right encoding or to try another guess).
data:image/s3,"s3://crabby-images/6a9ad/6a9ad89a7f4504fbd33d703f493bf92e3c0cc9a9" alt=""
On 01/08/13 05:03, Ryan wrote:
1.The link I provided wasn't how I wanted it to be. I was using it as an example to show it wasn't impossible.
But it *is* impossible even in principle to tell the difference between "text" and "binary", since both text and binary files are made up of the same bytes. Whether something is text or binary depends in part on the intention of the reader. E.g. a text file containing the ASCII string "Greetings and salutations Ryan\r\n" is bit-for-bit identical with a binary file containing four C doubles: 1.6937577544703708e+190 2.6890193974129695e+161 9.083672029092351e+223 2.9908963169274674e-260 So any such "is binary" function cannot determine whether a file actually is binary or not. The best it can do is "might be text". That perhaps leads to a less bad (although maybe not actually good) idea, a function which takes an encoding and tries to determine whether or not the contexts of the file could be text in that encoding. But really, file type guessing is too complex to be a simple function like "isbinary" or even "maybetext".
3.Did no one get the 'nothingness/is/eternal' joke?
Not me. -- Steven
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Wed, Jul 31, 2013 at 4:40 PM, Ryan <rymg19@gmail.com> wrote:
Going right back to the beginning here. Suppose this were deemed useful. Why should it be in os.path? Nothing else there, as far as I know, looks at the *contents* of a file. Everything's looking at directory entries, sometimes not even that (eg os.path.basename is pure string manipulation). I should be able to getctime() on a file even without permission to read it. I can't see whether it's binary or text without read permission. This sounds more like a job for a file-like object, maybe a subclass of file that reads (and buffers) the first 512 bytes, guesses whether it's text or binary, and then watches everything that goes through after that and revises its guess later on. And then the question becomes: How useful would that be? But mainly, I think it's only going to cause problems to have a potentially expensive operation stuck away with the very cheap operations in os.path. ChrisA
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Wed, Jul 31, 2013 at 5:42 PM, Chris Angelico <rosuav@gmail.com> wrote:
Something like: if fh.read(512).isprintable(): do_the_ascii_stuff(fh) else: do_the_bin_stuff(fh) -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
That's a pretty good idea. Or it could be like this: if fh.printable(): It would have an optional argument: the number of bytes to read in. Default is 512. So, if we wanted 1024 bytes instead of 512: if fh.printable(1024): David Mertz <mertz@gnosis.cx> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I actually got something simple working. It's only been tested on Android: import re def isbinary(fpath): with open(fpath, 'r') as f: data = re.sub(r'(^(\'|\")|(\'|\")$)', '', repr(f.read()).replace('\\n', '\n')) binchars = re.findall(r'\\x[0123456789abcdef]{2}', data) per = (float(len(binchars)) / float(len(data))) * 100 if int(per) == 0: return True else: return False Ryan <rymg19@gmail.com> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/c36fe/c36fed0c01a8ca13b4ec9c280190e019109b98eb" alt=""
well, the only remotely valid thing to do is to test if the input data is decodable with any of the encodings python knows. if we do some arbitrary threshold, we only get bugs like fedora’s current release name “Schrödinger's Cat” being considered “not text”. i’d never write code like this<https://github.com/hwoarang/libreport/blob/master/src/lib/problem_data.c#L27...> . PS: why do people still convert stuff to float? we live in python3 world, where 1/2 is 0.5
data:image/s3,"s3://crabby-images/4217a/4217a515224212b2ea36411402cf9d76744a5025" alt=""
On 2013-08-12, at 15:42 , Philipp A. wrote:
well, the only remotely valid thing to do is to test if the input data is decodable with any of the encodings python knows.
Most iso-8859 parts can decode any byte (and thus any byte sequence). Parts 3, 6, 7, 8 and 11 are the only ones not to be defined across all of the [128, 255] range (they're ascii extensions so the [0, 127] range is identical to ascii in all iso-8859 parts)
data:image/s3,"s3://crabby-images/ae99c/ae99c83a5503af3a14f5b60dbc2d4fde946fec97" alt=""
On Mon, Aug 12, 2013, at 10:32, Mathias Panzenböck wrote:
It depends on precisely what is meant by "iso-8859 parts" - and the same with any other character in 0-32 or 127-159 (there is nothing special about the null byte in this regard). But it's typical to think of "iso-8859" encodings as being more like IANA ISO-8859-1, which combines ISO/IEC 8859-1 with the control character definitions from ISO 6429.
data:image/s3,"s3://crabby-images/0f8ec/0f8eca326d99e0699073a022a66a77b162e23683" alt=""
On Mon, Aug 12, 2013 at 3:32 PM, Mathias Panzenböck <grosser.meister.morti@gmx.net> wrote:
I've often used the presence of a NUL in the data as a simple heuristic for "binary file", though only in places where it won't matter (for instance, showing file size in bytes rather than line count - if a binary file happens to have no \0 and its number of \n gets counted, big deal). Otherwise, not worth the hassle of finding out. ChrisA
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I actually got something simple working. It's only been tested on Android: import re def isbinary(fpath): with open(fpath, 'r') as f: data = re.sub(r'(^(\'|\")|(\'|\")$)', '', repr(f.read()).replace('\\n', '\n')) binchars = re.findall(r'\\x[0123456789abcdef]{2}', data) per = (float(len(binchars)) / float(len(data))) * 100 if int(per) == 0: return True else: return False Ryan <rymg19@gmail.com> wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
data:image/s3,"s3://crabby-images/c437d/c437dcdb651291e4422bd662821948cd672a26a3" alt=""
On Thu, Aug 1, 2013 at 2:47 PM, MRAB <python@mrabarnett.plus.com> wrote:
Doesn't that seem like a bug: ----- Help on method_descriptor: isprintable(...) S.isprintable() -> bool Return True if all characters in S are considered printable in repr() or S is empty, False otherwise. ----- In what sense is "\n" "not printable in repr()"?! -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
data:image/s3,"s3://crabby-images/52bd8/52bd80b85ad23b22cd55e442f406b4f3ee8efd9f" alt=""
I see... This works: import re with open('test.xml', 'r') as f: print re.sub(r'(^(\'|\")|(\'|\")$)', '', repr(f.read()).replace('\\n', '\n')) But, repr can still print binary characters. Opening libexpat.so shows all sorts of crazy characters like \x00. random832@fastmail.us wrote:
-- Sent from my Android phone with K-9 Mail. Please excuse my brevity.
participants (18)
-
Alexander Belopolsky
-
Andrew Barnert
-
Antoine Pitrou
-
Chris Angelico
-
Chris Kaynor
-
Clay Sweetser
-
David Mertz
-
Eli Bendersky
-
Masklinn
-
Mathias Panzenböck
-
MRAB
-
Oleg Broytman
-
Philipp A.
-
random832@fastmail.us
-
Ryan
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Terry Reedy