[Python-Dev] Bytes path support
Glenn Linderman
v+python at g.nevcal.com
Fri Aug 22 22:17:44 CEST 2014
On 8/22/2014 11:50 AM, Oleg Broytman wrote:
> On Fri, Aug 22, 2014 at 10:09:21AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
>> On 8/22/2014 9:52 AM, Oleg Broytman wrote:
>>> On Fri, Aug 22, 2014 at 09:37:13AM -0700, Glenn Linderman <v+python at g.nevcal.com> wrote:
>>>> On 8/22/2014 8:51 AM, Oleg Broytman wrote:
>>>>> What encoding does have a text file (an HTML, to be precise) with
>>>>> text in utf-8, ads in cp1251 (ad blocks were included from different
>>>>> files) and comments in koi8-r?
>>>>> Well, I must admit the HTML was rather an exception, but having a
>>>>> text file with some strange characters (binary strings, or paragraphs
>>>>> in different encodings) is not that exceptional.
>>>> That's not a text file. That's a binary file containing (hopefully
>>>> delimited, and documented) sections of encoded text in different
>>>> encodings.
>>> Allow me to disagree. For me, this is a text file which I can (and
>>> do) view with a pager, edit with a text editor, list on a console,
>>> search with grep and so on. If it is not a text file by strict Python3
>>> standards then these standards are too strict for me. Either I find a
>>> simple workaround in Python3 to work with such texts or find a different
>>> tool. I cannot avoid such files because my reality is much more complex
>>> than strict text/binary dichotomy in Python3.
>> I was not declaring your file not to be a "text file" from any
>> definition obtained from Python3 documentation, just from a common
>> sense definition of "text file".
> And in my opinion those files are perfect text. The files consist of
> lines separated by EOL characters (not necessary EOL characters of my OS
> because it could be a text file produced in a different OS), lines
> consist of words and words of characters.
Until you know or can deduce the encoding of a file, it is binary. If it
has multiple, different, embedded encodings of text, it is still binary.
In my opinion. So these are just opinions, and naming conventions. If
you call it text, you have a different definition of text file than I do.
>
>> Looking at it from Python3, though, it is clear that when opening a
>> file in "text" mode, an encoding may be specified or will be
>> assumed. That is one encoding, applying to the whole file, not 3
>> encodings, with declarations on when to switch between them. So I
>> think, in general, Python3 assumes or defines a definition of text
>> file that matches my "common sense" definition.
> I don't have problems with Python3 text. I have problems with Python3
> trying to get rid of byte strings and treating bytes as strict non-text.
Python3 is not trying to get rid of byte strings. But to some extent, it
is wanting to treat bytes as non-text... bytes can be encoded text, but
is not text until it is decoded. There is some processing that can be
done on encoded text, but it has to be done differently (in many cases)
than processing done on (non-encoded) text.
One difference is the interpretation of what character is what varies
from encoding to encoding, so if the processing requires understanding
the characters, then the character code must be known.
On the other hand, if it suffices to detect blocks of opaque text
delimited by a known set of delimiters codes (EOL: CR, LF, combinations
thereof) then that can be done relatively easily on binary, as long as
the encoding doesn't have data puns where a multibyte encoded character
might contain the code for the delimiter as one of the bytes of the code
for the character.
>> On the other hand, Python3 provides various facilities for working
>> with such files.
>>
>> The first I'll mention is the one that follows from my description
>> of what your file really is: Python3 allows opening files in binary
>> mode, and then decoding various sections of it using whatever
>> encoding you like, using the bytes.decode() operation on various
>> sections of the file. Determination of which sections are in which
>> encodings is beyond the scope of this description of the technique,
>> and is application dependent.
> This is perhaps the most promising approach. If I can open a text
> file in binary mode, iterate it line by line, split every line of
> non-ascii bytes with .split() and process them that'd satisfy my needs.
> But still there are dragons. If I read a filename from such file I
> read it as bytes, not str, so I can only use low-level APIs to
> manipulate with those filenames. Pity.
If the file names are in an unknown encoding, both in the directory and
in the encoded text in the file listing, then unless you can deduce the
encoding, you would be limited to doing manipulations with file APIs
that support bytes, the low-level ones, yes. If you can deduce the
encoding, then you are freed from that limitation.
> Let see a perfectly normal situation I am quite often in. A person
> sent me a directory full of MP3 files. The transport doesn't matter; it
> could be FTP, or rsync, or a zip file sent by email, or bittorrent. What
> matters is that filenames and content are in alien encodings. Most often
> it's cp1251 (the encoding used in Russian Windows) but can be koi8 or
> utf8. There is a playlist among the files -- a text file that lists MP3
> files, every file on a single line; usually with full paths
> ("C:\Audio\some.mp3").
> Now I want to read filenames from the file and process the filenames
> (strip paths) and files (verify existing of files, or renumber the files
> or extract ID3 tags [Russian ID3 tags, whatever ID3 standard says, are
> also in cp1251 of utf-8 encoding]...whatever).
"cp1251 of utf-8 encoding" is non-sensical. Either it is cp1251 or it is
utf-8, but it is not both. Maybe you meant "or" instead of "of".
> I don't know the encoding
> of the playlist but I know it corresponds to the encoding of filenames
> so I can expect those files exist on my filesystem; they have strangely
> looking unreadable names but they exist.
> Just a small example of why I do want to process filenames from a
> text file in an alien encoding. Without knowing the encoding in advance.
An interesting example, for sure. Life will be easier when everyone
converts to Unicode and UTF-8.
>
>> The second is to specify an error handler, that, like you, is
>> trained to recognize the other encodings and convert them
>> appropriately. I'm not aware that such an error handler has been or
>> could be written, myself not having your training.
>>
>> The third is to specify the UTF-8 with the surrogate escape error
>> handler. This allows non-UTF-8 codes to be loaded into memory. You,
>> or algorithms as smart as you, could perhaps be developed to detect
>> and manipulate the resulting "lone surrogate" codes in meaningful
>> ways, or could simply allow them to ride along without
>> interpretation, and be emitted as the original, into other files.
> Yes, these are different workarounds.
>
> Oleg.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/python-dev/attachments/20140822/2cf650c5/attachment.html>
More information about the Python-Dev
mailing list