Python 3.2 has some deadly infection
wxjmfauth at gmail.com
wxjmfauth at gmail.com
Fri Jun 6 11:48:01 EDT 2014
Le vendredi 6 juin 2014 17:44:57 UTC+2, wxjm... at gmail.com a écrit :
> Le vendredi 6 juin 2014 17:25:47 UTC+2, Chris Angelico a écrit :
>
> > On Fri, Jun 6, 2014 at 11:24 PM, Ethan Furman <ethan at stoneleaf.us> wrote:
>
> >
>
> > > On 06/05/2014 11:30 AM, Marko Rauhamaa wrote:
>
> >
>
> > >>
>
> >
>
> > >>
>
> >
>
> > >> How text is represented is very different from whether text is a
>
> >
>
> > >> fundamental data type. A fundamental text file is such that ordinary
>
> >
>
> > >> operating system facilities can't see inside the black box (that is,
>
> >
>
> > >> they are *not* encoded as far as the applications go).
>
> >
>
> > >
>
> >
>
> > > Of course they are. It may be an ASCII-encoding of some flavor or other, or
>
> >
>
> > > something really (to me) strange -- but an encoding is most assuredly in
>
> >
>
> > > affect.
>
> >
>
> >
>
> >
>
> > Allow me to explain what I think Marko's getting at here.
>
> >
>
> >
>
> >
>
> > In most file systems, a file exists on the disk as a set of sectors of
>
> >
>
> > data, plus some metadata including the file's actual size. When you
>
> >
>
> > ask the OS to read you that file, it goes to the disk, reads those
>
> >
>
> > sectors, truncates the data to the real size, and gives you those
>
> >
>
> > bytes.
>
> >
>
> >
>
> >
>
> > It's possible to mount a file as a directory, in which case the
>
> >
>
> > physical representation is very different, but the file still appears
>
> >
>
> > the same. In that case, the OS goes reading some part of the file,
>
> >
>
> > maybe decompresses it, and gives it to you. Same difference. These
>
> >
>
> > files still contain bytes.
>
> >
>
> >
>
> >
>
> > A "fundamental text file" would be one where, instead of reading and
>
> >
>
> > writing bytes, you read and write Unicode text. Since the hard disk
>
> >
>
> > still works with sectors and bytes, it'll still be stored as such, but
>
> >
>
> > that's an implementation detail; and you could format your disk UTF-8
>
> >
>
> > or UTF-16 or FSR or anything you like, and the only difference you'd
>
> >
>
> > see is performance.
>
> >
>
> >
>
> >
>
> > This could certainly be done, in theory. I don't know how well it'd
>
> >
>
> > fit with any of the popular OSes of today, but it could be done. And
>
> >
>
> > these files would not have an encoding; their on-platter
>
> >
>
> > representations would, but that's purely implementation - the text
>
> >
>
> > that you wrote out and the text that you read in are the same text,
>
> >
>
> > and there's been no encoding visible.
>
> >
>
> >
>
> ----------
>
>
>
> From the three, you can already eliminates one.
>
> It's not a good new.
>
>
>
> sys.getsizeof('Gödel'.encode('utf-8'))
>
> 23
>
> sys.getsizeof('Gödel'.encode('utf-16-le'))
>
> 27
>
> sys.getsizeof('Gödel')
>
> 42
>
> os.listdir(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
>
> ['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']
>
> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-8'))
>
> 61
>
> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-16-le'))
>
> 79
>
> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
>
> 100
>
>
>
> jmf
Sorry, wront copy/paste
>>> sys.getsizeof('Gödel'.encode('utf-8'))
23
>>> sys.getsizeof('Gödel'.encode('utf-16-le'))
27
>>> sys.getsizeof('Gödel')
42
>>> os.listdir(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
['a.txt', 'kk.bat', 'kk.cmd', 'kk.py', '__pycache__']
>>> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-8'))
61
>>> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe'.encode('utf-16-le'))
79
>>> sys.getsizeof(r'D:\jm\Москва\Zürich\Αθήνα\œdipe')
100
>>>
jmf
More information about the Python-list
mailing list