PEP 263 comments

Fri Mar 1 21:26:34 EST 2002

On 01 Mar 2002 11:04:06 +0100, Martin von Loewis <loewis at informatik.hu-berlin.de> wrote:

>bokr at oz.net (Bengt Richter) writes:
[...]
>
>> Perhaps we could just use a file to contain extra file metadata,
>> letting a file of metadata govern other files it names in the same
>> directory as itself. Probably a dot file in *nix.
>
>Nice idea, different PEP.
>
>> For PEP 263 purposes, it would only need to be a text file with file
>> names tab delimited from keyword=encoding-info, with the first line(s)
>> perhaps with a glob pattern for a compact way of specifying encoding
>> for a lot of files in a directory at once.
>
>I don't think the file encoding information should be stored in a
>different file; the risk of the two files becoming disassociated is
>just to big to be acceptable.
>
Well, there's risk of a sym link becoming dissociated from its file too,
but if you are using the mechanism, it's quickly apparent when it breaks.

I agree there is a dissociation danger, but when an error pops up, it will
be easy to add the misbehaving file's name to the local meta-data directory file.
Encoding detection tools also could do a one-time scan of a directory and validate
the metadata, or at least warn.

If desired you can hide the actual data file by renaming it to a hashed
or other alias name,  using the metadata entry to show the original name
and its symlink-like location option to point to the renamed file. Thus you
force the tools either not to find file_orig_name, or to look in the directory
file, where it will find

    file_orig_name  encoding=UTF-8 location=./hashedname

or not find it at all, but meta-data would not silently be ignored.

But the best way to keep metadata is with actual file system support, as Paul
Prescod mentioned privately (and I was about to go into when I decided my post
was getting too long ;-)

What I want is a universal file typing metadata prefix with codes issued
through a registry system that assigns numbers in a way that provides for both
common de facto standards and private company proprietary file types.

The prefix would be copied with any copy of the data file, but it would be excluded
from the range of normal seek operations.

If the prefix contained a location symlink, that would be all that was copied by
default. Data-verifiable links could contain md5 hashes of the data they link to.
The reason for the location link option is to be able to wrap legacy files and
whole file systems without modifying them, yet being able to integrate them into
the new file system. I like the idea of doing the prefix in UTF-8 so that a local
system can wrap a foreign system with local file names in the "symlinks".

Think of it as meta-data-enhanced UTF-8-encoded super-symlinks, with the principal
purpose of carrying universal-file-type-code & encoding-id in the meta-data, and
the option of an absent location-link meaning data follows immediately in
the same container as the super-symlink.

Note that not touching the linked file's data would allow you to enhance
the meta-data to handle compounded encoding formats which would not allow
embedding -*- "cookies" -*-, such zip, tgz, pgp, bin64, etc., etc.

You could create a new file with the metadata as a prefix with data
immediately following (probably on a block boundary), but this would
be nicest with file system support.

>> To provide international encoding for file-associated info, like
>> a local dialect/special characters name etc., in a system whose
>> native file naming is more restricted, perhaps this directory of
>> file attributes could be standardized to UTF-8 for its own encoding.
>
>We are not talking about file names here, but about file contents.
>
Right, I was 'introducing an optional side benefit'[1] of the main idea.
IOW, you could e.g. have optional meta-data containing a sanskrit string
for whatever purpose, irrespective of the type or encoding declared for
the file that the meta-data was associated with. If the meta-data file
were standardly always UTF-8 encoded, there would be no restriction on *its*
content, though the associated data file might be EBCDIC or whatever.

Whether you wanted to support file annotations, or special-language display names
or whatever would be a design decision above the super-symlink general
infrastructure I am describing.

[1] I guess that's a bad habit when trying to communicate a main idea ;-/

>> There are some changes as to legality checks, apparently,
>> as of last May. I'm wondering if this affects PEP 263
>> and/or the unicode implementation in Python.
>
>That doesn't affect this PEP; as for the Unicode 3.1 conformance, I
>believe the current CVS implements UTF-8 correctly.
>
I'll take your word for it ;-)

BTW, if my display font is Lucida Console will I be able to see infinity
like the 'A' in the following?
 >>> u'\u0041'
 u'A'
 >>> u'\u221e'
 u'\u221e'

Will the following work?

 >>> print u'\u0041'
 A
 >>> print u'\u221e'

 Traceback (most recent call last):
   File "<stdin>", line 1, in ?
 UnicodeError: ASCII encoding error: ordinal not in range(128)
 >>>

And if I redirect the output to a file, what will be in the file?

Regards,
Bengt Richter