[Python-Dev] PEP 385: the eol-type issue

Wed Aug 5 19:43:57 CEST 2009

On approximately 8/5/2009 4:28 AM, came the following characters from 
the keyboard of Dirkjan Ochtman:
> On Wed, Aug 5, 2009 at 13:19, Mark Hammond<mhammond at skippinet.com.au> wrote:
>> Configuring on each clone would certainly be sub-optimal, so the proposal is
>> this configuration be stored in a versioned file in the repo.
> 
> Even if we do that, enabling hg extensions will still need to be done
> locally -- although it can be done per-user/box instead of per-clone.

On approximately 8/5/2009 9:24 AM, came the following characters from 
the keyboard of Paul Moore:
 > 2) This behaviour is something needed for Python only. I've no issue
 > with enabling win32text globally, but I'd want to be clear that it is
 > a no-op unless specifically requested (ie, something like
 > **=cleverencode is *not* used in the absence of an explicit set of
 > rules). That may well be the case, but I had the impression that
 > win32text tried to be "automatic", so I'd like to verify it.

Depending on [Windows] users to configure their installation of 
Mercurial to work with the Python repository is lame; it will lead to 
new Windows contributors getting beat-up at check-in time, and make them 
less likely to want to contribute even the work they have already done 
(with wrong EOL), and much less to want to start future contributions, 
because some Unix Python hacker will be nasty about "Didn't you RTFM?" 
(Maybe not at first, but eventually).

If the configuration settings have to be different per project for 
Windows developers using Mercurial for multiple projects, then that is 
also lame... Windows developers would have to keep changing their 
configurations, or (implied in above discussion) remember to recreate 
settings for each new clone or branch or whatever of the Python project. 
  This is also error-prone, and leads to the above problem a different way.

I have read this whole discussion, but want to step back and look at it 
from a theoretical viewpoint.  A good solution would have the following 
characteristics:

INSTALLATION) The developer should install the [D]VCS (for this 
discussion, Mercurial, present or future version), and attempt to access 
a repository (for this discussion, the Python repository, converted and 
configured for the chosen [D]VCS).  The resultant environment should 
automatically be configured to work properly. If any [D]VCS extensions 
are required for the project, they should be automatically installed and 
configured, or the user given explicit instructions on how to do so, as 
a one-time installation step, that adversely affects no other projects 
for which the [D]VCS is used by that or other users of the present 
installation..  See below for what properly means.

EOL CONFIGURATION) Each file, when added to the repository, should have 
a repository setting that indicates what the appropriate EOL type is for 
that file.  The values I have heard are  \n only, \r\n, platform-native, 
and binary.  I haven't heard \r only in this discussion, but have heard 
it in other similar discussions, and it may be a useful setting for 
Mercurial to have, if the feature must be newly implemented there.  I 
believe there are also systems that use RS to separate lines, and 
perhaps other things (and are there new Unicode control characters that 
could be used for line endings?), so it might be good to leave a few 
unassigned values in such a setting.  I don't think any setting should 
be created to allow mixed line ending usage within a file, except 
binary.  Per repository default for this setting should be available to 
avoid burdening the user when creating the typical type of file.

ENCODING CONFIGURATION) Each file, when created, should have a 
repository settings that declares its character repertoire and encoding, 
and if it is a Unicode UTF encoding, whether or not it should have a 
leading BOM.  In my opinion, all source code files should use a Unicode 
encoding, the exception being for test files that help test encoding 
support in internationalized environments.  But the feature supports 
other people's opinions too.  Per repository default for this setting 
should be available to avoid burdening the user when creating the 
typical type of file.

CHECKOUT) Check-outs should be sensitive to the user's local environment 
(platform and locale settings), and non-binary files should be converted 
from the repository format to the local encoding and platform-specific 
line endings.  Settings to override the line endings should be 
optionally available for users whose tools understand other line 
endings, and prefer them over the native line endings.  If the 
characters used within a file cannot be converted losslessly to the 
encoding specified by the locale settings, then it should not be able to 
be checked out.  A special override might be useful for using a lossy 
transformation for a read-only view of the file, at user request.

CHECKIN) Check-ins, even local check-ins to local clones or branches, 
should automatically convert encodings and line endings from the 
platform and locale setting to the encoding and line ending specified by 
the repository for that file.  If the characters in the modified file 
cannot be transformed losslessly to the repository repertoire and 
encoding, the check-in should be prevented.

The CHECKIN should be a requirement of a useful [D]VCS, regardless of if 
any other capabilities are present.

Even if none of the existing tools can reach the above flexibility, the 
problems that results from using tools that do not have such flexibility 
should be understood in terms of their specific deficiencies compared to 
the theoretical model.

I can think of only one other solution that properly handles the 
problems (which is punting, really): to require the development 
environment to support the repertoire, encoding, and line endings of the 
repository.  Doing this in a cross-platform manner is hard, because the 
tool sets (editors, compilers, databases, etc.) tend to support the 
platform-native convention better than the non-native conventions.  It 
sounds like Mercurial's win32text extension is one form of this sort of 
requirement.  CHECKIN should be a requirement even in this case, to 
validate the incoming data file.  Basic software design requires 
validation of incoming data.

I have no clue how many of these characteristics are implemented by 
Mercurial (or any other VCS or DVCS, I've been 7 years away from using 
SCCS, CVS, and Clearcase, but none of them had such features then, and 
I've not used the modern crop of VCSes much: git, svn, hg, bazaar, 
except a little in passing, but haven't read any documentation, nor 
attempted to set up a project myself in any of them).

If none of the existing tools can reach the above flexibility, then 
there will be problems that result, and understanding what the problems 
are, and coming up with documented workarounds, processes, and auxiliary 
tools on each platform/envirenment to cure or prevent them, would seem 
to be necessary to support the use of such tools.

Since Mercurial is the presently chosen DVCS for Python to migrate to, 
I'd be delighted to learn how close it comes to the theoretical model, 
and I'm sure someone out there knows.  When I have some time, I'll 
attempt to figure that out by reading the Mercurial documentation... I 
have a personal (Python, cross-platform) project that is in need of a 
DVCS soon, and so I'm watching this discussion with much interest, to 
know whether I should also choose Mercurial, or should choose something 
that is closer to the theoretical solution outlined above (if there is 
something that is, or appears to be more likely to reach it sooner).

-- 
Glenn -- http://nevcal.com/
===========================
A protocol is complete when there is nothing left to remove.
-- Stuart Cheshire, Apple Computer, regarding Zero Configuration Networking