Hi.
[Mark Hammond]
> The point isn't about my suffering as such. The point is more that
> python-dev owns a tiny amount of the code out there, and I don't believe we
> should put Python's users through this.
>
> Sure - I would be happy to "upgrade" all the win32all code, no problem. I
> am also happy to live in the bleeding edge and take some pain that will
> cause.
>
> The issue is simply the user base, and giving Python a reputation of not
> being able to painlessly upgrade even dot revisions.
I agree with all this.
[As I imagined explicit syntax did not catch up and would require
lot of discussions.]
[GvR]
> > Another way is to use special rules
> > (similar to those for class defs), e.g. having
> >
> > <frag>
> > y=3
> > def f():
> > exec "y=2"
> > def g():
> > return y
> > return g()
> >
> > print f()
> > </frag>
> >
> > # print 3.
> >
> > Is that confusing for users? maybe they will more naturally expect 2
> > as outcome (given nested scopes).
>
> This seems the best compromise to me. It will lead to the least
> broken code, because this is the behavior that we had before nested
> scopes! It is also quite easy to implement given the current
> implementation, I believe.
>
> Maybe we could introduce a warning rather than an error for this
> situation though, because even if this behavior is clearly documented,
> it will still be confusing to some, so it is better if we outlaw it in
> some future version.
>
Yes this can be easy to implement but more confusing situations can arise:
<frag>
y=3
def f():
y=9
exec "y=2"
def g():
return y
return y,g()
print f()
</frag>
What should this print? the situation leads not to a canonical solution
as class def scopes.
or
<frag>
def f():
from foo import *
def g():
return y
return g()
print f()
</frag>
[Mark Hammond]
> > This probably won't be a very popular suggestion, but how about pulling
> > nested scopes (I assume they are at the root of the problem)
> > until this can be solved cleanly?
>
> Agreed. While I think nested scopes are kinda cool, I have lived without
> them, and really without missing them, for years. At the moment the cure
> appears worse then the symptoms in at least a few cases. If nothing else,
> it compromises the elegant simplicity of Python that drew me here in the
> first place!
>
> Assuming that people really _do_ want this feature, IMO the bar should be
> raised so there are _zero_ backward compatibility issues.
I don't say anything about pulling nested scopes (I don't think my opinion
can change things in this respect)
but I should insist that without explicit syntax IMO raising the bar
has a too high impl cost (both performance and complexity) or creates
confusion.
[Andrew Kuchling]
> >Assuming that people really _do_ want this feature, IMO the bar should be
> >raised so there are _zero_ backward compatibility issues.
>
> Even at the cost of additional implementation complexity? At the cost
> of having to learn "scopes are nested, unless you do these two things
> in which case they're not"?
>
> Let's not waffle. If nested scopes are worth doing, they're worth
> breaking code. Either leave exec and from..import illegal, or back
> out nested scopes, or think of some better solution, but let's not
> introduce complicated backward compatibility hacks.
IMO breaking code would be ok if we issue warnings today and implement
nested scopes issuing errors tomorrow. But this is simply a statement
about principles and raised impression.
IMO import * in an inner scope should end up being an error,
not sure about 'exec's.
We will need a final BDFL statement.
regards, Samuele Pedroni.
PEP: 0???
Title: Support for System Upgrades
Version: $Revision: 0.0 $
Author: mal(a)lemburg.com (Marc-Andr? Lemburg)
Status: Draft
Type: Standards Track
Python-Version: 2.3
Created: 19-Jul-2001
Post-History:
Abstract
This PEP proposes strategies to allow the Python standard library
to be upgraded in parts without having to reinstall the complete
distribution or having to wait for a new patch level release.
Problem
Python currently does not allow overriding modules or packages in
the standard library per default. Even though this is possible by
defining a PYTHONPATH environment variable (the paths defined in
this variable are prepended to the Python standard library path),
there is no standard way of achieving this without changing the
configuration.
Since Python's standard library is starting to host packages which
are also available separately, e.g. the distutils, email and PyXML
packages, which can also be installed independently of the Python
distribution, it is desireable to have an option to upgrade these
packages without having to wait for a new patch level release of
the Python interpreter to bring along the changes.
Proposed Solutions
This PEP proposes two different but not necessarily conflicting
solutions:
1. Adding a new standard search path to sys.path:
$stdlibpath/system-packages just before the $stdlibpath
entry. This complements the already existing entry for site
add-ons $stdlibpath/site-packages which is appended to the
sys.path at interpreter startup time.
To make use of this new standard location, distutils will need
to grow support for installing certain packages in
$stdlibpath/system-packages rather than the standard location
for third-party packages $stdlibpath/site-packages.
2. Tweaking distutils to install directly into $stdlibpath for the
system upgrades rather than into $stdlibpath/site-packages.
The first solution has a few advantages over the second:
* upgrades can be easily identified (just look in
$stdlibpath/system-packages)
* upgrades can be deinstalled without affecting the rest
of the interpreter installation
* modules can be virtually removed from packages; this is
due to the way Python imports packages: once it finds the
top-level package directory it stay in this directory for
all subsequent package submodule imports
* the approach has an overall much cleaner design than the
hackish install on top of an existing installation approach
The only advantages of the second approach are that the Python
interpreter does not have to changed and that it works with
older Python versions.
Both solutions require changes to distutils. These changes can
also be implemented by package authors, but it would be better to
define a standard way of switching on the proposed behaviour.
Scope
Solution 1: Python 2.3 and up
Solution 2: all Python versions supported by distutils
Credits
None
References
None
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
End:
--
Marc-Andre Lemburg
CEO eGenix.com Software GmbH
_______________________________________________________________________
eGenix.com -- Makers of the Python mx Extensions: mxDateTime,mxODBC,...
Python Consulting: http://www.egenix.com/
Python Software: http://www.egenix.com/files/python/
I've uploaded my logging module, the proposed implementation for PEP 282,
for committer review, to the SourceForge patch manager:
http://sourceforge.net/tracker/index.php?func=detail&aid=578494&group_id=547
0&atid=305470
I've assigned it to Mark Hammond as (a) he had posted some comments to Trent
Mick's original PEP posting, and (b) Barry Warsaw advised not assigning to
PythonLabs people on account of their current workload.
The file logging.py is (apart from some test scripts) all that's supposed to
go into Python 2.3. The file logging-0.4.6.tar.gz contains the module, an
updated version of the PEP (which I mailed to Barry Warsaw on 26th June),
numerous test/example scripts, TeX documentation etc. You can also refer to
http://www.red-dove.com/python_logging.html
Here's hoping for a speedy review :-)
Regards,
Vinay Sajip
I see no references to HAVE_CONFIG_H in the source code (except one
#undef in readline.c), yet we #define it on the command line. Is that
still necessary?
--Guido van Rossum (home page: http://www.python.org/~guido/)
Here's another new PEP.
Bye,
Walter Dörwald
----------------------------------------------------------------------
PEP: 293
Title: Codec Error Handling Callbacks
Version: $Revision: 1.1 $
Last-Modified: $Date: 2002/06/19 03:22:11 $
Author: Walter Dörwald
Status: Draft
Type: Standards Track
Created: 18-Jun-2002
Python-Version: 2.3
Post-History:
Abstract
This PEP aims at extending Python's fixed codec error handling
schemes with a more flexible callback based approach.
Python currently uses a fixed error handling for codec error
handlers. This PEP describes a mechanism which allows Python to
use function callbacks as error handlers. With these more
flexible error handlers it is possible to add new functionality to
existing codecs by e.g. providing fallback solutions or different
encodings for cases where the standard codec mapping does not
apply.
Specification
Currently the set of codec error handling algorithms is fixed to
either "strict", "replace" or "ignore" and the semantics of these
algorithms is implemented separately for each codec.
The proposed patch will make the set of error handling algorithms
extensible through a codec error handler registry which maps
handler names to handler functions. This registry consists of the
following two C functions:
int PyCodec_RegisterError(const char *name, PyObject *error)
PyObject *PyCodec_LookupError(const char *name)
and their Python counterparts
codecs.register_error(name, error)
codecs.lookup_error(name)
PyCodec_LookupError raises a LookupError if no callback function
has been registered under this name.
Similar to the encoding name registry there is no way of
unregistering callback functions or iterating through the
available functions.
The callback functions will be used in the following way by the
codecs: when the codec encounters an encoding/decoding error, the
callback function is looked up by name, the information about the
error is stored in an exception object and the callback is called
with this object. The callback returns information about how to
proceed (or raises an exception).
For encoding, the exception object will look like this:
class UnicodeEncodeError(UnicodeError):
def __init__(self, encoding, object, start, end, reason):
UnicodeError.__init__(self,
"encoding '%s' can't encode characters " +
"in positions %d-%d: %s" % (encoding,
start, end-1, reason))
self.encoding = encoding
self.object = object
self.start = start
self.end = end
self.reason = reason
This type will be implemented in C with the appropriate setter and
getter methods for the attributes, which have the following
meaning:
* encoding: The name of the encoding;
* object: The original unicode object for which encode() has
been called;
* start: The position of the first unencodable character;
* end: (The position of the last unencodable character)+1 (or
the length of object, if all characters from start to the end
of object are unencodable);
* reason: The reason why object[start:end] couldn't be encoded.
If object has consecutive unencodable characters, the encoder
should collect those characters for one call to the callback if
those characters can't be encoded for the same reason. The
encoder is not required to implement this behaviour but may call
the callback for every single character, but it is strongly
suggested that the collecting method is implemented.
The callback must not modify the exception object. If the
callback does not raise an exception (either the one passed in, or
a different one), it must return a tuple:
(replacement, newpos)
replacement is a unicode object that the encoder will encode and
emit instead of the unencodable object[start:end] part, newpos
specifies a new position within object, where (after encoding the
replacement) the encoder will continue encoding.
If the replacement string itself contains an unencodable character
the encoder raises the exception object (but may set a different
reason string before raising).
Should further encoding errors occur, the encoder is allowed to
reuse the exception object for the next call to the callback.
Furthermore the encoder is allowed to cache the result of
codecs.lookup_error.
If the callback does not know how to handle the exception, it must
raise a TypeError.
Decoding works similar to encoding with the following differences:
The exception class is named UnicodeDecodeError and the attribute
object is the original 8bit string that the decoder is currently
decoding.
The decoder will call the callback with those bytes that
constitute one undecodable sequence, even if there is more than
one undecodable sequence that is undecodable for the same reason
directly after the first one. E.g. for the "unicode-escape"
encoding, when decoding the illegal string "\\u00\\u01x", the
callback will be called twice (once for "\\u00" and once for
"\\u01"). This is done to be able to generate the correct number
of replacement characters.
The replacement returned from the callback is a unicode object
that will be emitted by the decoder as-is without further
processing instead of the undecodable object[start:end] part.
There is a third API that uses the old strict/ignore/replace error
handling scheme:
PyUnicode_TranslateCharmap/unicode.translate
The proposed patch will enhance PyUnicode_TranslateCharmap, so
that it also supports the callback registry. This has the
additional side effect that PyUnicode_TranslateCharmap will
support multi-character replacement strings (see SF feature
request #403100 [1]).
For PyUnicode_TranslateCharmap the exception class will be named
UnicodeTranslateError. PyUnicode_TranslateCharmap will collect
all consecutive untranslatable characters (i.e. those that map to
None) and call the callback with them. The replacement returned
from the callback is a unicode object that will be put in the
translated result as-is, without further processing.
All encoders and decoders are allowed to implement the callback
functionality themselves, if they recognize the callback name
(i.e. if it is a system callback like "strict", "replace" and
"ignore"). The proposed patch will add two additional system
callback names: "backslashreplace" and "xmlcharrefreplace", which
can be used for encoding and translating and which will also be
implemented in-place for all encoders and
PyUnicode_TranslateCharmap.
The Python equivalent of these five callbacks will look like this:
def strict(exc):
raise exc
def ignore(exc):
if isinstance(exc, UnicodeError):
return (u"", exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def replace(exc):
if isinstance(exc, UnicodeEncodeError):
return ((exc.end-exc.start)*u"?", exc.end)
elif isinstance(exc, UnicodeDecodeError):
return (u"\\ufffd", exc.end)
elif isinstance(exc, UnicodeTranslateError):
return ((exc.end-exc.start)*u"\\ufffd", exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def backslashreplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
if ord(c)<=0xff:
s += u"\\x%02x" % ord(c)
elif ord(c)<=0xffff:
s += u"\\u%04x" % ord(c)
else:
s += u"\\U%08x" % ord(c)
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
def xmlcharrefreplace(exc):
if isinstance(exc,
(UnicodeEncodeError, UnicodeTranslateError)):
s = u""
for c in exc.object[exc.start:exc.end]:
s += u"&#%d;" % ord(c)
return (s, exc.end)
else:
raise TypeError("can't handle %s" % exc.__name__)
These five callback handlers will also be accessible to Python as
codecs.strict_error, codecs.ignore_error, codecs.replace_error,
codecs.backslashreplace_error and codecs.xmlcharrefreplace_error.
Rationale
Most legacy encoding do not support the full range of Unicode
characters. For these cases many high level protocols support a
way of escaping a Unicode character (e.g. Python itself supports
the \x, \u and \U convention, XML supports character references
via &#xxx; etc.).
When implementing such an encoding algorithm, a problem with the
current implementation of the encode method of Unicode objects
becomes apparent: For determining which characters are unencodable
by a certain encoding, every single character has to be tried,
because encode does not provide any information about the location
of the error(s), so
# (1)
us = u"xxx"
s = us.encode(encoding)
has to be replaced by
# (2)
us = u"xxx"
v = []
for c in us:
try:
v.append(c.encode(encoding))
except UnicodeError:
v.append("&#%d;" % ord(c))
s = "".join(v)
This slows down encoding dramatically as now the loop through the
string is done in Python code and no longer in C code.
Furthermore this solution poses problems with stateful encodings.
For example UTF-16 uses a Byte Order Mark at the start of the
encoded byte string to specify the byte order. Using (2) with
UTF-16, results in an 8 bit string with a BOM between every
character.
To work around this problem, a stream writer - which keeps state
between calls to the encoding function - has to be used:
# (3)
us = u"xxx"
import codecs, cStringIO as StringIO
writer = codecs.getwriter(encoding)
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"&#%d;" % ord(c))
s = v.getvalue()
To compare the speed of (1) and (3) the following test script has
been used:
# (4)
import time
us = u"äa"*1000000
encoding = "ascii"
import codecs, cStringIO as StringIO
t1 = time.time()
s1 = us.encode(encoding, "replace")
t2 = time.time()
writer = codecs.getwriter(encoding)
v = StringIO.StringIO()
uv = writer(v)
for c in us:
try:
uv.write(c)
except UnicodeError:
uv.write(u"?")
s2 = v.getvalue()
t3 = time.time()
assert(s1==s2)
print "1:", t2-t1
print "2:", t3-t2
print "factor:", (t3-t2)/(t2-t1)
On Linux this gives the following output (with Python 2.3a0):
1: 0.274321913719
2: 51.1284689903
factor: 186.381278466
i.e. (3) is 180 times slower than (1).
Codecs must be stateless, because as soon as a callback is
registered it is available globally and can be called by multiple
encode() calls. To be able to use stateful callbacks, the errors
parameter for encode/decode/translate would have to be changed
from char * to PyObject *, so that the callback could be used
directly, without the need to register the callback globally. As
this requires changes to lots of C prototypes, this approach was
rejected.
Currently all encoding/decoding functions have arguments
const Py_UNICODE *p, int size
or
const char *p, int size
to specify the unicode characters/8bit characters to be
encoded/decoded. So in case of an error the codec has to create a
new unicode or str object from these parameters and store it in
the exception object. The callers of these encoding/decoding
functions extract these parameters from str/unicode objects
themselves most of the time, so it could speed up error handling
if these object were passed directly. As this again requires
changes to many C functions, this approach has been rejected.
Implementation Notes
A sample implementation is available as SourceForge patch #432401
[2]. The current version of this patch differs from the
specification in the following way:
* The error information is passed from the codec to the callback
not as an exception object, but as a tuple, which has an
additional entry state, which can be used for additional
information the codec might want to pass to the callback.
* There are two separate registries (one for
encoding/translating and one for decoding)
The class codecs.StreamReaderWriter uses the errors parameter for
both reading and writing. To be more flexible this should
probably be changed to two separate parameters for reading and
writing.
The errors parameter of PyUnicode_TranslateCharmap is not
availably to Python, which makes testing of the new functionality
of PyUnicode_TranslateCharmap impossible with Python scripts. The
patch should add an optional argument errors to unicode.translate
to expose the functionality and make testing possible.
Codecs that do something different than encoding/decoding from/to
unicode and want to use the new machinery can define their own
exception classes and the strict handlers will automatically work
with it. The other predefined error handlers are unicode specific
and expect to get a Unicode(Encode|Decode|Translate)Error
exception object so they won't work.
Backwards Compatibility
The semantics of unicode.encode with errors="replace" has changed:
The old version always stored a ? character in the output string
even if no character was mapped to ? in the mapping. With the
proposed patch, the replacement string from the callback callback
will again be looked up in the mapping dictionary. But as all
supported encodings are ASCII based, and thus map ? to ?, this
should not be a problem in practice.
References
[1] SF feature request #403100
"Multicharacter replacements in PyUnicode_TranslateCharmap"
http://www.python.org/sf/403100
[2] SF patch #432401 "unicode encoding error callbacks"
http://www.python.org/sf/432401
Copyright
This document has been placed in the public domain.
Local Variables:
mode: indented-text
indent-tabs-mode: nil
sentence-end-double-space: t
fill-column: 70
End:
I often find myself needing priority queues in python, and I've finally
broken down and written a simple implementation. Previously I've used
sorted lists (via bisect) to get the job done, but the heap code
significantly improves performance. There are C based implementations, but
the effort of compiling in an extension often isn't worth the effort. I'm
including the code here for everyone's amusement.
Any chance something like this could make it into the standard python
library? It would save a lot of time for lazy people like myself. :-)
Cheers,
-Kevin
def heappush(heap, item):
pos = len(heap)
heap.append(None)
while pos:
parentpos = (pos - 1) / 2
parent = heap[parentpos]
if item <= parent:
break
heap[pos] = parent
pos = parentpos
heap[pos] = item
def heappop(heap):
endpos = len(heap) - 1
if endpos <= 0:
return heap.pop()
returnitem = heap[0]
item = heap.pop()
pos = 0
while 1:
child2pos = (pos + 1) * 2
child1pos = child2pos - 1
if child2pos < endpos:
child1 = heap[child1pos]
child2 = heap[child2pos]
if item >= child1 and item >= child2:
break
if child1 > child2:
heap[pos] = child1
pos = child1pos
continue
heap[pos] = child2
pos = child2pos
continue
if child1pos < endpos:
child1 = heap[child1pos]
if child1 > item:
heap[pos] = child1
pos = child1pos
break
heap[pos] = item
return returnitem
--
------------------------------------------------------------------------
| Kevin O'Connor "BTW, IMHO we need a FAQ for |
| kevin(a)koconnor.net 'IMHO', 'FAQ', 'BTW', etc. !" |
------------------------------------------------------------------------
Kevin Altis kindly forwarded a 1.5MB XML database with about 6600 company
records. A record looks like this after running his script to turn them
into Python dicts:
{'Address': '395 Page Mill Road\nPalo Alto, CA 94306',
'Company': 'Agilent Technologies Inc.',
'Exchange': 'NYSE',
'NumberOfEmployees': '41,000',
'Phone': '(650) 752-5000',
'Profile': 'http://biz.yahoo.com/p/a/a.html',
'Symbol': 'A',
'Web': 'http://www.agilent.com'}
It appears to me that the XML file is maintained by hand, in order of ticker
symbol. But people make mistakes when alphabetizing by hand, and there are
37 indices i such that
data[i]['Symbol'] > data[i+1]['Symbol']
So it's "almost sorted" by that measure, with a few dozen glitches -- and
msort should be able to exploit this! I think this is an important case of
real-life behavior. The proper order of Yahoo profile URLs is also strongly
correlated with ticker symbol, while both the company name and web address
look weakly correlated, so there's hope that msort can get some benefit on
those too.
Here are runs sorting on all field names, building a DSU tuple list to sort
via
values = [(x.get(fieldname), x) for x in data]
Each field sort was run 5 times under sort, and under msort. So 5 times are
reported for each sort, reported in milliseconds, and listed from quickest
to slowest:
Sorting on field 'Address' -- 6589 records
via sort: 43.03 43.35 43.37 43.54 44.14
via msort: 45.15 45.16 45.25 45.26 45.30
Sorting on field 'Company' -- 6635 records
via sort: 40.41 40.55 40.61 42.36 42.63
via msort: 30.68 30.80 30.87 30.99 31.10
Sorting on field 'Exchange' -- 6579 records
via sort: 565.28 565.49 566.70 567.12 567.45
via msort: 573.29 573.61 574.55 575.34 576.46
Sorting on field 'NumberOfEmployees' -- 6531 records
via sort: 120.15 120.24 120.26 120.31 122.58
via msort: 134.25 134.29 134.50 134.74 135.09
Sorting on field 'Phone' -- 6589 records
via sort: 53.76 53.80 53.81 53.82 56.03
via msort: 56.05 56.10 56.19 56.21 56.86
Sorting on field 'Profile' -- 6635 records
via sort: 58.66 58.71 58.84 59.02 59.50
via msort: 8.74 8.81 8.98 8.99 8.99
Sorting on field 'Symbol' -- 6635 records
via sort: 39.92 40.11 40.19 40.38 40.62
via msort: 6.49 6.52 6.53 6.72 6.73
Sorting on field 'Web' -- 6632 records
via sort: 47.23 47.29 47.36 47.45 47.45
via msort: 37.12 37.27 37.33 37.42 37.89
So the hopes are realized: msort gets huge benefit from the nearly-sorted
Symbol field, also huge benefit from the correlated Profile field, and
highly significant benefit from the weakly correlated Company and Web
fields. K00L!
The Exchange field sort is so bloody slow because there are few distinct
Exchange values, and whenever there's a tie on those the tuple comparison
routine tries to break it by comparing the dicts. Note that I warned about
this kind of thing a week or two ago, in the context of trying to implement
priority queues by storing and comparing (priority, object) tuples -- it can
be a timing disaster if priorities are ever equal.
The other fields (Phone, etc) are in essentially random order, and msort is
systematically a bit slower on all of those. Note that these are all string
comparisons. I don't think it's a coincidence that msort went from a major
speedup on the Python-Dev task, to a significant slowdown, when I removed
all duplicate lines and shuffled the corpus first.
Only part of this can be accounted for by # of comparisons. On a given
random input, msort() may do fewer or more comparisons than sort(), but
doing many trials suggests that sort() has a small edge in # of compares on
random data, on the order of 1 or 2% This is easy to believe, since msort
does a few things it *knows* won't repay the cost if the order happens to be
random. These are well worth it, since they're what allow msort to get huge
wins when the data isn't random.
But that's not enough to account for things like the >10% higher runtime in
the NumberOfEmployees sort. I can't reproduce this magnitude of systematic
slowdown when doing random sorts on floats or ints, so I conclude it has
something to do with string compares. Unlike int and float compares, a
string compare takes variable time, depending on how long the common prefix
is. I'm not aware of specific research on this topic, but it's plausible to
me that partitioning may be more effective than merging at reducing the
number of comparisons specifically involving "nearly equal" elements.
Indeed, the fastest string-sorting methods I know of move heaven and earth
to avoid redundant prefix comparisons, and do so by partitioning.
Turns out <wink> that doesn't account for it, though. Here are the total
number of comparisons (first number on each line) done for each sort, and
the sum across all string compares of the number of common prefix characters
(second number on each line):
Sorting on field Address' -- 6589 records
via sort: 76188 132328
via msort: 76736 131081
Sorting on field 'Company' -- 6635 records
via sort: 76288 113270
via msort: 56013 113270
Sorting on field 'Exchange' -- 6579 records
via sort: 34851 207185
via msort: 37457 168402
Sorting on field 'NumberOfEmployees' -- 6531 records
via sort: 76167 116322
via msort: 76554 112436
Sorting on field 'Phone' -- 6589 records
via sort: 75972 278188
via msort: 76741 276012
Sorting on field 'Profile' -- 6635 records
via sort: 76049 1922016
via msort: 8750 233452
Sorting on field 'Symbol' -- 6635 records
via sort: 76073 73243
via msort: 8724 16424
Sorting on field 'Web' -- 6632 records
via sort: 76207 863837
via msort: 58811 666852
Contrary to prediction, msort always got the smaller "# of equal prefix
characters" total, even in the Exchange case, where it did nearly 10% more
total comparisons.
Python's string compare goes fastest if the first two characters aren't the
same, so maybe sort() gets a systematic advantage there? Nope. Turns out
msort() gets out early 17577 times on that basis when doing
NumberOfEmployees, but sort() only gets out early 15984 times.
I conclude that msort is at worst only a tiny bit slower when doing
NumberOfEmployees, and possibly a little faster. The only measure that
doesn't agree with that conclusion is time.clock() -- but time is such a
subjective thing I won't let that bother me <wink>.
I keep running into the problem that there is no reliable way to introspect
about whether a type supports multi-pass iterability (in the sense that an
input stream might support only a single pass, but a list supports multiple
passes). I suppose you could check for __getitem__, but that wouldn't cover
linked lists, for example.
Can anyone channel Guido's intent for me? Is this an oversight or a
deliberate design decision? Is there an interface for checking
multi-pass-ability that I've missed?
TIA,
Dave
Back in February, there was a thread in comp.lang.python (and, I
think, also on Python-Dev) that asked whether the following behavior:
>>> 'abcde'.split('')
Traceback (most recent call last):
File "<stdin>", line 1, in ?
ValueError: empty separator
was a bug or a feature. The prevailing opinion at the time seemed
to be that there was not a sensible, unique way of defining this
operation, so rejecting it was a feature.
That answer didn't bother me particularly at the time, but since then
I have learned a new fact (or perhaps an old fact that I didn't notice
at the time) that has changed my mind: Section 4.2.4 of the library
reference says that the 'split' method of a regular expression object
is defined as
Identical to the split() function, using the compiled pattern.
This claim does not appear to be correct:
>>> import re
>>> re.compile('').split('abcde')
['abcde']
This result differs from the result of using the string split method.
In other words, the documentation doesn't match the actual behavior,
so the status quo is broken.
It seems to me that there are four reasonable courses of action:
1) Do nothing -- the problem is too trivial to worry about.
2) Change string split (and its documentation) to match regexp split.
3) Change regexp split (and its documentation) to match string split.
4) Change both string split and regexp split to do something else :-)
My first impulse was to argue that (4) is right, and that the behavior
should be as follows
>>> 'abcde'.split('')
['a', 'b', 'c', 'd', 'e']
>>> import re
>>> re.compile('').split('abcde')
['a', 'b', 'c', 'd', 'e']
When this discussion came up last time, I think there was an objection
that s.split('') was ambiguous: What argument is there in favor of
'abcde'.split('') being ['a', 'b', 'c', 'd', 'e'] instead of, say,
['', 'a', 'b', 'c', 'd', 'e', ''] or, for that matter, ['', 'a', '',
'b', '', 'c', '', 'd', '', 'e', '']?
I made the counterargument that one could disambiguate by adding the
rule that no element of the result could be equal to the delimiter.
Therefore, if s is a string, s.split('') cannot contain any empty
strings.
However, looking at the behavior of regular expression splitting more
closely, I become more confused. Can someone explain the following
behavior to me?
>>> re.compile('a|(x?)').split('abracadabra')
['', None, 'br', None, 'c', None, 'd', None, 'br', None, '']
An enormous amount of research has been done on sorting since the last time
I wrote a sort for Python. Major developments have been in two areas:
1. Adaptive sorting. Sorting algorithms are usually tested on random
data, but in real life you almost never see random data. Python's
sort tries to catch some common cases of near-order via special-
casing. The literature has since defined more than 15 formal
measures of disorder, and developed algorithms provably optimal in
the face of one or more of them. But this is O() optimality, and
theoreticians aren't much concerned about how big the constant
factor is. Some researchers are up front about this, and toward
the end of one paper with "practical" in its title, the author was
overjoyed to report that an implementation was only twice as slow
as a naive quicksort <wink>.
2. Pushing the worst-case number of comparisons closer to the
information-theoretic limit (ceiling(log2(N!))).
I don't care much about #2 -- in experiments conducted when it was new, I
measured the # of comparisons our samplesort hybrid did on random inputs,
and it was never more than 2% over the theoretical lower bound, and
typically closer. As N grows large, the expected case provably converges to
the theoretical lower bound. There remains a vanishly small chance for a
bad case, but nobody has reported one, and at the time I gave up trying to
construct one.
Back on Earth, among Python users the most frequent complaint I've heard is
that list.sort() isn't stable. Alex is always quick to trot out the
appropriate DSU (Decorate Sort Undecorate) pattern then, but the extra
memory burden for that can be major (a new 2-tuple per list element costs
about 32 bytes, then 4 more bytes for a pointer to it in a list, and 12 more
bytes that don't go away to hold each non-small index).
After reading all those papers, I couldn't resist taking a crack at a new
algorithm that might be practical, and have something you might call a
non-recursive adaptive stable natural mergesort / binary insertion sort
hybrid. In playing with it so far, it has two bad aspects compared to our
samplesort hybrid:
+ It may require temp memory, up to 2*N bytes worst case (one pointer each
for no more than half the array elements).
+ It gets *some* benefit for arrays with many equal elements, but not
nearly as much as I was able to hack samplesort to get. In effect,
paritioning is very good at moving equal elements close to each other
quickly, but merging leaves them spread across any number of runs.
This is especially irksome because we're sticking to Py_LT for
comparisons, so can't even detect a==b without comparing a and b twice
(and then it's a deduction from that not a < b and not b < a). Given
the relatively huge cost of comparisons, it's a timing disaster to do
that (compare twice) unless it falls out naturally. It was fairly
natural to do so in samplesort, but not at all in this sort.
It also has good aspects:
+ It's stable (items that compare equal retain their relative order, so,
e.g., if you sort first on zip code, and a second time on name, people
with the same name still appear in order of increasing zip code; this
is important in apps that, e.g., refine the results of queries based
on user input).
+ The code is much simpler than samplesort's (but I think I can
fix that <wink>).
+ It gets benefit out of more kinds of patterns, and without lumpy
special-casing (a natural mergesort has to identify ascending and
descending runs regardless, and then the algorithm builds on just
that).
+ Despite that I haven't micro-optimized it, in the random case it's
almost as fast as the samplesort hybrid. In fact, it might
have been a bit faster had I run tests yesterday (the samplesort
hybrid got sped up by 1-2% last night). This one surprised me the
most, because at the time I wrote the samplesort hybrid, I tried
several ways of coding mergesorts and couldn't make it as fast.
+ It has no bad cases (O(N log N) is worst case; N-1 compares is best).
Here are some typical timings, taken from Python's sortperf.py, over
identical lists of floats:
Key:
*sort: random data
\sort: descending data
/sort: ascending data
3sort: ascending data but with 3 random exchanges
~sort: many duplicates
=sort: all equal
!sort: worst case scenario
That last one was a worst case for the last quicksort Python had before it
grew the samplesort, and it was a very bad case for that. By sheer
coincidence, turns out it's an exceptionally good case for the experimental
sort:
samplesort
i 2**i *sort \sort /sort 3sort ~sort =sort !sort
15 32768 0.13 0.01 0.01 0.10 0.04 0.01 0.11
16 65536 0.24 0.02 0.02 0.23 0.08 0.02 0.24
17 131072 0.54 0.05 0.04 0.49 0.18 0.04 0.53
18 262144 1.18 0.09 0.09 1.08 0.37 0.09 1.16
19 524288 2.58 0.19 0.18 2.34 0.76 0.17 2.52
20 1048576 5.58 0.37 0.36 5.12 1.54 0.35 5.46
timsort
15 32768 0.16 0.01 0.02 0.05 0.14 0.01 0.02
16 65536 0.24 0.02 0.02 0.06 0.19 0.02 0.04
17 131072 0.55 0.04 0.04 0.13 0.42 0.04 0.09
18 262144 1.19 0.09 0.09 0.25 0.91 0.09 0.18
19 524288 2.60 0.18 0.18 0.46 1.97 0.18 0.37
20 1048576 5.61 0.37 0.35 1.00 4.26 0.35 0.74
If it weren't for the ~sort column, I'd seriously suggest replacing the
samplesort with this. 2*N extra bytes isn't as bad as it might sound, given
that, in the absence of massive object duplication, each list element
consumes at least 12 bytes (type pointer, refcount and value) + 4 bytes for
the list pointer. Add 'em all up and that's a 13% worst-case temp memory
overhead.