[Datetime-SIG] PEP-431/495

Fri Aug 28 18:12:10 CEST 2015

[Tim]
> ...
> Inside pytz you already know everything that can be known about
> transitions.  You don't "poke and hope" to do that, you do a binary
> search, right?  You find the zoneinfo record for the time of interest,
> and compare that to the transitions on either side to deduce whether
> there's a fold or gap in play.

I should try to flesh that out.  I don't work with tzfiles (I'm on
Windows - nobody gives a shit about history on Windows ;-) ).  My
understanding is that you have a sorted listed of transition times, in
UTC.  I'll call that list `tt` (transition times).  Also a parallel
list of (at least) total UTC offsets.  I'll call that list `ofs`.

`fold` should be 1 if and only if the UTC time `u` is the later of
times ambiguous in the local zone.  Ambiguous times exist if and only
if a transition decreases the total UTC offset.  There are some before
the transition ("at the end of its life"), and some after the
transition ("at the start of its life").  fold is 1 only in the latter
case (the later of ambiguous times).

So first do a binary seach on `tt` to find the largest transition time
<= u.  Call the index `i`.  You must already have code to do that,
yes?

Then:

    if i == 0:
        # First entry in the file - there is no
        # transition _to_ this.
        return 0
    # How much did the total UTC offset change?
    delta = ofs[i] - ofs[i-1]
    if delta >= 0:
        # The offset didn't change, or the
        # offset increased so this a gap (not
        # fold) case.
        return 0
    # delta < 0, so the offset decreased.  All
    # and only the times from tt[i] up to (but
    # not including) tt[i] - delta ("-" because
    # delta is negative) are the later of
    # ambiguous times.
    return int(u - tt[i] < -delta)

    # or int(tt[i] - u > delta) to save a "-"
    # at the cost of making it incomprehensible

All of that may be wrong ;-)  The _point_ is that the base ideas are
so straightforward that if the code works in practice for some
non-trivial case, it probably works for almost all cases.  I don't
know enough about tzfiles to be sure of anything.  In particular, a
`delta` of 0 _may_ mean there's some kind of more-or-less artificial
(having nothing to do with offsets) entry in the file, and it's
necessary to search back for the closest preceding offset that's _not_
the same as ofs[i].  For example, maybe just the name of the zone
changed?  If so, then runtime search could be avoided by pre-computing
such stuff when loading the tzfile to begin with.  For that matter, a
parallel `delta` list could be precomputed once-&-for-all.  This
_could_ be made to run very fast.   Getting the right answer is minor
compared to running fast ;-)

Then compare those 6 lines of code to the incredible gyrations any
implementation of mktime() goes through to get is_dst right (if you
can find a mktime() implementation that always does get it right).