[Python-ideas] Exploring the 'strview' concept further

Thu Dec 8 00:51:45 CET 2011

On Thu, Dec 8, 2011 at 1:39 AM, Antoine Pitrou <solipsis at pitrou.net> wrote:
>> Alternatively, if codecs were
>> classified to explicitly indicate when they implemented stateless
>> fixed width encodings, then strview could simply be restricted to only
>> working with that subset of possible encodings.
>
> >From an usability POV this seems undesireable. On the other hand, if
> complete decoding is required, calling str() is just as cheap.

Yeah, that's kind of where I'm going with this. For stateless
encodings, views make sense - it's really just a memory view with a
particular way of interpreting the individual characters and providing
the string API rather than the bytes one. For the multitude of
ASCII-compatible single byte codings and the various fixed-width
encodings, that could be very useful. With some fiddling, you could
support BOM and signature encodings, too (just by offsetting your view
a bit and adjusting your interpretation of the individual code
points).

But for the fully general case of stateful encodings (including all
variable width encodings) it is basically impossible to do O(1)
indexing (which is the whole reason the Unicode model is the way it
is). Especially once PEP 393 is in place, you rapidly reach a point of
diminishing returns where converting the whole shebang to Unicode code
points and working directly on the code point array is the right
answer (and, if it isn't, you're clearly doing something sufficiently
sophisticated that you're going to be OK with rolling your own tools
to deal with the problem).

In those terms, I'm actually wondering if it might be appropriate to
extract some of the tools I created for the urllib.parse case and
publish them via the string module.

1. Provide a string.Text ABC (why *did* we put UserString in
collections, anyway?)
2. Provide a "coerce_to_str" helper:

    def coerce_to_str(*args, encoding, errors='strict'):
        # Invokes decode if necessary to create str args
        # and returns the coerced inputs along with
        # an appropriate result coercion function
        # - a noop for str inputs
        # - encoding function otherwise
        # False inputs (including None) are all coerced to the empty string
        args_are_text = isinstance(args[0], Text)
        if args_are_text:
            def _encode_result(obj):
                return obj
        else:
            def _encode_result(obj):
                return obj.encode(encoding, errors)
        def _decode(obj):
            if not obj:
                return ''
            if isinstance(obj, Text):
                return str(obj)
            return obj.decode(encoding, errors)
        def _decode_args(args):
            return tuple(map(_decode, args))
        for arg in args[1:]:
            # We special-case False values to support the relatively common
            # use of None and the empty string as default arguments
            if arg and args_are_text != isinstance(arg, Text):
                raise TypeError("Cannot mix text and non-text arguments")
        return _decode_args(args) + (_encode_result,)

Note the special-casing of None would be sufficient to support
arbitrary defaults in binary/text polymorphic APIs:

    def f(a, b=None):
        (a_str, b_str), _coerce_result = coerce_to_str(a, b, 'utf-8')
        if b is None:
            b_str = "Default text"

Cheers,
Nick.

>> 7. The new type would similarly support the full string API, returning
>> actual string objects rather than any kind of view.
>
> Even for slicing?

If we restricted strview to stateless encodings, then slicing could
also return views (there wouldn't be any point in returning a view for
iteration or indexing though - the view object would be bigger than
any single-character string. In fact, we could probably figure out a
cutoff whereby real strings are returned for sufficiently small
slices, too).

Cheers,
Nick.

-- 
Nick Coghlan   |   ncoghlan at gmail.com   |   Brisbane, Australia