[Tutor] use of __new__

Steven D'Aprano steve at pearwood.info
Fri Mar 12 01:53:16 CET 2010


On Fri, 12 Mar 2010 06:03:35 am spir wrote:
> Hello,
>
> I need a custom unicode subtype (with additional methods). This will
> not be directly used by the user, instead it is just for internal
> purpose. I would like the type to be able to cope with either a byte
> str or a unicode str as argument. In the first case, it needs to be 
> first decoded. I cannot do it in __init__ because unicode will first 
> try to decode it as ascii, which fails in the general case. 

Are you aware that you can pass an explicit encoding to unicode?

>>> print unicode('cdef', 'utf-16')
摣晥
>>> help(unicode)

Help on class unicode in module __builtin__:

class unicode(basestring)
 |  unicode(string [, encoding[, errors]]) -> object



> So, I 
> must have my own __new__. The issue is the object (self) is then a
> unicode one instead of my own type.
>
> class Unicode(unicode):
>     Unicode.FORMAT = "utf8"
>     def __new__(self, text, format=None):
>         # text can be str or unicode
>         format = Unicode.FORMAT if format is None else format
>         if isinstance(text,str):
>             text = text.decode(format)
>         return text
>     .......
>
> x = Unicode("abc")	# --> unicode, not Unicode

That's because you return a unicode object :) Python doesn't magically 
convert the result of __new__ into your class, in fact Python 
specifically allows __new__ to return something else. That's fairly 
unusual, but it does come in handy.

"format" is not a good name to use. The accepted term is "encoding". You 
should also try to match the function signature of the built-in unicode 
object, which includes unicode() -> u''.

Writing Unicode.FORMAT in the definition of Unicode can't work:

>>> class Unicode(unicode):
...     Unicode.FORMAT = 'abc'
...
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 2, in Unicode
NameError: name 'Unicode' is not defined

So it looks like you've posted something slightly different from what 
you are actually running.

I have tried to match the behaviour of the built-in unicode as close as 
I am able. See here:
http://docs.python.org/library/functions.html#unicode


class Unicode(unicode):
    """Unicode(string [, encoding[, errors]]) -> object

    Special Unicode class that has all sorts of wonderful 
    methods missing from the built-in unicode class.
    """
    _ENCODING = "utf8"
    _ERRORS = "strict"
    def __new__(cls, string='', encoding=None, errors=None):
        # If either encodings or errors is specified, then always
        # attempt decoding of the first argument.
        if (encoding, errors) != (None, None):
            if encoding is None: encoding = cls._ENCODING
            if errors is None: errors = cls._ERRORS
            obj = super(Unicode, cls).__new__(
                  Unicode, string, encoding, errors)
        else:  # Never attempt decoding.
            obj = super(Unicode, cls).__new__(Unicode, string)
        assert isinstance(obj, Unicode)
        return obj


>>> Unicode()
u''
>>> Unicode('abc')
u'abc'
>>> Unicode('cdef', 'utf-16')
u'\u6463\u6665'
>>> Unicode(u'abcd')
u'abcd'





-- 
Steven D'Aprano


More information about the Tutor mailing list