
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 """ sys.intern(b'12121212') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: must be str, not bytes """ I wonder why. - -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUjwvPJlgi5GaxT1NAQIINQP/ZmEyPBSapa52yZRhQf8QSVSBm5tXpWrC k9MbcvsK/5K6ArRogkulk1GSu1EJPPvuHMAXX8EenNFBXPvDRm0mOxrHkcYw5IKx Ml2ORENm+EEM/ziUJMRtY4aqD5Jp6pXSSl9UmP5OQMDJfuabSrVqs7X2409OOhUj BXeg9HvURxo= =78M+ -----END PGP SIGNATURE-----

2013/9/20 Jesus Cea <jcea@jcea.es>:
""" sys.intern(b'12121212') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: must be str, not bytes """
I wonder why.
Intern strings optimize dictionary lookup. In Python 3, most dictionaries use str keys (ex: __dict__ of classes). What would you be the use case of interned bytes objets? Victor

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/09/13 14:04, Victor Stinner wrote:
What would you be the use case of interned bytes objets?
Performance and memory. Pickle sizes (my particular issue now). - -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUjxKPZlgi5GaxT1NAQJ6FwQAim2H+OGpRD75KplahNhnKIfm9AUqVnHg CaLakWhADdHBYlit+DxRQsxtv5C7gyhhqMryydyvx97z33VaI2p1RGOOcK/lWdNw ETcetqJo8UswS2PSthJ0e5snOUsIeVJRomhJ48n8sJfIadCxAk6ozdMR75pHP5Y3 lJoUuUgdcJU= =vbuK -----END PGP SIGNATURE-----

Le Fri, 20 Sep 2013 15:14:37 +0200, Jesus Cea <jcea@jcea.es> a écrit :
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 20/09/13 14:04, Victor Stinner wrote:
What would you be the use case of interned bytes objets?
Performance and memory. Pickle sizes (my particular issue now).
sys.intern is an internal interpreter optimization and should be orthogonal to pickling. If pickle can't detect already-seen bytes object, then you may file an improvement request on the bug tracker. Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/09/13 15:31, Antoine Pitrou wrote:
sys.intern is an internal interpreter optimization and should be orthogonal to pickling. If pickle can't detect already-seen bytes object, then you may file an improvement request on the bug tracker.
Understood. Thanks for the clarification. Pickle manage correctly "same object" references, but not "different objects but equivalent". That is the issue. But for most uses this is not a problem, and implementing this redundance removal looks like a performance cost that few users would benefice from, but everybody pays. Pickle is already slow enough now :). - -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUjxPbZlgi5GaxT1NAQIyYQQAplpwDnz2/0bNTF7KN7V0PQnXZQknEnvL 0VACm298Y386hs8bJQFlRlTOzfhguulaaEwdqLaBPkXMKA7LBaVRHM1v5o+Xb40X 7DwSKglkbt6HgX7/nDMX4qk9Kxb2ZEVz+XOozPta2NRoZmaz8y7Xyqc/+4+UTzHH XsNJ2H5yVEo= =GPtw -----END PGP SIGNATURE-----

Le Fri, 20 Sep 2013 15:36:45 +0200, Jesus Cea <jcea@jcea.es> a écrit :
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 20/09/13 15:31, Antoine Pitrou wrote:
sys.intern is an internal interpreter optimization and should be orthogonal to pickling. If pickle can't detect already-seen bytes object, then you may file an improvement request on the bug tracker.
Understood. Thanks for the clarification.
Pickle manage correctly "same object" references, but not "different objects but equivalent". That is the issue.
Ah, well, in that case the issue is not in pickle, it's in your code. pickle doesn't try to guess if "equal" is really functionally equivalent to "identical". Regards Antoine.

Well, the pickler should memoize bytes objects if you have lots of the same one in a pickle... 2013/9/20 Jesus Cea <jcea@jcea.es>:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 20/09/13 14:04, Victor Stinner wrote:
What would you be the use case of interned bytes objets?
Performance and memory. Pickle sizes (my particular issue now).
- -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iQCVAwUBUjxKPZlgi5GaxT1NAQJ6FwQAim2H+OGpRD75KplahNhnKIfm9AUqVnHg CaLakWhADdHBYlit+DxRQsxtv5C7gyhhqMryydyvx97z33VaI2p1RGOOcK/lWdNw ETcetqJo8UswS2PSthJ0e5snOUsIeVJRomhJ48n8sJfIadCxAk6ozdMR75pHP5Y3 lJoUuUgdcJU= =vbuK -----END PGP SIGNATURE----- _______________________________________________ Python-Dev mailing list Python-Dev@python.org https://mail.python.org/mailman/listinfo/python-dev Unsubscribe: https://mail.python.org/mailman/options/python-dev/benjamin%40python.org
-- Regards, Benjamin

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/09/13 15:33, Benjamin Peterson wrote:
Well, the pickler should memoize bytes objects if you have lots of the same one in a pickle...
Only if they are the very same object. Not diferent bytes objects with the same value. Pickle doesn't do "a==b" but "id(a)==id(b)". Yes, I know that "a==b" would break mutable objects. It is just an example. I don't want to pursue that path. Performance of pickle is already appallingly slow. In my project, I will do the redundancy removal on my own way, as explained in ither message on this thread. Example: * Original pickle: 14416284 bytes * Pickle with "interned" strings: 3004880 bytes (quite an improvement, but this is particular to my case, I have a lot of string duplications here. The pickle also loads a bit faster) * Pickle including an extra dictionary of "interned" strings, created using the "interned.setdefault(object,object)" pattern: 5126587 bytes. Sniff. Could I do this more compactly?. - -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUjxRwZlgi5GaxT1NAQKW8wP/dhVa/v3RZbOKvOtogpHGs5nZyjhtChwn lFK1Lr1wl/+6IgCjgu9axkrRM0LLRaBN91HW+e9AkAM9XSFBQp6qAAqjJpI/jLDp xRLW9fMRHpD21m1tG9zxziz4ACCLNNDnlsyY9l7oHHbMzaAX6Gbigyml3hEbj0uK G5hk4VhyKEY= =m/3T -----END PGP SIGNATURE-----

Jesus Cea, 20.09.2013 15:46:
On 20/09/13 15:33, Benjamin Peterson wrote:
Well, the pickler should memoize bytes objects if you have lots of the same one in a pickle...
Only if they are the very same object. Not diferent bytes objects with the same value. Pickle doesn't do "a==b" but "id(a)==id(b)".
Yes, I know that "a==b" would break mutable objects. It is just an example.
I don't want to pursue that path. Performance of pickle is already appallingly slow.
In my project, I will do the redundancy removal on my own way, as explained in ither message on this thread.
Example:
* Original pickle: 14416284 bytes
* Pickle with "interned" strings: 3004880 bytes (quite an improvement, but this is particular to my case, I have a lot of string duplications here. The pickle also loads a bit faster)
* Pickle including an extra dictionary of "interned" strings, created using the "interned.setdefault(object,object)" pattern: 5126587 bytes. Sniff.
Could I do this more compactly?.
ISTM that what you are looking for is a compression-like pattern that efficiently encodes repeated literals (i.e. constants of safe types) in the pickle. That could be achieved by extending the pickle protocol to include backreferences to earlier objects, I guess (I'm not all that familiar with the internals of the pickle format). Any of the well known compression algorithms that are capable of handling streaming data would apply here. Assuming you don't want to simply send the pickle output through gzip & friends, that is... It also seems to me that python-dev isn't the right place to discuss this. python-ideas seems more appropriate for now. Stefan

Le Fri, 20 Sep 2013 13:19:24 +0200, Jesus Cea <jcea@jcea.es> a écrit :
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
""" sys.intern(b'12121212') Traceback (most recent call last): File "<stdin>", line 1, in <module> TypeError: must be str, not bytes """
I wonder why.
From http://docs.python.org/3.3/library/sys.html#sys.intern """sys.intern(string) Enter string in the table of “interned” strings and return the interned string [...]""" In Python 3 context, "string" means "str". Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/09/13 14:15, Antoine Pitrou wrote:
From http://docs.python.org/3.3/library/sys.html#sys.intern
"""sys.intern(string)
Enter string in the table of “interned” strings and return the interned string [...]"""
In Python 3 context, "string" means "str".
I read that, Antoine. In fact I read the manual, I thought it was a mistake carried over from 2.x documentation, I tried it just in case before reporting the "documentation mistake", and I was surprised it was actually true :-). I know that intern is used for performance reasons internally to the interpreter. But I am thinking about memory usage optimizations. For instance, I have a pickle that is 14MB in size, when "interning" the strings on it (there are a lot of redundancy), the new size is only 3MB and it loads faster. I can do it because most data in the pickle are strings, I could NOT do it if I used bytes. I could do a manual "intern" for hashable objects by hand using an "object:object" dictionary (that would work for integers too), but I wonder if extending builtin "sys.intern" would be something to consider. Anyway, this pattern is easy enough: Instead of object = sys.intern(object) I could do interned = dict() ... object = interned.setdefault(object, object) - -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUjxOkZlgi5GaxT1NAQIOVgQAhN36yRAAQP1YWbDsXGSamgZnhEULTloB penRZYTYz/Ir/VM9l27GoXS7ThGrucAkkYZoJqXnUr2vyP0hq6rsfp+N5lzl61Nf mDJBCtAczzKNdYqQSgMQ+Ugk43KnbEFFX7SB9Y5IkYroWCeWq7+5y6KX3ZKBspXG lmXotLgpvW0= =/RNw -----END PGP SIGNATURE-----

Le Fri, 20 Sep 2013 15:33:05 +0200, Jesus Cea <jcea@jcea.es> a écrit :
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 20/09/13 14:15, Antoine Pitrou wrote:
From http://docs.python.org/3.3/library/sys.html#sys.intern
"""sys.intern(string)
Enter string in the table of “interned” strings and return the interned string [...]"""
In Python 3 context, "string" means "str".
I read that, Antoine. In fact I read the manual, I thought it was a mistake carried over from 2.x documentation, I tried it just in case before reporting the "documentation mistake", and I was surprised it was actually true :-).
I know that intern is used for performance reasons internally to the interpreter. But I am thinking about memory usage optimizations. For instance, I have a pickle that is 14MB in size, when "interning" the strings on it (there are a lot of redundancy), the new size is only 3MB and it loads faster. I can do it because most data in the pickle are strings, I could NOT do it if I used bytes.
I could do a manual "intern" for hashable objects by hand using an "object:object" dictionary (that would work for integers too), but I wonder if extending builtin "sys.intern" would be something to consider.
Anyway, this pattern is easy enough:
Instead of
object = sys.intern(object)
I could do
interned = dict() ... object = interned.setdefault(object, object)
Yes. The main difference is that sys.intern() will remove the interned strings when every external reference vanishes. It requires either weakref'ability (which both str and bytes lack) or special cooperation from the object destructor (which is why sys.intern() is restricted to str instead of working with arbitrary objects). Regards Antoine.

-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 On 20/09/13 15:44, Antoine Pitrou wrote:
Yes. The main difference is that sys.intern() will remove the interned strings when every external reference vanishes. It requires either weakref'ability (which both str and bytes lack) or special cooperation from the object destructor (which is why sys.intern() is restricted to str instead of working with arbitrary objects).
Great comment. Thanks. Why str/bytes doesn't support weakrefs, beside memory use? - -- Jesús Cea Avión _/_/ _/_/_/ _/_/_/ jcea@jcea.es - http://www.jcea.es/ _/_/ _/_/ _/_/ _/_/ _/_/ Twitter: @jcea _/_/ _/_/ _/_/_/_/_/ jabber / xmpp:jcea@jabber.org _/_/ _/_/ _/_/ _/_/ _/_/ "Things are not so easy" _/_/ _/_/ _/_/ _/_/ _/_/ _/_/ "My name is Dump, Core Dump" _/_/_/ _/_/_/ _/_/ _/_/ "El amor es poner tu felicidad en la felicidad de otro" - Leibniz -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.10 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQCVAwUBUjxTqJlgi5GaxT1NAQIbvwP/fs7e5MJwF0pCa0NObTx0xN8CFQIX9/jt VkQ1Q7lPLSuRlZMC2B+0xfp9QoAsD6N3xqSXwG+T9uf7w6nZ9y3keI06kAdSn/Cz D8EyxoeuNk2aGq0VIzMO260mgs9Gr+3DtWAROcWCMG+8sr5uekJ/LAhI04/xMkqZ zr7aOy1xDgk= =hzGP -----END PGP SIGNATURE-----

2013/9/20 Jesus Cea <jcea@jcea.es>:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 20/09/13 15:44, Antoine Pitrou wrote:
Yes. The main difference is that sys.intern() will remove the interned strings when every external reference vanishes. It requires either weakref'ability (which both str and bytes lack) or special cooperation from the object destructor (which is why sys.intern() is restricted to str instead of working with arbitrary objects).
Great comment. Thanks.
Why str/bytes doesn't support weakrefs, beside memory use?
Is increased memory use for every str/bytes object not a good enough reason? -- Regards, Benjamin

On Fri, Sep 20, 2013 at 9:54 AM, Jesus Cea <jcea@jcea.es> wrote:
Why str/bytes doesn't support weakrefs, beside memory use?
The typical use case for weakrefs is to break reference cycles, but str and bytes can't *be* part of a reference cycle, so outside of interning-like use cases, there's no need for weakref support there.

On 09/20/2013 06:50 PM, PJ Eby wrote:
On Fri, Sep 20, 2013 at 9:54 AM, Jesus Cea <jcea@jcea.es> wrote:
Why str/bytes doesn't support weakrefs, beside memory use?
The typical use case for weakrefs is to break reference cycles,
Another typical use case, and the prime reason why languages without reference counting tend to introduce weak references, is managing object caches with automatic disposal of otherwise unused items. Such a cache is rarely necessary for primitive objects, so Python's choice to spare memory for weakrefs is quite justified. However, if one wanted to implement their own sys.intern(), inability to refer to strings would become a problem. This is one reason why sys.intern() directly fiddles with reference counts instead of reusing the weakref machinery. (The other of course being that intern predates weakrefs by many years.)
participants (7)
-
Antoine Pitrou
-
Benjamin Peterson
-
Hrvoje Niksic
-
Jesus Cea
-
PJ Eby
-
Stefan Behnel
-
Victor Stinner