RE: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140
I expect Martin checked in this change because of the unhappy hours he spent determining that the previous two versions of this function wrote beyond the memory they allocated. Since the most recent version still didn't bother to assert that it wasn't writing out of bounds, I can't blame Martin for checking in a version that does so assert; since I spent hours on this too, and this function has a repeated history of bad memory behavior, I viewed the version Martin replaced as unacceptable. However, the slowdown on large strings isn't attractive, and the previous version could easily enough have asserted its memory correctness.
-----Original Message----- From: python-checkins-admin@python.org [mailto:python-checkins-admin@python.org]On Behalf Of M.-A. Lemburg Sent: Saturday, April 20, 2002 11:26 AM To: loewis@sourceforge.net Cc: python-checkins@python.org Subject: Re: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140
loewis@sourceforge.net wrote:
Update of /cvsroot/python/python/dist/src/Objects In directory usw-pr-cvs1:/tmp/cvs-serv30961
Modified Files: unicodeobject.c Log Message: Patch #495401: Count number of required bytes for encoding UTF-8 before allocating the target buffer.
Martin, please back out this change again. We have discussed this quite a few times and I am against using your strategy since it introduces a performance hit which does not relate to the gained advantage of (temporarily) using less memory.
Your timings also show this, so I wonder why you checked in this patch, e.g. from the patch log: """ For the current CVS (unicodeobject.c 2.136: MAL's change to use a variable overalloc), I get
10 spaces 20.060 100 spaces 2.600 200 spaces 2.030 1000 spaces 0.930 10000 spaces 0.690 10 spaces, 3 bytes 23.520 100 spaces, 3 bytes 3.730 200 spaces, 3 bytes 2.470 1000 spaces, 3 bytes 0.980 10000 spaces, 3 bytes 0.690 30 bytes 24.800 300 bytes 5.220 600 bytes 3.830 3000 bytes 2.480 30000 bytes 2.230
With unicode3.diff (that's the one you checked in), I get
10 spaces 19.940 100 spaces 3.260 200 spaces 2.340 1000 spaces 1.650 10000 spaces 1.450 10 spaces, 3 bytes 21.420 100 spaces, 3 bytes 3.410 200 spaces, 3 bytes 2.420 1000 spaces, 3 bytes 1.660 10000 spaces, 3 bytes 1.450 30 bytes 22.260 300 bytes 5.830 600 bytes 4.700 3000 bytes 3.740 30000 bytes 3.540 """
The only case where your patch is faster is for very short strings and then only by a few percent, whereas for all longer strings you get worse timings, e.g. 3.74 seconds compared to 2.48 seconds -- that's a 50% increase in run-time !
Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
_______________________________________________ Python-checkins mailing list Python-checkins@python.org http://mail.python.org/mailman/listinfo/python-checkins
Tim Peters
I expect Martin checked in this change because of the unhappy hours he spent determining that the previous two versions of this function wrote beyond the memory they allocated. Since the most recent version still didn't bother to assert that it wasn't writing out of bounds, I can't blame Martin for checking in a version that does so assert; since I spent hours on this too, and this function has a repeated history of bad memory behavior, I viewed the version Martin replaced as unacceptable.
Exactly. I think the most recent version is worse than the one we had before.
However, the slowdown on large strings isn't attractive, and the previous version could easily enough have asserted its memory correctness.
I found the overallocations strategy that this code had just so embarrassing: a single three-byte character will cause an overallocation of three times the memory, which means that the final realloc will certainly truncate lots of bytes. As a result, we are at the mercy of the realloc implementation here: If realloc copies memory (such as Pymalloc might some day) when shrinking buffers, performance will get worse. Since this appears to be religious, I'm backing the patch out. Regards, Martin
[martin@v.loewis.de]
... I found the overallocations strategy that this code had just so embarrassing: a single three-byte character will cause an overallocation of three times the memory, which means that the final realloc will certainly truncate lots of bytes.
Well, Python routinely over-allocates string space all over the place, and it doesn't really cost more to throw away a million bytes at the end than to throw away five bytes. That is, the cost of a realloc shrink is at worst proportional to the number of retained bytes; the number of bytes thrown away doesn't matter. For that reason, the thing that puzzles me more about the current state of the code is why it doesn't overallocate by a factor of 4 from the start, and skip all the delicate, repeated "oops! I still didn't get enough memory" logic. It's almost certainly going to do a shrinking realloc at the end anyway.
As a result, we are at the mercy of the realloc implementation here: If realloc copies memory (such as Pymalloc might some day) when shrinking buffers, performance will get worse.
Since pymalloc only handles small blocks by itself, such copies would be of short contiguous memory regions that are almost certainly in L1 cache (since the encoder function just finished crawling over them). For that reason I doubt the speed hit would be monstrous.
Since this appears to be religious, I'm backing the patch out.
It needn't be religious <wink>: with sufficient effort we can quantify the tradeoff between the time pymalloc's realloc would take to move shrinking blocks, and the space wasted by realloc leaving them in place. I mentioned before that I want to give pymalloc a new entry point for realloc callers who specifically do or don't want a shrinking realloc to copy memory. For string storage I *suspect* it's better to move the memory (typically exactly one shrinking realloc, after which the string storage is immutable until death), while for most other kinds of storage it's better to leave the block alone (mutable memory that may shrink and grow repeatedly over time).
For string storage I *suspect* it's better to move the memory (typically exactly one shrinking realloc, after which the string storage is immutable until death)
OTOH, for many string objects, death comes quickly, e.g. if they get concatenated to some other string. It is difficult to anticipate how long any given string will live - wasted memory appears to be best measured in bytes*seconds. Regards, Martin
"Martin v. Loewis" wrote:
Since this appears to be religious, I'm backing the patch out.
Thanks. Please note that there's nothing religious about this: if you come up with a counting version of the codec which has similar performance or is even faster, I'd have no objections at all about your checkin. BTW, you could easily add your codec to the Python codec registry as e.g. 'utf8alt'. People could then enable it on memory starved machines by editing the codec alias list in encodings/aliases.py. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Tim Peters wrote:
I expect Martin checked in this change because of the unhappy hours he spent determining that the previous two versions of this function wrote beyond the memory they allocated. Since the most recent version still didn't bother to assert that it wasn't writing out of bounds, I can't blame Martin for checking in a version that does so assert; since I spent hours on this too, and this function has a repeated history of bad memory behavior, I viewed the version Martin replaced as unacceptable.
Are you sure, you're talking about the latest version I checked in ? I spent hours on this too and I'm pretty sure to have fixed the buffer overruns now.
However, the slowdown on large strings isn't attractive, and the previous version could easily enough have asserted its memory correctness.
So, why not just add the assert to my original version ?
-----Original Message----- From: python-checkins-admin@python.org [mailto:python-checkins-admin@python.org]On Behalf Of M.-A. Lemburg Sent: Saturday, April 20, 2002 11:26 AM To: loewis@sourceforge.net Cc: python-checkins@python.org Subject: Re: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140
loewis@sourceforge.net wrote:
Update of /cvsroot/python/python/dist/src/Objects In directory usw-pr-cvs1:/tmp/cvs-serv30961
Modified Files: unicodeobject.c Log Message: Patch #495401: Count number of required bytes for encoding UTF-8 before allocating the target buffer.
Martin, please back out this change again. We have discussed this quite a few times and I am against using your strategy since it introduces a performance hit which does not relate to the gained advantage of (temporarily) using less memory.
Your timings also show this, so I wonder why you checked in this patch, e.g. from the patch log: """ For the current CVS (unicodeobject.c 2.136: MAL's change to use a variable overalloc), I get
10 spaces 20.060 100 spaces 2.600 200 spaces 2.030 1000 spaces 0.930 10000 spaces 0.690 10 spaces, 3 bytes 23.520 100 spaces, 3 bytes 3.730 200 spaces, 3 bytes 2.470 1000 spaces, 3 bytes 0.980 10000 spaces, 3 bytes 0.690 30 bytes 24.800 300 bytes 5.220 600 bytes 3.830 3000 bytes 2.480 30000 bytes 2.230
With unicode3.diff (that's the one you checked in), I get
10 spaces 19.940 100 spaces 3.260 200 spaces 2.340 1000 spaces 1.650 10000 spaces 1.450 10 spaces, 3 bytes 21.420 100 spaces, 3 bytes 3.410 200 spaces, 3 bytes 2.420 1000 spaces, 3 bytes 1.660 10000 spaces, 3 bytes 1.450 30 bytes 22.260 300 bytes 5.830 600 bytes 4.700 3000 bytes 3.740 30000 bytes 3.540 """
The only case where your patch is faster is for very short strings and then only by a few percent, whereas for all longer strings you get worse timings, e.g. 3.74 seconds compared to 2.48 seconds -- that's a 50% increase in run-time !
Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
_______________________________________________ Python-checkins mailing list Python-checkins@python.org http://mail.python.org/mailman/listinfo/python-checkins
-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
[Tim]
... behavior, I viewed the version Martin replaced as unacceptable.
[M.-A. Lemburg]
Are you sure, you're talking about the latest version I checked in ?
Calling the version Martin checked in N, I'm talking about versions N-3, N-2, and N-1. N-3 and N-2 were unacceptable because they wrote out of bounds. N-1 ("the version Martin replaced") was unacceptable because it still didn't assert that it wasn't writing out of bounds. I asked repeatedly in the bug reports opened against N-3 and N-2 that asserts be added. If that had been done in version N-2, at least Barry, Martin, you and I wouldn't have spent additional hours chasing down what turned out to be more out-of-bounds writes (a debug-build run would have triggered an assert directly in the flawed code).
I spent hours on this too and I'm pretty sure to have fixed the buffer overruns now.
You were pretty sure about N-2 too
... So, why not just add the assert to my original version ?
I don't know why you didn't <wink>. Martin backed out version N, so we're back to N-1, except I see Martin added a crucial assert for you. I added some more since then.
Tim Peters wrote:
[Tim]
... behavior, I viewed the version Martin replaced as unacceptable.
[M.-A. Lemburg]
Are you sure, you're talking about the latest version I checked in ?
Calling the version Martin checked in N, I'm talking about versions N-3, N-2, and N-1. N-3 and N-2 were unacceptable because they wrote out of bounds. N-1 ("the version Martin replaced") was unacceptable because it still didn't assert that it wasn't writing out of bounds. I asked repeatedly in the bug reports opened against N-3 and N-2 that asserts be added. If that had been done in version N-2, at least Barry, Martin, you and I wouldn't have spent additional hours chasing down what turned out to be more out-of-bounds writes (a debug-build run would have triggered an assert directly in the flawed code).
Tim, I don't get it... why all the fuzz about some missing asserts ?
I spent hours on this too and I'm pretty sure to have fixed the buffer overruns now.
You were pretty sure about N-2 too
, and the more hours it takes to make tricky code correct, the more suspect that code is. As I most recently implored, in a comment on Barry's bug report against N-2: What I do care about is that there weren't (and still aren't) asserts *verifying* that this delicate code isn't spilling over the allocated bounds.
About timing, last time we went around on this, the "measure once, cut once" version of the code was significantly slower in my timing tests too. I don't care so much if the code is tricky, but the trickier the code the more asserts are required.
You checked in N-1 (and N-2) without responding to comments like that, and we're all paying for it. You realize asserts go away in the release build, right? They don't cost anything in production mode, they save our ass in debug mode.
I must have missed Barry post, sorry. I didn't leave out the asserts for any reason -- just didn't think about using them.
... So, why not just add the assert to my original version ?
I don't know why you didn't <wink>. Martin backed out version N, so we're back to N-1, except I see Martin added a crucial assert for you. I added some more since then.
Thanks. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
[MAL]
Tim, I don't get it... why all the fuzz about some missing asserts?
They were requested repeatedly, and would have saved everyone hours of debugging time. You didn't add asserts when requested, and didn't oppose them either -- they just got ignored. The last two times this routine was implicated in memory corruption, you assumed it was a pymalloc bug, and Martin had to do all the work of indentifying the true cause and then convincing you of that. The first time that was understandable, but the second time it wasn't; then that the third version of the code still didn't try to catch overwrites was over the edge.
I must have missed Barry post, sorry.
Barry filed the latest bug report, and you closed it. The memory corruption showed up as a bad UTF-8 translation in Mailman: http://www.python.org/sf/541828
I didn't leave out the asserts for any reason -- just didn't think about using them.
You would have had you read the comments on the bug reports you closed, so I'm not sure what to make of that. I'll be happy if you use asserts in the future, when the code is getting too tricky to be obviously correct; that shouldn't need to be a protracted battle.
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ? In any case, I wasn't looking for a fight when requesting to back out Martin's changes. Tim Peters wrote:
[MAL]
Tim, I don't get it... why all the fuzz about some missing asserts?
They were requested repeatedly, and would have saved everyone hours of debugging time. You didn't add asserts when requested, and didn't oppose them either -- they just got ignored.
Tim, I wasn't aware that you were requesting to add these. I remember that you mentioned something about using assert()s in the Python core many months ago after a discussion with Guido about this, but it wasn't clear to me that you wanted me to add assert()s to the codecs. The reason I wasn't adding assert()s is simply that I normally don't use them in C programming. This example shows that it would probably have been better to do so and I'll try to remember this for the future.
The last two times this routine was implicated in memory corruption, you assumed it was a pymalloc bug, and Martin had to do all the work of indentifying the true cause and then convincing you of that. The first time that was understandable, but the second time it wasn't; then that the third version of the code still didn't try to catch overwrites was over the edge.
True, the code was complicated due to the many different paths the codec could take. In the last version, I ripped out all the counting code because of the troubles it caused previously. In any case, before spending too much time understanding the code, why didn't you just ask for a better explanation ? I would have been the last to deny that request.
I must have missed Barry post, sorry.
Barry filed the latest bug report, and you closed it. The memory corruption showed up as a bad UTF-8 translation in Mailman:
True, but Martin and you continued discussing the bug after it had been closed. I hadn't read those messages, because I thought they were part of the usual SF duplication of message delivery. Sorry about that.
I didn't leave out the asserts for any reason -- just didn't think about using them.
You would have had you read the comments on the bug reports you closed, so I'm not sure what to make of that. I'll be happy if you use asserts in the future, when the code is getting too tricky to be obviously correct; that shouldn't need to be a protracted battle.
Will do. Again, there was no fight intended. assert(peace), -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?
The latter definitely playes a role -- when I was going to IETF meetings back around 94/95, character set issues were always good for a few big fights, and I think the people in Asia are still not all in agreement. Another issue is that adding Unicode was probably the most invasive set of changes ever made to the Python code base. It has complicated many parts of the code, and added at least a proportional share of bugs. (I found 166 source files in CVS containing some variation on the string "unicode", and 110 bug reports mentioning "unicode" in the SF bug tracker.) For a feature that few of the developers ever need to use for themselves (I believe everyone with CVS commit privileges needs at most Latin-1 for their own language :-), I can understand that makes it a touchy issue. --Guido van Rossum (home page: http://www.python.org/~guido/)
On donderdag, april 25, 2002, at 08:59 , Guido van Rossum wrote:
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?
[...]
Another issue is that adding Unicode was probably the most invasive set of changes ever made to the Python code base. It has complicated many parts of the code, and added at least a proportional share of bugs. (I found 166 source files in CVS containing some variation on the string "unicode", and 110 bug reports mentioning "unicode" in the SF bug tracker.)
Another thing that bothers me is that it retroactively changed
the interpretation of other Python objects. For me it's
perfectly logical that a character string is a character string,
unless there's a very good reason to treat it differently (a
framebuffer scanline, a binary blob, etc). And so if I have an
API OpenFileWithUnicodeName() that accepts a unicode filename I
expect that if I pass an 8-bit filename it would be converted on
the fly. Other people focus on different sets of API's, however,
and think there's nothing more logical than interpreting the
string object as a binary buffer containing UTF16 values or
what-have-you.
Scanlines or binary blobs hardly ever mixed with filenames, so
there wasn't an issue before unicode raised its pretty/ugly head.
(of course it could be argued that unicode has demonstrated a
design flaw in Python, namely that a single data-type was used
to store both binary data of unknown interpretation and
character arrays, and that there's now little more to be done
about that).
--
- Jack Jansen
Jack Jansen wrote:
(of course it could be argued that unicode has demonstrated a design flaw in Python, namely that a single data-type was used to store both binary data of unknown interpretation and character arrays, and that there's now little more to be done about that).
This is probably the most significant problem with the usage of strings -- and there's nothing much we can do about it since it's hard-wired in the programmers mind... most other languages have the same problem. In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding.
I think you're underestimating the tenacity of ASCII and Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido van Rossum wrote:
In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding.
I think you're underestimating the tenacity of ASCII and Latin-1.
This has nothing to do with a particular encoding, it's about meta data: much like you can use numbers to represent e.g. money or date/time values, you always need to store this meta information somewhere and object types are the ideal carrier for such meta data. It's an option which everybody working in the application space should consider. If you start working this way now, it'll pay off a few years ahead regardeless of whether it is useful for you now or not. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Guido van Rossum
In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding.
I think you're underestimating the tenacity of ASCII and Latin-1.
I agree with MAL: the run will just be *very* long. Regards, Martin
mal> In the long run, I think that the Unicode type should be used for mal> all text data and strings for binary data of unknown encoding. In that case, I would suggest you call "string" something else. Skip
Skip Montanaro wrote:
mal> In the long run, I think that the Unicode type should be used for mal> all text data and strings for binary data of unknown encoding.
In that case, I would suggest you call "string" something else.
That would call for world revolution and again put Unicode in a bad light. I'd rather not ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
Guido van Rossum wrote:
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?
The latter definitely playes a role -- when I was going to IETF meetings back around 94/95, character set issues were always good for a few big fights, and I think the people in Asia are still not all in agreement.
Another issue is that adding Unicode was probably the most invasive set of changes ever made to the Python code base. It has complicated many parts of the code, and added at least a proportional share of bugs. (I found 166 source files in CVS containing some variation on the string "unicode", and 110 bug reports mentioning "unicode" in the SF bug tracker.)
True; and it was hard enough to get it mostly to a working compromise.
For a feature that few of the developers ever need to use for themselves (I believe everyone with CVS commit privileges needs at most Latin-1 for their own language :-), I can understand that makes it a touchy issue.
Very true indeed. Still, I think Unicode gives a chance of "fixing" the problem we currently have with strings: Unicode is unlike strings only usable for text data and that makes it ideal as standard type for text -- we'll never convince people to make a difference between text and binary data in strings, so offering them Unicode as alternative is a good strategy, IMHO. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
True; and it was hard enough to get it mostly to a working compromise.
Let me add that I very much appreciate your heroic efforts there!!!
Very true indeed. Still, I think Unicode gives a chance of "fixing" the problem we currently have with strings: Unicode is unlike strings only usable for text data and that makes it ideal as standard type for text -- we'll never convince people to make a difference between text and binary data in strings, so offering them Unicode as alternative is a good strategy, IMHO.
It's a long way before we're there though -- we'd have to overhaul the I/O system entirely, and that takes a lot of time not just because of the effort but also because it won't be fully compatible. Also, once 8-bit strings are used for binary data only, I wonder if they shouldn't be more like Java's byte arrays -- i.e. mutable. And they don't need a literal notation. That's another major language change. :-( --Guido van Rossum (home page: http://www.python.org/~guido/)
Guido> Also, once 8-bit strings are used for binary data only, I wonder Guido> if they shouldn't be more like Java's byte arrays -- Guido> i.e. mutable. And they don't need a literal notation. That's Guido> another major language change. :-( How so? In theory, all the ways you write string constructors today would eventually map to Unicode objects. I'm thinking of string literals and constructor functions. That can be handled with the usual "warn for awhile" mechanism. The array module could be used to manipulate mutable arrays of 8-bit data. While permeating the Python innards with Unicode objects would be a major change I don't see any big syntactic changes - or is that not what worries you? Skip
How so? In theory, all the ways you write string constructors today would eventually map to Unicode objects. I'm thinking of string literals and constructor functions. That can be handled with the usual "warn for awhile" mechanism. The array module could be used to manipulate mutable arrays of 8-bit data. While permeating the Python innards with Unicode objects would be a major change I don't see any big syntactic changes - or is that not what worries you?
No syntactic changes, no. But the way we do things would become significantly different. And think of binary I/O vs. textual I/O -- currently, file.read() returns a string. Code dealing with binary files will look significantly different, and old code won't work. --Guido van Rossum (home page: http://www.python.org/~guido/)
On vrijdag, april 26, 2002, at 06:26 , Guido van Rossum wrote:
No syntactic changes, no. But the way we do things would become significantly different. And think of binary I/O vs. textual I/O -- currently, file.read() returns a string. Code dealing with binary files will look significantly different, and old code won't work.
It could be argued that open(..., 'r').read() returns a text
string and open(..., 'rb').read() returns a binary blob.
If textstrings and blobs become wholly different objects this
shouldn't create too many problems [see below], except for code
that opens a file in binary mode and (partially) reads the
resulting file expecting text. But this code would need
revisiting anyway if the normal textstring would become unicode.
[here's below] To my surprise I think that having blobs and
textstrings be unrelated objects creates less problems than
having the one be a subtype of the other. At least, every time I
try to do the subtyping in my head I flip back and forth between
textstrings-are-a-subtype-of-general-binary-buffers and
binary-buffers-are-a-special-case-of-python-strings every couple
of seconds. I think having them both be subtypes of a common
base type (basestring) might work, but I'm not sure.
--
- Jack Jansen
"JJ" == Jack Jansen
writes:
JJ> [here's below] To my surprise I think that having blobs and JJ> textstrings be unrelated objects creates less problems than JJ> having the one be a subtype of the other. While I think there's a lot of common operations you might do on blobs and textstrings, I don't see much of a need for them to be related via class hierarchy (although we might find it to be expedient for backwards compatibility). -Barry
[Guido]
No syntactic changes, no. But the way we do things would become significantly different. And think of binary I/O vs. textual I/O -- currently, file.read() returns a string. Code dealing with binary files will look significantly different, and old code won't work.
[Jack]
It could be argued that open(..., 'r').read() returns a text string and open(..., 'rb').read() returns a binary blob.
They might even return different kind of objects -- arguably, binary files don't need readline() etc., and text files may not need read(n) (though the arg-less variant is handy). If only I had the time to reinvent the I/O library...
If textstrings and blobs become wholly different objects this shouldn't create too many problems [see below], except for code that opens a file in binary mode and (partially) reads the resulting file expecting text. But this code would need revisiting anyway if the normal textstring would become unicode.
Yeah, that's usually just stubborn Unix users who don't believe in the distinction between binary and text mode. :-) Anyway, the proper way to convert between blobs and textstrings would be encodings. That's how Java does it.
[here's below] To my surprise I think that having blobs and textstrings be unrelated objects creates less problems than having the one be a subtype of the other. At least, every time I try to do the subtyping in my head I flip back and forth between textstrings-are-a-subtype-of-general-binary-buffers and binary-buffers-are-a-special-case-of-python-strings every couple of seconds. I think having them both be subtypes of a common base type (basestring) might work, but I'm not sure.
I think they don't need anything in common (apart their sequence-ness). I think Java's byte[] vs. String distinction is about right. --Guido van Rossum (home page: http://www.python.org/~guido/)
Jack Jansen
writes: It could be argued that open(..., 'r').read() returns a text string and open(..., 'rb').read() returns a binary blob.
That can't work: to get a text string, you need to know the encoding.
Regards, Martin
In my ideal rewrite of the I/O subsystem, an encoding is specified (or a site-or-app-specific default encoding used) when a file is opened. --Guido van Rossum (home page: http://www.python.org/~guido/)
[M.-A. Lemburg]
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it.
Huh -- I thought I was the only one who noticed this <wink>.
I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?
Unicode had nothing to do with my yelling in this thread. I've got very low tolerance for memory corruption, regardless of source. When it happens once I'm on high alert, when it happens twice in the same place I go postal. Had this been in dictobject.c or boolobject.c, I would have been just as unhappy. Now that the memory corruption is thought to be solved, and verified in the debug build regardless, *now* I'll get cranky about foreigners and their lameass character sets <wink>. On the technical issues remaining, I don't know how to judge the tradeoff between memory use and speed here. If you do, and pymalloc can help in some way, I'll be happy to help.
Tim Peters wrote:
[M.-A. Lemburg]
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it.
Huh -- I thought I was the only one who noticed this <wink>.
Naa, it's occurred to me several times in the past. Unicode seems to trigger some memory corruption in Brain 2.2 which results in spilling out huge amounts of adrenalin and causes the blood pressure to reach record highs ;-)
I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?
Unicode had nothing to do with my yelling in this thread. I've got very low tolerance for memory corruption, regardless of source. When it happens once I'm on high alert, when it happens twice in the same place I go postal. Had this been in dictobject.c or boolobject.c, I would have been just as unhappy.
Now that the memory corruption is thought to be solved, and verified in the debug build regardless, *now* I'll get cranky about foreigners and their lameass character sets <wink>.
Good to know.
On the technical issues remaining, I don't know how to judge the tradeoff between memory use and speed here. If you do, and pymalloc can help in some way, I'll be happy to help.
First of all, UTF-8 is probably the most common Unicode encoding used today and will certainly become *the* standard encoding within the next few years. So speed matters a lot in this particular corner of the Unicode implementation. The standard reasoning behind using overallocation for memory management is that typical modern malloc()s don't really allocate the memory until it is used (you know this anyway...), so overallocation doesn't actually cause bundles of memory chips to heat up. This makes overallocation ideal for the case where you don't know the exact size in advance but where you can estimate a reasonable upper bound. Now with pymalloc the situation is a bit different for smaller sized memory areas (larger chunks are handed off to the system malloc() which uses the above strategy). As Martin's benchmark showed, the counting strategy is faster for small chunks and this is probably due to the fact that pymalloc manages these. Since pymalloc cannot know that an algorithm wants to use overallocation as memory allocation strategy, it would probably help to find a way to tell pymalloc about this fact. It could then either redirect the request to the system malloc() or use a different malloc strategy for these chunks. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/
"M.-A. Lemburg"
The standard reasoning behind using overallocation for memory management is that typical modern malloc()s don't really allocate the memory until it is used (you know this anyway...),
That is not true. Each malloc implementation I know will always iterate through the freelist to find an appriately-sized chunk, and go to the OS if it doesn't find one. Now, the *OS* might implement such allocations as a no-op, but on most hardware, it can do so only in units of memory pages (e.g. 4k). Most strings are smaller than a page, so if you allocate lots of strings, every page allocated from the system will be used as well (atleast to fill in the string header). With overallocation, you will indeed overallocate real pages, and consume real memory.
This makes overallocation ideal for the case where you don't know the exact size in advance but where you can estimate a reasonable upper bound.
No. The overallocation has a real cost in terms of memory consumption.
As Martin's benchmark showed, the counting strategy is faster for small chunks and this is probably due to the fact that pymalloc manages these.
I doubt this claim.
Since pymalloc cannot know that an algorithm wants to use overallocation as memory allocation strategy, it would probably help to find a way to tell pymalloc about this fact. It could then either redirect the request to the system malloc() or use a different malloc strategy for these chunks.
That won't help. Regards, Martin
[MAL]
The standard reasoning behind using overallocation for memory management is that typical modern malloc()s don't really allocate the memory until it is used (you know this anyway...),
[Martin]
... Most strings are smaller than a page, so if you allocate lots of strings, every page allocated from the system will be used as well (at least to fill in the string header). With overallocation, you will indeed overallocate real pages, and consume real memory.
But Marc-Andre uses realloc at the end to return the excess. The excess bytes will get reused (and some returned yet again) by the next overallocation, and so on. The only memory *touched* by him is the only memory he actually needs in the end. Indeed, if strings are always smaller than a page, the aggregate overallocation at any point in a single-threaded program will be at worst a few pages total (overallocation is never more than a factor of 4, and the excess is always given back untouched).
As Martin's benchmark showed, the counting strategy is faster for small chunks and this is probably due to the fact that pymalloc manages these.
I doubt this claim.
If you both run the test with and without pymalloc enabled, the results should speak for themselves. I have not, but suspect the difference has more to do with that caches like small memory areas better than large ones, and especially when you crawl over a memory area twice. MAL, you should keep in mind that pymalloc is also managing the small chunks in your scheme: when you're fiddling with a 40-character Unicode string, an overallocation "by a factor of 4" only amounts to an 80-character UTF8 string. pymalloc handles blocks that small under either scheme, and indeed your current scheme is getting a speed benefit from that pymalloc currently refuses to copy the data to a smaller block when there's a shrinking realloc at the end.
Since pymalloc cannot know that an algorithm wants to use overallocation as memory allocation strategy, it would probably help to find a way to tell pymalloc about this fact. It could then either redirect the request to the system malloc() or use a different malloc strategy for these chunks.
Possibly, yes. It's still in need of quantifying the speed versus memory tradeoffs. pymalloc's current small-block realloc strategy favors speed, and going to the system malloc instead would be slower. You haven't yet told me that's what you actually want <wink>.
Tim Peters
But Marc-Andre uses realloc at the end to return the excess. The excess bytes will get reused (and some returned yet again) by the next overallocation, and so on.
Right. I confused this with the fact that PyMem_Realloc won't return the excess memory, so the extra bytes in a small string will be wasted for the life time of the string object - that still could cause significant memory wastage.
MAL, you should keep in mind that pymalloc is also managing the small chunks in your scheme: when you're fiddling with a 40-character Unicode string, an overallocation "by a factor of 4" only amounts to an 80-character UTF8 string.
[I guess this is a terminology, not a math problem: a 40 character Unicode string has already 80 bytes; the UTF-8 of it can have up to 160 bytes]. Regards, Martin
[Tim]
But Marc-Andre uses realloc at the end to return the excess. The excess bytes will get reused (and some returned yet again) by the next overallocation, and so on.
[Martin]
Right. I confused this with the fact that PyMem_Realloc won't return the excess memory,
PyMem_Realloc does whatever the system realloc does -- PyMem_Realloc doesn't go thru pymalloc today (except in a PYMALLOC_DEBUG build). Doesn't matter, though, since strings use the PyObject_{Malloc, Free, Realloc} family today, and that does use pymalloc. OTOH, there's no reason PyObject_Realloc *has* to hang on to all small-block memory on a shrinking realloc, and there's no reason pymalloc couldn't grow another realloc entry point specifying what the caller wants a shrinking realloc to do. These things are all easy to change, but I don't know what's truly desirable. Note another subtlety: I expect you brought up PyMem_Realloc because unicodeobject.c uses the PyMem_XYZ family for managing the PyUnicodeObject.str member today. That means it normally never uses pymalloc at all, except to allocate fixed-size PyUnicodeObject structs (which use the PyObject_XYZ memory family). I don't know whether that's the best idea, but that's how it is today. pymalloc gets into this because PyUnicode_EncodeUTF8 returns a plain string object, and the latter uses pymalloc today.
so the extra bytes in a small string will be wasted for the life time of the string object - that still could cause significant memory wastage.
It could. Python generally aims to optimize the expected case, not jump thru hoops to avoid worst cases (else we wouldn't use dicts at all <wink>). But I don't know what the expected case is here, and given how often I use Unicode in my own work it could be I'll never have a clue. Note that the expected uses of Unicode strings makes no difference to PyUnicode_EncodeUTF8: what counts there is the expected lifetimes and sizes of the "plain" utf8-encoded PyStringObjects it computes. Indeed, pymalloc has almost no implications for Unicode beyond the encode-as-a-plain-string functions (unless unicodeobject.c is changed to manage the PyUnicodeObject.str member using pymalloc too, as plain strings do today).
MAL, you should keep in mind that pymalloc is also managing the small chunks in your scheme: when you're fiddling with a 40-character Unicode string, an overallocation "by a factor of 4" only amounts to an 80-character UTF8 string.
[I guess this is a terminology, not a math problem:
Nope! Turns out it was an hallucination problem <wink>.
a 40 character Unicode string has already 80 bytes; the UTF-8 of it can have up to 160 bytes].
You're right, of course. The conclusion doesn't change, though: that's still in the range of block pymalloc handles (and will remain so unless I reduce pymalloc's small-object threshold below what's needed for pymalloc to handle small dicts on its own -- which I'm unlikely to do).
[Tim]
The conclusion doesn't change, though: that's still in the range of block pymalloc handles (and will remain so unless I reduce pymalloc's small-object threshold below what's needed for pymalloc to handle small dicts on its own -- which I'm unlikely to do).
Would it make sense to change the Unicode object to use pymalloc, and to change the UTF-8 codec to count the bytes if the shortest possible output would fit in a pymalloc block? (I guess this means that the length of the Unicode string should be less than SMALL_REQUEST_THRESHOLD - currently 256.) --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
Would it make sense to change the Unicode object to use pymalloc, and to change the UTF-8 codec to count the bytes if the shortest possible output would fit in a pymalloc block?
These are independent questions, and I don't know how to answer either unless you give me a test program that prints the value of the function you're trying to minimize <0.7 wink>. The Unicode object currenly uses quite an elaborate free list, caching both PyUnicodeObject structs (which currently use pymalloc), and their str members (which currently do not). Whether the str member uses pymalloc really doesn't have anything to do with what the UTF8 encoder function does (it returns plain strings, and those already use pymalloc today -- and it's not entirely clear whether they should either!). Counting the bytes in the UTF8 decoder could work well, independent of that: if the result is known to fit in a pymalloc block, just do it; as soon as it's known that it won't, overallocate with assurance that the system realloc will give back everything that isn't used. In the latter case I believe the code could be made much simpler, by doing a factor-of-4 overallocation from the start (it currently tries 2, then 3, then 4, with a bunch of embedded-in-the-loops tests to prevent overwrites; I'm not sure why it bothers with this staggered scheme, since it's going to touch exactly as much memory as it actually needs regardless, and give all the rest back untouched).
(I guess this means that the length of the Unicode string should be less than SMALL_REQUEST_THRESHOLD - currently 256.)
For a start, yes. I'd stick a "Py_" in front of that symbol and expose it then. The cutoff test would also have to take into account the size of the result's PyStringObject header (the whole stringobject enchilada counts against the threshold).
Guido van Rossum
Would it make sense to change the Unicode object to use pymalloc, and to change the UTF-8 codec to count the bytes if the shortest possible output would fit in a pymalloc block? (I guess this means that the length of the Unicode string should be less than SMALL_REQUEST_THRESHOLD - currently 256.)
Given my measurements, that would make sense. I suspect that counting small strings is quite efficient, so that the overhead of iterating over the string twice hides in the noise of additional invocations. Regards, Martin
Tim Peters
Right. I confused this with the fact that PyMem_Realloc won't return the excess memory,
PyMem_Realloc does whatever the system realloc does -- PyMem_Realloc doesn't go thru pymalloc today (except in a PYMALLOC_DEBUG build). Doesn't matter, though, since strings use the PyObject_{Malloc, Free, Realloc} family today, and that does use pymalloc.
That's what I mean (I'm *really* confused about memory family APIs, ever since everything changed :-)
OTOH, there's no reason PyObject_Realloc *has* to hang on to all small-block memory on a shrinking realloc, and there's no reason pymalloc couldn't grow another realloc entry point specifying what the caller wants a shrinking realloc to do. These things are all easy to change, but I don't know what's truly desirable.
Neither do I. To establish whether releasing the extra memory is worth the effort would depend on knowledge whether the object will be long-living; neither pymalloc nor its caller is able to tell.
Note another subtlety: I expect you brought up PyMem_Realloc because unicodeobject.c uses the PyMem_XYZ family for managing the PyUnicodeObject.str member today.
No, because I assumed PyMem_Realloc was a synonym for PyObject_Realloc.
That means it normally never uses pymalloc at all, except to allocate fixed-size PyUnicodeObject structs (which use the PyObject_XYZ memory family). I don't know whether that's the best idea, but that's how it is today.
I do think that the Unicode data should be managed by pymalloc as well. Of course, DecodeUTF8 would then raise the same issue: decoding UTF-8 doesn't know how many characters you'll get, either. This currently does not try to be clever, but allocates enough memory for the worst case. Regards, Martin
Yet another idea: decode into a fixed-size stack-allocated buffer. If it fits in that buffer, use PyString_FromStringAndSize(). Otherwise, do the overallocation thing. --Guido van Rossum (home page: http://www.python.org/~guido/)
[Guido]
Yet another idea: decode into a fixed-size stack-allocated buffer. If it fits in that buffer, use PyString_FromStringAndSize(). Otherwise, do the overallocation thing.
How come we can't use a version of C that pulls this trick for us automatically <wink>? We do end up there a lot. Here's a patch that does it for PyUnicode_EncodeUTF8: http://www.python.org/sf/549375
[martin@v.loewis.de]
That's what I mean (I'm *really* confused about memory family APIs, ever since everything changed :-)
Here's the in-depth course: PyMem_xyz calls the platform malloc/realloc/free (fiddled for x-platform uniformity in NULL and 0 handling) PyObject_xyz calls pymalloc's malloc/realloc/free and instead of a dozen layers of indirection we've now got crushingly straightforward WYSIWYG preprocessor blocks like: #ifdef WITH_PYMALLOC #ifdef PYMALLOC_DEBUG #define PyObject_MALLOC _PyObject_DebugMalloc #define PyObject_Malloc _PyObject_DebugMalloc #define PyObject_REALLOC _PyObject_DebugRealloc #define PyObject_Realloc _PyObject_DebugRealloc #define PyObject_FREE _PyObject_DebugFree #define PyObject_Free _PyObject_DebugFree #else /* WITH_PYMALLOC && ! PYMALLOC_DEBUG */ #define PyObject_MALLOC PyObject_Malloc #define PyObject_REALLOC PyObject_Realloc #define PyObject_FREE PyObject_Free #endif #else /* ! WITH_PYMALLOC */ #define PyObject_MALLOC PyMem_MALLOC #define PyObject_REALLOC PyMem_REALLOC #define PyObject_FREE PyMem_FREE #endif /* WITH_PYMALLOC */ #define PyObject_Del PyObject_Free #define PyObject_DEL PyObject_FREE /* for source compatibility with 2.2 */ #define _PyObject_Del PyObject_Free All the names you love are still there, it's just that most of them are redundant now <wink>.
... I do think that the Unicode data should be managed by pymalloc as well.
Well, that largely depends on how big these suckers are. Calling PyObject_XYZ adds real overhead if pymalloc can't handle the requested size: all the overhead of the system routines, + the overhead of pymalloc figuring out it can't handle it. I expect it's also not good to mix pymalloc with custom free lists: you hold on to one object from a pymalloc pool, and it prevents the entire pool from getting recycled for another size class. So if you want to investigate using pymalloc more heavily for Unicode objects, I suggest two things: 1. Get rid of the Unicode-specific free list. 2. Change the object layout to embed the str member storage, just as PyStringObject does. #1 is pretty localized, but #2 would require changing a lot of code.
Of course, DecodeUTF8 would then raise the same issue: decoding UTF-8 doesn't know how many characters you'll get, either. This currently does not try to be clever, but allocates enough memory for the worst case.
I just put a patch up on SourceForge that's *less* clever, but shouldn't waste any memory in the end. I expect you'll be happy with it, or rant inconsolably. It's all the same to me <wink>.
Tim Peters
I just put a patch up on SourceForge that's *less* clever, but shouldn't waste any memory in the end. I expect you'll be happy with it, or rant inconsolably. It's all the same to me <wink>.
Not that you care, but I'm happy with it :-) Martin
"MAL" == M
writes:
MAL> I must have missed Barry post, sorry. I didn't leave out the MAL> asserts for any reason -- just didn't think about using them. I think Tim's referring just to the bug report about the regression in s.encode('utf8') that I reported seeing in the email package test. BTW, did you add a separate unittest for this specific failure? -Barry
participants (7)
-
barry@zope.com
-
Guido van Rossum
-
Jack Jansen
-
M.-A. Lemburg
-
martin@v.loewis.de
-
Skip Montanaro
-
Tim Peters