Mailman 3 RE: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140 - Python-Dev

newer
Built in objects supporting slices

RE: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

older
an oddball alternative name for...

Tim Peters

21 Apr 2002 21 Apr '02

12:22 p.m.

I expect Martin checked in this change because of the unhappy hours he spent determining that the previous two versions of this function wrote beyond the memory they allocated. Since the most recent version still didn't bother to assert that it wasn't writing out of bounds, I can't blame Martin for checking in a version that does so assert; since I spent hours on this too, and this function has a repeated history of bad memory behavior, I viewed the version Martin replaced as unacceptable. However, the slowdown on large strings isn't attractive, and the previous version could easily enough have asserted its memory correctness.

...

-----Original Message----- From: python-checkins-admin@python.org [mailto:python-checkins-admin@python.org]On Behalf Of M.-A. Lemburg Sent: Saturday, April 20, 2002 11:26 AM To: loewis@sourceforge.net Cc: python-checkins@python.org Subject: Re: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

loewis@sourceforge.net wrote:

...
Update of /cvsroot/python/python/dist/src/Objects In directory usw-pr-cvs1:/tmp/cvs-serv30961

Modified Files: unicodeobject.c Log Message: Patch #495401: Count number of required bytes for encoding UTF-8 before allocating the target buffer.

Martin, please back out this change again. We have discussed this quite a few times and I am against using your strategy since it introduces a performance hit which does not relate to the gained advantage of (temporarily) using less memory.

Your timings also show this, so I wonder why you checked in this patch, e.g. from the patch log: """ For the current CVS (unicodeobject.c 2.136: MAL's change to use a variable overalloc), I get

10 spaces 20.060 100 spaces 2.600 200 spaces 2.030 1000 spaces 0.930 10000 spaces 0.690 10 spaces, 3 bytes 23.520 100 spaces, 3 bytes 3.730 200 spaces, 3 bytes 2.470 1000 spaces, 3 bytes 0.980 10000 spaces, 3 bytes 0.690 30 bytes 24.800 300 bytes 5.220 600 bytes 3.830 3000 bytes 2.480 30000 bytes 2.230

With unicode3.diff (that's the one you checked in), I get

10 spaces 19.940 100 spaces 3.260 200 spaces 2.340 1000 spaces 1.650 10000 spaces 1.450 10 spaces, 3 bytes 21.420 100 spaces, 3 bytes 3.410 200 spaces, 3 bytes 2.420 1000 spaces, 3 bytes 1.660 10000 spaces, 3 bytes 1.450 30 bytes 22.260 300 bytes 5.830 600 bytes 4.700 3000 bytes 3.740 30000 bytes 3.540 """

The only case where your patch is faster is for very short strings and then only by a few percent, whereas for all longer strings you get worse timings, e.g. 3.74 seconds compared to 2.48 seconds -- that's a 50% increase in run-time !

Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

_______________________________________________ Python-checkins mailing list Python-checkins@python.org http://mail.python.org/mailman/listinfo/python-checkins

Show replies by date

martin＠v.loewis.de

21 Apr 21 Apr

5:45 p.m.

New subject: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

Tim Peters writes:

...

I expect Martin checked in this change because of the unhappy hours he spent determining that the previous two versions of this function wrote beyond the memory they allocated. Since the most recent version still didn't bother to assert that it wasn't writing out of bounds, I can't blame Martin for checking in a version that does so assert; since I spent hours on this too, and this function has a repeated history of bad memory behavior, I viewed the version Martin replaced as unacceptable.

Exactly. I think the most recent version is worse than the one we had before.

...

However, the slowdown on large strings isn't attractive, and the previous version could easily enough have asserted its memory correctness.

I found the overallocations strategy that this code had just so embarrassing: a single three-byte character will cause an overallocation of three times the memory, which means that the final realloc will certainly truncate lots of bytes. As a result, we are at the mercy of the realloc implementation here: If realloc copies memory (such as Pymalloc might some day) when shrinking buffers, performance will get worse. Since this appears to be religious, I'm backing the patch out. Regards, Martin

Tim Peters

22 Apr 22 Apr

3:55 a.m.

New subject: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

[martin@v.loewis.de]

...

... I found the overallocations strategy that this code had just so embarrassing: a single three-byte character will cause an overallocation of three times the memory, which means that the final realloc will certainly truncate lots of bytes.

Well, Python routinely over-allocates string space all over the place, and it doesn't really cost more to throw away a million bytes at the end than to throw away five bytes. That is, the cost of a realloc shrink is at worst proportional to the number of retained bytes; the number of bytes thrown away doesn't matter. For that reason, the thing that puzzles me more about the current state of the code is why it doesn't overallocate by a factor of 4 from the start, and skip all the delicate, repeated "oops! I still didn't get enough memory" logic. It's almost certainly going to do a shrinking realloc at the end anyway.

...

As a result, we are at the mercy of the realloc implementation here: If realloc copies memory (such as Pymalloc might some day) when shrinking buffers, performance will get worse.

Since pymalloc only handles small blocks by itself, such copies would be of short contiguous memory regions that are almost certainly in L1 cache (since the encoder function just finished crawling over them). For that reason I doubt the speed hit would be monstrous.

...

Since this appears to be religious, I'm backing the patch out.

It needn't be religious <wink>: with sufficient effort we can quantify the tradeoff between the time pymalloc's realloc would take to move shrinking blocks, and the space wasted by realloc leaving them in place. I mentioned before that I want to give pymalloc a new entry point for realloc callers who specifically do or don't want a shrinking realloc to copy memory. For string storage I *suspect* it's better to move the memory (typically exactly one shrinking realloc, after which the string storage is immutable until death), while for most other kinds of storage it's better to leave the block alone (mutable memory that may shrink and grow repeatedly over time).

martin＠v.loewis.de

5:19 a.m.

New subject: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

...

For string storage I *suspect* it's better to move the memory (typically exactly one shrinking realloc, after which the string storage is immutable until death)

OTOH, for many string objects, death comes quickly, e.g. if they get concatenated to some other string. It is difficult to anticipate how long any given string will live - wasted memory appears to be best measured in bytes*seconds. Regards, Martin

M.-A. Lemburg

26 Apr 26 Apr

2:21 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

"Martin v. Loewis" wrote:

...

Since this appears to be religious, I'm backing the patch out.

Thanks. Please note that there's nothing religious about this: if you come up with a counting version of the codec which has similar performance or is even faster, I'd have no objections at all about your checkin. BTW, you could easily add your codec to the Python codec registry as e.g. 'utf8alt'. People could then enable it on memory starved machines by editing the codec alias list in encodings/aliases.py. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

M.-A. Lemburg

21 Apr 21 Apr

10:31 p.m.

New subject: [Python-checkins] python/dist/src/Objectsunicodeobject.c,2.139,2.140

Tim Peters wrote:

...

I expect Martin checked in this change because of the unhappy hours he spent determining that the previous two versions of this function wrote beyond the memory they allocated. Since the most recent version still didn't bother to assert that it wasn't writing out of bounds, I can't blame Martin for checking in a version that does so assert; since I spent hours on this too, and this function has a repeated history of bad memory behavior, I viewed the version Martin replaced as unacceptable.

Are you sure, you're talking about the latest version I checked in ? I spent hours on this too and I'm pretty sure to have fixed the buffer overruns now.

...

However, the slowdown on large strings isn't attractive, and the previous version could easily enough have asserted its memory correctness.

So, why not just add the assert to my original version ?

...

...
-----Original Message----- From: python-checkins-admin@python.org [mailto:python-checkins-admin@python.org]On Behalf Of M.-A. Lemburg Sent: Saturday, April 20, 2002 11:26 AM To: loewis@sourceforge.net Cc: python-checkins@python.org Subject: Re: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

loewis@sourceforge.net wrote:

...
Update of /cvsroot/python/python/dist/src/Objects In directory usw-pr-cvs1:/tmp/cvs-serv30961

Modified Files: unicodeobject.c Log Message: Patch #495401: Count number of required bytes for encoding UTF-8 before allocating the target buffer.

Martin, please back out this change again. We have discussed this quite a few times and I am against using your strategy since it introduces a performance hit which does not relate to the gained advantage of (temporarily) using less memory.

Your timings also show this, so I wonder why you checked in this patch, e.g. from the patch log: """ For the current CVS (unicodeobject.c 2.136: MAL's change to use a variable overalloc), I get

10 spaces 20.060 100 spaces 2.600 200 spaces 2.030 1000 spaces 0.930 10000 spaces 0.690 10 spaces, 3 bytes 23.520 100 spaces, 3 bytes 3.730 200 spaces, 3 bytes 2.470 1000 spaces, 3 bytes 0.980 10000 spaces, 3 bytes 0.690 30 bytes 24.800 300 bytes 5.220 600 bytes 3.830 3000 bytes 2.480 30000 bytes 2.230

With unicode3.diff (that's the one you checked in), I get

10 spaces 19.940 100 spaces 3.260 200 spaces 2.340 1000 spaces 1.650 10000 spaces 1.450 10 spaces, 3 bytes 21.420 100 spaces, 3 bytes 3.410 200 spaces, 3 bytes 2.420 1000 spaces, 3 bytes 1.660 10000 spaces, 3 bytes 1.450 30 bytes 22.260 300 bytes 5.830 600 bytes 4.700 3000 bytes 3.740 30000 bytes 3.540 """

The only case where your patch is faster is for very short strings and then only by a few percent, whereas for all longer strings you get worse timings, e.g. 3.74 seconds compared to 2.48 seconds -- that's a 50% increase in run-time !

Thanks, -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

_______________________________________________ Python-checkins mailing list Python-checkins@python.org http://mail.python.org/mailman/listinfo/python-checkins

-- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Tim Peters

22 Apr 22 Apr

1:30 a.m.

New subject: [Python-checkins] python/dist/src/Objectsunicodeobject.c,2.139,2.140

[Tim]

...

...
... behavior, I viewed the version Martin replaced as unacceptable.

[M.-A. Lemburg]

...

Are you sure, you're talking about the latest version I checked in ?

Calling the version Martin checked in N, I'm talking about versions N-3, N-2, and N-1. N-3 and N-2 were unacceptable because they wrote out of bounds. N-1 ("the version Martin replaced") was unacceptable because it still didn't assert that it wasn't writing out of bounds. I asked repeatedly in the bug reports opened against N-3 and N-2 that asserts be added. If that had been done in version N-2, at least Barry, Martin, you and I wouldn't have spent additional hours chasing down what turned out to be more out-of-bounds writes (a debug-build run would have triggered an assert directly in the flawed code).

...

I spent hours on this too and I'm pretty sure to have fixed the buffer overruns now.

You were pretty sure about N-2 too , and the more hours it takes to make tricky code correct, the more suspect that code is. As I most recently implored, in a comment on Barry's bug report against N-2: What I do care about is that there weren't (and still aren't) asserts *verifying* that this delicate code isn't spilling over the allocated bounds. About timing, last time we went around on this, the "measure once, cut once" version of the code was significantly slower in my timing tests too. I don't care so much if the code is tricky, but the trickier the code the more asserts are required. You checked in N-1 (and N-2) without responding to comments like that, and we're all paying for it. You realize asserts go away in the release build, right? They don't cost anything in production mode, they save our ass in debug mode.

...

... So, why not just add the assert to my original version ?

I don't know why you didn't <wink>. Martin backed out version N, so we're back to N-1, except I see Martin added a crucial assert for you. I added some more since then.

M.-A. Lemburg

6:04 a.m.

New subject: [Python-checkins]python/dist/src/Objectsunicodeobject.c,2.139,2.140

Tim Peters wrote:

...

[Tim]

...
...
... behavior, I viewed the version Martin replaced as unacceptable.

[M.-A. Lemburg]

...
Are you sure, you're talking about the latest version I checked in ?

Calling the version Martin checked in N, I'm talking about versions N-3, N-2, and N-1. N-3 and N-2 were unacceptable because they wrote out of bounds. N-1 ("the version Martin replaced") was unacceptable because it still didn't assert that it wasn't writing out of bounds. I asked repeatedly in the bug reports opened against N-3 and N-2 that asserts be added. If that had been done in version N-2, at least Barry, Martin, you and I wouldn't have spent additional hours chasing down what turned out to be more out-of-bounds writes (a debug-build run would have triggered an assert directly in the flawed code).

Tim, I don't get it... why all the fuzz about some missing asserts ?

...

...
I spent hours on this too and I'm pretty sure to have fixed the buffer overruns now.

You were pretty sure about N-2 too , and the more hours it takes to make tricky code correct, the more suspect that code is. As I most recently implored, in a comment on Barry's bug report against N-2:

What I do care about is that there weren't (and still aren't) asserts *verifying* that this delicate code isn't spilling over the allocated bounds.

About timing, last time we went around on this, the "measure once, cut once" version of the code was significantly slower in my timing tests too. I don't care so much if the code is tricky, but the trickier the code the more asserts are required.

You checked in N-1 (and N-2) without responding to comments like that, and we're all paying for it. You realize asserts go away in the release build, right? They don't cost anything in production mode, they save our ass in debug mode.

I must have missed Barry post, sorry. I didn't leave out the asserts for any reason -- just didn't think about using them.

...

...
... So, why not just add the assert to my original version ?

I don't know why you didn't <wink>. Martin backed out version N, so we're back to N-1, except I see Martin added a crucial assert for you. I added some more since then.

Thanks. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Tim Peters

7:43 a.m.

New subject: [Python-checkins]python/dist/src/Objectsunicodeobject.c,2.139,2.140

[MAL]

...

Tim, I don't get it... why all the fuzz about some missing asserts?

They were requested repeatedly, and would have saved everyone hours of debugging time. You didn't add asserts when requested, and didn't oppose them either -- they just got ignored. The last two times this routine was implicated in memory corruption, you assumed it was a pymalloc bug, and Martin had to do all the work of indentifying the true cause and then convincing you of that. The first time that was understandable, but the second time it wasn't; then that the third version of the code still didn't try to catch overwrites was over the edge.

...

I must have missed Barry post, sorry.

Barry filed the latest bug report, and you closed it. The memory corruption showed up as a bad UTF-8 translation in Mailman: http://www.python.org/sf/541828

...

I didn't leave out the asserts for any reason -- just didn't think about using them.

You would have had you read the comments on the bug reports you closed, so I'm not sure what to make of that. I'll be happy if you use asserts in the future, when the code is getting too tricky to be obviously correct; that shouldn't need to be a protracted battle.

M.-A. Lemburg

26 Apr 26 Apr

2:39 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ? In any case, I wasn't looking for a fight when requesting to back out Martin's changes. Tim Peters wrote:

...

[MAL]

...
Tim, I don't get it... why all the fuzz about some missing asserts?

They were requested repeatedly, and would have saved everyone hours of debugging time. You didn't add asserts when requested, and didn't oppose them either -- they just got ignored.

Tim, I wasn't aware that you were requesting to add these. I remember that you mentioned something about using assert()s in the Python core many months ago after a discussion with Guido about this, but it wasn't clear to me that you wanted me to add assert()s to the codecs. The reason I wasn't adding assert()s is simply that I normally don't use them in C programming. This example shows that it would probably have been better to do so and I'll try to remember this for the future.

...

The last two times this routine was implicated in memory corruption, you assumed it was a pymalloc bug, and Martin had to do all the work of indentifying the true cause and then convincing you of that. The first time that was understandable, but the second time it wasn't; then that the third version of the code still didn't try to catch overwrites was over the edge.

True, the code was complicated due to the many different paths the codec could take. In the last version, I ripped out all the counting code because of the troubles it caused previously. In any case, before spending too much time understanding the code, why didn't you just ask for a better explanation ? I would have been the last to deny that request.

...

...
I must have missed Barry post, sorry.

Barry filed the latest bug report, and you closed it. The memory corruption showed up as a bad UTF-8 translation in Mailman:

http://www.python.org/sf/541828

True, but Martin and you continued discussing the bug after it had been closed. I hadn't read those messages, because I thought they were part of the usual SF duplication of message delivery. Sorry about that.

...

...
I didn't leave out the asserts for any reason -- just didn't think about using them.

You would have had you read the comments on the bug reports you closed, so I'm not sure what to make of that. I'll be happy if you use asserts in the future, when the code is getting too tricky to be obviously correct; that shouldn't need to be a protracted battle.

Will do. Again, there was no fight intended. assert(peace), -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Guido van Rossum

2:59 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

...

I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?

The latter definitely playes a role -- when I was going to IETF meetings back around 94/95, character set issues were always good for a few big fights, and I think the people in Asia are still not all in agreement. Another issue is that adding Unicode was probably the most invasive set of changes ever made to the Python code base. It has complicated many parts of the code, and added at least a proportional share of bugs. (I found 166 source files in CVS containing some variation on the string "unicode", and 110 bug reports mentioning "unicode" in the SF bug tracker.) For a feature that few of the developers ever need to use for themselves (I believe everyone with CVS commit privileges needs at most Latin-1 for their own language :-), I can understand that makes it a touchy issue. --Guido van Rossum (home page: http://www.python.org/~guido/)

Jack Jansen

5:40 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

On donderdag, april 25, 2002, at 08:59 , Guido van Rossum wrote:

...

...
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?

[...]

...

Another issue is that adding Unicode was probably the most invasive set of changes ever made to the Python code base. It has complicated many parts of the code, and added at least a proportional share of bugs. (I found 166 source files in CVS containing some variation on the string "unicode", and 110 bug reports mentioning "unicode" in the SF bug tracker.)

Another thing that bothers me is that it retroactively changed the interpretation of other Python objects. For me it's perfectly logical that a character string is a character string, unless there's a very good reason to treat it differently (a framebuffer scanline, a binary blob, etc). And so if I have an API OpenFileWithUnicodeName() that accepts a unicode filename I expect that if I pass an 8-bit filename it would be converted on the fly. Other people focus on different sets of API's, however, and think there's nothing more logical than interpreting the string object as a binary buffer containing UTF16 values or what-have-you. Scanlines or binary blobs hardly ever mixed with filenames, so there wasn't an issue before unicode raised its pretty/ugly head. (of course it could be argued that unicode has demonstrated a design flaw in Python, namely that a single data-type was used to store both binary data of unknown interpretation and character arrays, and that there's now little more to be done about that). -- - Jack Jansen http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

M.-A. Lemburg

4:31 p.m.

New subject: unicodeobject.c,2.139,2.140 checkin

Jack Jansen wrote:

...

(of course it could be argued that unicode has demonstrated a design flaw in Python, namely that a single data-type was used to store both binary data of unknown interpretation and character arrays, and that there's now little more to be done about that).

This is probably the most significant problem with the usage of strings -- and there's nothing much we can do about it since it's hard-wired in the programmers mind... most other languages have the same problem. In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Guido van Rossum

8:19 p.m.

New subject: unicodeobject.c,2.139,2.140 checkin

...

In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding.

I think you're underestimating the tenacity of ASCII and Latin-1. --Guido van Rossum (home page: http://www.python.org/~guido/)

M.-A. Lemburg

27 Apr 27 Apr

12:48 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

Guido van Rossum wrote:

...

...
In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding.

I think you're underestimating the tenacity of ASCII and Latin-1.

This has nothing to do with a particular encoding, it's about meta data: much like you can use numbers to represent e.g. money or date/time values, you always need to store this meta information somewhere and object types are the ideal carrier for such meta data. It's an option which everybody working in the application space should consider. If you start working this way now, it'll pay off a few years ahead regardeless of whether it is useful for you now or not. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

martin＠v.loewis.de

1:16 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

Guido van Rossum writes:

...

...
In the long run, I think that the Unicode type should be used for all text data and strings for binary data of unknown encoding.

I think you're underestimating the tenacity of ASCII and Latin-1.

I agree with MAL: the run will just be *very* long. Regards, Martin

Skip Montanaro

26 Apr 26 Apr

11:54 p.m.

New subject: unicodeobject.c,2.139,2.140 checkin

mal> In the long run, I think that the Unicode type should be used for mal> all text data and strings for binary data of unknown encoding. In that case, I would suggest you call "string" something else. Skip

M.-A. Lemburg

27 Apr 27 Apr

1:03 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

Skip Montanaro wrote:

...

mal> In the long run, I think that the Unicode type should be used for mal> all text data and strings for binary data of unknown encoding.

In that case, I would suggest you call "string" something else.

That would call for world revolution and again put Unicode in a bad light. I'd rather not ;-) -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

M.-A. Lemburg

26 Apr 26 Apr

4:49 p.m.

New subject: Unicode

Guido van Rossum wrote:

...

...
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it. I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?

The latter definitely playes a role -- when I was going to IETF meetings back around 94/95, character set issues were always good for a few big fights, and I think the people in Asia are still not all in agreement.

Another issue is that adding Unicode was probably the most invasive set of changes ever made to the Python code base. It has complicated many parts of the code, and added at least a proportional share of bugs. (I found 166 source files in CVS containing some variation on the string "unicode", and 110 bug reports mentioning "unicode" in the SF bug tracker.)

True; and it was hard enough to get it mostly to a working compromise.

...

For a feature that few of the developers ever need to use for themselves (I believe everyone with CVS commit privileges needs at most Latin-1 for their own language :-), I can understand that makes it a touchy issue.

Very true indeed. Still, I think Unicode gives a chance of "fixing" the problem we currently have with strings: Unicode is unlike strings only usable for text data and that makes it ideal as standard type for text -- we'll never convince people to make a difference between text and binary data in strings, so offering them Unicode as alternative is a good strategy, IMHO. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

Guido van Rossum

8:23 p.m.

New subject: Unicode

...

True; and it was hard enough to get it mostly to a working compromise.

Let me add that I very much appreciate your heroic efforts there!!!

...

Very true indeed. Still, I think Unicode gives a chance of "fixing" the problem we currently have with strings: Unicode is unlike strings only usable for text data and that makes it ideal as standard type for text -- we'll never convince people to make a difference between text and binary data in strings, so offering them Unicode as alternative is a good strategy, IMHO.

It's a long way before we're there though -- we'd have to overhaul the I/O system entirely, and that takes a lot of time not just because of the effort but also because it won't be fully compatible. Also, once 8-bit strings are used for binary data only, I wonder if they shouldn't be more like Java's byte arrays -- i.e. mutable. And they don't need a literal notation. That's another major language change. :-( --Guido van Rossum (home page: http://www.python.org/~guido/)

Skip Montanaro

27 Apr 27 Apr

midnight

New subject: Unicode

Guido> Also, once 8-bit strings are used for binary data only, I wonder Guido> if they shouldn't be more like Java's byte arrays -- Guido> i.e. mutable. And they don't need a literal notation. That's Guido> another major language change. :-( How so? In theory, all the ways you write string constructors today would eventually map to Unicode objects. I'm thinking of string literals and constructor functions. That can be handled with the usual "warn for awhile" mechanism. The array module could be used to manipulate mutable arrays of 8-bit data. While permeating the Python innards with Unicode objects would be a major change I don't see any big syntactic changes - or is that not what worries you? Skip

Guido van Rossum

12:26 a.m.

New subject: Unicode

...

How so? In theory, all the ways you write string constructors today would eventually map to Unicode objects. I'm thinking of string literals and constructor functions. That can be handled with the usual "warn for awhile" mechanism. The array module could be used to manipulate mutable arrays of 8-bit data. While permeating the Python innards with Unicode objects would be a major change I don't see any big syntactic changes - or is that not what worries you?

No syntactic changes, no. But the way we do things would become significantly different. And think of binary I/O vs. textual I/O -- currently, file.read() returns a string. Code dealing with binary files will look significantly different, and old code won't work. --Guido van Rossum (home page: http://www.python.org/~guido/)

Jack Jansen

29 Apr 29 Apr

6:05 a.m.

New subject: Unicode

On vrijdag, april 26, 2002, at 06:26 , Guido van Rossum wrote:

...

No syntactic changes, no. But the way we do things would become significantly different. And think of binary I/O vs. textual I/O -- currently, file.read() returns a string. Code dealing with binary files will look significantly different, and old code won't work.

It could be argued that open(..., 'r').read() returns a text string and open(..., 'rb').read() returns a binary blob. If textstrings and blobs become wholly different objects this shouldn't create too many problems [see below], except for code that opens a file in binary mode and (partially) reads the resulting file expecting text. But this code would need revisiting anyway if the normal textstring would become unicode. [here's below] To my surprise I think that having blobs and textstrings be unrelated objects creates less problems than having the one be a subtype of the other. At least, every time I try to do the subtyping in my head I flip back and forth between textstrings-are-a-subtype-of-general-binary-buffers and binary-buffers-are-a-special-case-of-python-strings every couple of seconds. I think having them both be subtypes of a common base type (basestring) might work, but I'm not sure. -- - Jack Jansen http://www.cwi.nl/~jack - - If I can't dance I don't want to be part of your revolution -- Emma Goldman -

barry＠zope.com

6:41 a.m.

New subject: Unicode

...

...
...
...
...
"JJ" == Jack Jansen writes:

JJ> [here's below] To my surprise I think that having blobs and JJ> textstrings be unrelated objects creates less problems than JJ> having the one be a subtype of the other. While I think there's a lot of common operations you might do on blobs and textstrings, I don't see much of a need for them to be related via class hierarchy (although we might find it to be expedient for backwards compatibility). -Barry

Guido van Rossum

8:38 a.m.

New subject: Unicode

[Guido]

...

...
No syntactic changes, no. But the way we do things would become significantly different. And think of binary I/O vs. textual I/O -- currently, file.read() returns a string. Code dealing with binary files will look significantly different, and old code won't work.

[Jack]

...

It could be argued that open(..., 'r').read() returns a text string and open(..., 'rb').read() returns a binary blob.

They might even return different kind of objects -- arguably, binary files don't need readline() etc., and text files may not need read(n) (though the arg-less variant is handy). If only I had the time to reinvent the I/O library...

...

If textstrings and blobs become wholly different objects this shouldn't create too many problems [see below], except for code that opens a file in binary mode and (partially) reads the resulting file expecting text. But this code would need revisiting anyway if the normal textstring would become unicode.

Yeah, that's usually just stubborn Unix users who don't believe in the distinction between binary and text mode. :-) Anyway, the proper way to convert between blobs and textstrings would be encodings. That's how Java does it.

...

[here's below] To my surprise I think that having blobs and textstrings be unrelated objects creates less problems than having the one be a subtype of the other. At least, every time I try to do the subtyping in my head I flip back and forth between textstrings-are-a-subtype-of-general-binary-buffers and binary-buffers-are-a-special-case-of-python-strings every couple of seconds. I think having them both be subtypes of a common base type (basestring) might work, but I'm not sure.

I think they don't need anything in common (apart their sequence-ness). I think Java's byte[] vs. String distinction is about right. --Guido van Rossum (home page: http://www.python.org/~guido/)

martin＠v.loewis.de

30 Apr 30 Apr

3:01 p.m.

New subject: Unicode

Jack Jansen writes:

...

It could be argued that open(..., 'r').read() returns a text string and open(..., 'rb').read() returns a binary blob.

That can't work: to get a text string, you need to know the encoding. Regards, Martin

Guido van Rossum

8:59 p.m.

New subject: Unicode

...

Jack Jansen writes:

...
It could be argued that open(..., 'r').read() returns a text string and open(..., 'rb').read() returns a binary blob.

That can't work: to get a text string, you need to know the encoding.

Regards, Martin

In my ideal rewrite of the I/O subsystem, an encoding is specified (or a site-or-app-specific default encoding used) when a file is opened. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters

26 Apr 26 Apr

6:25 a.m.

New subject: unicodeobject.c,2.139,2.140 checkin

[M.-A. Lemburg]

...

I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it.

Huh -- I thought I was the only one who noticed this <wink>.

...

I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?

Unicode had nothing to do with my yelling in this thread. I've got very low tolerance for memory corruption, regardless of source. When it happens once I'm on high alert, when it happens twice in the same place I go postal. Had this been in dictobject.c or boolobject.c, I would have been just as unhappy. Now that the memory corruption is thought to be solved, and verified in the debug build regardless, *now* I'll get cranky about foreigners and their lameass character sets <wink>. On the technical issues remaining, I don't know how to judge the tradeoff between memory use and speed here. If you do, and pymalloc can help in some way, I'll be happy to help.

M.-A. Lemburg

4:26 p.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Tim Peters wrote:

...

[M.-A. Lemburg]

...
I don't know why it is, but Unicode always seems to unnecessarily heat up any discussion involving it.

Huh -- I thought I was the only one who noticed this <wink>.

Naa, it's occurred to me several times in the past. Unicode seems to trigger some memory corruption in Brain 2.2 which results in spilling out huge amounts of adrenalin and causes the blood pressure to reach record highs ;-)

...

...
I would really like to know what is causing this: is it a religious issue, does it have to do with the people involved or is Unicode inherently controversial ?

Unicode had nothing to do with my yelling in this thread. I've got very low tolerance for memory corruption, regardless of source. When it happens once I'm on high alert, when it happens twice in the same place I go postal. Had this been in dictobject.c or boolobject.c, I would have been just as unhappy.

Now that the memory corruption is thought to be solved, and verified in the debug build regardless, *now* I'll get cranky about foreigners and their lameass character sets <wink>.

Good to know.

...

On the technical issues remaining, I don't know how to judge the tradeoff between memory use and speed here. If you do, and pymalloc can help in some way, I'll be happy to help.

First of all, UTF-8 is probably the most common Unicode encoding used today and will certainly become *the* standard encoding within the next few years. So speed matters a lot in this particular corner of the Unicode implementation. The standard reasoning behind using overallocation for memory management is that typical modern malloc()s don't really allocate the memory until it is used (you know this anyway...), so overallocation doesn't actually cause bundles of memory chips to heat up. This makes overallocation ideal for the case where you don't know the exact size in advance but where you can estimate a reasonable upper bound. Now with pymalloc the situation is a bit different for smaller sized memory areas (larger chunks are handed off to the system malloc() which uses the above strategy). As Martin's benchmark showed, the counting strategy is faster for small chunks and this is probably due to the fact that pymalloc manages these. Since pymalloc cannot know that an algorithm wants to use overallocation as memory allocation strategy, it would probably help to find a way to tell pymalloc about this fact. It could then either redirect the request to the system malloc() or use a different malloc strategy for these chunks. -- Marc-Andre Lemburg CEO eGenix.com Software GmbH ______________________________________________________________________ Company & Consulting: http://www.egenix.com/ Python Software: http://www.egenix.com/files/python/

martin＠v.loewis.de

27 Apr 27 Apr

1:15 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

"M.-A. Lemburg" writes:

...

The standard reasoning behind using overallocation for memory management is that typical modern malloc()s don't really allocate the memory until it is used (you know this anyway...),

That is not true. Each malloc implementation I know will always iterate through the freelist to find an appriately-sized chunk, and go to the OS if it doesn't find one. Now, the *OS* might implement such allocations as a no-op, but on most hardware, it can do so only in units of memory pages (e.g. 4k). Most strings are smaller than a page, so if you allocate lots of strings, every page allocated from the system will be used as well (atleast to fill in the string header). With overallocation, you will indeed overallocate real pages, and consume real memory.

...

This makes overallocation ideal for the case where you don't know the exact size in advance but where you can estimate a reasonable upper bound.

No. The overallocation has a real cost in terms of memory consumption.

...

As Martin's benchmark showed, the counting strategy is faster for small chunks and this is probably due to the fact that pymalloc manages these.

I doubt this claim.

...

Since pymalloc cannot know that an algorithm wants to use overallocation as memory allocation strategy, it would probably help to find a way to tell pymalloc about this fact. It could then either redirect the request to the system malloc() or use a different malloc strategy for these chunks.

That won't help. Regards, Martin

Tim Peters

2:21 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

[MAL]

...

The standard reasoning behind using overallocation for memory management is that typical modern malloc()s don't really allocate the memory until it is used (you know this anyway...),

[Martin]

...

... Most strings are smaller than a page, so if you allocate lots of strings, every page allocated from the system will be used as well (at least to fill in the string header). With overallocation, you will indeed overallocate real pages, and consume real memory.

But Marc-Andre uses realloc at the end to return the excess. The excess bytes will get reused (and some returned yet again) by the next overallocation, and so on. The only memory *touched* by him is the only memory he actually needs in the end. Indeed, if strings are always smaller than a page, the aggregate overallocation at any point in a single-threaded program will be at worst a few pages total (overallocation is never more than a factor of 4, and the excess is always given back untouched).

...

...
As Martin's benchmark showed, the counting strategy is faster for small chunks and this is probably due to the fact that pymalloc manages these.

...

I doubt this claim.

If you both run the test with and without pymalloc enabled, the results should speak for themselves. I have not, but suspect the difference has more to do with that caches like small memory areas better than large ones, and especially when you crawl over a memory area twice. MAL, you should keep in mind that pymalloc is also managing the small chunks in your scheme: when you're fiddling with a 40-character Unicode string, an overallocation "by a factor of 4" only amounts to an 80-character UTF8 string. pymalloc handles blocks that small under either scheme, and indeed your current scheme is getting a speed benefit from that pymalloc currently refuses to copy the data to a smaller block when there's a shrinking realloc at the end.

...

...
Since pymalloc cannot know that an algorithm wants to use overallocation as memory allocation strategy, it would probably help to find a way to tell pymalloc about this fact. It could then either redirect the request to the system malloc() or use a different malloc strategy for these chunks.

Possibly, yes. It's still in need of quantifying the speed versus memory tradeoffs. pymalloc's current small-block realloc strategy favors speed, and going to the system malloc instead would be slower. You haven't yet told me that's what you actually want <wink>.

martin＠v.loewis.de

2:57 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Tim Peters writes:

...

But Marc-Andre uses realloc at the end to return the excess. The excess bytes will get reused (and some returned yet again) by the next overallocation, and so on.

Right. I confused this with the fact that PyMem_Realloc won't return the excess memory, so the extra bytes in a small string will be wasted for the life time of the string object - that still could cause significant memory wastage.

...

MAL, you should keep in mind that pymalloc is also managing the small chunks in your scheme: when you're fiddling with a 40-character Unicode string, an overallocation "by a factor of 4" only amounts to an 80-character UTF8 string.

[I guess this is a terminology, not a math problem: a 40 character Unicode string has already 80 bytes; the UTF-8 of it can have up to 160 bytes]. Regards, Martin

Tim Peters

4:59 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

[Tim]

...

...
But Marc-Andre uses realloc at the end to return the excess. The excess bytes will get reused (and some returned yet again) by the next overallocation, and so on.

[Martin]

...

Right. I confused this with the fact that PyMem_Realloc won't return the excess memory,

PyMem_Realloc does whatever the system realloc does -- PyMem_Realloc doesn't go thru pymalloc today (except in a PYMALLOC_DEBUG build). Doesn't matter, though, since strings use the PyObject_{Malloc, Free, Realloc} family today, and that does use pymalloc. OTOH, there's no reason PyObject_Realloc *has* to hang on to all small-block memory on a shrinking realloc, and there's no reason pymalloc couldn't grow another realloc entry point specifying what the caller wants a shrinking realloc to do. These things are all easy to change, but I don't know what's truly desirable. Note another subtlety: I expect you brought up PyMem_Realloc because unicodeobject.c uses the PyMem_XYZ family for managing the PyUnicodeObject.str member today. That means it normally never uses pymalloc at all, except to allocate fixed-size PyUnicodeObject structs (which use the PyObject_XYZ memory family). I don't know whether that's the best idea, but that's how it is today. pymalloc gets into this because PyUnicode_EncodeUTF8 returns a plain string object, and the latter uses pymalloc today.

...

so the extra bytes in a small string will be wasted for the life time of the string object - that still could cause significant memory wastage.

It could. Python generally aims to optimize the expected case, not jump thru hoops to avoid worst cases (else we wouldn't use dicts at all <wink>). But I don't know what the expected case is here, and given how often I use Unicode in my own work it could be I'll never have a clue. Note that the expected uses of Unicode strings makes no difference to PyUnicode_EncodeUTF8: what counts there is the expected lifetimes and sizes of the "plain" utf8-encoded PyStringObjects it computes. Indeed, pymalloc has almost no implications for Unicode beyond the encode-as-a-plain-string functions (unless unicodeobject.c is changed to manage the PyUnicodeObject.str member using pymalloc too, as plain strings do today).

...

...
MAL, you should keep in mind that pymalloc is also managing the small chunks in your scheme: when you're fiddling with a 40-character Unicode string, an overallocation "by a factor of 4" only amounts to an 80-character UTF8 string.

...

[I guess this is a terminology, not a math problem:

Nope! Turns out it was an hallucination problem <wink>.

...

a 40 character Unicode string has already 80 bytes; the UTF-8 of it can have up to 160 bytes].

You're right, of course. The conclusion doesn't change, though: that's still in the range of block pymalloc handles (and will remain so unless I reduce pymalloc's small-object threshold below what's needed for pymalloc to handle small dicts on its own -- which I'm unlikely to do).

Guido van Rossum

5:09 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

[Tim]

...

The conclusion doesn't change, though: that's still in the range of block pymalloc handles (and will remain so unless I reduce pymalloc's small-object threshold below what's needed for pymalloc to handle small dicts on its own -- which I'm unlikely to do).

Would it make sense to change the Unicode object to use pymalloc, and to change the UTF-8 codec to count the bytes if the shortest possible output would fit in a pymalloc block? (I guess this means that the length of the Unicode string should be less than SMALL_REQUEST_THRESHOLD - currently 256.) --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters

5:47 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

[Guido]

...

Would it make sense to change the Unicode object to use pymalloc, and to change the UTF-8 codec to count the bytes if the shortest possible output would fit in a pymalloc block?

These are independent questions, and I don't know how to answer either unless you give me a test program that prints the value of the function you're trying to minimize <0.7 wink>. The Unicode object currenly uses quite an elaborate free list, caching both PyUnicodeObject structs (which currently use pymalloc), and their str members (which currently do not). Whether the str member uses pymalloc really doesn't have anything to do with what the UTF8 encoder function does (it returns plain strings, and those already use pymalloc today -- and it's not entirely clear whether they should either!). Counting the bytes in the UTF8 decoder could work well, independent of that: if the result is known to fit in a pymalloc block, just do it; as soon as it's known that it won't, overallocate with assurance that the system realloc will give back everything that isn't used. In the latter case I believe the code could be made much simpler, by doing a factor-of-4 overallocation from the start (it currently tries 2, then 3, then 4, with a bunch of embedded-in-the-loops tests to prevent overwrites; I'm not sure why it bothers with this staggered scheme, since it's going to touch exactly as much memory as it actually needs regardless, and give all the rest back untouched).

...

(I guess this means that the length of the Unicode string should be less than SMALL_REQUEST_THRESHOLD - currently 256.)

For a start, yes. I'd stick a "Py_" in front of that symbol and expose it then. The cutoff test would also have to take into account the size of the result's PyStringObject header (the whole stringobject enchilada counts against the threshold).

martin＠v.loewis.de

6:14 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Guido van Rossum writes:

...

Would it make sense to change the Unicode object to use pymalloc, and to change the UTF-8 codec to count the bytes if the shortest possible output would fit in a pymalloc block? (I guess this means that the length of the Unicode string should be less than SMALL_REQUEST_THRESHOLD - currently 256.)

Given my measurements, that would make sense. I suspect that counting small strings is quite efficient, so that the overhead of iterating over the string twice hides in the noise of additional invocations. Regards, Martin

martin＠v.loewis.de

6:08 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Tim Peters writes:

...

...
Right. I confused this with the fact that PyMem_Realloc won't return the excess memory,

PyMem_Realloc does whatever the system realloc does -- PyMem_Realloc doesn't go thru pymalloc today (except in a PYMALLOC_DEBUG build). Doesn't matter, though, since strings use the PyObject_{Malloc, Free, Realloc} family today, and that does use pymalloc.

That's what I mean (I'm *really* confused about memory family APIs, ever since everything changed :-)

...

OTOH, there's no reason PyObject_Realloc *has* to hang on to all small-block memory on a shrinking realloc, and there's no reason pymalloc couldn't grow another realloc entry point specifying what the caller wants a shrinking realloc to do. These things are all easy to change, but I don't know what's truly desirable.

Neither do I. To establish whether releasing the extra memory is worth the effort would depend on knowledge whether the object will be long-living; neither pymalloc nor its caller is able to tell.

...

Note another subtlety: I expect you brought up PyMem_Realloc because unicodeobject.c uses the PyMem_XYZ family for managing the PyUnicodeObject.str member today.

No, because I assumed PyMem_Realloc was a synonym for PyObject_Realloc.

...

That means it normally never uses pymalloc at all, except to allocate fixed-size PyUnicodeObject structs (which use the PyObject_XYZ memory family). I don't know whether that's the best idea, but that's how it is today.

I do think that the Unicode data should be managed by pymalloc as well. Of course, DecodeUTF8 would then raise the same issue: decoding UTF-8 doesn't know how many characters you'll get, either. This currently does not try to be clever, but allocates enough memory for the worst case. Regards, Martin

Guido van Rossum

6:46 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Yet another idea: decode into a fixed-size stack-allocated buffer. If it fits in that buffer, use PyString_FromStringAndSize(). Otherwise, do the overallocation thing. --Guido van Rossum (home page: http://www.python.org/~guido/)

Tim Peters

12:42 p.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

[Guido]

...

Yet another idea: decode into a fixed-size stack-allocated buffer. If it fits in that buffer, use PyString_FromStringAndSize(). Otherwise, do the overallocation thing.

How come we can't use a version of C that pulls this trick for us automatically <wink>? We do end up there a lot. Here's a patch that does it for PyUnicode_EncodeUTF8: http://www.python.org/sf/549375

Tim Peters

1:01 p.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

[martin@v.loewis.de]

...

That's what I mean (I'm *really* confused about memory family APIs, ever since everything changed :-)

Here's the in-depth course: PyMem_xyz calls the platform malloc/realloc/free (fiddled for x-platform uniformity in NULL and 0 handling) PyObject_xyz calls pymalloc's malloc/realloc/free and instead of a dozen layers of indirection we've now got crushingly straightforward WYSIWYG preprocessor blocks like: #ifdef WITH_PYMALLOC #ifdef PYMALLOC_DEBUG #define PyObject_MALLOC _PyObject_DebugMalloc #define PyObject_Malloc _PyObject_DebugMalloc #define PyObject_REALLOC _PyObject_DebugRealloc #define PyObject_Realloc _PyObject_DebugRealloc #define PyObject_FREE _PyObject_DebugFree #define PyObject_Free _PyObject_DebugFree #else /* WITH_PYMALLOC && ! PYMALLOC_DEBUG */ #define PyObject_MALLOC PyObject_Malloc #define PyObject_REALLOC PyObject_Realloc #define PyObject_FREE PyObject_Free #endif #else /* ! WITH_PYMALLOC */ #define PyObject_MALLOC PyMem_MALLOC #define PyObject_REALLOC PyMem_REALLOC #define PyObject_FREE PyMem_FREE #endif /* WITH_PYMALLOC */ #define PyObject_Del PyObject_Free #define PyObject_DEL PyObject_FREE /* for source compatibility with 2.2 */ #define _PyObject_Del PyObject_Free All the names you love are still there, it's just that most of them are redundant now <wink>.

...

... I do think that the Unicode data should be managed by pymalloc as well.

Well, that largely depends on how big these suckers are. Calling PyObject_XYZ adds real overhead if pymalloc can't handle the requested size: all the overhead of the system routines, + the overhead of pymalloc figuring out it can't handle it. I expect it's also not good to mix pymalloc with custom free lists: you hold on to one object from a pymalloc pool, and it prevents the entire pool from getting recycled for another size class. So if you want to investigate using pymalloc more heavily for Unicode objects, I suggest two things: 1. Get rid of the Unicode-specific free list. 2. Change the object layout to embed the str member storage, just as PyStringObject does. #1 is pretty localized, but #2 would require changing a lot of code.

...

Of course, DecodeUTF8 would then raise the same issue: decoding UTF-8 doesn't know how many characters you'll get, either. This currently does not try to be clever, but allocates enough memory for the worst case.

I just put a patch up on SourceForge that's *less* clever, but shouldn't waste any memory in the end. I expect you'll be happy with it, or rant inconsolably. It's all the same to me <wink>.

martin＠v.loewis.de

28 Apr 28 Apr

12:55 a.m.

New subject: pymalloc and overallocation (unicodeobject.c,2.139,2.140 checkin)

Tim Peters writes:

...

I just put a patch up on SourceForge that's *less* clever, but shouldn't waste any memory in the end. I expect you'll be happy with it, or rant inconsolably. It's all the same to me <wink>.

Not that you care, but I'm happy with it :-) Martin

barry＠zope.com

22 Apr 22 Apr

9:22 p.m.

New subject: [Python-checkins]python/dist/src/Objectsunicodeobject.c,2.139,2.140

...

...
...
...
...
"MAL" == M writes:

MAL> I must have missed Barry post, sorry. I didn't leave out the MAL> asserts for any reason -- just didn't think about using them. I think Tim's referring just to the bug report about the regression in s.encode('utf8') that I reported seeing in the email package test. BTW, did you add a separate unittest for this specific failure? -Barry

8032

Age (days ago)

8041

Last active (days ago)

List overview

Download

41 comments

7 participants

participants (7)

barry＠zope.com
Guido van Rossum
Jack Jansen
M.-A. Lemburg
martin＠v.loewis.de
Skip Montanaro
Tim Peters

RE: [Python-checkins] python/dist/src/Objects unicodeobject.c,2.139,2.140

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Jack Jansen

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

M.-A. Lemburg

Jack Jansen

M.-A. Lemburg

tags

participants (7)