Mutable chars objects

Something that I have wanted in Python for a long time is something like the Java StringBuffer class - a mutable buffer, with string-like methods, that holds characters instead of bytes. I do a lot of stuff with parsing, and its often convenient to build up long strings of text one character at a time. Doing this with strings in Python is obviously not the way to go, since each time you append a character you have to construct a new string object. Doing it with lists is better, except that you still have to pay the overhead of the dynamic typing information for each character. Also, unlike a list or an array, you'd ideally want something that has string-like methods, such as toupper() and so on. Calling str( buffer ) should create a string of the contents of the buffer, not generate a repr() of the object which is what would happen if you call str() on a list or array. Passing this buffer to 'print' should also just print the characters. Similarly, you ought to be able to comparisons between the mutable buffer and a real string; slices of the buffer should be strings, not lists, and so on. In other words - it ought to act pretty much like STL strings. Also, the class ought to be optimized for single-character appending, it should be smart enough to grow memory in the right-sized chunks; And no, there's no particular reason why the memory needs to be contiguous, although it could be. Originally, I had thought that such a class might be called 'characters' (to correspond with 'bytes' in Python 3000), but it could just as easily be called strbuffer or something else. -- Talin

Talin <talin@acm.org> wrote:
8-bit ASCII characters, or compile-time specified unicode characters (16 or 32 bit)? If all you wanted was mutable characters, array.array('c'), it's smart about appending. The lack of string methods kind of kills it though. One of the reasons I was pushing for string views oh, about 7 months ago was for very similar reasons; it would be *really* nice to be able to add string methods to anything that provided the buffer interface. Nevermind that if it offered a multi-byte buffer view (like the extended buffer interface that will be coming in Py3k), you could treat arbitrary data as if it were strings - an array of 16 bit ints would be the same as 8 bit ints, the same as 8 bit characters, the same as 32 bit ints, etc. I guess I was 7 months too early in my proposal. - Josiah

On 3/11/07, Talin <talin@acm.org> wrote:
I do a lot of stuff with parsing, and its often convenient to build up long strings of text one character at a time.
Could you be more specific about this? When I write a parser it always starts with either _token_re = re.compile(r'''(?x) ...15 lines omitted... ''') or import yapps2 # wheeee! I've never had much luck hand-coding a lexer in straight-up Python. Not only is it slow, I feel like Python's syntax is working against me-- no switch statement, no do-while. (This is not a complaint! It's all right. I should be using a parser-generator anyway.) Josiah mentioned array.array('c'). There's also array.array('u'), which is an array of Py_UNICODEs. You can add the string methods in a subclass for a quick prototype. -j

"Jason Orendorff" <jason.orendorff@gmail.com> wrote:
I don't believe he's talking about parsing in the language lexer sense, I believe he is talking about perhaps url parsing (breaking it down into its component parts), "unmarshaling" (think pickle, marshal, etc.), or possibly even configuration files.
Kind-of, but it's horribly slow. The point of string views that I mentioned is that you get all of the benefits of the underlying C implementation; from speed to "it's already been implemented and debugged". - Josiah

Talin <talin@acm.org> wrote:
8-bit ASCII characters, or compile-time specified unicode characters (16 or 32 bit)? If all you wanted was mutable characters, array.array('c'), it's smart about appending. The lack of string methods kind of kills it though. One of the reasons I was pushing for string views oh, about 7 months ago was for very similar reasons; it would be *really* nice to be able to add string methods to anything that provided the buffer interface. Nevermind that if it offered a multi-byte buffer view (like the extended buffer interface that will be coming in Py3k), you could treat arbitrary data as if it were strings - an array of 16 bit ints would be the same as 8 bit ints, the same as 8 bit characters, the same as 32 bit ints, etc. I guess I was 7 months too early in my proposal. - Josiah

On 3/11/07, Talin <talin@acm.org> wrote:
I do a lot of stuff with parsing, and its often convenient to build up long strings of text one character at a time.
Could you be more specific about this? When I write a parser it always starts with either _token_re = re.compile(r'''(?x) ...15 lines omitted... ''') or import yapps2 # wheeee! I've never had much luck hand-coding a lexer in straight-up Python. Not only is it slow, I feel like Python's syntax is working against me-- no switch statement, no do-while. (This is not a complaint! It's all right. I should be using a parser-generator anyway.) Josiah mentioned array.array('c'). There's also array.array('u'), which is an array of Py_UNICODEs. You can add the string methods in a subclass for a quick prototype. -j

"Jason Orendorff" <jason.orendorff@gmail.com> wrote:
I don't believe he's talking about parsing in the language lexer sense, I believe he is talking about perhaps url parsing (breaking it down into its component parts), "unmarshaling" (think pickle, marshal, etc.), or possibly even configuration files.
Kind-of, but it's horribly slow. The point of string views that I mentioned is that you get all of the benefits of the underlying C implementation; from speed to "it's already been implemented and debugged". - Josiah
participants (3)
-
Jason Orendorff
-
Josiah Carlson
-
Talin