How can I create customized classes that have similar properties as 'str'?
Steven D'Aprano
steve at REMOVE-THIS-cybersource.com.au
Sat Nov 24 16:59:50 EST 2007
On Sat, 24 Nov 2007 03:44:59 -0800, Licheng Fang wrote:
> On Nov 24, 7:05 pm, Bjoern Schliessmann <usenet-
> mail-0306.20.chr0n... at spamgourmet.com> wrote:
>> Licheng Fang wrote:
>> > I find myself frequently in need of classes like this for two
>> > reasons. First, it's efficient in memory.
>>
>> Are you using millions of objects, or MB size objects? Otherwise, this
>> is no argument.
>
> Yes, millions.
Oh noes!!! Not millions of words!!!! That's like, oh, a few tens of
megabytes!!!!1! How will a PC with one or two gigabytes of RAM cope?????
Tens of megabytes is not a lot of data.
If the average word size is ten characters, then one million words takes
ten million bytes, or a little shy of ten megabytes. Even if you are
using four-byte characters, you've got 40 MB, still a moderate amount of
data on a modern system.
> In my natural language processing tasks, I almost always
> need to define patterns, identify their occurrences in a huge data, and
> count them. Say, I have a big text file, consisting of millions of
> words, and I want to count the frequency of trigrams:
>
> trigrams([1,2,3,4,5]) == [(1,2,3),(2,3,4),(3,4,5)]
>
> I can save the counts in a dict D1. Later, I may want to recount the
> trigrams, with some minor modifications, say, doing it on every other
> line of the input file, and the counts are saved in dict D2. Problem is,
> D1 and D2 have almost the same set of keys (trigrams of the text), yet
> the keys in D2 are new instances, even though these keys probably have
> already been inserted into D1. So I end up with unnecessary duplicates
> of keys. And this can be a great waste of memory with huge input data.
All these keys will almost certainly add up to only a few hundred
megabytes, which is a reasonable size of data but not excessive. This
really sounds to me like a case of premature optimization. I think you
are wasting your time solving a non-problem.
[snip]
> Wow, I didn't know this. But exactly how Python manage these strings? My
> interpretator gave me such results:
>
>>>> a = 'this'
>>>> b = 'this'
>>>> a is b
> True
>>>> a = 'this is confusing'
>>>> b = 'this is confusing'
>>>> a is b
> False
It's an implementation detail. You shouldn't use identity testing unless
you actually care that two names refer to the same object, not because
you want to save a few bytes. That's poor design: it's fragile,
complicated, and defeats the purpose of using a high-level language like
Python.
--
Steven.
More information about the Python-list
mailing list