Storing a big amount of path names

Paulo da Silva p_s_d_a_s_i_l_v_a_ns at netcabo.pt
Thu Feb 11 23:45:57 EST 2016


Às 04:23 de 12-02-2016, Chris Angelico escreveu:
> On Fri, Feb 12, 2016 at 3:15 PM, Paulo da Silva
> <p_s_d_a_s_i_l_v_a_ns at netcabo.pt> wrote:
>> Às 03:49 de 12-02-2016, Chris Angelico escreveu:
>>> On Fri, Feb 12, 2016 at 2:13 PM, MRAB <python at mrabarnett.plus.com> wrote:
>>>> Apart from all of the other answers that have been given:
>>>>
>> ...
>>>
>>> Simpler to let the language do that for you:
>>>
>>>>>> import sys
>>>>>> p1 = sys.intern('foo/bar')
>>>>>> p2 = sys.intern('foo/bar')
>>>>>> id(p1), id(p2)
>>> (139621017266528, 139621017266528)
>>>
>>
>> I didn't know about id or sys.intern :-)
>> I need to look at them ...
>>
>> As I can understand I can do in MyFile class
>>
>> self.dirname=sys.intern(dirname) # dirname passed as arg to the __init__
>>
>> and the character string doesn't get repeated.
>> Is this correct?
> 
> Correct. Two equal strings, passed to sys.intern(), will come back as
> identical strings, which means they use the same memory. You can have
> a million references to the same string and it takes up no additional
> memory.
I have being playing with this and found that it is not always true!
For example:

In [1]: def f(s):
   ...:     print(id(sys.intern(s)))
   ...:

In [2]: import sys

In [3]: f("12345")
139805480756480

In [4]: f("12345")
139805480755640

In [5]: f("12345")
139805480756480

In [6]: f("12345")
139805480756480

In [7]: f("12345")
139805480750864

I think a dict, as MRAB suggested, is needed.
At the end of the store process I may delete the dict.

> 
> But I reiterate: Don't even bother with this unless you know your
> program is running short of memory.

Yes, it is.
This is part of a previous post (sets of equal files) and I need lots of
memory for performance reasons. I only have 2G in this computer.

I already had implemented a solution. I used two dicts. One to map
dirnames to an int handler and the other to map the handler to dir
names. At the end I deleted the 1st. one because I only need to get the
dirname from the handler. But I thought there should be a better choice.

Thanks
Paulo



More information about the Python-list mailing list