Mailman 3 Use lazy loading with hashtable in python gettext module - Python-ideas

Oleg Broytman

December 2018

1:14 p.m.

Hi! On Tue, Dec 18, 2018 at 10:10:51AM +0100, Serge Ballesta via Python-ideas <python-ideas@python.org> wrote:

In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel ( http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the original msgfmt from GNU gettext.

I use gettext quite extensively. I use Python's msgfmt to generate .mo files. I also use Django's compilemessage; I don't know what it uses internally, it could be an independent implementation or Python's msgfmt.

...

I could find that only the original msgfmt was able to generate a hashtable, and that anyway the Python gettext module loaded everything in memory and did not use it. But I also find a TODO note saying

# TODO: # - Lazy loading of .mo files. Currently the entire catalog is loaded into # memory, but that's probably bad for large translated programs. Instead, # the lexical sort of original strings in GNU .mo files should be exploited # to do binary searches and lazy initializations. Or you might want to use # the undocumented double-hash algorithm for .mo files with hash tables, but # you'll need to study the GNU gettext code to do this.

I have studied GNU gettext code and found that implemententing the hashing algorithm in Python would not be that hard.

That's interesting!

...

The undocumented features required for implementation are: - the version number can safely stay to 0 when processing Python code - the size of the hash table is the first odd prime greater than or equal to 4 * n / 3 where n is the number of strings - the first hashing function uses a variant of PJW hash function described in https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~ high is replaced with h = h ^ high, and using 32 bits integers. The index in the table in the result of the function modulus the size of the hash table - when there is a conflict (the slot given by the first hashing function is already used by another string) the following is used: - let h be the result of the PJW variant hash function and size be the size of the hash table, an increment value is set to 1 +( h % (size -2)) - that increment is repeatedly added to the index in the hash table (modulus the table size) until an empty slot is found (or the correct original string is found)

For now, my (alpha) code is able to generate in pure Python the same mo file that GNU msgfmt generates, and use the hashtable to access the strings.

Remaining problems: - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it?

You should ask a lawyer and I am not. But my understanding is that you can borrow ideas from a GPL-protected code without contaminating your code with GPL. You cannot copy code -- that makes your code GPL'd.

...

- the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better

There is the third option -- open and close the file. I'd prefer the option as file descriptors are precious resources limited in supply. There is a twist though. The file could be replaced while closed so you have to find a way to verify the was replaced and reread the has table from it. Perhaps checking timestamp of the file (date/time of the last modification) is enough.

...

- I should either rely on the current way (load everything in memory) or implement a binary search algo for the case where the hash table is not present (it is of course optional) - it would be an important change, and I think that options should be allow to choose between an eager or lazy access

Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end.

Well, I mus admit my .po/.mo aren't that big. The biggest .po is 60k, its corresponding .mo is only 30k bytes. I don't know if using the hash table gives me improvement. Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Reply

Sign in to reply online Use email software

Barry Scott

10:09 p.m.

...

On 18 Dec 2018, at 09:10, Serge Ballesta via Python-ideas <python-ideas@python.org> wrote:

In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel (http://babel.pocoo.org/en/latest/ <http://babel.pocoo.org/en/latest/>), msgfmt.py from Python tools or the original msgfmt from GNU gettext.

snip

...

Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end

I think about it this way. Based on the largest project I have worked on that was internationalised into 14 languages the British English text translated to American English (en-US) created a 350KiB MO file. The largest mo file was for Thai (th-TH) at 680KiB. Is it worth the complexity of the hash code to save that memory? Will the hash code improve the load time? We never noticed the load time and we reloaded the MO on ever web page access. As for FDs it uses 1 and on my linux system I have 1.6M to play with. Barry

Reply

Sign in to reply online Use email software

Serge Ballesta

5:06 p.m.

Hi all, The feed back on my initial mail convinced me that it was important to allow the current behaviour of eagerly loading the whole catalog, and that keeping the files opened should also be optional. All that lead to this proposal: Features: ======== The gettext module should be allowed to load lazily the catalogs from mo file. This lazy load should be optional and make use of the hash tables from mo files when they are present or revert to a binary search. The translation strings should be cached for better performances. API changes: ============ 3 functions from the gettext module will have 2 new optional parameter named caching, and keepopen: gettext.bindtextdomain(domain, localedir=None) would become gettext.bindtextdomain(domain, localedir=None, caching=None, keepopen=False) gettext.translation(domain, localedir=None, languages=None, class_=None, fallback=False, codeset=None) would become gettext.translation(domain, localedir=None, languages=None, class_=None, fallback=False, codeset=None, caching=None, keepopen=False) gettext.install(domain, localedir=None, codeset=None, names=None) would become gettext.install(domain, localedir=None, codeset=None, names=None, caching=None, keepopen=False) The new caching parameter could receive the following values: caching=None: revert to the previour eager loading of the full catalog. It will be the default to allow previous application to see no change caching=1: lazy loading with unlimited cache caching=n where n is a positive (>=0) integer value: lazy loading with a LRU cache limited to n strings The keepopen parameter would be a boolean: keepopen=False (default): the mo file is only opened before loading a translation string and closed immediately after - it is also opened once when the GNUTranslation class is initialized to load the file description keepopen=True: the mo file is kept open during the lifetime of the GNUTranslation object. This parameter is ignored and not used if caching is None Implementation: ============== The current GNUTranslation class loads the content of the mo file to build a dictionnary where the original strings are the keys and the translated keys the values. Plural forms use a special processing: the key is a 2 tuple (singular original string, order), and the value is the corresponding translated string - order=0 is normally for the singular translated string. The proposed implementation would simply replace this dictionary with a special mapping subclass when caching is not None. That subclass would use same keys as the original directory and would: - first search in its cache - if not found in cache and if the hashtable has not a zero size search the original string by hash - if not found in cache and if the hashtable has a zero size, search the original string with a binary search algorithm. - if a string is found, it should feed the LRU cache, eventually throwing away the oldest entry (entries) That should allow to implement the new feature with minimal refactoring for the gettext module. Le 18/12/2018 à 10:10, Serge Ballesta via Python-ideas a écrit :

...

In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel (http://babel.pocoo.org/en/latest/), msgfmt.py from Python tools or the original msgfmt from GNU gettext.

I could find that only the original msgfmt was able to generate a hashtable, and that anyway the Python gettext module loaded everything in memory and did not use it. But I also find a TODO note saying

# TODO: # - Lazy loading of .mo files. Currently the entire catalog is loaded into # memory, but that's probably bad for large translated programs. Instead, # the lexical sort of original strings in GNU .mo files should be exploited # to do binary searches and lazy initializations. Or you might want to use # the undocumented double-hash algorithm for .mo files with hash tables, but # you'll need to study the GNU gettext code to do this.

I have studied GNU gettext code and found that implemententing the hashing algorithm in Python would not be that hard.

The undocumented features required for implementation are: - the version number can safely stay to 0 when processing Python code - the size of the hash table is the first odd prime greater than or equal to 4 * n / 3 where n is the number of strings - the first hashing function uses a variant of PJW hash function described in https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~high is replaced with h = h ^ high, and using 32 bits integers. The index in the table in the result of the function modulus the size of the hash table - when there is a conflict (the slot given by the first hashing function is already used by another string) the following is used: - let h be the result of the PJW variant hash function and size be the size of the hash table, an increment value is set to 1 +( h % (size -2)) - that increment is repeatedly added to the index in the hash table (modulus the table size) until an empty slot is found (or the correct original string is found)

For now, my (alpha) code is able to generate in pure Python the same mo file that GNU msgfmt generates, and use the hashtable to access the strings.

Remaining problems: - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it? - the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better - I should either rely on the current way (load everything in memory) or implement a binary search algo for the case where the hash table is not present (it is of course optional) - it would be an important change, and I think that options should be allow to choose between an eager or lazy access

Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end.

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Reply

Sign in to reply online Use email software

Oleg Broytman

December 2018

1:14 p.m.

Hi! On Tue, Dec 18, 2018 at 10:10:51AM +0100, Serge Ballesta via Python-ideas <python-ideas@python.org> wrote:

...

In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel ( http://babel.pocoo.org/en/latest/ ), msgfmt.py from Python tools or the original msgfmt from GNU gettext.

I use gettext quite extensively. I use Python's msgfmt to generate .mo files. I also use Django's compilemessage; I don't know what it uses internally, it could be an independent implementation or Python's msgfmt.

...

I could find that only the original msgfmt was able to generate a hashtable, and that anyway the Python gettext module loaded everything in memory and did not use it. But I also find a TODO note saying

# TODO: # - Lazy loading of .mo files. Currently the entire catalog is loaded into # memory, but that's probably bad for large translated programs. Instead, # the lexical sort of original strings in GNU .mo files should be exploited # to do binary searches and lazy initializations. Or you might want to use # the undocumented double-hash algorithm for .mo files with hash tables, but # you'll need to study the GNU gettext code to do this.

I have studied GNU gettext code and found that implemententing the hashing algorithm in Python would not be that hard.

That's interesting!

...

The undocumented features required for implementation are: - the version number can safely stay to 0 when processing Python code - the size of the hash table is the first odd prime greater than or equal to 4 * n / 3 where n is the number of strings - the first hashing function uses a variant of PJW hash function described in https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~ high is replaced with h = h ^ high, and using 32 bits integers. The index in the table in the result of the function modulus the size of the hash table - when there is a conflict (the slot given by the first hashing function is already used by another string) the following is used: - let h be the result of the PJW variant hash function and size be the size of the hash table, an increment value is set to 1 +( h % (size -2)) - that increment is repeatedly added to the index in the hash table (modulus the table size) until an empty slot is found (or the correct original string is found)

For now, my (alpha) code is able to generate in pure Python the same mo file that GNU msgfmt generates, and use the hashtable to access the strings.

Remaining problems: - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it?

You should ask a lawyer and I am not. But my understanding is that you can borrow ideas from a GPL-protected code without contaminating your code with GPL. You cannot copy code -- that makes your code GPL'd.

...

- the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better

There is the third option -- open and close the file. I'd prefer the option as file descriptors are precious resources limited in supply. There is a twist though. The file could be replaced while closed so you have to find a way to verify the was replaced and reread the has table from it. Perhaps checking timestamp of the file (date/time of the last modification) is enough.

...

- I should either rely on the current way (load everything in memory) or implement a binary search algo for the case where the hash table is not present (it is of course optional) - it would be an important change, and I think that options should be allow to choose between an eager or lazy access

Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end.

Well, I mus admit my .po/.mo aren't that big. The biggest .po is 60k, its corresponding .mo is only 30k bytes. I don't know if using the hash table gives me improvement. Oleg. -- Oleg Broytman https://phdru.name/ phd@phdru.name Programmers don't die, they just GOSUB without RETURN.

Reply

Sign in to reply online Use email software

Barry Scott

10:09 p.m.

...

On 18 Dec 2018, at 09:10, Serge Ballesta via Python-ideas <python-ideas@python.org> wrote:

In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel (http://babel.pocoo.org/en/latest/ <http://babel.pocoo.org/en/latest/>), msgfmt.py from Python tools or the original msgfmt from GNU gettext.

snip

...

Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end

I think about it this way. Based on the largest project I have worked on that was internationalised into 14 languages the British English text translated to American English (en-US) created a 350KiB MO file. The largest mo file was for Thai (th-TH) at 680KiB. Is it worth the complexity of the hash code to save that memory? Will the hash code improve the load time? We never noticed the load time and we reloaded the MO on ever web page access. As for FDs it uses 1 and on my linux system I have 1.6M to play with. Barry

Reply

Sign in to reply online Use email software

Serge Ballesta

5:06 p.m.

Hi all, The feed back on my initial mail convinced me that it was important to allow the current behaviour of eagerly loading the whole catalog, and that keeping the files opened should also be optional. All that lead to this proposal: Features: ======== The gettext module should be allowed to load lazily the catalogs from mo file. This lazy load should be optional and make use of the hash tables from mo files when they are present or revert to a binary search. The translation strings should be cached for better performances. API changes: ============ 3 functions from the gettext module will have 2 new optional parameter named caching, and keepopen: gettext.bindtextdomain(domain, localedir=None) would become gettext.bindtextdomain(domain, localedir=None, caching=None, keepopen=False) gettext.translation(domain, localedir=None, languages=None, class_=None, fallback=False, codeset=None) would become gettext.translation(domain, localedir=None, languages=None, class_=None, fallback=False, codeset=None, caching=None, keepopen=False) gettext.install(domain, localedir=None, codeset=None, names=None) would become gettext.install(domain, localedir=None, codeset=None, names=None, caching=None, keepopen=False) The new caching parameter could receive the following values: caching=None: revert to the previour eager loading of the full catalog. It will be the default to allow previous application to see no change caching=1: lazy loading with unlimited cache caching=n where n is a positive (>=0) integer value: lazy loading with a LRU cache limited to n strings The keepopen parameter would be a boolean: keepopen=False (default): the mo file is only opened before loading a translation string and closed immediately after - it is also opened once when the GNUTranslation class is initialized to load the file description keepopen=True: the mo file is kept open during the lifetime of the GNUTranslation object. This parameter is ignored and not used if caching is None Implementation: ============== The current GNUTranslation class loads the content of the mo file to build a dictionnary where the original strings are the keys and the translated keys the values. Plural forms use a special processing: the key is a 2 tuple (singular original string, order), and the value is the corresponding translated string - order=0 is normally for the singular translated string. The proposed implementation would simply replace this dictionary with a special mapping subclass when caching is not None. That subclass would use same keys as the original directory and would: - first search in its cache - if not found in cache and if the hashtable has not a zero size search the original string by hash - if not found in cache and if the hashtable has a zero size, search the original string with a binary search algorithm. - if a string is found, it should feed the LRU cache, eventually throwing away the oldest entry (entries) That should allow to implement the new feature with minimal refactoring for the gettext module. Le 18/12/2018 à 10:10, Serge Ballesta via Python-ideas a écrit :

...

In a project of mine, I have used the gettext module from Python Standard Library. I have found that several tools could be used to generate the Machine Object (mo) file from the source Portable Object (one): pybabel (http://babel.pocoo.org/en/latest/), msgfmt.py from Python tools or the original msgfmt from GNU gettext.

I could find that only the original msgfmt was able to generate a hashtable, and that anyway the Python gettext module loaded everything in memory and did not use it. But I also find a TODO note saying

# TODO: # - Lazy loading of .mo files. Currently the entire catalog is loaded into # memory, but that's probably bad for large translated programs. Instead, # the lexical sort of original strings in GNU .mo files should be exploited # to do binary searches and lazy initializations. Or you might want to use # the undocumented double-hash algorithm for .mo files with hash tables, but # you'll need to study the GNU gettext code to do this.

I have studied GNU gettext code and found that implemententing the hashing algorithm in Python would not be that hard.

The undocumented features required for implementation are: - the version number can safely stay to 0 when processing Python code - the size of the hash table is the first odd prime greater than or equal to 4 * n / 3 where n is the number of strings - the first hashing function uses a variant of PJW hash function described in https://en.wikipedia.org/wiki/PJW_hash_function, where the line h = h & ~high is replaced with h = h ^ high, and using 32 bits integers. The index in the table in the result of the function modulus the size of the hash table - when there is a conflict (the slot given by the first hashing function is already used by another string) the following is used: - let h be the result of the PJW variant hash function and size be the size of the hash table, an increment value is set to 1 +( h % (size -2)) - that increment is repeatedly added to the index in the hash table (modulus the table size) until an empty slot is found (or the correct original string is found)

For now, my (alpha) code is able to generate in pure Python the same mo file that GNU msgfmt generates, and use the hashtable to access the strings.

Remaining problems: - I had to read GPL copyrighted code to find the undocumented features. I have of course wrote my own code from scratch, but may I use an Apache Free License 2.1 on it? - the current code for gettext loads everything from the mo file and immediately closes it. My own code keeps the file opened to be able to access it with the mmap module. There could be use case where first option is better - I should either rely on the current way (load everything in memory) or implement a binary search algo for the case where the hash table is not present (it is of course optional) - it would be an important change, and I think that options should be allow to choose between an eager or lazy access

Before going further, I would like to know whether implementing lazy access through the hash table that way seems to be a interesting improvement or a dead end.

_______________________________________________ Python-ideas mailing list Python-ideas@python.org https://mail.python.org/mailman/listinfo/python-ideas Code of Conduct: http://python.org/psf/codeofconduct/

Reply

Sign in to reply online Use email software

Use lazy loading with hashtable in python gettext module

s-ball＠laposte.net

Oleg Broytman

Serge Ballesta

Barry Scott

Serge Ballesta

Serge Ballesta

Oleg Broytman

Serge Ballesta

Barry Scott

Serge Ballesta

Serge Ballesta

tags

participants (4)