[New-bugs-announce] [issue35628] Allow lazy loading of translations in gettext.

s-ball report at bugs.python.org
Mon Dec 31 08:14:02 EST 2018


New submission from s-ball <s-ball at laposte.net>:

When working on i18n, I realized that msgfmt.py did not generate any hash table.  One step further, I realized that the gettext.py would not have used it because it unconditionnaly loads the whole translation files and contains the following TODO message: 

TODO:
- Lazy loading of .mo files.  Currently the entire catalog is loaded into
memory, but that's probably bad for large translated programs.  Instead,
the lexical sort of original strings in GNU .mo files should be exploited
to do binary searches and lazy initializations.  Or you might want to use
the undocumented double-hash algorithm for .mo files with hash tables, but
you'll need to study the GNU gettext code to do this.

I have studied the code, and found that it should not be too complex to implement it in pure Python. I have posted a message on python-ideas about it and here are my conclusion:

Features:
========
The gettext module should be allowed to load lazily the catalogs from mo 
file. This lazy load should be optional and make use of the hash tables 
from mo files when they are present or revert to a binary search. The 
translation strings should be cached for better performances.

API changes:
============
3 functions from the gettext module will have 2 new optional parameter 
named caching, and keepopen:

gettext.bindtextdomain(domain, localedir=None) would become
gettext.bindtextdomain(domain, localedir=None, caching=None, keepopen=False)

gettext.translation(domain, localedir=None, languages=None, class_=None, 
fallback=False, codeset=None) would become
gettext.translation(domain, localedir=None, languages=None, class_=None, 
fallback=False, codeset=None, caching=None, keepopen=False)

gettext.install(domain, localedir=None, codeset=None, names=None) would 
become
gettext.install(domain, localedir=None, codeset=None, names=None, 
caching=None, keepopen=False)

The new caching parameter could receive the following values:
caching=None: revert to the previour eager loading of the full catalog. 
It will be the default to allow previous application to see no change
caching=1: lazy loading with unlimited cache
caching=n where n is a positive (>=0) integer value: lazy loading with a 
LRU cache limited to n strings

The keepopen parameter would be a boolean:
keepopen=False (default): the mo file is only opened before loading a 
translation string and closed immediately after - it is also opened once 
when the GNUTranslation class is initialized to load the file description
keepopen=True: the mo file is kept open during the lifetime of the 
GNUTranslation object.
This parameter is ignored and not used if caching is None

Implementation:
==============
The current GNUTranslation class loads the content of the mo file to 
build a dictionnary where the original strings are the keys and the 
translated keys the values. Plural forms use a special processing: the 
key is a 2 tuple (singular original string, order), and the value is the 
corresponding translated string - order=0 is normally for the singular 
translated string.

The proposed implementation would simply replace this dictionary with a 
special mapping subclass when caching is not None. That subclass would 
use same keys as the original directory and would:
- first search in its cache
- if not found in cache and if the hashtable has not a zero size search 
the original string by hash
- if not found in cache and if the hashtable has a zero size, search the 
original string with a binary search algorithm.
- if a string is found, it should feed the LRU cache, eventually 
throwing away the oldest entry (entries)

That should allow to implement the new feature with minimal refactoring 
for the gettext module.

But I also propose to change msgfmt.py to build the hashtable. IMHO, the function should lie in the standard library probably as a submodule of gettext to allow various Python projects (pybabel, django) to directly use it instead of developping their own ones.

I will probably submit a PR in a while but it will will require some time to propose a full implementation with a correct test coverage.

----------
components: Library (Lib)
messages: 332815
nosy: s-ball
priority: normal
severity: normal
status: open
title: Allow lazy loading of translations in gettext.
type: enhancement
versions: Python 3.8

_______________________________________
Python tracker <report at bugs.python.org>
<https://bugs.python.org/issue35628>
_______________________________________


More information about the New-bugs-announce mailing list