Re: [Patches] Translating doc strings
Martin von Loewis wrote:
To simplify usage of Python for people who don't speak English well, I'd like to start a project translating the doc strings in the Python library. This has two aspects:
1. there must be a simple way to access the translated doc strings
Do you plan to use GNU gettext here ? (This would cause the translated version of Python to fall under GPL, AFAIK) I'd propose to use the existing doc-strings as keys to a translation mapping. This assures that existing doc-strings remain intact and that the actual translation process is done at query time, e.g. by using a help() built-in function.
2. there must be actual translations to the various native languages of Python users.
Since the second task is much more complicated, I submit a snapshot of this project, namely, a message catalog of the doc strings in the Python libraries, taken from the CVS; along with a snapshot of the German translations. I intend to complete the German translations in the coming weeks, but I want to give other translators a chance to also start working on that.
Please note that extracting the docstrings was not straight-forward; I've used François Pinard's most excellent po-utils 0.5 as a starting point, and enhanced it with the capability to recognize __doc__[] in C code, so that I would also get (most of) the doc strings in C modules. I plan to update this catalog a few times before Python 1.6 is released, so that translators can update their translations.
A key point is finding translators. I propose to use the infrastructure of the GNU translation project for that: There are established teams for all major languages, and an infrastructure (also maintained by François) where notifications about new catalogs are automatically distributed to the teams. That should not stop volunteers which don't currently participate in the GNU translation project from translating - however, they should announce that they plan to work on translating these messages to avoid duplication of work.
This will only work iff the translations can be submitted via the usual "post to patches with dislcaimer" method... aren't the GNU people interested in putting the translations under the GPL ?
Another matter is where the catalogs should live in the Python source tree. I propose to have a Misc/po directory, which will contain both the PO catalogues, as well as the binary .mo objects; only the latter will be installed during the installation process.
Please let me know what you think, in particular, whether I can submit the catalog to the translation teams.
Regards, Martin
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Name: doc.tgz doc.tgz Type: unspecified type (application/octet-stream) Encoding: base64
-- Marc-Andre Lemburg ______________________________________________________________________ Business: http://www.lemburg.com/ Python Pages: http://www.lemburg.com/python/
Do you plan to use GNU gettext here ? (This would cause the translated version of Python to fall under GPL, AFAIK)
No, I plan to use the Python gettext module, which is currently being integrated into Python. It will either use the system's gettext library, or read mo files using pure Python.
I'd propose to use the existing doc-strings as keys to a translation mapping. This assures that existing doc-strings remain intact and that the actual translation process is done at query time, e.g. by using a help() built-in function.
This is more or less what I've planned. I'd propose to call the function doc, with an interface like
doc(time.time) time() -> Gleitkommazahl
Gib die aktuelle Zeit in Sekunden seit Beginn der Epoche zurück. Sekundenbruchteile sind vorhanden, falls die Systemuhr sie bereitstellt. It won't use a dictionary, though, but the underlying gettext query mechanism. Exact naming and parameters are certainly subject to discussion, my proposal would be doc(object, doprint=1, translate=1) so that users save quite some typing over
print time.time.__doc__ time() -> floating point number
Return the current time in seconds since the Epoch. Fractions of a second may be present if the system clock provides them.
This will only work iff the translations can be submitted via the usual "post to patches with dislcaimer" method... aren't the GNU people interested in putting the translations under the GPL ?
Is it really necessary to have the translations posted to patches@python.org? Or would it be sufficient if translators express their disclaimer in some other way. I don't think the translation teams are "the GNU people"; the translators accepted to assign their copyright to the FSF for the translations they did - I'd assume at least some of them would also accept maintaining the copyright, or assigning it to the Python Consortium (or whoever else wants it). It's more that the Python distributor would need to make suggestions what the copyright on translations should be - I'm sure that could be clearly communicated to the translators. Regards, Martin
"MvL" == Martin von Loewis <loewis@informatik.hu-berlin.de> writes:
MvL> No, I plan to use the Python gettext module, which is MvL> currently being integrated into Python. It will either use MvL> the system's gettext library, or read mo files using pure MvL> Python. pygettext is in Tools/i18n, and I've been working with James Henstridge and Peter Funk on getting a standard gettext module integrated into the core. A few other things have bumped that down on my list, but it's still there. We'll still need xgettext to scan the C code. Also, marking Python module docstrings is a bit problematic. I've resorted to Something Really Ugly: -------------------- snip snip -------------------- try: import fintl _ = fintl.gettext except ImportError: def _(s): return s __doc__ = _("""pygettext -- Python equivalent of xgettext(1) ...") -------------------- snip snip -------------------- Yuck. -Barry
Hi, [Barry A. Warsaw]:
pygettext is in Tools/i18n, and I've been working with James Henstridge and Peter Funk on getting a standard gettext module integrated into the core. A few other things have bumped that down on my list, but it's still there.
I will try to make some progress. Currently I'm not sure how to define a class 'Translator' ... I'm open for suggestions. James has also made some interesting points.
We'll still need xgettext to scan the C code. Also, marking Python module docstrings is a bit problematic. I've resorted to Something Really Ugly:
-------------------- snip snip -------------------- try: import fintl _ = fintl.gettext except ImportError: def _(s): return s
__doc__ = _("""pygettext -- Python equivalent of xgettext(1) ...") -------------------- snip snip --------------------
Yuck.
I agree: this is really ugly. Since doc-strings are something special, I don't think, we should travel further down this road. I believe, we should use a special doc-string extration-tool (possibly build on top of ping's 'inspect.py'?), which will then create a .pot-file solely out of __doc__-strings. Regards, Peter.
I agree: this is really ugly. Since doc-strings are something special, I don't think, we should travel further down this road. I believe, we should use a special doc-string extration-tool (possibly build on top of ping's 'inspect.py'?), which will then create a .pot-file solely out of __doc__-strings.
I agree. Again, I'd like to advertise François Pinard's xpot, which can extract doc strings from both Python code and C code. Regards, Martin
On Fri, 2 Jun 2000, Peter Funk wrote:
I agree: this is really ugly. Since doc-strings are something special, I don't think, we should travel further down this road. I believe, we should use a special doc-string extration-tool (possibly build on top of ping's 'inspect.py'?), which will then create a .pot-file solely out of __doc__-strings.
Getting __doc__ strings is pretty easy (inspect.py is one possibility). But presumably we want to get all the strings, don't we? That should be trivial with tokenize, right? ---- getstrings.py ----- import sys, tokenize strings = [] def tokeneater(type, token, start, end, line): if type == tokenize.STRING: strings.append(eval(token)) file = open(sys.argv[1]) tokenize.tokenize(file.readline, tokeneater) print strings ------------------------ % ./getstrings.py /usr/local/lib/python1.5/calendar.py ['calendar.error', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday', 'Mon', 'Tue', 'Wed', 'Thu', 'Fri', 'Sat', 'Sun', '', 'January', 'February', 'March', 'April', 'May', 'June', 'July', 'August', 'September', 'October', 'No vember', 'December', ' ', 'Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec', 'bad month number', ' ', ' ', '', '', ' ', ' ', '\012', '\012', '\012', ' ', '', '', ' '] Am i missing something? -- ?!ng
But presumably we want to get all the strings, don't we?
Certainly not. For example, in ftplib, the string "anonymous" should not be translated - it would end up as "anonym" in German, and that would not be accepted by FTP servers. In general, great care is needed to select translatable strings. For example, the GNU ls program was localized to print the month names in German. Pretty safe, eh? Now, the emacs dired mode wouldn't recognize any file names in the list output anymore, because it had a regular expression to detect the various fields, which involved an alternative list for all the month names... Regards, Martin
We'll still need xgettext to scan the C code.
Please have a look at my lib.pot; I've been using xpot to extract the C doc strings, which aren't currently marked-up in the Python source. As for module docstrings: xpot doesn't recognize them either, but I think it could be improved to do so. However, that would give a substantial increase of the catalogs, so I'd recommend to add them only when the translators are done with the first round of translation. Having the full set of distutils doc strings in the catalog is already a substantial amount of text to translate... Regards, Martin
On 02 June 2000, Martin von Loewis said:
Having the full set of distutils doc strings in the catalog is already a substantial amount of text to translate...
Most of those docstrings in the Distutils are not really for public consumption; they're there so that Distutils developers can remember (or learn) what the heck such-and-such a method is supposed to do. Also, they're a moving target; things are still changing in the Distutils, and trying to keep on top of translating internal docstrings would be a hopeless and frustrating task. Greg
participants (6)
-
bwarsaw@python.org
-
Greg Ward
-
Ka-Ping Yee
-
M.-A. Lemburg
-
Martin von Loewis
-
pf@artcom-gmbh.de