Use of HTMLparser to change language

Vlastimil Brom vlastimil.brom at gmail.com
Fri Mar 20 13:42:08 CET 2009


2009/3/20 pranav <pranny at gmail.com>:
> Greetings All,
>
> I have huge number of HTML files, all in english. I also have their
> counterpart files in Spanish. The non english files have their look
> and feel a little different than their english counterpart.
>
> My task is to make sure that the English HTML files contain the
> Spanish text, with retaining the English look and feel.
>
...
>
> Pranny
> --
> http://mail.python.org/mailman/listinfo/python-list
>

Hi, I guess, this task can probably not be solved fully automatically
unless there is some exact structure of the HTML, but it doesn't seem
likely.
If you would prefer to work with static sources, you can try to
identify the differences in the markup of english and spanish pages.

e.g. using BeautifulSoup http://www.crummy.com/software/BeautifulSoup/
or at least approximately with regular expressions,
e.g.:
tags_only_source = re.findall(r"<[^>]+>", html_source)
should return the tags source for simple code (neglecting nesting,
commented code, strings containing tags source ...)

the difflib library then could help in identifying the differences in code, cf:
http://docs.python.org/library/difflib.html

>>> for difference in difflib.ndiff("abcadefsdf", "abQcadsdfAA"): print difference
...
  a
  b
+ Q
  c
  a
  d
- e
- f
  s
  d
  f
+ A
+ A

(sample strings used here as arguments for ndiff can also be lists of
strings returned by findall() above.)

If you are lucky and the differences are rather small and regular, you
can then try to modify the markup in the spanish pages to be more
similar to the english ones;
again possibly using BeautifulSoup or even re.sub(...)
(of course, saving the modified sources as new files in some other directory)
(The opposite - taking the english markup and feeding it with english
text - would be more tricky, I guess.)

However, all that is likely to help only with the part of the task,
which will almost certainly require, more or less "manual" work.
Someone more experienced can probably propose a more effective
approach...

hth,
  vbr



More information about the Python-list mailing list