[Tutor] Please help replace Long place-names with short place-names in many files with Python

mc UJI mcptrad at gmail.com
Sun Sep 1 22:19:19 CEST 2013


Dear Pythonistas,

I am totally new to Python. This means a know the basics. And by basics I
mean the very, very basics.

I have a problem with which I need help.

in short, I need to:

a) Open many files (in a dir) with an .html extension
b)  Find Long name-places (Austria)
c) Replace them by short name.places (AT)
d) In the context of the xml tags (<birth_place country=".*?"> and
<constituency country=".*?"/>)

At length:

I have many xml files containing a day of speeches at the European
Parliament each file. Each file has some xml-label for the session metadata
and then the speakers (MPs) interventions. These interventions consist of
my metadata and text. I include here a sample of two speeches (Please pay
attention to xml labels <birth_place country=".*?"> and <constituency
country=".*?"/> ):

****************************************************************************************
                                       SAMPLE OF INTERVENTIONS
<intervention id='in12'>
<speaker>
<name>Knapman, Roger</name>
<birth_date>19440220</birth_date>
<birth_place country="United Kingdom">Crediton</birth_place>
<status>NA</status>
<gender>male</gender>
<institution>
<io>
<eu body="EP"/>
</io>
</institution>
<constituency country="United Kingdom"/>
<affiliation>
<national_party>UK Independence Party</national_party>
<ep group="IND-DEM"/>
</affiliation>
<post>on behalf of the group</post>
</speaker>
<speech id='sp15' language="EN">
<p id='pa108'><s id='se408'>Mr President, Mr Juncker's speech was made with
all the passion that a civil servant is likely to raise.</s></p>
<p id='pa109'><s id='se409'>Mr Juncker, you say that the Stability and
Growth Pact will be your top priority, but your past statements serve to
illustrate only the inconsistencies.</s> <s id='se410'>Whilst I acknowledge
that you played a key role in negotiating the pact's original rules, you
recently said that the credibility of the pact had been buried and that the
pact was dead.</s> <s id='se411'>Is that still your opinion?</s></p>
<p id='pa110'><s id='se412'>You also said that you have a window of
opportunity to cut a quick deal on the EU budget, including the British
rebate of some EUR 4 billion a year.</s> <s id='se413'>Is that so, Mr
Juncker?</s> <s id='se414'>The rebate took <italics>five years</italics> to
negotiate.</s> <s id='se415'>If your comments are true and you can cut a
deal by June, then Mr Blair must have agreed in principle to surrender the
rebate.</s> <s id='se416'>Is that the case?</s> <s id='se417'>With whom in
the British Government precisely are you negotiating?</s> <s
id='se418'>Will the British electorate know about this at the time of the
British general election, probably in May?</s></p>
<p id='pa111'><s id='se419'>Finally, the UK Independence Party, and in
particular my colleague Mr Farage, has drawn attention to the criminal
activities of more than one Commissioner.</s> <s id='se420'>More details
will follow shortly and regularly.</s> <s id='se421'>Are you to be tainted
by association with them, or will you be expressing your concerns and the
pressing need for change?</s></p>
</speech>
</intervention>

<intervention id='in13'>
<speaker>
<name>Angelilli, Roberta</name>
<birth_date>19650201</birth_date>
<birth_place country="Italy">Roma</birth_place>
<status>NA</status>
<gender>female</gender>
<institution>
<io>
<eu body="EP"/>
</io>
</institution>
<constituency country="Italy"/>
<affiliation>
<national_party>Alleanza nazionale</national_party>
<ep group="UEN"/>
</affiliation>
<post>on behalf of the group</post>
</speaker>
<speech id='sp16' language="IT">
<p id='pa112'><s id='se422'>Mr President, the Luxembourg Presidency’s
programme is packed with crucial issues for the future of Europe, including
the priorities on the economic front: the Lisbon strategy, reform of the
Stability Pact and approval of the financial perspective up to 2013.</s></p>
<p id='pa113'><s id='se423'>My first point is that it will soon be time for
the mid-term review of the level of implementation of the Lisbon
strategy.</s> <s id='se424'>To give it a greater chance of success, the
programme needs to make the individual Member States responsible for
achieving the targets that were set.</s> <s id='se425'>To that end, I
consider the proposal to specify an individual at national level to be
responsible for putting the strategy into practice to be a very useful
idea.</s></p>
<p id='pa114'><s id='se426'>Secondly, with regard to the review of the
Stability Pact, it has also been emphasised this morning that a reform is
needed which can propose a more flexible interpretation of the Pact during
times of recession, without bypassing the Maastricht criteria and without
giving up the commitment to reduce the debt.</s> <s id='se427'>I am also
convinced that steps could be taken to exclude certain specific types of
investment from the calculation of the deficit in order to give a new boost
to Europe’s growth and competitiveness.</s></p>
<p id='pa115'><s id='se428'>Thirdly, I hope that we can really succeed in
approving the financial perspective up to 2013 by June, so that the
resources can be used to the full from the very beginning of the period in
question.</s> <s id='se429'>I especially hope that the proposals – the
Council’s and the Commission’s proposals on those important topics – are
adequately discussed in advance by Parliament which, let us recall, is the
only European institution that directly represents the sovereignty of the
people.</s></p>
<p id='pa116'><s id='se430'>Lastly, I hope that a European civil protection
agency will at last be set up during the Luxembourg Presidency so that
natural disasters can be dealt with in an appropriate manner, with
particular emphasis on prevention.</s></p>
</speech>
</intervention>

                              END OF SAMPLE OF INTERVENTIONS
*************************************************************************************


Now, as you see, label:

<birth_place country=".*?"> and <constituency country=".*?"/>

Have long place-names. For instance

<birth_place country=".United Kingdom"> and <constituency country="United
Kingdom"/>

But I would like short place-names (UK instead of United Kingdom, for
instance)

The long-names I have are all the members of the European Union.

************************************************************************************
LIST OF LONG PLACE-NAMES AND EQUIVALENT SHORT PLACE-NAMES

Austria = AT
Belgium = BE
Bulgaria = BG
Croatia = HR
Cyprus = CY
Czech Republic = CS
Denmark = DK
Estonia = EE
Finland = FI
France = FR
Germany = DE
Greece = GR
Hungary = HU
Ireland = IE
Italy = IT
Latvia = LV
Lithuania = LT
Luxembourg = LU
Malta = MT
Netherlands = NL
Poland = PL
Portugal = PT
Romania = RO
Slovakia = SK
Slovenia = SI
Spain = ES
Sweden = SE
United Kingdom = GB

*************************************************************************************

TO SUM UP

I am in despair at this point. Is there a way to use Python (dictionaries
and regular expressions or whatever is suitable to:

a) Open many files with an .html extension
b)  Find Long name-places (Austria)
c) Replace them by short name.places (AT)
d) In the context of the xml tags mentioned above.

Please i NEED YOUR HELP

Many thanks for your patience.

María
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.python.org/pipermail/tutor/attachments/20130901/a1bd0e85/attachment.html>


More information about the Tutor mailing list