[Tutor] Please help replace Long place-names with short place-names in many files with Python

Joel Goldstick joel.goldstick at gmail.com
Mon Sep 2 01:25:38 CEST 2013


On Sun, Sep 1, 2013 at 4:19 PM, mc UJI <mcptrad at gmail.com> wrote:
> Dear Pythonistas,
>
> I am totally new to Python. This means a know the basics. And by basics I
> mean the very, very basics.
>
> I have a problem with which I need help.
>
> in short, I need to:
>
> a) Open many files (in a dir) with an .html extension

Do you know how to do this?  Its not hard.  Google 'python reading all
files in a directory' to get started
> b)  Find Long name-places (Austria)
> c) Replace them by short name.places (AT)

In python strings are objects and they have the replace method  (look
here:  http://docs.python.org/2/library/string.html#string.replace)

If you figure out how to read each file into a string (see above), you
just run the string.replace method on it for each long-short country
pair.  Then write the file.

Try first to write this code.  If you apply yourself, it will take you
maybe an hour to get something working.
Then come back with your code and tell us what went wrong.


> d) In the context of the xml tags (<birth_place country=".*?"> and
> <constituency country=".*?"/>)
>
> At length:
>
> I have many xml files containing a day of speeches at the European
> Parliament each file. Each file has some xml-label for the session metadata
> and then the speakers (MPs) interventions. These interventions consist of my
> metadata and text. I include here a sample of two speeches (Please pay
> attention to xml labels <birth_place country=".*?"> and <constituency
> country=".*?"/> ):
>
> ****************************************************************************************
>                                        SAMPLE OF INTERVENTIONS
> <intervention id='in12'>
> <speaker>
> <name>Knapman, Roger</name>
> <birth_date>19440220</birth_date>
> <birth_place country="United Kingdom">Crediton</birth_place>
> <status>NA</status>
> <gender>male</gender>
> <institution>
> <io>
> <eu body="EP"/>
> </io>
> </institution>
> <constituency country="United Kingdom"/>
> <affiliation>
> <national_party>UK Independence Party</national_party>
> <ep group="IND-DEM"/>
> </affiliation>
> <post>on behalf of the group</post>
> </speaker>
> <speech id='sp15' language="EN">
> <p id='pa108'><s id='se408'>Mr President, Mr Juncker's speech was made with
> all the passion that a civil servant is likely to raise.</s></p>
> <p id='pa109'><s id='se409'>Mr Juncker, you say that the Stability and
> Growth Pact will be your top priority, but your past statements serve to
> illustrate only the inconsistencies.</s> <s id='se410'>Whilst I acknowledge
> that you played a key role in negotiating the pact's original rules, you
> recently said that the credibility of the pact had been buried and that the
> pact was dead.</s> <s id='se411'>Is that still your opinion?</s></p>
> <p id='pa110'><s id='se412'>You also said that you have a window of
> opportunity to cut a quick deal on the EU budget, including the British
> rebate of some EUR 4 billion a year.</s> <s id='se413'>Is that so, Mr
> Juncker?</s> <s id='se414'>The rebate took <italics>five years</italics> to
> negotiate.</s> <s id='se415'>If your comments are true and you can cut a
> deal by June, then Mr Blair must have agreed in principle to surrender the
> rebate.</s> <s id='se416'>Is that the case?</s> <s id='se417'>With whom in
> the British Government precisely are you negotiating?</s> <s id='se418'>Will
> the British electorate know about this at the time of the British general
> election, probably in May?</s></p>
> <p id='pa111'><s id='se419'>Finally, the UK Independence Party, and in
> particular my colleague Mr Farage, has drawn attention to the criminal
> activities of more than one Commissioner.</s> <s id='se420'>More details
> will follow shortly and regularly.</s> <s id='se421'>Are you to be tainted
> by association with them, or will you be expressing your concerns and the
> pressing need for change?</s></p>
> </speech>
> </intervention>
>
> <intervention id='in13'>
> <speaker>
> <name>Angelilli, Roberta</name>
> <birth_date>19650201</birth_date>
> <birth_place country="Italy">Roma</birth_place>
> <status>NA</status>
> <gender>female</gender>
> <institution>
> <io>
> <eu body="EP"/>
> </io>
> </institution>
> <constituency country="Italy"/>
> <affiliation>
> <national_party>Alleanza nazionale</national_party>
> <ep group="UEN"/>
> </affiliation>
> <post>on behalf of the group</post>
> </speaker>
> <speech id='sp16' language="IT">
> <p id='pa112'><s id='se422'>Mr President, the Luxembourg Presidency’s
> programme is packed with crucial issues for the future of Europe, including
> the priorities on the economic front: the Lisbon strategy, reform of the
> Stability Pact and approval of the financial perspective up to 2013.</s></p>
> <p id='pa113'><s id='se423'>My first point is that it will soon be time for
> the mid-term review of the level of implementation of the Lisbon
> strategy.</s> <s id='se424'>To give it a greater chance of success, the
> programme needs to make the individual Member States responsible for
> achieving the targets that were set.</s> <s id='se425'>To that end, I
> consider the proposal to specify an individual at national level to be
> responsible for putting the strategy into practice to be a very useful
> idea.</s></p>
> <p id='pa114'><s id='se426'>Secondly, with regard to the review of the
> Stability Pact, it has also been emphasised this morning that a reform is
> needed which can propose a more flexible interpretation of the Pact during
> times of recession, without bypassing the Maastricht criteria and without
> giving up the commitment to reduce the debt.</s> <s id='se427'>I am also
> convinced that steps could be taken to exclude certain specific types of
> investment from the calculation of the deficit in order to give a new boost
> to Europe’s growth and competitiveness.</s></p>
> <p id='pa115'><s id='se428'>Thirdly, I hope that we can really succeed in
> approving the financial perspective up to 2013 by June, so that the
> resources can be used to the full from the very beginning of the period in
> question.</s> <s id='se429'>I especially hope that the proposals – the
> Council’s and the Commission’s proposals on those important topics – are
> adequately discussed in advance by Parliament which, let us recall, is the
> only European institution that directly represents the sovereignty of the
> people.</s></p>
> <p id='pa116'><s id='se430'>Lastly, I hope that a European civil protection
> agency will at last be set up during the Luxembourg Presidency so that
> natural disasters can be dealt with in an appropriate manner, with
> particular emphasis on prevention.</s></p>
> </speech>
> </intervention>
>
>                               END OF SAMPLE OF INTERVENTIONS
> *************************************************************************************
>
> Now, as you see, label:
>
> <birth_place country=".*?"> and <constituency country=".*?"/>
>
> Have long place-names. For instance
>
> <birth_place country=".United Kingdom"> and <constituency country="United
> Kingdom"/>
>
> But I would like short place-names (UK instead of United Kingdom, for
> instance)
>
> The long-names I have are all the members of the European Union.
>
> ************************************************************************************
> LIST OF LONG PLACE-NAMES AND EQUIVALENT SHORT PLACE-NAMES
>
> Austria = AT
> Belgium = BE
> Bulgaria = BG
> Croatia = HR
> Cyprus = CY
> Czech Republic = CS
> Denmark = DK
> Estonia = EE
> Finland = FI
> France = FR
> Germany = DE
> Greece = GR
> Hungary = HU
> Ireland = IE
> Italy = IT
> Latvia = LV
> Lithuania = LT
> Luxembourg = LU
> Malta = MT
> Netherlands = NL
> Poland = PL
> Portugal = PT
> Romania = RO
> Slovakia = SK
> Slovenia = SI
> Spain = ES
> Sweden = SE
> United Kingdom = GB
>
> *************************************************************************************
>
> TO SUM UP
>
> I am in despair at this point. Is there a way to use Python (dictionaries
> and regular expressions or whatever is suitable to:
>
> a) Open many files with an .html extension
> b)  Find Long name-places (Austria)
> c) Replace them by short name.places (AT)
> d) In the context of the xml tags mentioned above.
>
> Please i NEED YOUR HELP
>
> Many thanks for your patience.
>
> María
>
> _______________________________________________
> Tutor maillist  -  Tutor at python.org
> To unsubscribe or change subscription options:
> http://mail.python.org/mailman/listinfo/tutor
>



-- 
Joel Goldstick
http://joelgoldstick.com


More information about the Tutor mailing list