Hello Spir, Alan, and Paul,<br><br>Thank you for your help. I have been working on the file, but I still have a problem doing what I wanted. As a reminder,<br><br>I have<br><br><br>#!usr/bin/python<br>tags = {<br>'noun-prop': 'noun_prop null null'.split(),<br>
'case_def_gen': 'case_def gen null'.split(),<br>'dem_pron_f': 'dem_pron f null'.split(),<br>'case_def_acc': 'case_def acc null'.split(),<br>}<br><br><br>TAB = '\t'<br>
<br><br>def newlyTaggedWord(line):<br> (word,tag) = line.split(TAB) # separate parts of line, keeping data only<br> new_tags = tags[tag] # read in dict<br> tagging = TAB.join(new_tags) # join with TABs<br>
return word + TAB + tagging # formatted result<br><br>def replaceTagging(source_name, target_name):<br> target_file = open(target_name, "w")<br> # replacement loop<br> for line in open(source_name, "r"):<br>
new_line = newlyTaggedWord(line) + '\n'<br> target_file.write(new_line)<br> <br>target_file.close()<br><br>if __name__ == "__main__":<br> source_name = sys.argv[1]<br> target_name = sys.argv[2]<br>
replaceTagging(source_name, target_name)<br><br><br><br><div class="gmail_quote">On Mon, May 4, 2009 at 12:38 PM, <span dir="ltr"><<a href="mailto:tutor-request@python.org">tutor-request@python.org</a>></span> wrote:<br>
<blockquote class="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0pt 0pt 0pt 0.8ex; padding-left: 1ex;">Send Tutor mailing list submissions to<br>
<a href="mailto:tutor@python.org">tutor@python.org</a><br>
<br>
To subscribe or unsubscribe via the World Wide Web, visit<br>
<a href="http://mail.python.org/mailman/listinfo/tutor" target="_blank">http://mail.python.org/mailman/listinfo/tutor</a><br>
or, via email, send a message with subject or body 'help' to<br>
<a href="mailto:tutor-request@python.org">tutor-request@python.org</a><br>
<br>
You can reach the person managing the list at<br>
<a href="mailto:tutor-owner@python.org">tutor-owner@python.org</a><br>
<br>
When replying, please edit your Subject line so it is more specific<br>
than "Re: Contents of Tutor digest..."<br>
<br>
<br>
Today's Topics:<br>
<br>
1. Re: Iterating over a long list with regular expressions and<br>
changing each item? (Paul McGuire)<br>
2. Advanced String Search using operators AND, OR etc.. (Alex Feddor)<br>
3. Re: Encode problem (Pablo P. F. de Faria)<br>
4. Re: Encode problem (Pablo P. F. de Faria)<br>
5. Re: Advanced String Search using operators AND, OR etc..<br>
(vince spicer)<br>
<br>
<br>
----------------------------------------------------------------------<br>
<br>
Message: 1<br>
Date: Mon, 4 May 2009 11:17:53 -0500<br>
From: "Paul McGuire" <<a href="mailto:ptmcg@austin.rr.com">ptmcg@austin.rr.com</a>><br>
Subject: Re: [Tutor] Iterating over a long list with regular<br>
expressions and changing each item?<br>
To: <<a href="mailto:tutor@python.org">tutor@python.org</a>><br>
Message-ID: <99B447F3C7EF4996AA2ED683F1EE6DB6@AWA2><br>
Content-Type: text/plain; charset="us-ascii"<br>
<br>
Original:<br>
'case_def_gen':['case_def','gen','null'],<br>
'nsuff_fem_pl':['nsuff','null', 'null'],<br>
'abbrev': ['abbrev, null, null'],<br>
'adj': ['adj, null, null'],<br>
'adv': ['adv, null, null'],}<br>
<br>
Note the values for 'abbrev', 'adj' and 'adv' are not lists, but strings<br>
containing comma-separated lists.<br>
<br>
Should be:<br>
'case_def_gen':['case_def','gen','null'],<br>
'nsuff_fem_pl':['nsuff','null', 'null'],<br>
'abbrev': ['abbrev', 'null', 'null'],<br>
'adj': ['adj', 'null', 'null'],<br>
'adv': ['adv', 'null', 'null'],}<br>
<br>
For much of my own code, I find lists of string literals to be tedious to<br>
enter, and easy to drop a ' character. This style is a little easier on the<br>
eyes, and harder to screw up.<br>
<br>
'case_def_gen':['case_def gen null'.split()],<br>
'nsuff_fem_pl':['nsuff null null'.split()],<br>
'abbrev': ['abbrev null null'.split()],<br>
'adj': ['adj null null'.split()],<br>
'adv': ['adv null null'.split()],}<br>
<br>
Since all that your code does at runtime with the value strings is<br>
"\t".join() them, then you might as well initialize the dict with these<br>
computed values, for at least some small gain in runtime performance:<br>
<br>
T = lambda s : "\t".join(s.split())<br>
'case_def_gen' : T('case_def gen null'),<br>
'nsuff_fem_pl' : T('nsuff null null'),<br>
'abbrev' : T('abbrev null null'),<br>
'adj' : T('adj null null'),<br>
'adv' : T('adv null null'),}<br>
del T<br>
<br>
(Yes, I know PEP8 says *not* to add spaces to line up assignments or other<br>
related values, but I think there are isolated cases where it does help to<br>
see what's going on. You could even write this as:<br>
<br>
T = lambda s : "\t".join(s.split())<br>
'case_def_gen' : T('case_def gen null'),<br>
'nsuff_fem_pl' : T('nsuff null null'),<br>
'abbrev' : T('abbrev null null'),<br>
'adj' : T('adj null null'),<br>
'adv' : T('adv null null'),}<br>
del T<br>
<br>
and the extra spaces help you to see the individual subtags more easily,<br>
with no change in the resulting values since split() splits on multiple<br>
whitespace the same as a single space.)<br>
<br>
Of course you could simply code as:<br>
<br>
'case_def_gen' : T('case_def\tgen\t null'),<br>
'nsuff_fem_pl' : T('nsuff\tnull\tnull'),<br>
'abbrev' : T('abbrev\tnull\tnull'),<br>
'adj' : T('adj\tnull\tnull'),<br>
'adv' : T('adv\tnull\tnull'),}<br>
<br>
But I think readability definitely suffers here, I would probably go with<br>
the penultimate version.<br>
<br>
-- Paul<br>
<br>
<br>
<br>
<br>
------------------------------<br>
<br>
Message: 2<br>
Date: Mon, 4 May 2009 14:45:06 +0200<br>
From: Alex Feddor <<a href="mailto:alex.feddor@gmail.com">alex.feddor@gmail.com</a>><br>
Subject: [Tutor] Advanced String Search using operators AND, OR etc..<br>
To: <a href="mailto:tutor@python.org">tutor@python.org</a><br>
Message-ID:<br>
<<a href="mailto:5bf184e30905040545i78bc75b8ic78eabf44a55aa20@mail.gmail.com">5bf184e30905040545i78bc75b8ic78eabf44a55aa20@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="iso-8859-1"<br>
<br>
Hi<br>
<br>
I am looking for method enables advanced text string search. Method<br>
string.find() or re module seems no supporting what I am looking for. The<br>
idea is as follows:<br>
<br>
Text ="FDA meeting was successful. New drug is approved for whole sale<br>
distribution!"<br>
<br>
I would like to scan the text using AND and OR operators and gets -1 or<br>
other value if the searching elements haven't found in the text.<br>
Example 01:<br>
search criteria: "FDA" AND ( "approve*" OR "supported")<br>
The catch is that in Text variable FDA and approve words are not one after<br>
another (other words are in between).<br>
Example 02:<br>
search criteria: "Ben"<br>
The catch is that code sould find only exact Ben words not also words which<br>
that has firts three letters Ben such as Benquick, Benseek etc.. Only Ben is<br>
the right word we are looking for.<br>
<br>
I would really appreciated your advice - code sample / links how above can<br>
be achieved! if possible I would appreciated solution achieved with free of<br>
charge module.<br>
<br>
Cheers, Alex<br>
PS:<br>
A few moths ago I have discovered Python. I am amazed what all can be done<br>
with it. Really cool programming language..<br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/tutor/attachments/20090504/bbd34b5a/attachment-0001.htm" target="_blank">http://mail.python.org/pipermail/tutor/attachments/20090504/bbd34b5a/attachment-0001.htm</a>><br>
<br>
------------------------------<br>
<br>
Message: 3<br>
Date: Mon, 4 May 2009 11:09:25 -0300<br>
From: "Pablo P. F. de Faria" <<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a>><br>
Subject: Re: [Tutor] Encode problem<br>
To: Kent Johnson <<a href="mailto:kent37@tds.net">kent37@tds.net</a>><br>
Cc: *tutor python <<a href="mailto:tutor@python.org">tutor@python.org</a>><br>
Message-ID:<br>
<<a href="mailto:3ea81d4c0905040709m78a45d11j2037943380817297@mail.gmail.com">3ea81d4c0905040709m78a45d11j2037943380817297@mail.gmail.com</a>><br>
Content-Type: text/plain; charset=ISO-8859-1<br>
<br>
Thanks, Kent, but that doesn't solve my problem. In fact, I need<br>
ConfigParser to work with non-ascii characters, since my App may run<br>
in "latin-1" environments (folders e files names). I must find out why<br>
the str() function in the module ConfigParser doesn't use the encoding<br>
defined for the application (# -*- coding: utf-8 -*-). The rest of the<br>
application works properly with utf-8, except for ConfigParser. What I<br>
found out is that ConfigParser seems to make use of the configuration<br>
in Site.py (which is set to 'ascii'), instead of the configuration<br>
defined for the App (if I change . But this is very problematic to<br>
have to change Site.py in every computer... So I wonder if there is a<br>
way to replace the settings in Site.py only for my App.<br>
<br>
2009/5/1 Kent Johnson <<a href="mailto:kent37@tds.net">kent37@tds.net</a>>:<br>
> On Fri, May 1, 2009 at 4:54 PM, Pablo P. F. de Faria<br>
> <<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a>> wrote:<br>
>> Hi, Kent.<br>
>><br>
>> The stack trace is:<br>
>><br>
>> Traceback (most recent call last):<br>
>> ?File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in OnClose<br>
>> ? ?self.SavePreferences()<br>
>> ?File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1068,<br>
>> in SavePreferences<br>
>> ? ?self.cfg.set(u'File Settings',u'Recent files',<br>
>> unicode(",".join(self.recent_files)))<br>
>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position<br>
>> 12: ordinal not in range(128)<br>
>><br>
>> The "unicode" function, actually doesn't do any difference... The<br>
>> content of the string being saved is "/home/pablo/?rea de<br>
>> Trabalho/teste.xml".<br>
><br>
> OK, this error is in your code, not the ConfigParser. The problem is with<br>
> ",".join(self.recent_files)<br>
><br>
> Are the entries in self.recent_files unicode strings? If so, then I<br>
> think the join is trying to convert to a string using the default<br>
> codec. Try<br>
><br>
> self.cfg.set('File Settings','Recent files',<br>
> ','.join(name.encode('utf-8') for name in self.recent_files))<br>
><br>
> Looking at the ConfigParser.write() code, it wants the values to be<br>
> strings or convertible to strings by calling str(), so non-ascii<br>
> unicode values will be a problem there. I would use plain strings for<br>
> all the interaction with ConfigParser and convert to Unicode yourself.<br>
><br>
> Kent<br>
><br>
> PS Please Reply All to reply to the list.<br>
><br>
<br>
<br>
<br>
--<br>
---------------------------------<br>
"Estamos todos na sarjeta, mas alguns de n?s olham para as estrelas."<br>
(Oscar Wilde)<br>
---------------------------------<br>
Pablo Faria<br>
Mestrando em Aquisi??o de Linguagem - IEL/Unicamp<br>
Bolsista t?cnico FAPESP no Projeto Padr?es R?tmicos e Mudan?a Ling??stica<br>
(19) 3521-1570<br>
<a href="http://www.tycho.iel.unicamp.br/%7Epablofaria/" target="_blank">http://www.tycho.iel.unicamp.br/~pablofaria/</a><br>
<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a><br>
<br>
<br>
------------------------------<br>
<br>
Message: 4<br>
Date: Mon, 4 May 2009 11:11:58 -0300<br>
From: "Pablo P. F. de Faria" <<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a>><br>
Subject: Re: [Tutor] Encode problem<br>
To: Kent Johnson <<a href="mailto:kent37@tds.net">kent37@tds.net</a>><br>
Cc: *tutor python <<a href="mailto:tutor@python.org">tutor@python.org</a>><br>
Message-ID:<br>
<<a href="mailto:3ea81d4c0905040711p62376925n26fb93a8955fefe4@mail.gmail.com">3ea81d4c0905040711p62376925n26fb93a8955fefe4@mail.gmail.com</a>><br>
Content-Type: text/plain; charset=ISO-8859-1<br>
<br>
Here is the traceback, after the last change you sugested:<br>
<br>
Traceback (most recent call last):<br>
File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in OnClose<br>
self.SavePreferences()<br>
File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1069,<br>
in SavePreferences<br>
self.cfg.write(codecs.open(self.properties_file,'w','utf-8'))<br>
File "/usr/lib/python2.5/ConfigParser.py", line 373, in write<br>
(key, str(value).replace('\n', '\n\t')))<br>
File "/usr/lib/python2.5/codecs.py", line 638, in write<br>
return self.writer.write(data)<br>
File "/usr/lib/python2.5/codecs.py", line 303, in write<br>
data, consumed = self.encode(object, self.errors)<br>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position<br>
27: ordinal not in range(128)<br>
<br>
So, in "str(value)" the content is a folder name with an accented character (?).<br>
<br>
2009/5/4 Pablo P. F. de Faria <<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a>>:<br>
> Thanks, Kent, but that doesn't solve my problem. In fact, I need<br>
> ConfigParser to work with non-ascii characters, since my App may run<br>
> in "latin-1" environments (folders e files names). I must find out why<br>
> the str() function in the module ConfigParser doesn't use the encoding<br>
> defined for the application (# -*- coding: utf-8 -*-). The rest of the<br>
> application works properly with utf-8, except for ConfigParser. What I<br>
> found out is that ConfigParser seems to make use of the configuration<br>
> in Site.py (which is set to 'ascii'), instead of the configuration<br>
> defined for the App (if I change . But this is very problematic to<br>
> have to change Site.py in every computer... So I wonder if there is a<br>
> way to replace the settings in Site.py only for my App.<br>
><br>
> 2009/5/1 Kent Johnson <<a href="mailto:kent37@tds.net">kent37@tds.net</a>>:<br>
>> On Fri, May 1, 2009 at 4:54 PM, Pablo P. F. de Faria<br>
>> <<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a>> wrote:<br>
>>> Hi, Kent.<br>
>>><br>
>>> The stack trace is:<br>
>>><br>
>>> Traceback (most recent call last):<br>
>>> ?File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1057, in OnClose<br>
>>> ? ?self.SavePreferences()<br>
>>> ?File "/home/pablo/workspace/E-Dictor/src/MainFrame.py", line 1068,<br>
>>> in SavePreferences<br>
>>> ? ?self.cfg.set(u'File Settings',u'Recent files',<br>
>>> unicode(",".join(self.recent_files)))<br>
>>> UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position<br>
>>> 12: ordinal not in range(128)<br>
>>><br>
>>> The "unicode" function, actually doesn't do any difference... The<br>
>>> content of the string being saved is "/home/pablo/?rea de<br>
>>> Trabalho/teste.xml".<br>
>><br>
>> OK, this error is in your code, not the ConfigParser. The problem is with<br>
>> ",".join(self.recent_files)<br>
>><br>
>> Are the entries in self.recent_files unicode strings? If so, then I<br>
>> think the join is trying to convert to a string using the default<br>
>> codec. Try<br>
>><br>
>> self.cfg.set('File Settings','Recent files',<br>
>> ','.join(name.encode('utf-8') for name in self.recent_files))<br>
>><br>
>> Looking at the ConfigParser.write() code, it wants the values to be<br>
>> strings or convertible to strings by calling str(), so non-ascii<br>
>> unicode values will be a problem there. I would use plain strings for<br>
>> all the interaction with ConfigParser and convert to Unicode yourself.<br>
>><br>
>> Kent<br>
>><br>
>> PS Please Reply All to reply to the list.<br>
>><br>
><br>
><br>
><br>
> --<br>
> ---------------------------------<br>
> "Estamos todos na sarjeta, mas alguns de n?s olham para as estrelas."<br>
> (Oscar Wilde)<br>
> ---------------------------------<br>
> Pablo Faria<br>
> Mestrando em Aquisi??o de Linguagem - IEL/Unicamp<br>
> Bolsista t?cnico FAPESP no Projeto Padr?es R?tmicos e Mudan?a Ling??stica<br>
> (19) 3521-1570<br>
> <a href="http://www.tycho.iel.unicamp.br/%7Epablofaria/" target="_blank">http://www.tycho.iel.unicamp.br/~pablofaria/</a><br>
> <a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a><br>
><br>
<br>
<br>
<br>
--<br>
---------------------------------<br>
"Estamos todos na sarjeta, mas alguns de n?s olham para as estrelas."<br>
(Oscar Wilde)<br>
---------------------------------<br>
Pablo Faria<br>
Mestrando em Aquisi??o de Linguagem - IEL/Unicamp<br>
Bolsista t?cnico FAPESP no Projeto Padr?es R?tmicos e Mudan?a Ling??stica<br>
(19) 3521-1570<br>
<a href="http://www.tycho.iel.unicamp.br/%7Epablofaria/" target="_blank">http://www.tycho.iel.unicamp.br/~pablofaria/</a><br>
<a href="mailto:pablofaria@gmail.com">pablofaria@gmail.com</a><br>
<br>
<br>
------------------------------<br>
<br>
Message: 5<br>
Date: Mon, 4 May 2009 10:38:31 -0600<br>
From: vince spicer <<a href="mailto:vinces1979@gmail.com">vinces1979@gmail.com</a>><br>
Subject: Re: [Tutor] Advanced String Search using operators AND, OR<br>
etc..<br>
To: Alex Feddor <<a href="mailto:alex.feddor@gmail.com">alex.feddor@gmail.com</a>><br>
Cc: <a href="mailto:tutor@python.org">tutor@python.org</a><br>
Message-ID:<br>
<<a href="mailto:1e53c510905040938q25d787f3w17f7a18f65bd0410@mail.gmail.com">1e53c510905040938q25d787f3w17f7a18f65bd0410@mail.gmail.com</a>><br>
Content-Type: text/plain; charset="iso-8859-1"<br>
<br>
Advanced Strings searches are Regex via re module.<br>
<br>
EX:<br>
<br>
import re<br>
<br>
m = re.compile("(FDA.*?(approved|supported)|Ben[^\s])*")<br>
<br>
if m.search(Text):<br>
print m.search(Text).group()<br>
<br>
<br>
Vince<br>
<br>
<br>
On Mon, May 4, 2009 at 6:45 AM, Alex Feddor <<a href="mailto:alex.feddor@gmail.com">alex.feddor@gmail.com</a>> wrote:<br>
<br>
> Hi<br>
><br>
> I am looking for method enables advanced text string search. Method<br>
> string.find() or re module seems no supporting what I am looking for. The<br>
> idea is as follows:<br>
><br>
> Text ="FDA meeting was successful. New drug is approved for whole sale<br>
> distribution!"<br>
><br>
> I would like to scan the text using AND and OR operators and gets -1 or<br>
> other value if the searching elements haven't found in the text.<br>
> Example 01:<br>
> search criteria: "FDA" AND ( "approve*" OR "supported")<br>
> The catch is that in Text variable FDA and approve words are not one after<br>
> another (other words are in between).<br>
> Example 02:<br>
> search criteria: "Ben"<br>
> The catch is that code sould find only exact Ben words not also words which<br>
> that has firts three letters Ben such as Benquick, Benseek etc.. Only Ben is<br>
> the right word we are looking for.<br>
><br>
> I would really appreciated your advice - code sample / links how above can<br>
> be achieved! if possible I would appreciated solution achieved with free of<br>
> charge module.<br>
><br>
> Cheers, Alex<br>
> PS:<br>
> A few moths ago I have discovered Python. I am amazed what all can be done<br>
> with it. Really cool programming language..<br>
><br>
> _______________________________________________<br>
> Tutor maillist - <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>
> <a href="http://mail.python.org/mailman/listinfo/tutor" target="_blank">http://mail.python.org/mailman/listinfo/tutor</a><br>
><br>
><br>
-------------- next part --------------<br>
An HTML attachment was scrubbed...<br>
URL: <<a href="http://mail.python.org/pipermail/tutor/attachments/20090504/88993fa6/attachment.htm" target="_blank">http://mail.python.org/pipermail/tutor/attachments/20090504/88993fa6/attachment.htm</a>><br>
<br>
------------------------------<br>
<br>
_______________________________________________<br>
Tutor maillist - <a href="mailto:Tutor@python.org">Tutor@python.org</a><br>
<a href="http://mail.python.org/mailman/listinfo/tutor" target="_blank">http://mail.python.org/mailman/listinfo/tutor</a><br>
<br>
<br>
End of Tutor Digest, Vol 63, Issue 8<br>
************************************<br>
</blockquote></div><br>