Case tagging and python
aaa at bbb.it
Thu Jul 31 13:00:52 CEST 2008
I'm relatively new to programming in general, and totally new to python,
and I've been told that this language is particularly good for what I
need to do. Let me explain.
I have a large corpus of English text, in the form of several files.
First of all I would like to scan each file. Then, for each word I find,
I'd like to examine its case status, and write the (lower case) word back
to another text file - with, appended, a tag stating the case it had in
the original file.
An example. Suppose we have three possible "case conditions"
-initial uppercase only
Three corresponding tags for each of these might be, respectively:
Therefore, given the string
"The Chairman of BP was asleep"
I would like to produce
"the/cap chairman/cap of/nocap /bp/allcaps was/nocap /asleep/nocap"
and writing this into a file.
I have the following algorithm in mind:
-open input file
-open output file
-get line of text
-split line into words
-for each word
-tag = checkCase(word)
-newword = lowercase(word) + append(tag)
rejoin words into line
write line into output file
Now, I managed to write the following initial code
for s in file:
lines += 1
if lines % 1000 == 0:
print '%d lines' % We print the total lines
sent = s.split() #split string by spaces
But then I don't quite know what would be the fastest/best way to do
this. Could I use the join function to reform the string? And, regarding
the casetest() function, what do you suggest to do? Should I test each
character of each word or there are faster methods?
Thanks very much,
More information about the Python-list