Simple Text Processing Help
patrick.waldo at gmail.com
patrick.waldo at gmail.com
Sun Oct 14 18:57:06 CEST 2007
Thank you both for helping me out. I am still rather new to Python
and so I'm probably trying to reinvent the wheel here.
When I try to do Paul's response, I get
>>>tokens = line.strip().split()
So I am not quite sure how to read line by line.
tokens = input.read().split() gets me all the information from the
file. tokens[2:-1] = [u' '.join(tokens[2:-1])] works just fine, like
in the example; however, how can I loop this for the entire document?
Also, when I try output.write(tokens), I get "TypeError: coercing to
Unicode: need string or buffer, list found".
On Oct 14, 4:25 pm, Paul Hankin <paul.han... at gmail.com> wrote:
> On Oct 14, 2:48 pm, patrick.wa... at gmail.com wrote:
> > Hi all,
> > I started Python just a little while ago and I am stuck on something
> > that is really simple, but I just can't figure out.
> > Essentially I need to take a text document with some chemical
> > information in Czech and organize it into another text file. The
> > information is always EINECS number, CAS, chemical name, and formula
> > in tables. I need to organize them into lines with | in between. So
> > it goes from:
> > 200-763-1 71-73-8
> > nátrium-tiopentál C11H18N2O2S.Na to:
> > 200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na
> > but if I have a chemical like: kyselina močová
> > I get:
> > 200-720-7|69-93-2|kyselina|močová
> > |C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál
> > and then it is all off.
> > How can I get Python to realize that a chemical name may have a space
> > in it?
> In the original file, is every chemical on a line of its own? I assume
> it is here.
> You might use a regexp (look at the re module), or I think here you
> can use the fact that only chemicals have spaces in them. Then, you
> can split each line on whitespace (like you're doing), and join back
> together all the words between the 3rd (ie index 2) and the last (ie
> index -1) using tokens[2:-1] = [u' '.join(tokens[2:-1])]. This uses
> the somewhat unusual python syntax for replacing a section of a list
> with another list.
> The approach you took involves reading the whole file, and building a
> list of all the chemicals which you don't seem to use: I've changed it
> to a per-line version and removed the big lists.
> path = "c:\\text_samples\\chem_1_utf8.txt"
> path2 = "c:\\text_samples\\chem_2.txt"
> input = codecs.open(path, 'r','utf8')
> output = codecs.open(path2, 'w', 'utf8')
> for line in input:
> tokens = line.strip().split()
> tokens[2:-1] = [u' '.join(tokens[2:-1])]
> chemical = u'|'.join(tokens)
> print chemical + u'\n'
> output.write(chemical + u'\r\n')
> Obviously, this isn't tested because I don't have your chem_1_utf8.txt
> Paul Hankin
More information about the Python-list