# Simple Text Processing Help

patrick.waldo at gmail.com patrick.waldo at gmail.com
Sun Oct 14 15:48:51 CEST 2007

Hi all,

I started Python just a little while ago and I am stuck on something
that is really simple, but I just can't figure out.

Essentially I need to take a text document with some chemical
information in Czech and organize it into another text file.  The
information is always EINECS number, CAS, chemical name, and formula
in tables.  I need to organize them into lines with | in between.  So
it goes from:

200-763-1                     71-73-8
nátrium-tiopentál           C11H18N2O2S.Na           to:

200-763-1|71-73-8|nátrium-tiopentál|C11H18N2O2S.Na

but if I have a chemical like: kyselina močová

I get:
200-720-7|69-93-2|kyselina|močová
|C5H4N4O3|200-763-1|71-73-8|nátrium-tiopentál

and then it is all off.

How can I get Python to realize that a chemical name may have a space
in it?

Thank you,
Patrick

So far I have:

#take tables in one text file and organize them into lines in another

import codecs

path = "c:\\text_samples\\chem_1_utf8.txt"
path2 = "c:\\text_samples\\chem_2.txt"
input = codecs.open(path, 'r','utf8')
output = codecs.open(path2, 'w', 'utf8')

#read and enter into a list
chem_file = []

#split words and store them in a list
for word in chem_file:
words = word.split()

#starting values in list
e=0   #EINECS
c=1   #CAS
ch=2  #chemical name
f=3   #formula

n=0
loop=1
x=len(words)              #counts how many words there are in the file

print '-'*100
while loop==1:
if n<x and f<=x:
print words[e], '|', words[c], '|', words[ch], '|', words[f],
'\n'
output.write(words[e])
output.write('|')
output.write(words[c])
output.write('|')
output.write(words[ch])
output.write('|')
output.write(words[f])
output.write('\r\n')
#increase variables by 4 to get next set
e = e + 4
c = c + 4
ch = ch + 4
f = f + 4
# increase by 1 to repeat
n=n+1
else:
loop=0

input.close()
output.close()