parsing tab and newline delimited text
Tim Chase
python.list at tim.thechases.com
Tue Aug 3 22:49:57 EDT 2010
On 08/03/10 21:14, elsa wrote:
> I have a large file of text I need to parse. Individual 'entries' are
> separated by newline characters, while fields within each entry are
> separated by tab characters.
>
> So, an individual entry might have this form (in printed form):
>
> Title date position data
>
> with each field separated by tabs, and a newline at the end of data.
> So, I thought I could simply open a file, read each line in in turn,
> and parse it....
>
> f=open('MyFile')
> line=f.readline()
> parts=line.split('\t')
>
> etc...
>
> However, 'data' is a fairly random string of characters. Because the
> files I'm processing are large, there is a good chance that in every
> file, there is a data field that might look like this:
>
> 899998dlKKlS\lk3#kdf\nllllKK99
My first question is whether the line contains actual newline/tab
characters within the field data, or the string-representation of
the line. For one of the lines in question, what does
print repr(line)
(or "print line.encode('hex')") produce? If the line has extra
literal tabs, then you may be stuck; if the line has escaped text
(a backslash followed by an "n" or "t", i.e. 2 characters) then
it's pretty straight-forward. Ideally, you'd see something like
>>> print repr(line)
'MyTitle\t2010-08-02\t42\t89998dlKKlS\\lk3#kdf\\nlllKK99'
^tab ^tab ^tab ^backslash^
where the backslashes are literal.
If you know that it's the last ("data") field that can contain
such characters, you can at least catch non-newline characters by
only splitting the first N splits:
parts = line.split('\t', 3)
That doesn't solve the newline problem, but your file's
definition prevents you from being able to discern
filedata = 'title1\tdate1\tpos1\tdata1\nxxxx\tyyyy\tzzzz\twwww\n'
Would xxxx/yyyy/zzzz/wwww be a continuation of data1 or are they
the items in the next row?
-tkc
More information about the Python-list
mailing list