Reading by positions plain text files

Tim Harig usernet at ilthio.net
Wed Dec 1 01:15:12 EST 2010


On 2010-12-01, javivd <javiervandam at gmail.com> wrote:
> On Nov 30, 11:43 pm, Tim Harig <user... at ilthio.net> wrote:
>> On 2010-11-30, javivd <javiervan... at gmail.com> wrote:
>>
>> > I have a case now in wich another file has been provided (besides the
>> > database) that tells me in wich column of the file is every variable,
>> > because there isn't any blank or tab character that separates the
>> > variables, they are stick together. This second file specify the
>> > variable name and his position:
>>
>> > VARIABLE NAME      POSITION (COLUMN) IN FILE
>> > var_name_1                 123-123
>> > var_name_2                 124-125
>> > var_name_3                 126-126
>> > ..
>> > ..
>> > var_name_N                 512-513 (last positions)
>>
>> I am unclear on the format of these positions.  They do not look like
>> what I would expect from absolute references in the data.  For instance,
>> 123-123 may only contain one byte??? which could change for different
>> encodings and how you mark line endings.  Frankly, the use of the
>> world columns in the header suggests that the data *is* separated by
>> line endings rather then absolute position and the position refers to
>> the line number. In which case, you can use splitlines() to break up
>> the data and then address the proper line by index.  Nevertheless,
>> you can use file.seek() to move to an absolute offset in the file,
>> if that really is what you are looking for.
>
> I work in a survey research firm. the data im talking about has a lot
> of 0-1 variables, meaning yes or no of a lot of questions. so only one
> position of a character is needed (not byte), explaining the 123-123
> kind of positions of a lot of variables.

Then file.seek() is what you are looking for; but, you need to be aware of
line endings and encodings as indicated.  Make sure that you open the file
using whatever encoding was used when it was generated or you could have
problems with multibyte characters affecting the offsets.



More information about the Python-list mailing list