Why do my list go uni-code by itself?

MRAB python at mrabarnett.plus.com
Mon Dec 20 16:41:10 EST 2010


 > I'm reading a fixed format text file, line by line. I hereunder 
present the code. I have <snipped> out part not related to the file reading.
 > Only relevant detail left out is the lstCutters. It looks like this:
 > [[1, 9], [11, 21], [23, 48], [50, 59], [61, 96], [98, 123], [125, 150]]
 > It specifies the first and last character position of each token in 
the fixed format of the input line.
 > All this works fine, and is only to explain where I'm going.
 >
 > The code, in the function definition, is broken up in more lines than 
necessary, to be able to monitor the variables, step by step.
 >
 > --- Code start ------
 >
 > import codecs
 >
 > <snip>
 >
 > def CutLine2List(strIn,lstCut):
 >     strIn = strIn.strip()
 >     print '>InNextLine>',strIn
 >     # skip if line is empty
 >     if len(strIn)<1:
 >         return False

More Pythonic would be:

     if not strIn:

 >     lstIn = list()
 >     for cc in lstCut:
 >         strSubline =strIn[cc[0]-1:cc[1]-1].strip()

The start index is inclusive; the end index is exclusive.

 >         lstIn.append(strSubline)
 >         print '>InSubline2>'+lstIn[len(lstIn)-1]+'<'
 >     del strIn, lstCut,cc

Not necessary to del the names. They exist only within the function,
which you're about to leave.

 >     print '>InReturLst>',lstIn
 >     return lstIn
 >
Sometimes it returns a list and sometimes False. That's a bad idea; try
to be consistent.

 > <snip>
 >
 > filIn = codecs.open(
 >                     strFileNameIn,
 >                     mode='r',
 >                     encoding='utf-8',
 >                     errors='strict',
 >                     buffering=1)

You're decoding from UTF-8 to Unicode, so all the strings you're
working on are Unicode strings.

 >  for linIn in filIn:
 >     lstIn = CutLine2List(linIn,lstCutters)
 >




More information about the Python-list mailing list