Why do my list go uni-code by itself?
MRAB
python at mrabarnett.plus.com
Mon Dec 20 16:41:10 EST 2010
> I'm reading a fixed format text file, line by line. I hereunder
present the code. I have <snipped> out part not related to the file reading.
> Only relevant detail left out is the lstCutters. It looks like this:
> [[1, 9], [11, 21], [23, 48], [50, 59], [61, 96], [98, 123], [125, 150]]
> It specifies the first and last character position of each token in
the fixed format of the input line.
> All this works fine, and is only to explain where I'm going.
>
> The code, in the function definition, is broken up in more lines than
necessary, to be able to monitor the variables, step by step.
>
> --- Code start ------
>
> import codecs
>
> <snip>
>
> def CutLine2List(strIn,lstCut):
> strIn = strIn.strip()
> print '>InNextLine>',strIn
> # skip if line is empty
> if len(strIn)<1:
> return False
More Pythonic would be:
if not strIn:
> lstIn = list()
> for cc in lstCut:
> strSubline =strIn[cc[0]-1:cc[1]-1].strip()
The start index is inclusive; the end index is exclusive.
> lstIn.append(strSubline)
> print '>InSubline2>'+lstIn[len(lstIn)-1]+'<'
> del strIn, lstCut,cc
Not necessary to del the names. They exist only within the function,
which you're about to leave.
> print '>InReturLst>',lstIn
> return lstIn
>
Sometimes it returns a list and sometimes False. That's a bad idea; try
to be consistent.
> <snip>
>
> filIn = codecs.open(
> strFileNameIn,
> mode='r',
> encoding='utf-8',
> errors='strict',
> buffering=1)
You're decoding from UTF-8 to Unicode, so all the strings you're
working on are Unicode strings.
> for linIn in filIn:
> lstIn = CutLine2List(linIn,lstCutters)
>
More information about the Python-list
mailing list