[ python-Bugs-1224047 ] Len too large with national characters

SourceForge.net noreply at sourceforge.net
Mon Jun 20 15:12:01 CEST 2005


Bugs item #1224047, was opened at 2005-06-20 11:52
Message generated for change (Comment added) made by mwh
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1224047&group_id=5470

Please note that this message will contain a full copy of the comment thread,
including the initial issue submission, for this request,
not just the latest update.
Category: Python Library
Group: Python 2.4
>Status: Closed
>Resolution: Invalid
Priority: 5
Submitted By: Henrik Winther Jensen (henrikwj)
>Assigned to: Michael Hudson (mwh)
Summary: Len too large with national characters

Initial Comment:
It looks as if len returns the lenght of an UTF8 string
even if the string
only contains ascii characters and default encoding is
ascii. This
means that if you insert f. ex. one danish ø in a
string. len will return a
value of 2. i.e.

a='ø'
print len(a)

gives:
2

----------------------------------------------------------------------

>Comment By: Michael Hudson (mwh)
Date: 2005-06-20 14:12

Message:
Logged In: YES 
user_id=6656

Well, what encoding is the file in?

I suspect that it's in utf-8, so when you open the file and
call read() you get utf-8 data and thus your danish
character is represented as two bytes.

You might want to do 

import codecs
fileobj = codecs.open('filename.txt', encoding='utf-8')

and then fileobj.read() will return a unicode string of the
length you're expecting.

At any rate, I see no evidence of a Python bug here, so closing.

----------------------------------------------------------------------

Comment By: Henrik Winther Jensen (henrikwj)
Date: 2005-06-20 14:06

Message:
Logged In: YES 
user_id=1299770

Actually the problem persists whether i am reading from a
file or inputting from a keyboard. I am using python from the
command line in linux shell. I dont know what console that is.
But it is able to show the danish characters on the screen as 
well as reading them from the keyboard.

----------------------------------------------------------------------

Comment By: Michael Hudson (mwh)
Date: 2005-06-20 13:12

Message:
Logged In: YES 
user_id=6656

How are you getting your danish character into the string?  If it's by typing 
it into a console, is your console in utf-8 mode?

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=105470&aid=1224047&group_id=5470


More information about the Python-bugs-list mailing list