![](https://secure.gravatar.com/avatar/01aa7d6d4db83982a2f6dd363d0ee0f3.jpg?s=120&d=mm&r=g)
"BG" == Ben Gertzfield <che@debian.org> writes:
>> To follow up, I believe I have this working now. Here's how it >> works. BG> Thanks for the excellent explanation and implementation, BG> Barry. Took me two days. I still say Unicode is something everyone wants until they get it. :) BG> I'll test this when it's checked in. Some comments below.. Excellent! >> First, the only change to the MemberAdaptor API is that real >> names can now be Unicode strings as well as 8-bit strings. If >> they're 8-bit then they'll contain only ascii characters. BG> ASCII is by definition 7-bit, Barry. Did you mean ISO-8859-1 BG> here? Sorry, I meant "normal" Python strings (sometimes called "8-bit strings") but which contain only 7-bit ascii characters. Those beasties I don't convert to Python unicode strings. >> When a real name is entered into a web form, we'll first >> attempt to convert it to us-ascii. If that succeeds, we know >> the real name is ascii only and we'll store it in the >> membership database as an 8-bit ascii-only-containing string. BG> Again, I assume you mean ISO-8859-1 instead of ascii here. Same thing here. We do name.encode('us-ascii') and catch any UnicodeError that might occur. If no error occurs, we know we have a string with 7-bit ascii characters in it, so we store that as an 8-bit Python string, not as a unicode Python string. >> If the conversion fails, we'll convert the real name to Unicode >> using the charset of the context's language (i.e. list >> preferred if we're looking at an admin page, user preferred if >> we're looking at an options page, and form value if we're >> looking at the subscribe page -- all with appropriate fallbacks >> to Something Sensible). We'll also do html entity replacement >> (e.g. #&246; -> ö). We'll store this Unicode string as the >> member's real name in the membership database, but we don't >> store the charset because... BG> This is a good thing. Note that some browsers might (I BG> haven't checked this) incorrectly send the entity &246; for BG> whatever character is at position 246 in the user's default BG> character set, not character 246 in Unicode. This might be BG> something to look out for, but I don't know if it's important. I don't know what else to do. Note that you could literally type ö into the web form and it would have the same effect. This is probably an 80/20 solution. BG> Everything else looks good. The kludge to assume iso-8859-1 BG> on us-ascii pages is unfortunately a generally good one, as BG> that will make the most people happy. I hate to do it, BG> though! Me too! It means that names in other charsets will be screwed on English lists, but again, I think this is best we can do for a practical 80/20 solution. Thanks for the feedback. -Barry