[Tutor] How does len() compute length of a string in UTF-8, 16, and 32?

Thu Aug 10 21:40:05 EDT 2017

On Thu, Aug 10, 2017 at 8:01 AM, Steven D'Aprano <steve at pearwood.info> wrote:
>
> Another **Must Read** resource for unicode is:
>
> The Absolute Minimum Every Software Developer Absolutely Positively Must
> Know About Unicode (No Excuses!)
>
> https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

This was an enjoyable read, but did not have as much technical detail
as the two videos Zach had referenced.  But then the author did say
"the absolute minimum ...".  I will strive to avoid peeling onions on
a sub!

> (By the way, it is nearly 14 years later, and PHP still believes that
> the world is ASCII.)

I thought you must surely be engaging in hyperbole, but at
http://php.net/manual/en/xml.encoding.php I found:

"The default source encoding used by PHP is ISO-8859-1."

>
> Python 3 makes Unicode about as easy as it can get. To include a unicode
> string in your source code, you just need to ensure your editor saves
> the file as UTF-8, and then insert (by whatever input technology you
> have) the character you want. You want a Greek pi?
>
> pi = "π"
>
> How about an Israeli sheqel?
>
> money = "₪1000"
>
> So long as your editor knows to save the file in UTF-8, it will Just
> Work.

So Python 3's default behavior for strings is to store them as UTF-8
encodings in both RAM and files?  No funny business anywhere?  Except
perhaps in my Windows 7 cmd.exe and PowerShell, but that's not
Python's fault.  Which makes me wonder, what is my editor's default
encoding/decoding?  I will have to investigate!

Cheers!

-- 
boB