Mailman 3 Unicode 8.0 and 3.5 - Python-Dev

newer
Re: [Python-Dev] cpython: Minor...

Unicode 8.0 and 3.5

older
Summary of Python tracker Issues

Terry Reedy

18 Jun 2015 18 Jun '15

2:27 p.m.

Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5? -- Terry Jan Reedy

Show replies by date

Larry Hastings

18 Jun 18 Jun

2:33 p.m.

On 06/18/2015 11:27 AM, Terry Reedy wrote:

...

Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

What does this entail? Data changes, code changes, both? //arry/

MRAB

3:34 p.m.

On 2015-06-18 19:33, Larry Hastings wrote:

...

On 06/18/2015 11:27 AM, Terry Reedy wrote:

...
Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

What does this entail? Data changes, code changes, both?

It looks like just data changes. There are additional codepoints and a renamed property (which the standard library doesn't support anyway).

Steven D'Aprano

7:56 p.m.

On Thu, Jun 18, 2015 at 08:34:14PM +0100, MRAB wrote:

...

On 2015-06-18 19:33, Larry Hastings wrote:

...
On 06/18/2015 11:27 AM, Terry Reedy wrote:

...
Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

What does this entail? Data changes, code changes, both?

It looks like just data changes.

At the very least, there is a change to the casefolding algorithm. Cherokee was classified as unicameral but is now considered bicameral (two cases, like English). Unusually, case-folding Cherokee maps to uppercase rather than lowercase. The full set of changes is listed here: http://unicode.org/versions/Unicode8.0.0/ Apart from the addition of 7716 characters and changes to str.casefold(), I don't think any of the changes will make a big difference to Python's implementation. But it would be good to support Unicode 8 (to the degree that Python actually does support Unicode, rather than just that character set part of it).

...

There are additional codepoints and a renamed property (which the standard library doesn't support anyway).

Which one are you referring to, Indic_Matra_Category renamed to Indic_Positional_Category? -- Steve

MRAB

8:55 p.m.

On 2015-06-19 00:56, Steven D'Aprano wrote:

...

On Thu, Jun 18, 2015 at 08:34:14PM +0100, MRAB wrote:

...
On 2015-06-18 19:33, Larry Hastings wrote:

...
On 06/18/2015 11:27 AM, Terry Reedy wrote:

...
Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

What does this entail? Data changes, code changes, both?

It looks like just data changes.

At the very least, there is a change to the casefolding algorithm. Cherokee was classified as unicameral but is now considered bicameral (two cases, like English). Unusually, case-folding Cherokee maps to uppercase rather than lowercase.

Doesn't the case-folding just depend on the data and the algorithm remains the same?

...

The full set of changes is listed here:

http://unicode.org/versions/Unicode8.0.0/

Apart from the addition of 7716 characters and changes to str.casefold(), I don't think any of the changes will make a big difference to Python's implementation. But it would be good to support Unicode 8 (to the degree that Python actually does support Unicode, rather than just that character set part of it).

...
There are additional codepoints and a renamed property (which the standard library doesn't support anyway).

Which one are you referring to, Indic_Matra_Category renamed to Indic_Positional_Category?

Yes.

Steven D'Aprano

11:33 p.m.

On Fri, Jun 19, 2015 at 01:55:07AM +0100, MRAB wrote:

...

On 2015-06-19 00:56, Steven D'Aprano wrote:

...

...
At the very least, there is a change to the casefolding algorithm. Cherokee was classified as unicameral but is now considered bicameral (two cases, like English). Unusually, case-folding Cherokee maps to uppercase rather than lowercase.

Doesn't the case-folding just depend on the data and the algorithm remains the same?

That depends on what algorithm str.casefold uses :-) Case folding is specifically mentioned as something that people migrating to Unicode 8 will need to take care with, and also says: "This mapping also has consequences on identifiers, as described in the changes to UAX #31, Unicode Identifier and Pattern Syntax." http://unicode.org/versions/Unicode8.0.0/#Migration -- Steve

Serhiy Storchaka

19 Jun 19 Jun

12:56 a.m.

On 18.06.15 22:34, MRAB wrote:

...

On 2015-06-18 19:33, Larry Hastings wrote:

...
On 06/18/2015 11:27 AM, Terry Reedy wrote:

...
Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

What does this entail? Data changes, code changes, both?

It looks like just data changes.

There are additional codepoints and a renamed property (which the standard library doesn't support anyway).

May be private table for case-insensitive matching in the re module should be updated too.

Serhiy Storchaka

28 Jun 28 Jun

2:03 a.m.

On 19.06.15 07:56, Serhiy Storchaka wrote:

...

May be private table for case-insensitive matching in the re module should be updated too.

Confirm that the re module doesn't need the update to Unicode 8.0.

Jim J. Jewett

22 Jun 22 Jun

10:17 a.m.

On Thu Jun 18 20:33:13 CEST 2015, Larry Hastings asked:

...

On 06/18/2015 11:27 AM, Terry Reedy wrote:

...
Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

...

What does this entail? Data changes, code changes, both?

Note that the unicode 7 changes also need to be considered, because python 3.4 used unicode 6.3. There are some changes to the recommendations on what to use in identifiers. Python doesn't follow precisely the previous rules, but it would be good to ensure that any newly allowed characters are intentional -- particularly for the newly defined characters. My gut feel is that it would have been fine during beta, but for the 3rd RC I am not so sure. -jJ -- If there are still threading problems with my replies, please email me with details, so that I can try to resolve them. -jJ

Larry Hastings

18 Jun 18 Jun

11:30 p.m.

On 06/18/2015 11:27 AM, Terry Reedy wrote:

...

Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

What do the Python unicodedata Experts say? That'd be "loewis", "lemburg", and "ezio.melotti". According to the Dev Guide, //arry/

Benjamin Peterson

27 Jun 27 Jun

4:57 p.m.

On Thu, Jun 18, 2015, at 13:27, Terry Reedy wrote:

...

Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

3.5 now has Unicode 8.0.0.

Terry Reedy

28 Jun 28 Jun

12:22 a.m.

On 6/27/2015 4:57 PM, Benjamin Peterson wrote:

...

On Thu, Jun 18, 2015, at 13:27, Terry Reedy wrote:

...
Unicode 8.0 was just released. Can we have unicodedata updated to match in 3.5?

3.5 now has Unicode 8.0.0.

Great. Does the release PEP or something else have instructions on how to do this? -- Terry Jan Reedy

3222

Age (days ago)

3232

Last active (days ago)

List overview

Download

11 comments

7 participants

participants (7)

Benjamin Peterson
Jim J. Jewett
Larry Hastings
MRAB
Serhiy Storchaka
Steven D'Aprano
Terry Reedy

Unicode 8.0 and 3.5

tags

participants (7)