One-byte string dtype: third time's the charm?
The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term? A key consequence of not having a one-byte string dtype is that handling ASCII data stored in binary formats such as HDF or FITS is basically broken in Python 3. Packages like h5py, pytables, and astropy.io.fits all return text data arrays with the numpy 'S' type, and in fact have no direct support for the numpy wide unicode 'U' type. In Python 3, the 'S' type array cannot be compared with the Python str type, so that something like below fails:
mask = (names_array == "john") # FAIL
Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit. For a good top-level summary of much of the previous thread discussion, see [5] from Chris Barker. Condensing this down to just a few points: - *Changing* the behavior of the existing 'S' type is going to break code and seems a bad idea. - *Adding* a new dtype 's' will work and allow highly performant conversion from 'S' to 's' via view(). - Using the latin-1 encoding will minimize code breakage vis-a-vis what works in Python 2 [6]. Using latin-1 is a pragmatic compromise that provides continuity to allow scientists to run their existing code in Python 3 and have things just work. It isn't perfect and it should not be the end of the story, but it would be good. This single issue is the *only* thing blocking me and my team from using Python 3 in operations. As a final point, I don't know the numpy internals at all, but it *seems* like this proposal is one of the easiest to implement amongst those that were discussed. Cheers, Tom [1]: http://mail.scipy.org/pipermail/numpy-discussion/2014-January/068622.html [2]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070574.html [3]: https://github.com/astropy/astropy/issues/3311 [4]: http://astropy.readthedocs.org/en/latest/api/astropy.table.Table.html#astrop... [5]: http://mail.scipy.org/pipermail/numpy-discussion/2014-July/070631.html [6]: It is not uncommon to store uint8 data in a bytestring
On 22/02/15 19:21, Aldcroft, Thomas wrote:
Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit.
Why UCS-4? The Python's internal "flexible string respresentation" will use ascii for ascii text. By PEP 393 an application should not assume an internal string representation at all: https://www.python.org/dev/peps/pep-0393/ If the problem is PEP 393 violation in NumPy string or unicode dtype, we shouldn't violate it even further by adding a latin-1 encoded ascii string. We should let Python represent strings as it wants, and it will not bloat. I am m -1 on adding latin-1 and +1 on making the unicode dtype PEP 393 compliant if it is not. And on Python 3 'U' and 'S' should just be synonyms. You can also store an array of bytes with uint8. Then you can decode it however you like to make a Python string. If it is encoded as latin-1 then decode it as latin-1: In [1]: import numpy as np In [2]: ascii_bytestr = "The quick brown fox jumps over the lazy dog".encode('latin-1') In [3]: numpy_bytestr = np.array(memoryview(ascii_bytestr)) In [4]: numpy_bytestr.dtype, numpy_bytestr.shape Out[4]: (dtype('uint8'), (43,)) In [5]: bytes(numpy_bytestr).decode('latin-1') Out[5]: 'The quick brown fox jumps over the lazy dog' Sturla
On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine. The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs. -n -- Nathaniel J. Smith -- http://vorpus.org
On Sun, Feb 22, 2015 at 11:29 AM, Sturla Molden
On 22/02/15 19:21, Aldcroft, Thomas wrote:
Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit.
Why UCS-4? The Python's internal "flexible string respresentation" will use ascii for ascii text.
This is a discussion about how strings are represented as bit-patterns inside ndarrays; the internal storage representation used by 'str' is irrelevant. -n -- Nathaniel J. Smith -- http://vorpus.org
On Sun, Feb 22, 2015 at 7:29 PM, Sturla Molden
On 22/02/15 19:21, Aldcroft, Thomas wrote:
Problems like this are now showing up in the wild [3]. Workarounds are also showing up, like a way to easily convert from 'S' to 'U' within astropy Tables [4], but this is really not a desirable way to go. Gigabyte-sized string data arrays are not uncommon, so converting to UCS-4 is a real memory and performance hit.
Why UCS-4? The Python's internal "flexible string respresentation" will use ascii for ascii text.
numpy's 'U' dtype is UCS-4, and this is what Thomas is referring to, not Python's string type. It cannot have a flexible representation as it *is* the representation. Python 3's `str` type is opaque, so it can freely choose how to represent the data in memory. numpy dtypes transparently describe how the data is represented in memory. -- Robert Kern
On 22/02/15 21:04, Robert Kern wrote:
Python 3's `str` type is opaque, so it can freely choose how to represent the data in memory. numpy dtypes transparently describe how the data is represented in memory.
Hm, yes, that is a good point. Sturla
On 22/02/15 20:57, Nathaniel Smith wrote:
This is a discussion about how strings are represented as bit-patterns inside ndarrays; the internal storage representation used by 'str' is irrelevant.
I thought it would be clever to just use the same internal representation as Python would choose. But obviously it is not. UTF-8 would fail because it is not regularly stored. And every string in an ndarray will need to have the same encoding, but Python might think otherwise. Sturla
On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith
On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype.
From there I can do the "annoying" copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have).
- Tom
-n
-- Nathaniel J. Smith -- http://vorpus.org _______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, Feb 22, 2015 at 12:52 PM, Nathaniel Smith
On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
We're also running out of letters for types. We need to decide on how to extend that representation. It would seem straight forward to just start using multiple letters, but there is a lot of code the uses things like `for dt in 'efdg':`. Can we perhaps introduce an extended dtype structure, maybe with some ideas from dynd and versioning. Chuck
On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith
wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the "annoying" copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have).
- Tom
The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems. Chuck
On Sun, Feb 22, 2015 at 2:42 PM, Charles R Harris
On Sun, Feb 22, 2015 at 12:52 PM, Nathaniel Smith
wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
We're also running out of letters for types. We need to decide on how to extend that representation. It would seem straight forward to just start using multiple letters, but there is a lot of code the uses things like `for dt in 'efdg':`. Can we perhaps introduce an extended dtype structure, maybe with some ideas from dynd and versioning.
I don't mind using "s" for this particular case, but in general I think we should de-emphasise the string representations, and even allow new dtypes to forgo them entirely. We have all of Python to work with. It's much nicer for users and for us to write things like dtype=np.someclass(special_option=True) instead of dtype="SC[S_O=T]" or whatever weird ad-hoc syntax we come up with. (Obviously there are some details to work out with things like the .npy format, but these seem solveable.) -n -- Nathaniel J. Smith -- http://vorpus.org
On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris wrote: On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
The idea of a one-byte string dtype has been extensively discussed
twice
before, with a lot of good input and ideas, but no action [1, 2]. tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte
string
dtype named 's' that uses latin-1 encoding as a bridge to enable
Python 3
usage in the near term? I think this is a good idea. I think overall it would be good for
numpy to switch to using variable-length strings in most cases (cf.
pandas), which is a different kind of change, but fixed-length 8-bit
encoded text is obviously a common on-disk format in scientific
applications, so numpy will still need some way to deal with it
conveniently. In the long run we'd like to have more flexibility (e.g.
allowing choice of character encoding), but since this proposal is a
subset of that functionality, then it won't interfere with later
improvements. I can see an argument for utf8 over latin1, but it
really doesn't matter that much so whatever, blue and purple bikesheds
are both fine. The tricky bit here is "just" :-). Do you want to implement this? Do
you know someone who does? It's possible but will be somewhat
annoying, since to do it directly without refactoring how dtypes work
first then you'll have to add lots of copy-paste code to all the
different ufuncs. I'm would be happy to have a go at this, with the caveat that someone who
understands numpy would need to get me started with a minimal prototype.
From there I can do the "annoying" copy-paste for ufuncs etc, writing tests
and docs. I'm assuming that with a prototype then the rest can be done
without any deep understanding of numpy internals (which I do not have). - Tom The last two new types added to numpy were float16 and datetime64. Might
be worth looking at the steps needed to implement those. There was also a
user type, `rational` that got added, that could also provide a template.
Maybe we need to have a way to add 'numpy certified' user data types. It
might also be possible to reuse the `c` data type, currently implemented as
`S1` IIRC, but that could cause some problems. OK I'll have a look at those.
Thanks,
Tom Chuck _______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Sun, Feb 22, 2015 at 2:46 PM, Charles R Harris
On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas
wrote: On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith
wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable
Python
3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the "annoying" copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have).
- Tom
The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems.
float16 and rational probably aren't too relevant because they are fixed-size types, and variable-size dtypes are much trickier. datetime64 will be more similar, but also add its own irrelevant complexities -- you might be best off just looking at how S and U work and copying them. -n -- Nathaniel J. Smith -- http://vorpus.org
On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris < charlesr.harris@gmail.com> wrote:
On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas < aldcroft@head.cfa.harvard.edu> wrote:
On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith
wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed twice before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal prototype. From there I can do the "annoying" copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have).
- Tom
The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems.
OK I'll have a look at those.
On second thought.. Maybe I'm being naive, but I think that starting from scratch looking at entirely new dtypes is harder than it needs to be, or at least not the most straightforward path [EDIT: just saw email from Nathan agreeing here]. What is being proposed is essentially: - For Python 2, the 's' type is exactly a clone of 'S'. In other words 's' will interface with Python as a bytes (aka str) object just like 'S'. - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in all operations, but interfaces with Python as a latin-1 encoded string. So the only difference is at the interface layer with Python (initialization, comparison, iteration, etc). So as a starting point we would want to clone 'S' to 's', then fix up the interface to Python 3. Does that sound about right? - Tom
Thanks, Tom
Chuck
_______________________________________________ NumPy-Discussion mailing list NumPy-Discussion@scipy.org http://mail.scipy.org/mailman/listinfo/numpy-discussion
On Feb 22, 2015 3:39 PM, "Aldcroft, Thomas"
On Sun, Feb 22, 2015 at 5:56 PM, Aldcroft, Thomas <
On Sun, Feb 22, 2015 at 5:46 PM, Charles R Harris <
charlesr.harris@gmail.com> wrote:
On Sun, Feb 22, 2015 at 3:40 PM, Aldcroft, Thomas <
aldcroft@head.cfa.harvard.edu> wrote:
On Sun, Feb 22, 2015 at 2:52 PM, Nathaniel Smith
wrote: On Sun, Feb 22, 2015 at 10:21 AM, Aldcroft, Thomas
wrote: The idea of a one-byte string dtype has been extensively discussed
twice
before, with a lot of good input and ideas, but no action [1, 2].
tl;dr: Perfect is the enemy of good. Can numpy just add a one-byte string dtype named 's' that uses latin-1 encoding as a bridge to enable Python 3 usage in the near term?
I think this is a good idea. I think overall it would be good for numpy to switch to using variable-length strings in most cases (cf. pandas), which is a different kind of change, but fixed-length 8-bit encoded text is obviously a common on-disk format in scientific applications, so numpy will still need some way to deal with it conveniently. In the long run we'd like to have more flexibility (e.g. allowing choice of character encoding), but since this proposal is a subset of that functionality, then it won't interfere with later improvements. I can see an argument for utf8 over latin1, but it really doesn't matter that much so whatever, blue and purple bikesheds are both fine.
The tricky bit here is "just" :-). Do you want to implement this? Do you know someone who does? It's possible but will be somewhat annoying, since to do it directly without refactoring how dtypes work first then you'll have to add lots of copy-paste code to all the different ufuncs.
I'm would be happy to have a go at this, with the caveat that someone who understands numpy would need to get me started with a minimal
- Tom
The last two new types added to numpy were float16 and datetime64. Might be worth looking at the steps needed to implement those. There was also a user type, `rational` that got added, that could also provide a template. Maybe we need to have a way to add 'numpy certified' user data types. It might also be possible to reuse the `c` data type, currently implemented as `S1` IIRC, but that could cause some problems.
OK I'll have a look at those.
On second thought.. Maybe I'm being naive, but I think that starting from scratch looking at entirely new dtypes is harder than it needs to be, or at least not the most straightforward path [EDIT: just saw email from Nathan agreeing here]. What is being proposed is essentially:
- For Python 2, the 's' type is exactly a clone of 'S'. In other words 's' will interface with Python as a bytes (aka str) object just like 'S'. - For Python 3, the 's' type is internally the same as 'S' (np.bytes_) in all operations, but interfaces with Python as a latin-1 encoded string. So
aldcroft@head.cfa.harvard.edu> wrote: prototype. From there I can do the "annoying" copy-paste for ufuncs etc, writing tests and docs. I'm assuming that with a prototype then the rest can be done without any deep understanding of numpy internals (which I do not have). the only difference is at the interface layer with Python (initialization, comparison, iteration, etc).
So as a starting point we would want to clone 'S' to 's', then fix up the
interface to Python 3. Does that sound about right? Sounds reasonable to me. You'll also want to consider interactions between the dtypes -- mixed operations like array("a", dtype="s") == array("a", dtype="U") should do the right thing, and casting s<->U ditto. -n
Hi all,
Using latin-1 is a pragmatic compromise that provides continuity to allow scientists to run their existing code in Python 3 and have things just work. It isn't perfect and it should not be the end of the story, but it would be good. This single issue is the *only* thing blocking me and my team from using Python 3 in operations.
Since you mentioned HDF compatibility, I would just note that the two string formats HDF5 supports are ASCII and UTF-8, although presently no validation is performed by HDF5 as to the actual contents. This shouldn't discourage anyone from going with Latin-1, but it would mean that h5py (and presumably PyTables) would have to choose from the following options: 1. Convert to UTF-8, and risk truncation 2. Store as ASCII and replace out-of-range characters with "?" 3. Just store the Latin-1 text in a type labelled "ASCII", and live with it. 4. Raise an exception if non-ASCII characters are present Realistically, h5py might go with (3) as the ASCII type in HDF5 is much abused already. Andrew
On Mon, Feb 23, 2015 at 11:55 AM, Andrew Collette wrote: Hi all, Using latin-1 is a pragmatic compromise that provides continuity to allow
scientists to run their existing code in Python 3 and have things just
work.
It isn't perfect and it should not be the end of the story, but it would
be
good. This single issue is the *only* thing blocking me and my team from
using Python 3 in operations. Since you mentioned HDF compatibility, I would just note that the two
string formats HDF5 supports are ASCII and UTF-8, although presently
no validation is performed by HDF5 as to the actual contents. This
shouldn't discourage anyone from going with Latin-1, but it would mean
that h5py (and presumably PyTables) would have to choose from the
following options: 1. Convert to UTF-8, and risk truncation
2. Store as ASCII and replace out-of-range characters with "?"
3. Just store the Latin-1 text in a type labelled "ASCII", and live with
it.
4. Raise an exception if non-ASCII characters are present Realistically, h5py might go with (3) as the ASCII type in HDF5 is
much abused already. I was working on the assumption that (3) would be the best choice, for the
reason you gave and to minimize breakage in transitioning to Python 3.
- Tom Andrew
_______________________________________________
NumPy-Discussion mailing list
NumPy-Discussion@scipy.org
http://mail.scipy.org/mailman/listinfo/numpy-discussion
participants (6)
-
Aldcroft, Thomas
-
Andrew Collette
-
Charles R Harris
-
Nathaniel Smith
-
Robert Kern
-
Sturla Molden