<html><head><meta http-equiv="content-type" content="text/html; charset=utf-8"></head><body dir="auto"><div></div><div><br></div><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div class="gmail_extra"><div class="gmail_quote">A compact dtype for mostly-ascii text:</div></div></div></blockquote><div><br></div><div>I'm a little confused about exactly what you're trying to do.</div></div></div></div></blockquote><div><br></div><div>Actually, *I* am not trying to do anything here -- I'm the one that said computers are so big and fast now that we shouldn't whine about 4 bytes for a character....but this whole conversation started with that request...and I have sympathy .. no one likes to waste memory. After all, numpy support small numeric dtypes, too.</div><div><br></div><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><div> Do you need your in-memory format for this data to be compatible with anything in particular? </div></div></div></div></blockquote><div><br></div><div>Not for this requirement -- binary interchange is another requirement.</div><br><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><div>If you're not reading or writing files in this format, then it's just a matter of storing a whole bunch of things that are already python strings in memory. Could you use an object array? Or do you have an enormous number so that you need a more compact, fixed-stride memory layout?</div></div></div></div></blockquote><div><br></div><div>That's the whole point, yes. Object arrays would be a good solution to the full Unicode problem, not the "why am I wasting so much space when all my data are ascii ?</div><div><br></div><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><div>Presumably you're getting byte strings (with unknown encoding.</div></div></div></div></blockquote><div><br></div><div>No -- thus is for creating and using mostly ascii string data with python and numpy.</div><div><br></div><div>Unknown encoding bytes belong in byte arrays -- they are not text.</div><div><br></div><div>I DO recommend Latin-1 As a default encoding ONLY for "mostly ascii, with a few extra characters" data. With all the sloppiness over the years, there are way to many files like that.</div><div><br></div><div>Note: the primary use-case I have in mind is working with ascii text in numpy arrays efficiently-- folks have called for that. All I'm saying is use Latin-1 instead of ascii -- that buys you some useful extra characters.</div><div><br></div><div><br></div><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><div>If your question is "what should numpy's default string dtype be?", well, maybe default to object arrays; </div></div></div></div></blockquote><div><br></div><div>Or UCS-4. </div><div><br></div><div>I think object arrays would be more problematic for npz storage, and raw "tostring" dumping. (And pickle?) not sure how important that is.</div><div><br></div><div>And it would be good to have something that plays well with recarrays</div><br><blockquote type="cite"><div><div dir="ltr"><div class="gmail_quote"><div>anyone who just has a bunch of python strings to store is unlikely to be surprised by this. Someone with more specific needs will choose a more specific - that is, not default - string data type.</div></div></div></div></blockquote><div><br></div><div>Exactly.</div><div><br></div><div>-CHB</div></body></html>