`to_file()` method for strings

As a social scientists trying to help other social scientists move from language like R, Stata, and Matlab into Python, one of the behaviors I've found unnecessarily difficult to explain is the "file.open()/file.close()" idiom (or, alternatively, context managers). In normal operating systems, and many high level languages, saving is a one-step operation. I understand there are situations where an open file handle is useful, but it seems a simple `to_file` method on strings (essentially wrapping a context-manager) would be really nice, as it would save users from learning this idiom. Apparently there's something like this in the Path library ( https://docs.python.org/dev/library/pathlib.html#pathlib.Path.write_text) , but I suspect most people have no idea about that method (I've been doing python for years and this has always been a personal frustration, and I've asked several others for better options and no one had one to offer), and it seems like it would make much more sense as a string method. If someone has a string they want to save to disk, I can't imagine anyone looking in the Path library. I respect the desire to avoid bloat -- the context manager or open/close idiom has just felt unnecessarily complicated (dare I say unpythonic?) for a common task. Thanks! Nick

On Tue, Mar 22, 2016 at 11:06 PM, Nick Eubank <nickeubank@gmail.com> wrote:
it seems a simple `to_file` method on strings (essentially wrapping a context-manager) would be really nice
-1 It is a rare situation when you would want to write just a single string to a file. In most cases you write several strings and or do other file operations between opening and closing a file stream. The proposed to_file() method may become an attractive nuisance leading to highly inefficient code. Remember: opening or closing a file is still in most setups a mechanical operation that involves moving macroscopic physical objects, not just electrons.

On Tue, Mar 22, 2016 at 11:32 PM, Nick Eubank <nickeubank@gmail.com> wrote:
I output "single strings" to LaTeX all the time.
If you do this in an interactive session - take a look at ipython. Its %writefile [1] magic command is not quite what you want but close. You may also have a better chance getting a new feature in ipython than here. If you do that from a script - stop doing that - and your script will run faster. [1] https://ipython.org/ipython-doc/3/interactive/magics.html#cellmagic-writefil...

On Mar 22, 2016, at 20:29, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
I do it all the time in other languages when dealing with smallish files. Python's very nice file-object concept, slant toward iterator-based processing, and amazingly consistent ecosystem means that the same issues don't apply, so I'd rarely do the same thing. But for users migrating to Python from another language, or using Python occasionally while primarily using another language, I can see it being a lot more attractive. Also, you're neglecting the atomic-write issue. Coming up with a good API for an iterative atomic write is hard; for single-string write-all, it's just an extra flag on the function.
In most cases you write several strings and or do other file operations between opening and closing a file stream. The proposed to_file() method may become an attractive nuisance leading to highly inefficient code. Remember: opening or closing a file is still in most setups a mechanical operation that involves moving macroscopic physical objects, not just electrons.
All the more reason to assemble the whole thing in memory and write it all at once, rather than streaming it out. Then you don't have to worry about how good the buffering is, what happens if there's a power failure (or just an exception, if you don't use with statements) halfway through, etc. It's definitely going to be as fast and as safe as possible if all of the details are done by the stdlib instead of user code. (I trust Python's buffering in 3.6, but not in 2.x or 3.1--and I've seen people even in modern 3.x try to "optimize" by opening files in raw mode and writing 7 bytes here and 18 bytes there, which is going to be much slower than concatenating onto a buffer and writing blocks at a time...)

Thanks for the thoughts both! I'm not opposed to `a = str.read_file()` -- it does require knowing classes to grok it, but it's super readable and intuitive to look at (i.e. pythonic?). Regarding bytes, I was thinking `to_file()` would include a handful of arguments to support unusual encodings or bytes, but leaving the default to utf-8 text. On Tue, Mar 22, 2016 at 8:52 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Tue, Mar 22, 2016 at 11:51 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Then you don't have to worry about how good the buffering is, what happens if there's a power failure ...
You are right: it is not uncommon for data scientists to prefer loosing all the computed data rather than the last few blocks in this case. :-)

On Mar 22, 2016, at 21:32, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
The question is whether it's better to have the complete previous version of the data, or 21% of the new version of the data.[1] And the answer is different in different cases, which is why atomic pretty much has to be an option (or a separate function), not always-on. But I think on by default makes sense. --- [1]: In fact, with delete=False and a sensible pattern, atomic actually gives you the best of both worlds: the previous version of the data is still there, and 21% of the new version is in a tempfile that you can recover if you know what you're doing...

Andrew Barnert via Python-ideas writes:
It occurs to me that we already have a perfectly appropriate builtin for the purpose anyway: print. Add a filename= keyword argument, and make use of both file= and filename= in the same call an error. If that's not the right answer, I don't see how str.to_file() can possibly be better.

On 24 March 2016 at 18:02, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Well, with print("A string", filename='foo.txt') what encoding should be used? What if the user wants control of the encoding (on Windows, it's far from uncommon to want UTF-8 instead of the system default, for example)? What about error handlers? The same questions apply for str.to_file() of course, although having extra arguments is likely to be less of an issue there. And yes, at some point the answer is "if the convenience function doesn't suit your needs, you can fall back to explicit file handling". But the trick is finding the *right* balance of convenience. And having a "simple" way of writing a string to a file that doesn't at least hint to the user that they should be thinking about encodings seems to me like something of an attractive nuisance. Paul

On Mar 24, 2016, at 11:02, Stephen J. Turnbull <stephen@xemacs.org> wrote:
That's not a bad idea. However, it's obvious how to extend str.to_file() to reads (str.from_file() classmethod), binary data (bytes.to_file()), atomic writes (str.to_file(path, atomic=True), non-default encodings (str.to_file(path, encoding='Latin-1', errors='replace')). With print, those are all less obvious. Maybe input() for reads, but how can it handle bytes? And, while you could toss on more parameters to print() for some of these, it already has a lot, and adding new parameters that are only legal if filename is specified seems like it'll be make the docs even harder to follow. Of course if the answers are "we don't want any of those", then being more limited is a good thing, not a problem. :)

I did a quick survey of other languages, cross-language frameworks, etc. to see what they offer, and I was surprised. To look at my quick survey (or correct me! I wasn't super-careful here, and may easily have taken misleading StackOverflow posts or the like at face value): https://goo.gl/9bJcAf Anyway, the original proposal in this thread is very close to Cocoa, but that turns out to be pretty uncommon. Most frameworks use free functions, not methods on either string or path objects. Most don't offer an option for atomic writes or exclusive locks--although some do them whether you want it or not. Most have an option for appending. All but one either let you specify the encoding or don't do test. Almost none of those were what I expected.

also, Python's strings are immutable, so we really don't want to encourage people to build up a big string in memory anyway. and what's wrong with: open(a_path, 'w').write(the_string) short, simple one-liner. OK, non cPython garbage collection may leave that file open and dangling, but we're talking the quicky scripting data analysis type user -- the script will terminate soon enough. BTW, numpy does offer one stop ways to save and load arrays to a file, binary or text -- and that's a lot more useful than a simple string. especially the reading. oh, and for reading: string = open(path).read() I really don't get the point of all this. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 25 March 2016 at 08:07, Chris Barker <chris.barker@noaa.gov> wrote:
One of the few downsides of Python's popularity as both a scripting language and an app development language is that a lot of tutorials are written for the latter, and in app development, relying on the GC for external resource cleanup isn't a great habit to get into. As a result, tutorials will introduce the deterministic cleanup form, and actively discourage use of the non-deterministic cleanup. Independent of the "Why not just rely on the GC?" question, though, we're also forcing the user to upgrade their mental model to achieve their objective. User model: "I want to save this data to disk at a particular location and be able to read it back later" By contrast, unpacking the steps in the one-liner: - open the nominated location for writing (with implied text encoding & error handling) - write the data to that location It's that switch from a 1-step process to a 2-step process that breaks flow, rather than the specifics of the wording in the code (Python 3 at least improves the hidden third step in the process by having the implied text encoding typically be UTF-8 rather than ASCII). Formulating the question this way does suggest a somewhat radical notion, though: what if we had a JSON-based save builtin that wrote UTF-8 encoded files based on json.dump()? That is, "save(a_path, the_string)" would attempt to do: with open(a_path, 'w', encoding='utf-8', errors='strict') as f: json.dump(the_string, f) While a corresponding "load(a_path)" would attempt to do: with open(a_path, 'r', encoding='utf-8', errors='strict') as f: return json.load(f) The format of the created files would be using a well-defined standard rather than raw data dumps (as well as covering more than just pre-serialised strings), and the degenerate case of saving a single string would just have quotation marks added to the beginning and end. If we later chose to define a "__save__" and "__load__" protocol, then json.dump/load would also be able to benefit. There'd also be a potential long term security benefit here, as folks are often prone to reaching for pickle to save data structures to disk, which creates an arbitrary code execution security risk when loading them again later. Explicitly favouring the less dangerous JSON as the preferred serialisation format can help nudge people towards safer practices without forcing them to consciously think through the security implications. Switching from the save/load builtins to manual file management would then be a matter of improving storage efficiency and speed of access, just as switching to any other kind of persistent data store would be (and a JSON based save/load would migrate very well to NoSQL style persistent data storage services). Cheers, Nick. P.S. I'm going to be mostly offline until late next week, but found this idea intriguing enough to share before I left. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks Nick -- I really couldn't have said it better than "we're also forcing the user to upgrade their mental model to achieve their objective." Python is beautiful in part because it takes all sorts of idioms that are complicated in other languages and wraps them up in something simple and intuitive ("in" has to be my favorite builtin of all time). This feels like one of those few cases where Python still feels like C, and it's not at all clear to my why that needs to be the case. On Thu, Mar 24, 2016 at 7:22 PM Nick Coghlan <ncoghlan@gmail.com> wrote:

On Mar 24, 2016 10:22 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
I was going to suggest that, if str save/load were to be implemented, it could be made as a parallel to json.dump and json.load (as string module members?). Any new features like atomicity would be "backported" to json.dump/load. Same with pickle.load/dump. It's just str serialization, isn't it? If there is a save/load builtin (or dump/load), I don't see what's so special about JSON. (YAML is more like Python, including in readability.) I'd suggest that you can specify a "format" parameter (or with another name) to use it with pickle, json, yaml, etc., but that means two ways of doing things. Deprecate the original way? import json load("data.json", format=json) Without a format kwarg, Python might look at the file extension and try to guess the format. Modules could register extensions (is this magic?), or it could look for an object with the same name which has __dump__/__load__ (is this VERY magic?). I don't like either: You have to import pickle to load it, but you wouldn't have to name it explicitly to use it. I'd hate `import pickle; pickle.register(".p")`: this thread is for scripters and not application devs, so wordiness matters. Your proposal means I don't even have to import json to use json, and json needs explaining to a data scientist (who won't typically have web dev experience).

On Sat, Mar 26, 2016 at 3:30 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
The thing that's special about JSON is that nearly everyone recognizes the name. Thanks to JSON's extensive use in networked systems (courtesy of web browser <-> web server communication), pretty much everyone will be able to grok bidirectional JSON handling. Maybe YAML would be better, but if you publish something that uses JSON, nobody will be confused. Not sure that's enough justification to make it the default, but it is significant.
You can slap those lines into sitecustomize.py though. Or third party Python distributors could slap those lines into sitecustomize.py, and you won't even be aware that stuff had to be registered. But I don't like the registration of extensions as a means of selecting an output format, generally. Too magical. ChrisA

On Mar 24, 2016, at 7:22 PM, Nick Coghlan : what if we had a JSON-based save builtin that wrote UTF-8 encoded files based on json.dump()?
I've been think about this for a while, but would rather have a "pyson" format -- I.e. Python literals, rather than JSON. This would preserve the tuple vs list and integer vs float distinction, and allow more options for dictionary keys.(and sets?). Granted, you'd lose the interoperability, but for the quick saving and loading of data, it'd be pretty nice. There is also JSON pickle: https://jsonpickle.github.io Though as I understand it, it has the same security issues as pickle. But could we make a not-quite-as-complete pickle-like protocol that could save and load arbitrary objects, without ever running arbitrary code? -CHB

Le 28/03/2016 17:30, Chris Barker - NOAA Federal a écrit :
If it's for quick data saving, the security is not an issue since the data will never comes from an attacker if you do a quick script. For other needs, where security is an issue, having a oneliner to dump some serialization is not going to do much of a difference.

On Mon, Mar 28, 2016 at 8:44 AM, Michel Desmoulin <desmoulinmichel@gmail.com
wrote:
If it's for quick data saving, the security is not an issue since the data will never comes from an attacker if you do a quick script.
that's why we have pickle already -- but a nice human-readable and editable form would be nice... Also -- when it comes to security it's a tough one -- people DO start with one thing, thinking "I will always trust this source" (if they think about it at all), then later expand the system to be a web service, or .. and oops!
For other needs, where security is an issue, having a one liner to dump some serialization is not going to do much of a difference.
no -- I kind of mixed topics here -- my "safe json serialization" would be for web services, configuration, etc -- where security matters, but quick one-liner access is not so important -- though why not have one thing for multiple uses? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 29 March 2016 at 01:44, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
"These files will never be supplied or altered by an attacker" is the kind of assumption that has graced the world with such things as MS Office macro viruses. That means that as Python makes more inroads into the traditional territory of MS Excel and other spreadsheets, ensuring we encourage a clear distinction between code (which is always dangerous to trust) and data (which *should* be safe to read, aside from processing capacity limits) becomes increasingly important. If we ever did something like this, then Chris's suggestion of a Python-specific format that can be loaded from a string via ast.literal_eval() rather than using JSON likely makes sense [1], but it would also be appropriate to revisit that idea first as a project outside the standard library for ad hoc data persistence, before proposing it for standard library inclusion. Cheers, Nick. [1] https://code.google.com/archive/p/pyon/ is a project from several years ago aimed at that task. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

As with my previous email about __main__ for random, uuid and os, I wish to suggest a similar __main__ for datetime. The target audience may not be the same, so I'm making it a difference proposal. E.G: python -m datetime now => print(str(datetime.datetime.now()) python -m datetime utcnow => print(str(datetime.datetime.utcnow()) python -m time epoch => print(time.time()) python -m datetime now "%d/%m/%Y" => print(str(datetime.datetime.now().strftime("%d/%m/%Y")) python -m datetime utcnow "%d/%m/%Y" => print(str(datetime.datetime.utcnow().strftime("%d/%m/%Y"))

Note: new subjects should be posted as new threads. This was posted as a response in the unrelated "`to_file()` method for strings" thread and will not be seen by anyone who has killed that thread or who has threads collapsed. On 4/1/2016 2:02 PM, Michel Desmoulin wrote:
As with my previous email about __main__ for random, uuid and os, I wish to suggest a similar __main__ for datetime.
File __main__ is for packages. I believe none of the modules you mention are packages, so I presume you mean adding within the module something like def main(args): ... if __name__ = '__main__': from sys import args main(args[1:]) or the equivalent in C.
What is the particular motivation for this package? Should we add a command line interface to math? So that python -m math func arg => print(func(arg)) ? I am wondering what is or should be the general policy on the subject. How easy is this for C-coded modules? -- Terry Jan Reedy

On 01/04/2016 19:02, Michel Desmoulin wrote:
Not very funny for 1st April. Would you care to have another go? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Fri, Apr 1, 2016 at 2:02 PM, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
+1 In fact, I would not mind seeing a fairly complete GNU date utility [1] reimplementation as datetime.__main__. [1] http://www.gnu.org/software/coreutils/manual/html_node/date-invocation.html#...

On Sat, Apr 2, 2016 at 1:49 AM Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
Seems like Microsoft is solving this problem for you. http://www.hanselman.com/blog/DevelopersCanRunBashShellAndUsermodeUbuntuLinu...

[1] https://code.google.com/archive/p/pyon/ is a project from several years ago aimed at that task.
Thanks for the link -- I'll check it out. - Chris

On Thu, Mar 24, 2016 at 7:22 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
no, it's not. though now that think about it, while I understand that context managers are great as a universal and flexible way to ,well, mange context, for file objects themselves: open('a_file', 'w').write(some_stuff) Is really nice -- while we don't want to make reference-counting garbage collection a required part of the language in general, I wonder if it would be practical to make it part of the spec for SOME objects -- i.e kind of like you can define a given object to be a context manager, you could define certain objects to clean up after themselves when they go out of scope. It would sure be nice is this example ;; it would buy us that simple syntax, the atomic operation, etc... I have not thought this through AT ALL, but I wonder if one could avoid requiring a reference counting system by catching at the parsing stage that that the object created is never assigned to anything -- and this you know it's going to go out of scope as soon as that line is finished processing. In fact, if you have a non-reference counting garbage collection system, it might be helpful to special case all objects with very short lives, so you don't pile up a bunch of temporaries, etc that have to be collected later.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

I wonder how reasonable it would be to add a new keyword to open that would .close() the file object upon a single write/read. Consider: data = open("foo.txt", "r", close_after_use=True).read() It doesn’t rely on garbage collection to work properly, is a fancy one-liner, and is obvious that you can only read or write from/to it once. Could use a better keyword for that, though. -Emanuel ~ If it doesn’t quack like a duck, add a quack() method ~

On Mon, Mar 28, 2016 at 8:54 AM, Émanuel Barry <vgr255@live.ca> wrote:
it makes it a tiny bit shorter than using "with", but doesn't solve Nick's mental model issue -- the user still needs to be thinking about the fact that they are creating a file object and that it needs to be closed when you are done with it. and I"m not sure how you would define "use" -- any i.o. opeartion? i.e.: infile = open("foo.txt", "r", close_after_use=True) first_line = infile.readline() now what is the state of infile? a closed file object? exactly why context managers were introduced: with open("foo.txt", "r", close_after_use=True) as infile: first_line = infile.readline() something_else... now it's pretty clear that infile is no longer a useful object. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Mar 28, 2016 at 8:54 AM, Émanuel Barry < <mailto:vgr255@live.ca> vgr255@live.ca> wrote: I wonder how reasonable it would be to add a new keyword to open that would .close() the file object upon a single write/read. Consider: data = open("foo.txt", "r", close_after_use=True).read() it makes it a tiny bit shorter than using "with", but doesn't solve Nick's mental model issue -- the user still needs to be thinking about the fact that they are creating a file object and that it needs to be closed when you are done with it. and I"m not sure how you would define "use" -- any i.o. opeartion? i.e.: infile = open("foo.txt", "r", close_after_use=True) first_line = infile.readline() now what is the state of infile? a closed file object? Sure. The keyword really could benefit from a better name, but it might solve some part of the issue. It doesn’t solve the mental model issue though, but to be fair, open() should always be wrapped in a context manager (IMO, anyway). -Emanuel exactly why context managers were introduced: with open("foo.txt", "r", close_after_use=True) as infile: first_line = infile.readline() something_else... now it's pretty clear that infile is no longer a useful object. -CHB

On Mon, Mar 28, 2016 at 9:32 AM, Émanuel Barry <vgr255@live.ca> wrote:
then we don't need anyting else :-) but the name of the keyword is not that point here -- the issue is what does it mean to "use" a file object? using "with" let's the user clearly define, when they are done with the object. and if a reference to the object is not stored anywhere, then you can be sure the user doesn't expect to be abel to use it again. but: data = open("foo.txt", "r", close_after_use=True).read() and infile = open("foo.txt", "r", close_after_use=True) data = infile.read() look exactly the same to the file object itself, it has no idea when the user is done with it. in cPython, the reference counter knows that teh file object has no references to it when that first line is done running, so it can clean up and delete the object -- but without a reference counting system, who knows when it will get cleaned up? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Mar 28, 2016 at 12:40 PM, Chris Barker <chris.barker@noaa.gov> wrote:
It is my understanding that under the "close_after_use" proposal, infile will be closed by .read() in both cases. Still, I don't like this idea. I puts action specification (close) too far from where the action is taken. The two-line example already makes me uncomfortable and imagine if infile is passed through several layers of function calls before infile.read() is called. I would rather see read_and_close() and write_and_close() convenience methods for file objects: data = open("foo.txt").read_and_close()

On Tue, Mar 29, 2016 at 2:45 AM, Chris Barker <chris.barker@noaa.gov> wrote:
The whole point of the context manager is that it defines the notion "go out of scope". But if you want the simple syntax, the easiest way is to create a helper: def write_file(filename, contents): with open(filename, "w") as f: f.write(contents) write_file('a_file', some_stuff) Or, if you're writing a quick one-off script, just use the line you have above, and don't worry about guaranteed closing. What's the worst that can happen? AIUI buffered data will be written out at process termination if not before, and of course the file will be closed. The main case where you absolutely must close the file promptly is when you plan to then open it again (in the same or another process), and a quick script will usually know if that's going to be a possibility. (How bad can it be? I suppose https://xkcd.com/292/ level bad. But not usually.) ChrisA

On Mon, Mar 28, 2016 at 8:55 AM, Chris Angelico <rosuav@gmail.com> wrote:
The whole point of the context manager is that it defines the notion "go out of scope".
of course it does -- very useful but a lot of extra for a siple read this file then close it right away, for the casual scripting user :-)
So I guess this comes around to the OPs point -- a helper like that should be built in, and easy and obvious to scripting user. I don't think it should be astring method, but... Or, if you're writing a quick one-off script, just use the line you have above, and don't worry about guaranteed closing. And that comes back around to my original post on this thread -- we should just tell scripting users to do that :-) but I think it was Nick's point that Python is used for scripting and "real" system development, so it's hard to communicate best practices for both... given that, I'm kind of liking "write_file" It solves Nick's mental model issue" " want to write some stuff to a file", rather than "I want to create a file object, then write stuff to it" -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Just to make sure I follow: Chris, you're proposing a modification to garbage collection to clean up `open()` if `open()` is used by not closed and not assigned to anything so that `open(file).write(stuff)` is safe? That sounds brilliant to me, and totally meets my goal of having a "one-step" method for writing to disk that doesn't require users to develop a two-step mental model. If it's feasible (I have nothing but the most basic understanding of how the garbage collector works), that would totally obviate my desire to have a `to_file()` method on strings. On Mon, Mar 28, 2016 at 8:46 AM Chris Barker <chris.barker@noaa.gov> wrote:

Just to make sure I follow: Chris, you're proposing a modification to garbage collection to clean up `open()` if `open()` is used by not closed and not assigned to anything so that `open(file).write(stuff)` is safe? Well, I wouldn't call it a proposal, more a random inspiration that popped up when thinking about this thread. That sounds brilliant to me, and totally meets my goal of having a "one-step" method for writing to disk that doesn't require users to develop a two-step mental model. If it's feasible (I have nothing but the most basic understanding of how the garbage collector works), that would totally obviate my desire to have a `to_file()` method on strings. IIUC, in the current cPython implementation, you do indeed have that -- I've used it for years, long before context managers existed, and still do for quickie scripts. CPython uses a reference counting scheme: each time an object is referenced, its count is increased, each lost reference, and it is decreased. When the count goes to zero, the object is deleted. So in: Data = open(file name).read() The file object is created, given a refount of one. Then the read() method is called, creating a string, then bound to the name Data. When the next line is reached, the recount of the file object is reduced to zero, and the object is deleted. And the internal file pointer is closed before deleting the object. The "trick" is that the Python language spec does not require this kind of garbage collection. So jython, or pypy, or ironPython may not clean up that file object right away, it could hang around in an open and unflushed state until the garbage collector gets around to cleaning it up. But for short scripts, it'll get cleaned up at the end of the script anyway. The thing is that while I, at least, think it's a fine practice for scripting, it's really not a good idea for larger system development. Thus the general advise to use "with". As for any proposal, it dawned on me that while we don't want Python to require any particular garbage collecting scheme, I wonder if it would be possible (or desirable) to specify that temporary objects used on one line of code, and never referenced any other way, get deleted right away. It's actually very common to create a lot of temporaries on a line of code -- when you chain operations or methods. So it might be a small performance tweak worth doing (or it may not :-) ) -CHB

Brevity may be the soul of wit, but I am completely happy without a convenience function to open and read/write, then close a file. In production, the ``with`` statement seems concise enough. If I'm messing around at the prompt we can already write ``text = open(filename).read()``. We can also write ``open(filename, 'w').write(text)`` and not worry about closing the file in the basic Python prompt. On Mon, Mar 28, 2016 at 8:49 PM Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:

[Sorry, Nick, didn't mean to just send it to you.] On 03/28/2016 11:16 AM, Nick Eubank wrote:
Even if he is, I don't see it happening. Modifying the garbage collector to act sooner on one particular type of object when we already have good, if very mildly inconvenient, solutions is not going to be a high-priority item -- especially if it has /any/ negative impact on performance. -- ~Ethan~

On Tue, Mar 29, 2016 at 5:06 AM, Eric V. Smith <eric@trueblade.com> wrote:
Again -- this is a thought experiment, not a proposal, but: It's only relevant for other implementations -- cPython already cleans up the file object (any object) when it is no longer referenced. And I do move the thought experiment along -- it was never specific to file objects, but rather: In the case of temporary objects that are never referred to outside a single line of code -- delete it right away. in theory, this could help all implementations leave a little less garbage lying around. Whether that is possible or desirable, I have no idea. -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, 29 Mar 2016 at 15:11 Chris Barker <chris.barker@noaa.gov> wrote:
It's not desirable to dictate what various Python implementations must make sure that their garbage collector supports direct collection like this for this specific case rather than staying general. It would very much be a shift in the definition of the Python project where we have purposefully avoided dictating how objects need to behave in terms of garbage collection. -Brett

On Wed, Mar 30, 2016 at 9:10 AM, Chris Barker <chris.barker@noaa.gov> wrote:
The problem is that there's no way to be sure. For instance, compare these lines of code: from threading import Thread open("marker").write("Ha") Thread(target=threadfunc).start() One of them has finished with the object completely. The other most certainly has not. Python prefers, in the case of ambiguity, to require explicit directives. ChrisA

On Wed, Mar 30, 2016 at 3:33 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
There'll be another reference somewhere. Presumably a new stack is allocated or something, and that's what keeps a reference. It doesn't matter _what_ keeps the ref; all that matters is that there is one, and that the code line itself doesn't show that. Hence you need to explicitly say "open this, and close it when you're done". ChrisA

On Mar 22, 2016, at 20:06, Nick Eubank <nickeubank@gmail.com> wrote:
As a social scientists trying to help other social scientists move from language like R, Stata, and Matlab into Python, one of the behaviors I've found unnecessarily difficult to explain is the "file.open()/file.close()" idiom (or, alternatively, context managers). In normal operating systems, and many high level languages, saving is a one-step operation.
I understand there are situations where an open file handle is useful, but it seems a simple `to_file` method on strings (essentially wrapping a context-manager) would be really nice, as it would save users from learning this idiom.
Funny, I've seen people looking for a one-step file-load more often than file-save. But really, the two go together; any environment that provided one but not the other would seem strange to me. The question is, if you stick the save as a string method s.to_file(path), where do you stick the load? Is it a string class method str.from_file(path)? That means you have to teach novices about class methods very early on... Alternatively, they could both be builtins, but adding two new builtins is a pretty heavy-duty change. Anything else seems like it would be too non-parallel to be discoverable by novices or remembered by occasional Python users. I'd assume you'd also want this on bytes (and probably bytearray) for dealing with binary files. Anyway, another huge advantage of this proposal is that it can handle atomic writes properly. After all, even novices can quickly learn this: with open(path, 'w') as f: f.write(s) ... but how many of them can even understand this: with tempfile.NamedTemporaryFile('w', dir=os.path.dirname(path), delete=False) as f: f.write(s) f.flush() os.replace(f.path, path) (That's assuming I even got the latter right. I'm on my phone right now, so I can't check, but I wouldn't be surprised if there's a mistake there...) One last thing: I think many of Cocoa, .NET, Qt, Java, etc. have a similar facility. And some of them come from languages that force everything to be a method rather than a function, which means they'd have good answers for the "where does from_file go?" question. Any serious proposal should probably survey those frameworks (along with the scientific systems--and probably how NumPy compared to the latter). Actually, one _more_ last thing. Having this as a Path method makes sense; your problem is that novices don't discover pathlib, and experts don't use pathlib because it's not usable in most non-trivial cases (i.e., you can't use it even with stdlib modules like zipfile, much less third-party libs, without explicitly "casting" Path and str back and forth all over the place). Assuming that the long-term goal is to get everyone using pathlib, rather than just abandoning it as a stdlib fossil of a nice idea that didn't work out, copying its useful features like read_file and write_file out to builtins could be taken as working against that long-term goal. (Also compare Cocoa, which quasi-deprecated hundreds of methods, and then actually deprecated and removed some, to get everyone to switch to using NSURL for all paths.)

On Tue, Mar 22, 2016 at 11:33 PM, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
This would be a reasonable addition as a pathlib.Path method.
(That's assuming I even got the latter right. I'm on my phone right now, so I can't check, but I wouldn't be surprised if there's a mistake there...)
You've got it wrong, but I understand what you tried to achieve. Note that the "write to temp and move" trick may not work if your /tmp and your path are mounted on different filesystems. And with some filesystems it may not work at all, but I agree that it would be nice to have a state of the art atomic write method somewhere in stdlib.

On Wed, Mar 23, 2016 at 3:22 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
It's specifically selecting a directory for the temp file, so it ought to work. However, I'm not certain in my own head of the interaction between NamedTemporaryFile with delete=False and os.replace (nor of exactly how the latter differs from os.rename); what exactly happens when the context manager exits here? And what happens if there's an exception in the middle of this and stuff doesn't complete properly? Are there points at which this could (a) destroy data by deleting without saving, or (b) leave cruft around? This would be very nice to have as either stdlib or a well-documented recipe. ChrisA

On Mar 22, 2016, at 21:40, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
That's why I used os.replace, not os.rename. Which _is_ guaranteed to either replace atomically or do nothing and fail (and works with the most important filesystems on Windows 2003+ and every major POSIX OS). So that's not the problem either. The "delete=False" is also not a problem. Doing the replace inside the NamedTemporaryFile may be a problem on Windows. Or maybe it was doing it right after a flush without something else in between? I forget. Whatever it is, I have a solution in the library that I use (which also emulates os.replace on Python 2.6-7 and 3.1-2), so I don't have to remember how to solve it every time. One argument against adding it to write_file (besides the argument that novices don't discover pathlib and experts don't use it for other reasons) is that I think the one obvious way to do it should be atomic by default for novices, but that would be a breaking API change to the pathlib function. But if that were the way to get atomic writes in the stdlib, I'd start using pathlib for them.

On Wed, Mar 23, 2016 at 12:40:56AM -0400, Alexander Belopolsky wrote:
I don't think that Python can guarantee to do more than the file system. The best Python can do is promise to do an atomic write if the OS and file system support it, otherwise to be no worse than what the OS and file system support. -- Steve

On Tue, Mar 22, 2016 at 9:33 PM, Chris Angelico <rosuav@gmail.com> wrote:
Also: cross-platform support (Windows Is Different), handling of permissions, do you care about xattrs?, when you say "atomic" then do you mean atomic WRT power loss?, ... it's not actually 100% clear to me that a one-size-fits-all atomic write implementation is possible. -n -- Nathaniel J. Smith -- https://vorpus.org

On Mar 22, 2016, at 21:49, Nathaniel Smith <njs@pobox.com> wrote:
I know a lot of people who never touch Windows with a 10-foot pole think this problem is still unsolved, but that's not true. Microsoft added sufficient atomic-replace APIs in 2003 (in the Win32 API only, not in crt/libc), and as of 3.3, Python's os.replace really is guaranteed to either atomically replace the file or leave it untouched and raise an exception on all platforms (as long as the files are on the same filesystem, and as long as there's not an EIO or equivalent because of an already-corrupted filesystem or a physical error on the media). (For platforms besides Windows and POSIX, it does this just by not existing and raising a NameError...) Likewise for safe temporary files--as of 3.3, tempfile.NamedTemporaryFile is safe on every platform where it exists, and that includes Windows.
handling of permissions, do you care about xattrs?
That can be handled effectively the same way as copy vs. copy2 if desired. I don't know if it's important enough, but if it is, it's easy. (My library actually does have options for preserving different levels of stuff, but I never use them.)
when you say "atomic" then do you mean atomic WRT power loss?
Write-and-replace is atomic WRT both exceptions and power loss. Until the replace succeeds, the old version of the file is still there. This is guaranteed by POSIX and by Windows. If the OS can't offer that on some filesystem, it won't let you call os.replace.
, ... it's not actually 100% clear to me that a one-size-fits-all atomic write implementation is possible.
At least a one-size-fits-all-POSIX-and-post-XP-Windows solution is possible, and that's good enough to many things in the stdlib today. If atomic writes had "availability: Unix, Windows", like most of the os module, I think that would be fine.

If you are interested by "atomic write into a file", see the practical issues to implement a portable function: http://bugs.python.org/issue8604 Victor

On Mar 23, 2016 1:13 AM, "Andrew Barnert" <abarnert@yahoo.com> wrote:
On Mar 22, 2016, at 21:49, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Mar 22, 2016 at 9:33 PM, Chris Angelico <rosuav@gmail.com>
wrote: path this problem is still unsolved, but that's not true. Microsoft added sufficient atomic-replace APIs in 2003 (in the Win32 API only, not in crt/libc), and as of 3.3, Python's os.replace really is guaranteed to either atomically replace the file or leave it untouched and raise an exception on all platforms (as long as the files are on the same filesystem, and as long as there's not an EIO or equivalent because of an already-corrupted filesystem or a physical error on the media). (For platforms besides Windows and POSIX, it does this just by not existing and raising a NameError...) Likewise for safe temporary files--as of 3.3, tempfile.NamedTemporaryFile is safe on every platform where it exists, and that includes Windows. Ah, thanks! I indeed didn't know about this, and missed that the code was calling os.replace rather than os.rename.
Right, but this is the kind of thing that makes me worry about a one-size-fits-all solution :-).
POSIX doesn't guarantee anything whatsoever over power loss. Individual filesystem implementations make somewhat stronger guarantees, but it's a mess: http://danluu.com/file-consistency/ At the very least atomicity requires fsync'ing the new file before calling rename, or you might end up with: Original code: new = open("new") new.write(...) new.close() os.replace("old", "new") Gets reordered on the way to the hard drive to become: new = open("new") os.replace("new", "old") new.write(...) new.close() POSIX does of course guarantee that if the OS reorders things like this then it has to hide that from you -- processes will always see the write happen before the rename. Except if there's a power loss, now we can have: Gets executed as: new = open("new") os.replace("new", "old") --- whoops, power lost here, and so is the file contents --- But fsync is a very expensive operation; there are plenty of applications for atomic writes where this is unnecessary (e.g. if the file is being used as an IPC mechanism, so power loss -> the processes die, no one cares about their IPC channel anymore). And there are plenty of applications where this is insufficient (e.g. if you expect/need atomic_write(path1, data1); atomic_write(path2, data2) to guarantee that the two atomic writes can't be reordered relative to each other). I don't want to get sucked into a long debate about this; it's entirely likely that adding something like that original recipe to the stdlib would be an improvement, so long as it had *very* detailed docs explaining the exact tradeoffs made. All I want to do is raise a cautionary flag that such an effort would need to tread carefully :-) -n

On Wed, Mar 23, 2016 at 10:14:07AM -0700, Nathaniel Smith wrote:
And then there are file media which lie to you, and return from a fsync before actually syncing, because that makes their benchmarks look good. I've seen file corruption on USB sticks that do this, including some otherwise "respectable" brands. Worst case I ever saw was a USB stick that (fortunately) had a blinking light to show when it was writing. It continuing writing for *eight minutes* (I timed it) after returning from fsync and the OS had unmounted the stick and it was nominally safe to remove. I'm told that some hard drives will do the same, although I've never knowingly seen it. -- Steve

On Thu, Mar 24, 2016 at 2:10 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, but not as much of it; usually with hard drives, it's a small amount of on-board cache, which tends to improve overall performance by batching writes. That can be disabled by the paranoid (which you should be). More commonly, SSDs do the same bare-faced lying that you noted with USB sticks. There was a time when I took the paranoid approach of eschewing SSDs even on my client computers because they're not trustworthy. I'm beginning to shift toward an even more paranoid approach of not trusting _any_ local storage - all my important data gets pushed to a remote server anyway (maybe GitHub, maybe something on the LAN, whatever), so it won't matter if a power failure corrupts stuff. But for a server with a database on it, you kinda need to solve this problem three ways at once, rather than depending on just one. ChrisA

On Wed, Mar 23, 2016 at 8:10 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Not just USB sticks and hard drives -- whole operating systems too. Ones you might have heard of! Like... OS X. Their fsync() syscall is intentionally designed and documented to provide absolutely zero guarantees of anything. (From the man page: "[after calling fsync], if the drive loses power or the OS crashes, the application may find that only some or none of their data was written [...] This is not a theoretical edge case. This scenario is easily reproduced with real world workloads".) A real fsync call is still available, it's just hidden in fcntl. While we're speaking of it, Python's os.fsync on OS X should probably call that fcntl instead of fsync. Right now os.fsync is effectively a no-op on OS X. (Apparently this is https://bugs.python.org/issue11877 but that issue seems to have petered out years ago in confusion.) -n -- Nathaniel J. Smith -- https://vorpus.org

On 23 March 2016 at 13:33, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
A key part of the problem here is that from a structured design perspective, in-memory representation, serialisation and persistence are all separate concerns. If you're creating a domain specific language (and yes, I count R, Stata and MATLAB as domain specific), then you can make some reasonable assumptions for all of those, and hide more of the complexities from your users. In Python, for example, the Pandas IO methods are closer to what other data analysis and modelling environments are able to offer: http://pandas.pydata.org/pandas-docs/stable/io.html NumPy similarly has some dedicated IO routines: http://docs.scipy.org/doc/numpy/reference/routines.io.html However, for a general purpose language, providing convenient "I don't care about the details, just do something sensible" defaults gets trickier, as not only are suitable defaults often domain specific, you have a greater responsibility to help folks figure out when they have ventured into territory where those defaults are no longer appropriate. If you can't figure out a reasonable set of default behaviours, or can't figure out how to nudge people towards alternatives when they start hitting the limits of the default behaviour, then you're often better off ducking the question entirely and getting people to figure it out for themselves. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

semi OT, but: On Tue, Mar 22, 2016 at 8:33 PM, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
+inf on that -- we really do need to get the entire stdlib to work with Path objects! -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Not to knock too much on R, but this kind of thing was one of the things I always really hated about that language. Separating the operations of converting a string and then writing/printing/sending/etc. that string is just so much more flexible. It reminds me of the many objects I worked with in R that had natural print methods, but did not have natural conversions to strings (at least nowhere near as easy as it should have been). There were so many times that having the operations as separate steps (e.g. having print just call call __str__() or whatever) is so much better. I understand the argument that it feels more natural to certain people, but it really incentivizes bad design imo. Definitely a big -1 from me. On 03/22/2016 08:06 PM, Nick Eubank wrote:

On 23.03.2016 05:35, Thomas Nyberg wrote:
and inconvenient. I am no social scientist but I share the sentiment. bash: echo "my string" >> my_file python: with open('my_file', 'w') as f: f.write('my string') Hmm. Python actually is best at avoiding special characters but regarding files, it's still on the level of C.
Maybe, I am missed that but when did Nick talk about non-string to be written to a file?
I understand the argument that it feels more natural to certain people, but it really incentivizes bad design imo.
I disagree. It incentivizes good design as it forces you to prepare 1) your file name and 2) your data properly and 3) you don't "invent" self-proclaimed-but-not-really atomic writes. I can tell you from my experience with several aged Python developers that they regularly fail to implement atomic file operations. Just saying.
Definitely a big -1 from me.
+1 from me. Best, Sven

On Wed, Mar 23, 2016 at 6:20 PM, Sven R. Kunze <srkunze@mail.de> wrote:
bash: echo "my string" >> my_file python: with open('my_file', 'w') as f: f.write('my string')
...
I can tell you from my experience with several aged Python developers that they regularly fail to implement atomic file operations. Just saying.
What make you think your bash example implements an atomic write? It actually performs an append and therefore not equivalent to the python code that followed.

On 23.03.2016 23:30, Alexander Belopolsky wrote:
What makes you think that I think my bash example implements an atomic write? ;-) My point was the simplicity of the bash command. Make it Python as simple AND atomic here (with file handling) and people will drop bash immediately. Best, Sven

On 23/03/2016 23:27, Sven R. Kunze wrote:
In four decades I've never once felt the need for a one-liner that writes a string to a file. How do you drop something that you've never used and have no interest in? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Wed, Mar 23, 2016 at 5:06 AM, Nick Eubank <nickeubank@gmail.com> wrote: [...]
Maybe you know this, but Path.read_text and Path.write_text (and the bytes versions) were introduced in Python 3.5, released only half a year ago, so there has not really been time for their adoption. Path("mytextfile.txt").write_text(...) Path("mytextfile.txt").read_text() Besides being very readable and convenient, one of the nice things about this is that it helps avoid code like open("mytextfile.txt").write(text) which you see all the time, even if the file will be left open for some time after this in non-ref-counting implementations of Python. So, as mentioned before by others, all we need is the better interplay of pathlib with the rest of the stdlib and that people will learn about it. Personally, I now always use `pathlib` instead of `open` when I don't need to care about compatibility with old Python versions. Maybe some day, Path could be in builtins like open is now? - Koos

On Thu, Mar 24, 2016 at 7:46 PM, Sven R. Kunze <srkunze@mail.de> wrote:
For me, it's most often str->Path, meaning things like Path("file.txt") or directory = Path("/path/to/some/directory") with (directory / "file.txt").open() as f: #do something with f with (directory / "otherfile.txt").open() as f: #do something with f Then, somewhat less often, I need to give a str as an argument to something. I then need an additional str(...) around the Path object. That just feels stupid and makes me start wishing Path was a subclass of str, and/or that Path was a builtin. Or even better, that you could do p"filename.txt", which would give you a Path string object. Has this been discussed? - Koos

On Thu, Mar 24, 2016 at 5:06 PM, Koos Zevenhoven <k7hoven@gmail.com> wrote:
Yes, right in the PEP 428 where pathlib was proposed. [1] [1]: https://www.python.org/dev/peps/pep-0428/#id29

On Thu, Mar 24, 2016 at 11:22 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
Thanks. The pep indeed mentions that str (or other builtins like tuple) are not subclassed in pathlib, but couldn't find the explanation why. Well, it would add all kinds of attributes, many of which will just be confusing and not useful for paths. - Koos

On 24.03.2016 22:06, Koos Zevenhoven wrote:
Or even better, that you could do p"filename.txt", which would give you a Path string object. Has this been discussed?
Interesting. I had this thought half a year ago. Quite recently I mentioned this to Andrew in a private conversation. p'/etc/hosts' would make a perfect path which subclassed from str. The p-string idea has not been discussed in the PEP I think. The subclassing thing however was and I think its resolution was a mistake. The explanation, at least from my point of view, is a bit weird. Maybe, this can be re-discussed in a separate thread? Especially when different people think independently of the same issue and the same solution. Best, Sven

On Tue, Mar 22, 2016 at 11:06 PM, Nick Eubank <nickeubank@gmail.com> wrote:
it seems a simple `to_file` method on strings (essentially wrapping a context-manager) would be really nice
-1 It is a rare situation when you would want to write just a single string to a file. In most cases you write several strings and or do other file operations between opening and closing a file stream. The proposed to_file() method may become an attractive nuisance leading to highly inefficient code. Remember: opening or closing a file is still in most setups a mechanical operation that involves moving macroscopic physical objects, not just electrons.

On Tue, Mar 22, 2016 at 11:32 PM, Nick Eubank <nickeubank@gmail.com> wrote:
I output "single strings" to LaTeX all the time.
If you do this in an interactive session - take a look at ipython. Its %writefile [1] magic command is not quite what you want but close. You may also have a better chance getting a new feature in ipython than here. If you do that from a script - stop doing that - and your script will run faster. [1] https://ipython.org/ipython-doc/3/interactive/magics.html#cellmagic-writefil...

On Mar 22, 2016, at 20:29, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
I do it all the time in other languages when dealing with smallish files. Python's very nice file-object concept, slant toward iterator-based processing, and amazingly consistent ecosystem means that the same issues don't apply, so I'd rarely do the same thing. But for users migrating to Python from another language, or using Python occasionally while primarily using another language, I can see it being a lot more attractive. Also, you're neglecting the atomic-write issue. Coming up with a good API for an iterative atomic write is hard; for single-string write-all, it's just an extra flag on the function.
In most cases you write several strings and or do other file operations between opening and closing a file stream. The proposed to_file() method may become an attractive nuisance leading to highly inefficient code. Remember: opening or closing a file is still in most setups a mechanical operation that involves moving macroscopic physical objects, not just electrons.
All the more reason to assemble the whole thing in memory and write it all at once, rather than streaming it out. Then you don't have to worry about how good the buffering is, what happens if there's a power failure (or just an exception, if you don't use with statements) halfway through, etc. It's definitely going to be as fast and as safe as possible if all of the details are done by the stdlib instead of user code. (I trust Python's buffering in 3.6, but not in 2.x or 3.1--and I've seen people even in modern 3.x try to "optimize" by opening files in raw mode and writing 7 bytes here and 18 bytes there, which is going to be much slower than concatenating onto a buffer and writing blocks at a time...)

Thanks for the thoughts both! I'm not opposed to `a = str.read_file()` -- it does require knowing classes to grok it, but it's super readable and intuitive to look at (i.e. pythonic?). Regarding bytes, I was thinking `to_file()` would include a handful of arguments to support unusual encodings or bytes, but leaving the default to utf-8 text. On Tue, Mar 22, 2016 at 8:52 PM Andrew Barnert <abarnert@yahoo.com> wrote:

On Tue, Mar 22, 2016 at 11:51 PM, Andrew Barnert <abarnert@yahoo.com> wrote:
Then you don't have to worry about how good the buffering is, what happens if there's a power failure ...
You are right: it is not uncommon for data scientists to prefer loosing all the computed data rather than the last few blocks in this case. :-)

On Mar 22, 2016, at 21:32, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
The question is whether it's better to have the complete previous version of the data, or 21% of the new version of the data.[1] And the answer is different in different cases, which is why atomic pretty much has to be an option (or a separate function), not always-on. But I think on by default makes sense. --- [1]: In fact, with delete=False and a sensible pattern, atomic actually gives you the best of both worlds: the previous version of the data is still there, and 21% of the new version is in a tempfile that you can recover if you know what you're doing...

Andrew Barnert via Python-ideas writes:
It occurs to me that we already have a perfectly appropriate builtin for the purpose anyway: print. Add a filename= keyword argument, and make use of both file= and filename= in the same call an error. If that's not the right answer, I don't see how str.to_file() can possibly be better.

On 24 March 2016 at 18:02, Stephen J. Turnbull <stephen@xemacs.org> wrote:
Well, with print("A string", filename='foo.txt') what encoding should be used? What if the user wants control of the encoding (on Windows, it's far from uncommon to want UTF-8 instead of the system default, for example)? What about error handlers? The same questions apply for str.to_file() of course, although having extra arguments is likely to be less of an issue there. And yes, at some point the answer is "if the convenience function doesn't suit your needs, you can fall back to explicit file handling". But the trick is finding the *right* balance of convenience. And having a "simple" way of writing a string to a file that doesn't at least hint to the user that they should be thinking about encodings seems to me like something of an attractive nuisance. Paul

On Mar 24, 2016, at 11:02, Stephen J. Turnbull <stephen@xemacs.org> wrote:
That's not a bad idea. However, it's obvious how to extend str.to_file() to reads (str.from_file() classmethod), binary data (bytes.to_file()), atomic writes (str.to_file(path, atomic=True), non-default encodings (str.to_file(path, encoding='Latin-1', errors='replace')). With print, those are all less obvious. Maybe input() for reads, but how can it handle bytes? And, while you could toss on more parameters to print() for some of these, it already has a lot, and adding new parameters that are only legal if filename is specified seems like it'll be make the docs even harder to follow. Of course if the answers are "we don't want any of those", then being more limited is a good thing, not a problem. :)

I did a quick survey of other languages, cross-language frameworks, etc. to see what they offer, and I was surprised. To look at my quick survey (or correct me! I wasn't super-careful here, and may easily have taken misleading StackOverflow posts or the like at face value): https://goo.gl/9bJcAf Anyway, the original proposal in this thread is very close to Cocoa, but that turns out to be pretty uncommon. Most frameworks use free functions, not methods on either string or path objects. Most don't offer an option for atomic writes or exclusive locks--although some do them whether you want it or not. Most have an option for appending. All but one either let you specify the encoding or don't do test. Almost none of those were what I expected.

also, Python's strings are immutable, so we really don't want to encourage people to build up a big string in memory anyway. and what's wrong with: open(a_path, 'w').write(the_string) short, simple one-liner. OK, non cPython garbage collection may leave that file open and dangling, but we're talking the quicky scripting data analysis type user -- the script will terminate soon enough. BTW, numpy does offer one stop ways to save and load arrays to a file, binary or text -- and that's a lot more useful than a simple string. especially the reading. oh, and for reading: string = open(path).read() I really don't get the point of all this. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 25 March 2016 at 08:07, Chris Barker <chris.barker@noaa.gov> wrote:
One of the few downsides of Python's popularity as both a scripting language and an app development language is that a lot of tutorials are written for the latter, and in app development, relying on the GC for external resource cleanup isn't a great habit to get into. As a result, tutorials will introduce the deterministic cleanup form, and actively discourage use of the non-deterministic cleanup. Independent of the "Why not just rely on the GC?" question, though, we're also forcing the user to upgrade their mental model to achieve their objective. User model: "I want to save this data to disk at a particular location and be able to read it back later" By contrast, unpacking the steps in the one-liner: - open the nominated location for writing (with implied text encoding & error handling) - write the data to that location It's that switch from a 1-step process to a 2-step process that breaks flow, rather than the specifics of the wording in the code (Python 3 at least improves the hidden third step in the process by having the implied text encoding typically be UTF-8 rather than ASCII). Formulating the question this way does suggest a somewhat radical notion, though: what if we had a JSON-based save builtin that wrote UTF-8 encoded files based on json.dump()? That is, "save(a_path, the_string)" would attempt to do: with open(a_path, 'w', encoding='utf-8', errors='strict') as f: json.dump(the_string, f) While a corresponding "load(a_path)" would attempt to do: with open(a_path, 'r', encoding='utf-8', errors='strict') as f: return json.load(f) The format of the created files would be using a well-defined standard rather than raw data dumps (as well as covering more than just pre-serialised strings), and the degenerate case of saving a single string would just have quotation marks added to the beginning and end. If we later chose to define a "__save__" and "__load__" protocol, then json.dump/load would also be able to benefit. There'd also be a potential long term security benefit here, as folks are often prone to reaching for pickle to save data structures to disk, which creates an arbitrary code execution security risk when loading them again later. Explicitly favouring the less dangerous JSON as the preferred serialisation format can help nudge people towards safer practices without forcing them to consciously think through the security implications. Switching from the save/load builtins to manual file management would then be a matter of improving storage efficiency and speed of access, just as switching to any other kind of persistent data store would be (and a JSON based save/load would migrate very well to NoSQL style persistent data storage services). Cheers, Nick. P.S. I'm going to be mostly offline until late next week, but found this idea intriguing enough to share before I left. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

Thanks Nick -- I really couldn't have said it better than "we're also forcing the user to upgrade their mental model to achieve their objective." Python is beautiful in part because it takes all sorts of idioms that are complicated in other languages and wraps them up in something simple and intuitive ("in" has to be my favorite builtin of all time). This feels like one of those few cases where Python still feels like C, and it's not at all clear to my why that needs to be the case. On Thu, Mar 24, 2016 at 7:22 PM Nick Coghlan <ncoghlan@gmail.com> wrote:

On Mar 24, 2016 10:22 PM, "Nick Coghlan" <ncoghlan@gmail.com> wrote:
I was going to suggest that, if str save/load were to be implemented, it could be made as a parallel to json.dump and json.load (as string module members?). Any new features like atomicity would be "backported" to json.dump/load. Same with pickle.load/dump. It's just str serialization, isn't it? If there is a save/load builtin (or dump/load), I don't see what's so special about JSON. (YAML is more like Python, including in readability.) I'd suggest that you can specify a "format" parameter (or with another name) to use it with pickle, json, yaml, etc., but that means two ways of doing things. Deprecate the original way? import json load("data.json", format=json) Without a format kwarg, Python might look at the file extension and try to guess the format. Modules could register extensions (is this magic?), or it could look for an object with the same name which has __dump__/__load__ (is this VERY magic?). I don't like either: You have to import pickle to load it, but you wouldn't have to name it explicitly to use it. I'd hate `import pickle; pickle.register(".p")`: this thread is for scripters and not application devs, so wordiness matters. Your proposal means I don't even have to import json to use json, and json needs explaining to a data scientist (who won't typically have web dev experience).

On Sat, Mar 26, 2016 at 3:30 AM, Franklin? Lee <leewangzhong+python@gmail.com> wrote:
The thing that's special about JSON is that nearly everyone recognizes the name. Thanks to JSON's extensive use in networked systems (courtesy of web browser <-> web server communication), pretty much everyone will be able to grok bidirectional JSON handling. Maybe YAML would be better, but if you publish something that uses JSON, nobody will be confused. Not sure that's enough justification to make it the default, but it is significant.
You can slap those lines into sitecustomize.py though. Or third party Python distributors could slap those lines into sitecustomize.py, and you won't even be aware that stuff had to be registered. But I don't like the registration of extensions as a means of selecting an output format, generally. Too magical. ChrisA

On Mar 24, 2016, at 7:22 PM, Nick Coghlan : what if we had a JSON-based save builtin that wrote UTF-8 encoded files based on json.dump()?
I've been think about this for a while, but would rather have a "pyson" format -- I.e. Python literals, rather than JSON. This would preserve the tuple vs list and integer vs float distinction, and allow more options for dictionary keys.(and sets?). Granted, you'd lose the interoperability, but for the quick saving and loading of data, it'd be pretty nice. There is also JSON pickle: https://jsonpickle.github.io Though as I understand it, it has the same security issues as pickle. But could we make a not-quite-as-complete pickle-like protocol that could save and load arbitrary objects, without ever running arbitrary code? -CHB

Le 28/03/2016 17:30, Chris Barker - NOAA Federal a écrit :
If it's for quick data saving, the security is not an issue since the data will never comes from an attacker if you do a quick script. For other needs, where security is an issue, having a oneliner to dump some serialization is not going to do much of a difference.

On Mon, Mar 28, 2016 at 8:44 AM, Michel Desmoulin <desmoulinmichel@gmail.com
wrote:
If it's for quick data saving, the security is not an issue since the data will never comes from an attacker if you do a quick script.
that's why we have pickle already -- but a nice human-readable and editable form would be nice... Also -- when it comes to security it's a tough one -- people DO start with one thing, thinking "I will always trust this source" (if they think about it at all), then later expand the system to be a web service, or .. and oops!
For other needs, where security is an issue, having a one liner to dump some serialization is not going to do much of a difference.
no -- I kind of mixed topics here -- my "safe json serialization" would be for web services, configuration, etc -- where security matters, but quick one-liner access is not so important -- though why not have one thing for multiple uses? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On 29 March 2016 at 01:44, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
"These files will never be supplied or altered by an attacker" is the kind of assumption that has graced the world with such things as MS Office macro viruses. That means that as Python makes more inroads into the traditional territory of MS Excel and other spreadsheets, ensuring we encourage a clear distinction between code (which is always dangerous to trust) and data (which *should* be safe to read, aside from processing capacity limits) becomes increasingly important. If we ever did something like this, then Chris's suggestion of a Python-specific format that can be loaded from a string via ast.literal_eval() rather than using JSON likely makes sense [1], but it would also be appropriate to revisit that idea first as a project outside the standard library for ad hoc data persistence, before proposing it for standard library inclusion. Cheers, Nick. [1] https://code.google.com/archive/p/pyon/ is a project from several years ago aimed at that task. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

As with my previous email about __main__ for random, uuid and os, I wish to suggest a similar __main__ for datetime. The target audience may not be the same, so I'm making it a difference proposal. E.G: python -m datetime now => print(str(datetime.datetime.now()) python -m datetime utcnow => print(str(datetime.datetime.utcnow()) python -m time epoch => print(time.time()) python -m datetime now "%d/%m/%Y" => print(str(datetime.datetime.now().strftime("%d/%m/%Y")) python -m datetime utcnow "%d/%m/%Y" => print(str(datetime.datetime.utcnow().strftime("%d/%m/%Y"))

Note: new subjects should be posted as new threads. This was posted as a response in the unrelated "`to_file()` method for strings" thread and will not be seen by anyone who has killed that thread or who has threads collapsed. On 4/1/2016 2:02 PM, Michel Desmoulin wrote:
As with my previous email about __main__ for random, uuid and os, I wish to suggest a similar __main__ for datetime.
File __main__ is for packages. I believe none of the modules you mention are packages, so I presume you mean adding within the module something like def main(args): ... if __name__ = '__main__': from sys import args main(args[1:]) or the equivalent in C.
What is the particular motivation for this package? Should we add a command line interface to math? So that python -m math func arg => print(func(arg)) ? I am wondering what is or should be the general policy on the subject. How easy is this for C-coded modules? -- Terry Jan Reedy

On 01/04/2016 19:02, Michel Desmoulin wrote:
Not very funny for 1st April. Would you care to have another go? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Fri, Apr 1, 2016 at 2:02 PM, Michel Desmoulin <desmoulinmichel@gmail.com> wrote:
+1 In fact, I would not mind seeing a fairly complete GNU date utility [1] reimplementation as datetime.__main__. [1] http://www.gnu.org/software/coreutils/manual/html_node/date-invocation.html#...

On Sat, Apr 2, 2016 at 1:49 AM Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
Seems like Microsoft is solving this problem for you. http://www.hanselman.com/blog/DevelopersCanRunBashShellAndUsermodeUbuntuLinu...

[1] https://code.google.com/archive/p/pyon/ is a project from several years ago aimed at that task.
Thanks for the link -- I'll check it out. - Chris

On Thu, Mar 24, 2016 at 7:22 PM, Nick Coghlan <ncoghlan@gmail.com> wrote:
no, it's not. though now that think about it, while I understand that context managers are great as a universal and flexible way to ,well, mange context, for file objects themselves: open('a_file', 'w').write(some_stuff) Is really nice -- while we don't want to make reference-counting garbage collection a required part of the language in general, I wonder if it would be practical to make it part of the spec for SOME objects -- i.e kind of like you can define a given object to be a context manager, you could define certain objects to clean up after themselves when they go out of scope. It would sure be nice is this example ;; it would buy us that simple syntax, the atomic operation, etc... I have not thought this through AT ALL, but I wonder if one could avoid requiring a reference counting system by catching at the parsing stage that that the object created is never assigned to anything -- and this you know it's going to go out of scope as soon as that line is finished processing. In fact, if you have a non-reference counting garbage collection system, it might be helpful to special case all objects with very short lives, so you don't pile up a bunch of temporaries, etc that have to be collected later.... -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

I wonder how reasonable it would be to add a new keyword to open that would .close() the file object upon a single write/read. Consider: data = open("foo.txt", "r", close_after_use=True).read() It doesn’t rely on garbage collection to work properly, is a fancy one-liner, and is obvious that you can only read or write from/to it once. Could use a better keyword for that, though. -Emanuel ~ If it doesn’t quack like a duck, add a quack() method ~

On Mon, Mar 28, 2016 at 8:54 AM, Émanuel Barry <vgr255@live.ca> wrote:
it makes it a tiny bit shorter than using "with", but doesn't solve Nick's mental model issue -- the user still needs to be thinking about the fact that they are creating a file object and that it needs to be closed when you are done with it. and I"m not sure how you would define "use" -- any i.o. opeartion? i.e.: infile = open("foo.txt", "r", close_after_use=True) first_line = infile.readline() now what is the state of infile? a closed file object? exactly why context managers were introduced: with open("foo.txt", "r", close_after_use=True) as infile: first_line = infile.readline() something_else... now it's pretty clear that infile is no longer a useful object. -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Mar 28, 2016 at 8:54 AM, Émanuel Barry < <mailto:vgr255@live.ca> vgr255@live.ca> wrote: I wonder how reasonable it would be to add a new keyword to open that would .close() the file object upon a single write/read. Consider: data = open("foo.txt", "r", close_after_use=True).read() it makes it a tiny bit shorter than using "with", but doesn't solve Nick's mental model issue -- the user still needs to be thinking about the fact that they are creating a file object and that it needs to be closed when you are done with it. and I"m not sure how you would define "use" -- any i.o. opeartion? i.e.: infile = open("foo.txt", "r", close_after_use=True) first_line = infile.readline() now what is the state of infile? a closed file object? Sure. The keyword really could benefit from a better name, but it might solve some part of the issue. It doesn’t solve the mental model issue though, but to be fair, open() should always be wrapped in a context manager (IMO, anyway). -Emanuel exactly why context managers were introduced: with open("foo.txt", "r", close_after_use=True) as infile: first_line = infile.readline() something_else... now it's pretty clear that infile is no longer a useful object. -CHB

On Mon, Mar 28, 2016 at 9:32 AM, Émanuel Barry <vgr255@live.ca> wrote:
then we don't need anyting else :-) but the name of the keyword is not that point here -- the issue is what does it mean to "use" a file object? using "with" let's the user clearly define, when they are done with the object. and if a reference to the object is not stored anywhere, then you can be sure the user doesn't expect to be abel to use it again. but: data = open("foo.txt", "r", close_after_use=True).read() and infile = open("foo.txt", "r", close_after_use=True) data = infile.read() look exactly the same to the file object itself, it has no idea when the user is done with it. in cPython, the reference counter knows that teh file object has no references to it when that first line is done running, so it can clean up and delete the object -- but without a reference counting system, who knows when it will get cleaned up? -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Mon, Mar 28, 2016 at 12:40 PM, Chris Barker <chris.barker@noaa.gov> wrote:
It is my understanding that under the "close_after_use" proposal, infile will be closed by .read() in both cases. Still, I don't like this idea. I puts action specification (close) too far from where the action is taken. The two-line example already makes me uncomfortable and imagine if infile is passed through several layers of function calls before infile.read() is called. I would rather see read_and_close() and write_and_close() convenience methods for file objects: data = open("foo.txt").read_and_close()

On Tue, Mar 29, 2016 at 2:45 AM, Chris Barker <chris.barker@noaa.gov> wrote:
The whole point of the context manager is that it defines the notion "go out of scope". But if you want the simple syntax, the easiest way is to create a helper: def write_file(filename, contents): with open(filename, "w") as f: f.write(contents) write_file('a_file', some_stuff) Or, if you're writing a quick one-off script, just use the line you have above, and don't worry about guaranteed closing. What's the worst that can happen? AIUI buffered data will be written out at process termination if not before, and of course the file will be closed. The main case where you absolutely must close the file promptly is when you plan to then open it again (in the same or another process), and a quick script will usually know if that's going to be a possibility. (How bad can it be? I suppose https://xkcd.com/292/ level bad. But not usually.) ChrisA

On Mon, Mar 28, 2016 at 8:55 AM, Chris Angelico <rosuav@gmail.com> wrote:
The whole point of the context manager is that it defines the notion "go out of scope".
of course it does -- very useful but a lot of extra for a siple read this file then close it right away, for the casual scripting user :-)
So I guess this comes around to the OPs point -- a helper like that should be built in, and easy and obvious to scripting user. I don't think it should be astring method, but... Or, if you're writing a quick one-off script, just use the line you have above, and don't worry about guaranteed closing. And that comes back around to my original post on this thread -- we should just tell scripting users to do that :-) but I think it was Nick's point that Python is used for scripting and "real" system development, so it's hard to communicate best practices for both... given that, I'm kind of liking "write_file" It solves Nick's mental model issue" " want to write some stuff to a file", rather than "I want to create a file object, then write stuff to it" -CHB -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Just to make sure I follow: Chris, you're proposing a modification to garbage collection to clean up `open()` if `open()` is used by not closed and not assigned to anything so that `open(file).write(stuff)` is safe? That sounds brilliant to me, and totally meets my goal of having a "one-step" method for writing to disk that doesn't require users to develop a two-step mental model. If it's feasible (I have nothing but the most basic understanding of how the garbage collector works), that would totally obviate my desire to have a `to_file()` method on strings. On Mon, Mar 28, 2016 at 8:46 AM Chris Barker <chris.barker@noaa.gov> wrote:

Just to make sure I follow: Chris, you're proposing a modification to garbage collection to clean up `open()` if `open()` is used by not closed and not assigned to anything so that `open(file).write(stuff)` is safe? Well, I wouldn't call it a proposal, more a random inspiration that popped up when thinking about this thread. That sounds brilliant to me, and totally meets my goal of having a "one-step" method for writing to disk that doesn't require users to develop a two-step mental model. If it's feasible (I have nothing but the most basic understanding of how the garbage collector works), that would totally obviate my desire to have a `to_file()` method on strings. IIUC, in the current cPython implementation, you do indeed have that -- I've used it for years, long before context managers existed, and still do for quickie scripts. CPython uses a reference counting scheme: each time an object is referenced, its count is increased, each lost reference, and it is decreased. When the count goes to zero, the object is deleted. So in: Data = open(file name).read() The file object is created, given a refount of one. Then the read() method is called, creating a string, then bound to the name Data. When the next line is reached, the recount of the file object is reduced to zero, and the object is deleted. And the internal file pointer is closed before deleting the object. The "trick" is that the Python language spec does not require this kind of garbage collection. So jython, or pypy, or ironPython may not clean up that file object right away, it could hang around in an open and unflushed state until the garbage collector gets around to cleaning it up. But for short scripts, it'll get cleaned up at the end of the script anyway. The thing is that while I, at least, think it's a fine practice for scripting, it's really not a good idea for larger system development. Thus the general advise to use "with". As for any proposal, it dawned on me that while we don't want Python to require any particular garbage collecting scheme, I wonder if it would be possible (or desirable) to specify that temporary objects used on one line of code, and never referenced any other way, get deleted right away. It's actually very common to create a lot of temporaries on a line of code -- when you chain operations or methods. So it might be a small performance tweak worth doing (or it may not :-) ) -CHB

Brevity may be the soul of wit, but I am completely happy without a convenience function to open and read/write, then close a file. In production, the ``with`` statement seems concise enough. If I'm messing around at the prompt we can already write ``text = open(filename).read()``. We can also write ``open(filename, 'w').write(text)`` and not worry about closing the file in the basic Python prompt. On Mon, Mar 28, 2016 at 8:49 PM Chris Barker - NOAA Federal < chris.barker@noaa.gov> wrote:

[Sorry, Nick, didn't mean to just send it to you.] On 03/28/2016 11:16 AM, Nick Eubank wrote:
Even if he is, I don't see it happening. Modifying the garbage collector to act sooner on one particular type of object when we already have good, if very mildly inconvenient, solutions is not going to be a high-priority item -- especially if it has /any/ negative impact on performance. -- ~Ethan~

On Tue, Mar 29, 2016 at 5:06 AM, Eric V. Smith <eric@trueblade.com> wrote:
Again -- this is a thought experiment, not a proposal, but: It's only relevant for other implementations -- cPython already cleans up the file object (any object) when it is no longer referenced. And I do move the thought experiment along -- it was never specific to file objects, but rather: In the case of temporary objects that are never referred to outside a single line of code -- delete it right away. in theory, this could help all implementations leave a little less garbage lying around. Whether that is possible or desirable, I have no idea. -CHB
-- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

On Tue, 29 Mar 2016 at 15:11 Chris Barker <chris.barker@noaa.gov> wrote:
It's not desirable to dictate what various Python implementations must make sure that their garbage collector supports direct collection like this for this specific case rather than staying general. It would very much be a shift in the definition of the Python project where we have purposefully avoided dictating how objects need to behave in terms of garbage collection. -Brett

On Wed, Mar 30, 2016 at 9:10 AM, Chris Barker <chris.barker@noaa.gov> wrote:
The problem is that there's no way to be sure. For instance, compare these lines of code: from threading import Thread open("marker").write("Ha") Thread(target=threadfunc).start() One of them has finished with the object completely. The other most certainly has not. Python prefers, in the case of ambiguity, to require explicit directives. ChrisA

On Wed, Mar 30, 2016 at 3:33 PM, Chris Barker - NOAA Federal <chris.barker@noaa.gov> wrote:
There'll be another reference somewhere. Presumably a new stack is allocated or something, and that's what keeps a reference. It doesn't matter _what_ keeps the ref; all that matters is that there is one, and that the code line itself doesn't show that. Hence you need to explicitly say "open this, and close it when you're done". ChrisA

On Mar 22, 2016, at 20:06, Nick Eubank <nickeubank@gmail.com> wrote:
As a social scientists trying to help other social scientists move from language like R, Stata, and Matlab into Python, one of the behaviors I've found unnecessarily difficult to explain is the "file.open()/file.close()" idiom (or, alternatively, context managers). In normal operating systems, and many high level languages, saving is a one-step operation.
I understand there are situations where an open file handle is useful, but it seems a simple `to_file` method on strings (essentially wrapping a context-manager) would be really nice, as it would save users from learning this idiom.
Funny, I've seen people looking for a one-step file-load more often than file-save. But really, the two go together; any environment that provided one but not the other would seem strange to me. The question is, if you stick the save as a string method s.to_file(path), where do you stick the load? Is it a string class method str.from_file(path)? That means you have to teach novices about class methods very early on... Alternatively, they could both be builtins, but adding two new builtins is a pretty heavy-duty change. Anything else seems like it would be too non-parallel to be discoverable by novices or remembered by occasional Python users. I'd assume you'd also want this on bytes (and probably bytearray) for dealing with binary files. Anyway, another huge advantage of this proposal is that it can handle atomic writes properly. After all, even novices can quickly learn this: with open(path, 'w') as f: f.write(s) ... but how many of them can even understand this: with tempfile.NamedTemporaryFile('w', dir=os.path.dirname(path), delete=False) as f: f.write(s) f.flush() os.replace(f.path, path) (That's assuming I even got the latter right. I'm on my phone right now, so I can't check, but I wouldn't be surprised if there's a mistake there...) One last thing: I think many of Cocoa, .NET, Qt, Java, etc. have a similar facility. And some of them come from languages that force everything to be a method rather than a function, which means they'd have good answers for the "where does from_file go?" question. Any serious proposal should probably survey those frameworks (along with the scientific systems--and probably how NumPy compared to the latter). Actually, one _more_ last thing. Having this as a Path method makes sense; your problem is that novices don't discover pathlib, and experts don't use pathlib because it's not usable in most non-trivial cases (i.e., you can't use it even with stdlib modules like zipfile, much less third-party libs, without explicitly "casting" Path and str back and forth all over the place). Assuming that the long-term goal is to get everyone using pathlib, rather than just abandoning it as a stdlib fossil of a nice idea that didn't work out, copying its useful features like read_file and write_file out to builtins could be taken as working against that long-term goal. (Also compare Cocoa, which quasi-deprecated hundreds of methods, and then actually deprecated and removed some, to get everyone to switch to using NSURL for all paths.)

On Tue, Mar 22, 2016 at 11:33 PM, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
This would be a reasonable addition as a pathlib.Path method.
(That's assuming I even got the latter right. I'm on my phone right now, so I can't check, but I wouldn't be surprised if there's a mistake there...)
You've got it wrong, but I understand what you tried to achieve. Note that the "write to temp and move" trick may not work if your /tmp and your path are mounted on different filesystems. And with some filesystems it may not work at all, but I agree that it would be nice to have a state of the art atomic write method somewhere in stdlib.

On Wed, Mar 23, 2016 at 3:22 PM, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
It's specifically selecting a directory for the temp file, so it ought to work. However, I'm not certain in my own head of the interaction between NamedTemporaryFile with delete=False and os.replace (nor of exactly how the latter differs from os.rename); what exactly happens when the context manager exits here? And what happens if there's an exception in the middle of this and stuff doesn't complete properly? Are there points at which this could (a) destroy data by deleting without saving, or (b) leave cruft around? This would be very nice to have as either stdlib or a well-documented recipe. ChrisA

On Mar 22, 2016, at 21:40, Alexander Belopolsky <alexander.belopolsky@gmail.com> wrote:
That's why I used os.replace, not os.rename. Which _is_ guaranteed to either replace atomically or do nothing and fail (and works with the most important filesystems on Windows 2003+ and every major POSIX OS). So that's not the problem either. The "delete=False" is also not a problem. Doing the replace inside the NamedTemporaryFile may be a problem on Windows. Or maybe it was doing it right after a flush without something else in between? I forget. Whatever it is, I have a solution in the library that I use (which also emulates os.replace on Python 2.6-7 and 3.1-2), so I don't have to remember how to solve it every time. One argument against adding it to write_file (besides the argument that novices don't discover pathlib and experts don't use it for other reasons) is that I think the one obvious way to do it should be atomic by default for novices, but that would be a breaking API change to the pathlib function. But if that were the way to get atomic writes in the stdlib, I'd start using pathlib for them.

On Wed, Mar 23, 2016 at 12:40:56AM -0400, Alexander Belopolsky wrote:
I don't think that Python can guarantee to do more than the file system. The best Python can do is promise to do an atomic write if the OS and file system support it, otherwise to be no worse than what the OS and file system support. -- Steve

On Tue, Mar 22, 2016 at 9:33 PM, Chris Angelico <rosuav@gmail.com> wrote:
Also: cross-platform support (Windows Is Different), handling of permissions, do you care about xattrs?, when you say "atomic" then do you mean atomic WRT power loss?, ... it's not actually 100% clear to me that a one-size-fits-all atomic write implementation is possible. -n -- Nathaniel J. Smith -- https://vorpus.org

On Mar 22, 2016, at 21:49, Nathaniel Smith <njs@pobox.com> wrote:
I know a lot of people who never touch Windows with a 10-foot pole think this problem is still unsolved, but that's not true. Microsoft added sufficient atomic-replace APIs in 2003 (in the Win32 API only, not in crt/libc), and as of 3.3, Python's os.replace really is guaranteed to either atomically replace the file or leave it untouched and raise an exception on all platforms (as long as the files are on the same filesystem, and as long as there's not an EIO or equivalent because of an already-corrupted filesystem or a physical error on the media). (For platforms besides Windows and POSIX, it does this just by not existing and raising a NameError...) Likewise for safe temporary files--as of 3.3, tempfile.NamedTemporaryFile is safe on every platform where it exists, and that includes Windows.
handling of permissions, do you care about xattrs?
That can be handled effectively the same way as copy vs. copy2 if desired. I don't know if it's important enough, but if it is, it's easy. (My library actually does have options for preserving different levels of stuff, but I never use them.)
when you say "atomic" then do you mean atomic WRT power loss?
Write-and-replace is atomic WRT both exceptions and power loss. Until the replace succeeds, the old version of the file is still there. This is guaranteed by POSIX and by Windows. If the OS can't offer that on some filesystem, it won't let you call os.replace.
, ... it's not actually 100% clear to me that a one-size-fits-all atomic write implementation is possible.
At least a one-size-fits-all-POSIX-and-post-XP-Windows solution is possible, and that's good enough to many things in the stdlib today. If atomic writes had "availability: Unix, Windows", like most of the os module, I think that would be fine.

If you are interested by "atomic write into a file", see the practical issues to implement a portable function: http://bugs.python.org/issue8604 Victor

On Mar 23, 2016 1:13 AM, "Andrew Barnert" <abarnert@yahoo.com> wrote:
On Mar 22, 2016, at 21:49, Nathaniel Smith <njs@pobox.com> wrote:
On Tue, Mar 22, 2016 at 9:33 PM, Chris Angelico <rosuav@gmail.com>
wrote: path this problem is still unsolved, but that's not true. Microsoft added sufficient atomic-replace APIs in 2003 (in the Win32 API only, not in crt/libc), and as of 3.3, Python's os.replace really is guaranteed to either atomically replace the file or leave it untouched and raise an exception on all platforms (as long as the files are on the same filesystem, and as long as there's not an EIO or equivalent because of an already-corrupted filesystem or a physical error on the media). (For platforms besides Windows and POSIX, it does this just by not existing and raising a NameError...) Likewise for safe temporary files--as of 3.3, tempfile.NamedTemporaryFile is safe on every platform where it exists, and that includes Windows. Ah, thanks! I indeed didn't know about this, and missed that the code was calling os.replace rather than os.rename.
Right, but this is the kind of thing that makes me worry about a one-size-fits-all solution :-).
POSIX doesn't guarantee anything whatsoever over power loss. Individual filesystem implementations make somewhat stronger guarantees, but it's a mess: http://danluu.com/file-consistency/ At the very least atomicity requires fsync'ing the new file before calling rename, or you might end up with: Original code: new = open("new") new.write(...) new.close() os.replace("old", "new") Gets reordered on the way to the hard drive to become: new = open("new") os.replace("new", "old") new.write(...) new.close() POSIX does of course guarantee that if the OS reorders things like this then it has to hide that from you -- processes will always see the write happen before the rename. Except if there's a power loss, now we can have: Gets executed as: new = open("new") os.replace("new", "old") --- whoops, power lost here, and so is the file contents --- But fsync is a very expensive operation; there are plenty of applications for atomic writes where this is unnecessary (e.g. if the file is being used as an IPC mechanism, so power loss -> the processes die, no one cares about their IPC channel anymore). And there are plenty of applications where this is insufficient (e.g. if you expect/need atomic_write(path1, data1); atomic_write(path2, data2) to guarantee that the two atomic writes can't be reordered relative to each other). I don't want to get sucked into a long debate about this; it's entirely likely that adding something like that original recipe to the stdlib would be an improvement, so long as it had *very* detailed docs explaining the exact tradeoffs made. All I want to do is raise a cautionary flag that such an effort would need to tread carefully :-) -n

On Wed, Mar 23, 2016 at 10:14:07AM -0700, Nathaniel Smith wrote:
And then there are file media which lie to you, and return from a fsync before actually syncing, because that makes their benchmarks look good. I've seen file corruption on USB sticks that do this, including some otherwise "respectable" brands. Worst case I ever saw was a USB stick that (fortunately) had a blinking light to show when it was writing. It continuing writing for *eight minutes* (I timed it) after returning from fsync and the OS had unmounted the stick and it was nominally safe to remove. I'm told that some hard drives will do the same, although I've never knowingly seen it. -- Steve

On Thu, Mar 24, 2016 at 2:10 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Yes, but not as much of it; usually with hard drives, it's a small amount of on-board cache, which tends to improve overall performance by batching writes. That can be disabled by the paranoid (which you should be). More commonly, SSDs do the same bare-faced lying that you noted with USB sticks. There was a time when I took the paranoid approach of eschewing SSDs even on my client computers because they're not trustworthy. I'm beginning to shift toward an even more paranoid approach of not trusting _any_ local storage - all my important data gets pushed to a remote server anyway (maybe GitHub, maybe something on the LAN, whatever), so it won't matter if a power failure corrupts stuff. But for a server with a database on it, you kinda need to solve this problem three ways at once, rather than depending on just one. ChrisA

On Wed, Mar 23, 2016 at 8:10 PM, Steven D'Aprano <steve@pearwood.info> wrote:
Not just USB sticks and hard drives -- whole operating systems too. Ones you might have heard of! Like... OS X. Their fsync() syscall is intentionally designed and documented to provide absolutely zero guarantees of anything. (From the man page: "[after calling fsync], if the drive loses power or the OS crashes, the application may find that only some or none of their data was written [...] This is not a theoretical edge case. This scenario is easily reproduced with real world workloads".) A real fsync call is still available, it's just hidden in fcntl. While we're speaking of it, Python's os.fsync on OS X should probably call that fcntl instead of fsync. Right now os.fsync is effectively a no-op on OS X. (Apparently this is https://bugs.python.org/issue11877 but that issue seems to have petered out years ago in confusion.) -n -- Nathaniel J. Smith -- https://vorpus.org

On 23 March 2016 at 13:33, Andrew Barnert via Python-ideas <python-ideas@python.org> wrote:
A key part of the problem here is that from a structured design perspective, in-memory representation, serialisation and persistence are all separate concerns. If you're creating a domain specific language (and yes, I count R, Stata and MATLAB as domain specific), then you can make some reasonable assumptions for all of those, and hide more of the complexities from your users. In Python, for example, the Pandas IO methods are closer to what other data analysis and modelling environments are able to offer: http://pandas.pydata.org/pandas-docs/stable/io.html NumPy similarly has some dedicated IO routines: http://docs.scipy.org/doc/numpy/reference/routines.io.html However, for a general purpose language, providing convenient "I don't care about the details, just do something sensible" defaults gets trickier, as not only are suitable defaults often domain specific, you have a greater responsibility to help folks figure out when they have ventured into territory where those defaults are no longer appropriate. If you can't figure out a reasonable set of default behaviours, or can't figure out how to nudge people towards alternatives when they start hitting the limits of the default behaviour, then you're often better off ducking the question entirely and getting people to figure it out for themselves. Regards, Nick. -- Nick Coghlan | ncoghlan@gmail.com | Brisbane, Australia

semi OT, but: On Tue, Mar 22, 2016 at 8:33 PM, Andrew Barnert via Python-ideas < python-ideas@python.org> wrote:
+inf on that -- we really do need to get the entire stdlib to work with Path objects! -- Christopher Barker, Ph.D. Oceanographer Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception Chris.Barker@noaa.gov

Not to knock too much on R, but this kind of thing was one of the things I always really hated about that language. Separating the operations of converting a string and then writing/printing/sending/etc. that string is just so much more flexible. It reminds me of the many objects I worked with in R that had natural print methods, but did not have natural conversions to strings (at least nowhere near as easy as it should have been). There were so many times that having the operations as separate steps (e.g. having print just call call __str__() or whatever) is so much better. I understand the argument that it feels more natural to certain people, but it really incentivizes bad design imo. Definitely a big -1 from me. On 03/22/2016 08:06 PM, Nick Eubank wrote:

On 23.03.2016 05:35, Thomas Nyberg wrote:
and inconvenient. I am no social scientist but I share the sentiment. bash: echo "my string" >> my_file python: with open('my_file', 'w') as f: f.write('my string') Hmm. Python actually is best at avoiding special characters but regarding files, it's still on the level of C.
Maybe, I am missed that but when did Nick talk about non-string to be written to a file?
I understand the argument that it feels more natural to certain people, but it really incentivizes bad design imo.
I disagree. It incentivizes good design as it forces you to prepare 1) your file name and 2) your data properly and 3) you don't "invent" self-proclaimed-but-not-really atomic writes. I can tell you from my experience with several aged Python developers that they regularly fail to implement atomic file operations. Just saying.
Definitely a big -1 from me.
+1 from me. Best, Sven

On Wed, Mar 23, 2016 at 6:20 PM, Sven R. Kunze <srkunze@mail.de> wrote:
bash: echo "my string" >> my_file python: with open('my_file', 'w') as f: f.write('my string')
...
I can tell you from my experience with several aged Python developers that they regularly fail to implement atomic file operations. Just saying.
What make you think your bash example implements an atomic write? It actually performs an append and therefore not equivalent to the python code that followed.

On 23.03.2016 23:30, Alexander Belopolsky wrote:
What makes you think that I think my bash example implements an atomic write? ;-) My point was the simplicity of the bash command. Make it Python as simple AND atomic here (with file handling) and people will drop bash immediately. Best, Sven

On 23/03/2016 23:27, Sven R. Kunze wrote:
In four decades I've never once felt the need for a one-liner that writes a string to a file. How do you drop something that you've never used and have no interest in? -- My fellow Pythonistas, ask not what our language can do for you, ask what you can do for our language. Mark Lawrence

On Wed, Mar 23, 2016 at 5:06 AM, Nick Eubank <nickeubank@gmail.com> wrote: [...]
Maybe you know this, but Path.read_text and Path.write_text (and the bytes versions) were introduced in Python 3.5, released only half a year ago, so there has not really been time for their adoption. Path("mytextfile.txt").write_text(...) Path("mytextfile.txt").read_text() Besides being very readable and convenient, one of the nice things about this is that it helps avoid code like open("mytextfile.txt").write(text) which you see all the time, even if the file will be left open for some time after this in non-ref-counting implementations of Python. So, as mentioned before by others, all we need is the better interplay of pathlib with the rest of the stdlib and that people will learn about it. Personally, I now always use `pathlib` instead of `open` when I don't need to care about compatibility with old Python versions. Maybe some day, Path could be in builtins like open is now? - Koos

On Thu, Mar 24, 2016 at 7:46 PM, Sven R. Kunze <srkunze@mail.de> wrote:
For me, it's most often str->Path, meaning things like Path("file.txt") or directory = Path("/path/to/some/directory") with (directory / "file.txt").open() as f: #do something with f with (directory / "otherfile.txt").open() as f: #do something with f Then, somewhat less often, I need to give a str as an argument to something. I then need an additional str(...) around the Path object. That just feels stupid and makes me start wishing Path was a subclass of str, and/or that Path was a builtin. Or even better, that you could do p"filename.txt", which would give you a Path string object. Has this been discussed? - Koos

On Thu, Mar 24, 2016 at 5:06 PM, Koos Zevenhoven <k7hoven@gmail.com> wrote:
Yes, right in the PEP 428 where pathlib was proposed. [1] [1]: https://www.python.org/dev/peps/pep-0428/#id29

On Thu, Mar 24, 2016 at 11:22 PM, Alexander Belopolsky < alexander.belopolsky@gmail.com> wrote:
Thanks. The pep indeed mentions that str (or other builtins like tuple) are not subclassed in pathlib, but couldn't find the explanation why. Well, it would add all kinds of attributes, many of which will just be confusing and not useful for paths. - Koos

On 24.03.2016 22:06, Koos Zevenhoven wrote:
Or even better, that you could do p"filename.txt", which would give you a Path string object. Has this been discussed?
Interesting. I had this thought half a year ago. Quite recently I mentioned this to Andrew in a private conversation. p'/etc/hosts' would make a perfect path which subclassed from str. The p-string idea has not been discussed in the PEP I think. The subclassing thing however was and I think its resolution was a mistake. The explanation, at least from my point of view, is a bit weird. Maybe, this can be re-discussed in a separate thread? Especially when different people think independently of the same issue and the same solution. Best, Sven
participants (24)
-
Alexander Belopolsky
-
Andrew Barnert
-
Brett Cannon
-
Chris Angelico
-
Chris Barker
-
Chris Barker - NOAA Federal
-
Eric V. Smith
-
Ethan Furman
-
Franklin? Lee
-
Koos Zevenhoven
-
Mark Lawrence
-
Michael Selik
-
Michel Desmoulin
-
Nathaniel Smith
-
Nick Coghlan
-
Nick Eubank
-
Paul Moore
-
Stephen J. Turnbull
-
Steven D'Aprano
-
Sven R. Kunze
-
Terry Reedy
-
Thomas Nyberg
-
Victor Stinner
-
Émanuel Barry