[issue25495] binascii documentation incorrect
New submission from Mouse: binascii b2a_base64() documentation says: The length of data should be at most 57 to adhere to the base64 standard. This is incorrect, because there is no base64 standard that restricts the length of input data, especially to such a small value. What RFC4648 (that superseded RFC3548 that your documentation still keeps referring to) actually says is that MIME enforces the limit ofthe OUTPUT LINE length at 76, but NOT of the entire output, and certainly not of the entire input. Please correct the documentation, making it conformant with what the ACTUAL base64 standard says. See https://en.wikipedia.org/wiki/Base64 and https://tools.ietf.org/html/rfc4648 Thanks! ---------- assignee: docs@python components: Documentation messages: 253572 nosy: docs@python, mouse07410 priority: normal severity: normal status: open title: binascii documentation incorrect versions: Python 3.5 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
R. David Murray added the comment: I agree that the documentation is not optimal. To give you some background, binascii was primarily implemented to support the email module, and the standard it is referring to is in fact the MIME standard that references base64 (I believe at the time the independent base64 standard did not exist). The number 57 is derived from the fact that 57 * 4/3 = 76; that is, input data of length 57 will result in an encoded line that is equal to the maximum recommended line length. Thus in encoding an email the email package (originally, it no longer does this) passed the data in in 57 byte chunks and appended the resulting lines to the output buffer. So, this documentation is historically correct, but no longer particularly useful. Suggested improvements are welcome. This state of affairs exists because the binascii module doesn't really have a current maintainer. Someday I'd love to have enough time to pick it up, since I maintain email and it is still used by email (and could be used better, with some module improvements). ---------- nosy: +r.david.murray _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: Yes I know where this came from. :-) Here is my proposed change. Replace the statement The length of data should be at most 57 to adhere to the base64 standard. with: To be MIME-compliant, the Base64 output (as defined in RFC4648) should be broken into lines of at most 76 characters long. This post-processing of the output is the responsibility of the caller. Note that the original PEM context-transfer encoding limited line length to 64 characters. Would this change be agreeable to you? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Stéphane Wirtel added the comment: Here is a patch with the submitted description. ---------- keywords: +patch nosy: +matrixise Added file: http://bugs.python.org/file40878/issue25495.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: Thank you for the quick turn-around, and for taking care of this issue! ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Martin Panter added the comment: Thanks for bringing this up, it has often bugged me. My understanding, based on the original wording and places where this is used, is that the data is often pre-processed into 57-byte chunks, rather than post-processing it. Maybe it is worthwhile restoring that info. It should help understanding the presence of the newline parameter (or why a newline is always added in 3.5). Also, the link between RFC 4648 and this function could be made even more explicit. Maybe move “as defined” into the first sentence, or change “the Base64 output” to “the function’s output”. ---------- nosy: +martin.panter stage: -> patch review versions: +Python 2.7, Python 3.4, Python 3.6 _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: As far as I remember, the data was not "originally processed in 57-byte chunks". I've been around the first PEM and MIME standards and discussions (and code, though not in Python, which wasn't around then) to be in position to know. :) Whether the user prefers to process data in chunks or not, is up to the user. Not to mention that PEM is long gone, and MIME also changed somewhat. The link between this function and RFC4648 can and should be more explicit, but I think just referring to it is enough. Do you have a recommendation for additional info to explain newline issues? Yes, changing "Base64 output" to "function output" makes perfect sense. Thanks! ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Martin Panter added the comment: I was only referring to the original Python documentation and library. See the base64.encode() implementation for an example which does do this 57-byte pre-chunking. Simplified: MAXLINESIZE = 76 # Excluding the CRLF MAXBINSIZE = (MAXLINESIZE//4)*3 # 57 ... while True: s = input.read(MAXBINSIZE) if not s: break line = binascii.b2a_base64(s) output.write(line) Here’s my attempt to rewrite the doc (3.6 version): ''' Convert binary data to the base 64 encoding defined in :rfc:`4648`. The return value includes a trailing newline ``b"\n"`` if *newline* is true. To be MIME-compliant, base 64 output should be broken into lines at most 76 characters long. One way to do this is to call this function with 57-byte chunks and ``newline=True``. Also, the original PEM context-transfer encoding limited the line length to 64 characters. ''' But if PEM is long gone as you say, perhaps we don’t need that last sentence? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: 1. I concede knowing nothing about the early Python library implementation, functionality, or even purpose. 2. I don't think it makes sense now to either refer to PEM. We'd be two decades too late for that (well, 27 years, to be precise :). See https://en.wikipedia.org/wiki/Privacy-enhanced_Electronic_Mail 3. I don't think we are in position to tell programmers how to split a string of characters into 76-long chunks. Not to mention that the example you gave is likely to suffer in performance (just count those function calls), compared to other methods, and won't reflect well on the authors. Here's one possible doc version: ''' Convert binary data to the base 64 encoding defined in :rfc:`4648`. The return value includes a trailing newline ``b"\n"`` if *newline* is true. If the output is used as Base64 transfer encoding for MIME (:rfc: 2045), base 64 output should be broken into lines at most 76 characters long to be compliant. Base64 encoding standard does not limit the maximum encoded line length. ''' ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
R. David Murray added the comment: Add a parenthetical "(57 bytes of the input per line)" and I'll be happy with that. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: Let's not insinuate anything about the input. This is about what constraints on the OUTPUT MAY be there, not a tutorial from the 80-ties on how one might accomplish it. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: And even those constraints depend on the use. E.g. X.509 does not have those. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
R. David Murray added the comment: Please take a look at the Examples section of this: http://perldoc.perl.org/MIME/Base64.html Looks kind of like Martin's suggestion :) ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: 1. I am OK with the following text, modeling referred Perldoc: b2a_base64( $bytes, $eol ); Encode data by calling the encode_base64() function. The first argument is the byte string to encode. The second argument is optional, and provides the line-ending sequence to use. When it is given, the returned encoded string is broken into lines of no more than 76 characters each and it will end with $eol unless it is empty. Pass an empty string, or no second argument at all if you do not want the encoded string to be broken into lines. 2. I already had people telling me that "Python-3 doc prohibits input longer than 57 bytes, even though it doesn't currently enforce it". Please help putting end to spreading of this confusion. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Martin Panter added the comment: FWIW that Perl function looks like it does the line splitting for you. It mentions “chunks that are a multiple of 57 bytes”. The Python function does not do any line splitting. You have to use base64.encodebytes(), codecs.encode(encoding="base64") or perhaps something in the email package (or user code) for that. I think we all agree that there is no hard limit of 57. I have avoided this function in the past due to the documentation. The question is whether the documentation should mention that number in a more accurate context, or not at all. Personally I don’t see much harm in mentioning the 57-byte input chunking, as long as it is obvious it is not the only option. I don’t have a strong view; I am just trying to be conservative. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: The harm in mentioning the 57-byte chunking is that so far it successfully confused people. b2a_base64() function is not coupled to MIME. It has no constraints on either its input, or its output. *IF* it is used by (in) a MIME application, then the caller may want to make its output RFC 2045-compliant, by whatever way he chooses. Giving (an unwelcome) advice to a writer of one specific application is in my opinion completely out of scope here. Justification that it used to matter 25 years ago and therefore should be kept here doesn't make sense to me. I strongly insist that this "chunking" thing does not belong, and must be removed. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Martin Panter added the comment: Perhaps we can focus on the Python 2 version where there is always a newline appended. Here is a possible patch. ---------- Added file: http://bugs.python.org/file40953/issue25495.base64.2.7.patch _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: Unfortunately, NO. The problem (and this bug report) is for Python-3 documentation, so trying to address it in Python-2 rather than in Python-3 does not make sense. We seem to both understand and agree that there is no length limitation on b2a_base64() input, either recommended or enforced - contrary to what the current Python-3 documentation implies. We understand that *if* the *output* of this function is intended for use in MIME (rather than X.509 or whatever else Base64 is good for), then the caller should do other things besides calling b2a_base64(), and in all likelihood the caller is already aware of that - after all, if he figured that he needs Base64 in his stuff, he probably knows something about what MIME standards say and require?. I repeat my original complaint: Python-3 documentation is buggy because it implies a restriction on the input that is not there. This reference should be removed from there because it confuses people. I've talked to those confused personally, so this is first-hand. I refer you to the original msg253572 of this bug report. If you want to write a MIME-in-Python tutorial, it is up to you - but b2a_base64() does not seem to be the right place for it. (And I'd rather see an X.509 tutorial if you're dead set on writing something besides strict plain b2a_base64() doc. :-) ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: To add: I do not understand your attachment to that 57 "...(exactly 57 bytes of input data per line)", and request that this parenthesized sentence is removed from your Python-2.7 doc patch. Please give the reader the benefit of the doubt, and allow that *if* he wants to repeatedly call b2a_base64() instead of splitting its output - the ability to compute (76 * 3 / 4) is within his skill level. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Georg Brandl added the comment: issue25495.base64.2.7.patch looks good to me. A similar patch can be adapted for 3.x. ---------- nosy: +georg.brandl _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Martin Panter added the comment: Mouse, I know you originally opened this against 3.5. Apart from the module description at the bottom, my patch should be valid for 3.5 also. The relevant wording is identical to 2.7. I have resisted removing the magic number 57 for a couple of reasons. Reading existing code that uses this number may be harder. David said he would be happier with it kept. I believed we could solve your original complaint and explain why the number was really there at the same time. It helps explain how the function was originally to be used, and why the newline is appended. Anyway, I think it is best if I let this go, and someone else pick it up. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment:
my patch should be valid for 3.5 also. The relevant wording is identical to 2.7.
OK.
I have resisted removing the magic number 57 for a couple of reasons. Reading existing code that uses this number may be harder.
You expect to see "existing code that uses this number" in Python-3.5+? Interesting... (Care to point me at a couple of samples of such "existing" Python-3 code?) And you expect that the main info source for understanding the reason behind that "57" (assuming this function is invoked that way, as opposed to splitting the output :) would be the doc for this function, rather than the main program, or RFC 2045, or...? Fine.
It helps explain how the function was originally to be used, and why the newline is appended.
Pardon me, but why do you think anybody would care...? There are tons of functions, old and new, with more new ones popping up fast enough. I'd really envy a person who has time to enjoy history of one minuscule function of an old (albeit still useful :) library. OK. You think a history of this function should be documented - fine. I don't need it (and don't think anybody else wants to read it either), but it's not my doc or my decision. Just get the darn bug fixed. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Mouse added the comment: Status...? ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
R. David Murray added the comment: See also Issue 1753718. I will try to review both of these issues soon. ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Roundup Robot added the comment: New changeset 7b137466e879 by R David Murray in branch '2.7': #25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/7b137466e879 New changeset 3d5bf9bd15a3 by R David Murray in branch '3.4': #25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/3d5bf9bd15a3 New changeset ea9951598bab by R David Murray in branch '3.5': Merge: #25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/ea9951598bab New changeset 35650db28afe by R David Murray in branch 'default': Merge: #25495: Clarify b2a_base64 documentation vis 57 bytes. https://hg.python.org/cpython/rev/35650db28afe ---------- nosy: +python-dev _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
R. David Murray added the comment: I kept the 57 as part of an historical note explaining why the newline is added. I dropped that sentence in the 3.6 docs, where a keyword to control the apending of the newline has been added. ---------- resolution: -> fixed stage: patch review -> resolved status: open -> closed _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
Martin Panter added the comment: Thanks for fixing this up David ---------- _______________________________________ Python tracker <report@bugs.python.org> <http://bugs.python.org/issue25495> _______________________________________
participants (6)
-
Georg Brandl
-
Martin Panter
-
Mouse
-
R. David Murray
-
Roundup Robot
-
Stéphane Wirtel