tail

Sun May 8 15:47:18 EDT 2022

On Sun, 8 May 2022 at 20:31, Barry Scott <barry at barrys-emacs.org> wrote:
>
> > On 8 May 2022, at 17:05, Marco Sulla <Marco.Sulla.Python at gmail.com> wrote:
> >
> > def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
> >    n_chunk_size = n * chunk_size
>
> Why use tiny chunks? You can read 4KiB as fast as 100 bytes as its typically the smaller size the file system will allocate.
> I tend to read on multiple of MiB as its near instant.

Well, I tested on a little file, a list of my preferred pizzas, so....

> >    pos = os.stat(filepath).st_size
>
> You cannot mix POSIX API with text mode.
> pos is in bytes from the start of the file.
> Textmode will be in code points. bytes != code points.
>
> >    chunk_line_pos = -1
> >    lines_not_found = n
> >
> >    with open(filepath, newline=newline, encoding=encoding) as f:
> >        text = ""
> >
> >        hard_mode = False
> >
> >        if newline == None:
> >            newline = _lf
> >        elif newline == "":
> >            hard_mode = True
> >
> >        if hard_mode:
> >            while pos != 0:
> >                pos -= n_chunk_size
> >
> >                if pos < 0:
> >                    pos = 0
> >
> >                f.seek(pos)
>
> In text mode you can only seek to a value return from f.tell() otherwise the behaviour is undefined.

Why? I don't see any recommendation about it in the docs:
https://docs.python.org/3/library/io.html#io.IOBase.seek

> >                text = f.read()
>
> You have on limit on the amount of data read.

I explained that previously. Anyway, chunk_size is small, so it's not
a great problem.

> >                lf_after = False
> >
> >                for i, char in enumerate(reversed(text)):
>
> Simple use text.rindex('\n') or text.rfind('\n') for speed.

I can't use them when I have to find both \n or \r. So I preferred to
simplify the code and use the for cycle every time. Take into mind
anyway that this is a prototype for a Python C Api implementation
(builtin I hope, or a C extension if not)

> > Shortly, the file is always opened in text mode. File is read at the end in
> > bigger and bigger chunks, until the file is finished or all the lines are
> > found.
>
> It will fail if the contents is not ASCII.

Why?

> > Why? Because in encodings that have more than 1 byte per character, reading
> > a chunk of n bytes, then reading the previous chunk, can eventually split
> > the character between the chunks in two distinct bytes.
>
> No it cannot. text mode only knows how to return code points. Now if you are in
> binary it could be split, but you are not in binary mode so it cannot.

>From the docs:

seek(offset, whence=SEEK_SET)
Change the stream position to the given byte offset.

> > Do you think there are chances to get this function as a method of the file
> > object in CPython? The method for a file object opened in bytes mode is
> > simpler, since there's no encoding and newline is only \n in that case.
>
> State your requirements. Then see if your implementation meets them.

The method should return the last n lines from a file object.
If the file object is in text mode, the newline parameter must be honored.
If the file object is in binary mode, a newline is always b"\n", to be
consistent with readline.

I suppose the current implementation of tail satisfies the
requirements for text mode. The previous one satisfied binary mode.

Anyway, apart from my implementation, I'm curious if you think a tail
method is worth it to be a method of the builtin file objects in
CPython.