tail

Sun May 8 12:05:11 EDT 2022

I think I've _almost_ found a simpler, general way:

import os

_lf = "\n"
_cr = "\r"

def tail(filepath, n=10, newline=None, encoding=None, chunk_size=100):
    n_chunk_size = n * chunk_size
    pos = os.stat(filepath).st_size
    chunk_line_pos = -1
    lines_not_found = n

    with open(filepath, newline=newline, encoding=encoding) as f:
        text = ""

        hard_mode = False

        if newline == None:
            newline = _lf
        elif newline == "":
            hard_mode = True

        if hard_mode:
            while pos != 0:
                pos -= n_chunk_size

                if pos < 0:
                    pos = 0

                f.seek(pos)
                text = f.read()
                lf_after = False

                for i, char in enumerate(reversed(text)):
                    if char == _lf:
                        lf_after == True
                    elif char == _cr:
                        lines_not_found -= 1

                        newline_size = 2 if lf_after else 1

                        lf_after = False
                    elif lf_after:
                        lines_not_found -= 1
                        newline_size = 1
                        lf_after = False

                    if lines_not_found == 0:
                        chunk_line_pos = len(text) - 1 - i + newline_size
                        break

                if lines_not_found == 0:
                    break
        else:
            while pos != 0:
                pos -= n_chunk_size

                if pos < 0:
                    pos = 0

                f.seek(pos)
                text = f.read()

                for i, char in enumerate(reversed(text)):
                    if char == newline:
                        lines_not_found -= 1

                        if lines_not_found == 0:
                            chunk_line_pos = len(text) - 1 - i +
len(newline)
                            break

                if lines_not_found == 0:
                    break

    if chunk_line_pos == -1:
        chunk_line_pos = 0

    return text[chunk_line_pos:]

Shortly, the file is always opened in text mode. File is read at the end in
bigger and bigger chunks, until the file is finished or all the lines are
found.

Why? Because in encodings that have more than 1 byte per character, reading
a chunk of n bytes, then reading the previous chunk, can eventually split
the character between the chunks in two distinct bytes.

I think one can read chunk by chunk and test the chunk junction problem. I
suppose the code will be faster this way. Anyway, it seems that this trick
is quite fast anyway and it's a lot simpler.

The final result is read from the chunk, and not from the file, so there's
no problems of misalignment of bytes and text. Furthermore, the builtin
encoding parameter is used, so this should work with all the encodings
(untested).

Furthermore, a newline parameter can be specified, as in open(). If it's
equal to the empty string, the things are a little more complicated, anyway
I suppose the code is clear. It's untested too. I only tested with an utf8
linux file.

Do you think there are chances to get this function as a method of the file
object in CPython? The method for a file object opened in bytes mode is
simpler, since there's no encoding and newline is only \n in that case.