An efficient split function

Mon May 10 11:07:49 EDT 1999

I have posted this message to comp.lang.c++.moderated,
comp.lang.perl.moderated, and this group in hopes of sparking
discussion about a very common programming idiom which I describe
below.  Though this is mainly concerned with a C++ solution, it does
touch on Python, so I thought it might be of interest.

I have made an effort to write an efficient split function in C++,
comparing it to similar versions written in Python and Perl.  I'm
posting this to the three language communities because I would like
help in answering two questions:  Is my approach in general valid?
and, Are the particular solutions I've come up with in each language
reasonable?

An efficient split function is part of the foundation of many programs
--- especially those that, e.g, summarize information in log files of
various sorts.  The other part that is quite important for performance
issues, as I have discovered, is the routine used to read in the "raw"
information.  The idiom used here, quite common I'm sure, might be
thought of as "Read a line; split on a separator character; process
the fields" (we might call this the "RSP" idiom).  My focus below is
on the first two parts of this idiom.

I have been testing a set of functions I have written to split
character strings (both NULL-terminated character arrays and C++
strings).  I needed to see what sort of performance improvement I
could get over two interpreted languages I use quite frequently,
Python (version 1.5.2) and Perl (version 5.004_04).

For comparison, I used the following Perl program:

    my $count = 0;
    my @v;
    while (<STDIN>) {
        @v = split(/\|/);
        $count++;
    }
    print "$count\n"; 

and the following Python program:

    import sys, string

    readline = sys.stdin.readline
    split = string.split

    count = 0
    while 1: 
        line = readline()
        if not line: break
        l = split(line, '|')
        count = count + 1

    print count

As can be seen, to substitute for the third part of the RSP idiom I
simply coded a count of the number of lines processed.

Surprisingly, to me, the Python version far outperformed the Perl
version.  Running on 1 million lines of input of 9 fields each, the
Python version ran in just under 20 seconds, the Perl version in just
under 40 seconds (this on a 400Mhz Pentium Linux box).

For the C++ split function, the first time through I tried a very
straightforward approach using the C++ string and vector classes (I'm
using the egcs compiler, version 1.1b).  The split function prototype
is

    int split(vector<string>&, const string&, char);

(I've posted the implementation of the split functions below) and the
main function is:

    main() {
        string line;
        vector<string> v;
        int count = 0;

        while (getline(cin, line)) {
            split(v, line, '|');
            ++count;
        }
        cout << count << endl;
    }

This ran in about 8.5 seconds, (21% of Perl, 42% of Python).  But, this
wasn't the sort of speedup I had in mind, so I kept plowing ahead.

Previous tests of mine comparing the getline() version that works
with C++ strings and that which works with character arrays showed me
that the latter was much faster, albeit less flexible.  So, keeping
the general split algorithm intact, I recoded the split function:

    int split(vector<string>&, const char*, char);

and the main routine:

    main() {
        char line[1024];
        vector<string> v;
        int count = 0;

        while (cin.getline(line, 1024)) {
            split(v, line, '|');
            ++count;
        }
        cout << count << endl;
    }

This ran in about 4.75 seconds (12% of Perl, 24% of Python).  Better, but
I still wanted more.  So I devised a split routine to operate on character
arrays, but to create a vector of "spans" instead of strings:

    typedef pair<const char*, const char*> span;

    int split(vector<span>&, const char*, char);

and a new main routine:

    main() {
        char line[1024];
        vector<span> v(30);
        int count = 0;

        while (cin.getline(line, 1024)) {
            split(v, line, '|');
            ++count;
        }
        cout << count << endl;
    }

This ran in 1.2 seconds (3% of Perl, 6% of Python), which sounded about
where I wanted the performance to be.

For comparison purposes, here are the three split routines:

    // Split a string on a given character into a vector of strings
    // The vector is passed by reference and cleared each time
    // The number of strings split out is returned
    int split(vector<string>& v, const string& str, char c)
    {
        v.clear();
        string::const_iterator s = str.begin();
        while (true) {
            string::const_iterator begin = s;

            while (*s != c && s != str.end()) { ++s; }

	    v.push_back(string(begin, s));

	    if (s == str.end()) {
                break;
            }

            if (++s == str.end()) {
                v.push_back("");
                break;
            }
        }
        return v.size();
    }

    // Split a NULL-terminated character array on a given character into
    // a vector of strings
    // The vector is passed by reference and cleared each time
    // The number of strings split out is returned
    int split(vector<string>& v, const char* s, char c)
    {
        v.clear();
        while (true) {
            const char* begin = s;

            while (*s != c && *s) { ++s; }

	    v.push_back(string(begin, s));

	    if (!*s) {
                break;
            }

            if (!*++s) {
                v.push_back("");
                break;
            }
        }
        return v.size();
    }

    // Represents a span of a character array
    typedef pair<const char*, const char*> span;

    // Convenience function to set a span
    inline void set_span(span& s, const char* b, const char* e)
    {
        s.first = b;
        s.second = e;
    }

    // Split a NULL-terminated character array on a given character into
    // a vector of spans
    // The vector is of constant size is not cleared each time
    // The number of spans split out is returned
    int split(vector<span>& v, const char* s, char c)
    {
        int i = 0;
        while (true) {
            const char* begin = s;

            while (*s != c && *s) { ++s; }

	    set_span(v[i++], begin, s);

	    if (!*s) {
                break;
            }

	    if (!*++s) {
                set_span(v[i++], 0, 0);
                break;
            }
        }
        return i;
    }

Each of these routines will split the following string:

    <field0>|<field1>|<field2>| ... |<fieldN>

where any of the N fields can be empty, into N+1 fields.

So, what do folks think of this?  Obviously, the particular span
approach above has the drawback that is relies both on a fixed-length
input buffer and a fixed-length vector of spans.  However, for my
purposes, this is safe enough and speed is paramount at this time.

I'm sure there are better ways to approach this, issues I haven't
thought of, speed gains to be found here and there, etc.  I'd be glad
to hear of them.

Bill
-- 
William S. Lear | Who is there that sees not that this inextricable labyrinth
r a e l @       | of reasons  of state was artfully invented, lest the people
d e j a .       | should  understand  their own  affairs, and, understanding,
c o m           | become inclined to conduct them?    ---William Godwin, 1793