An efficient split function
William S. Lear
rael at see.sig
Mon May 10 11:07:49 EDT 1999
I have posted this message to comp.lang.c++.moderated,
comp.lang.perl.moderated, and this group in hopes of sparking
discussion about a very common programming idiom which I describe
below. Though this is mainly concerned with a C++ solution, it does
touch on Python, so I thought it might be of interest.
I have made an effort to write an efficient split function in C++,
comparing it to similar versions written in Python and Perl. I'm
posting this to the three language communities because I would like
help in answering two questions: Is my approach in general valid?
and, Are the particular solutions I've come up with in each language
reasonable?
An efficient split function is part of the foundation of many programs
--- especially those that, e.g, summarize information in log files of
various sorts. The other part that is quite important for performance
issues, as I have discovered, is the routine used to read in the "raw"
information. The idiom used here, quite common I'm sure, might be
thought of as "Read a line; split on a separator character; process
the fields" (we might call this the "RSP" idiom). My focus below is
on the first two parts of this idiom.
I have been testing a set of functions I have written to split
character strings (both NULL-terminated character arrays and C++
strings). I needed to see what sort of performance improvement I
could get over two interpreted languages I use quite frequently,
Python (version 1.5.2) and Perl (version 5.004_04).
For comparison, I used the following Perl program:
my $count = 0;
my @v;
while (<STDIN>) {
@v = split(/\|/);
$count++;
}
print "$count\n";
and the following Python program:
import sys, string
readline = sys.stdin.readline
split = string.split
count = 0
while 1:
line = readline()
if not line: break
l = split(line, '|')
count = count + 1
print count
As can be seen, to substitute for the third part of the RSP idiom I
simply coded a count of the number of lines processed.
Surprisingly, to me, the Python version far outperformed the Perl
version. Running on 1 million lines of input of 9 fields each, the
Python version ran in just under 20 seconds, the Perl version in just
under 40 seconds (this on a 400Mhz Pentium Linux box).
For the C++ split function, the first time through I tried a very
straightforward approach using the C++ string and vector classes (I'm
using the egcs compiler, version 1.1b). The split function prototype
is
int split(vector<string>&, const string&, char);
(I've posted the implementation of the split functions below) and the
main function is:
main() {
string line;
vector<string> v;
int count = 0;
while (getline(cin, line)) {
split(v, line, '|');
++count;
}
cout << count << endl;
}
This ran in about 8.5 seconds, (21% of Perl, 42% of Python). But, this
wasn't the sort of speedup I had in mind, so I kept plowing ahead.
Previous tests of mine comparing the getline() version that works
with C++ strings and that which works with character arrays showed me
that the latter was much faster, albeit less flexible. So, keeping
the general split algorithm intact, I recoded the split function:
int split(vector<string>&, const char*, char);
and the main routine:
main() {
char line[1024];
vector<string> v;
int count = 0;
while (cin.getline(line, 1024)) {
split(v, line, '|');
++count;
}
cout << count << endl;
}
This ran in about 4.75 seconds (12% of Perl, 24% of Python). Better, but
I still wanted more. So I devised a split routine to operate on character
arrays, but to create a vector of "spans" instead of strings:
typedef pair<const char*, const char*> span;
int split(vector<span>&, const char*, char);
and a new main routine:
main() {
char line[1024];
vector<span> v(30);
int count = 0;
while (cin.getline(line, 1024)) {
split(v, line, '|');
++count;
}
cout << count << endl;
}
This ran in 1.2 seconds (3% of Perl, 6% of Python), which sounded about
where I wanted the performance to be.
For comparison purposes, here are the three split routines:
// Split a string on a given character into a vector of strings
// The vector is passed by reference and cleared each time
// The number of strings split out is returned
int split(vector<string>& v, const string& str, char c)
{
v.clear();
string::const_iterator s = str.begin();
while (true) {
string::const_iterator begin = s;
while (*s != c && s != str.end()) { ++s; }
v.push_back(string(begin, s));
if (s == str.end()) {
break;
}
if (++s == str.end()) {
v.push_back("");
break;
}
}
return v.size();
}
// Split a NULL-terminated character array on a given character into
// a vector of strings
// The vector is passed by reference and cleared each time
// The number of strings split out is returned
int split(vector<string>& v, const char* s, char c)
{
v.clear();
while (true) {
const char* begin = s;
while (*s != c && *s) { ++s; }
v.push_back(string(begin, s));
if (!*s) {
break;
}
if (!*++s) {
v.push_back("");
break;
}
}
return v.size();
}
// Represents a span of a character array
typedef pair<const char*, const char*> span;
// Convenience function to set a span
inline void set_span(span& s, const char* b, const char* e)
{
s.first = b;
s.second = e;
}
// Split a NULL-terminated character array on a given character into
// a vector of spans
// The vector is of constant size is not cleared each time
// The number of spans split out is returned
int split(vector<span>& v, const char* s, char c)
{
int i = 0;
while (true) {
const char* begin = s;
while (*s != c && *s) { ++s; }
set_span(v[i++], begin, s);
if (!*s) {
break;
}
if (!*++s) {
set_span(v[i++], 0, 0);
break;
}
}
return i;
}
Each of these routines will split the following string:
<field0>|<field1>|<field2>| ... |<fieldN>
where any of the N fields can be empty, into N+1 fields.
So, what do folks think of this? Obviously, the particular span
approach above has the drawback that is relies both on a fixed-length
input buffer and a fixed-length vector of spans. However, for my
purposes, this is safe enough and speed is paramount at this time.
I'm sure there are better ways to approach this, issues I haven't
thought of, speed gains to be found here and there, etc. I'd be glad
to hear of them.
Bill
--
William S. Lear | Who is there that sees not that this inextricable labyrinth
r a e l @ | of reasons of state was artfully invented, lest the people
d e j a . | should understand their own affairs, and, understanding,
c o m | become inclined to conduct them? ---William Godwin, 1793
More information about the Python-list
mailing list