[Expat-bugs] [ expat-Patches-699487 ] Hash table improvement

Tue Apr 1 09:50:03 EST 2003

Patches item #699487, was opened at 2003-03-07 11:31
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=310127&aid=699487&group_id=10127

Category: None
Group: None
Status: Open
Resolution: None
Priority: 5
Submitted By: Karl Waclawek (kwaclaw)
Assigned to: Karl Waclawek (kwaclaw)
Summary: Hash table improvement

Initial Comment:
This patch addresses two areas:

1) The current hash function ignores the high byte
    of UTF-16 characters. This is not good.
2) There is a cheap way to switch from the current
    method of linear probing (to resolve collisions)
    to the slightly better double hashing.

Ad 1) hashing of wide characters:
Currently, the algorithm is like this:
  h = h * 33 + (unsigned char)c;
where h is the hash value and c is a character
in the string argument (char or wchar_t).

It seems this hashing algorithm belongs to a family
of algorithms of the type:
  h = h * <some number> + character
where <some number> is taken froma list of tried
and proven values like 31, 33, 37, 41, 65599 which
should be (but are not always) prime numbers.

I added a wchar_t version (#ifdef XML_UNICODE)
of this type: h = h * 65599 + (unsigned short)c;

Ad 2) double hashing:
Currently the hash table index is calculated from 
the hash value by masking out the lower bits
according to table size. This works because the
table size is a power of 2. But this also means that
some bits of the hash value are unused. From 
these we  can therefore generate the second hash 
value with little extra effort.

The patch also updates the inlined hash table 
implementation used for detecting duplicate 
prefixed attributes (in function storeAtts()).

See the attached file hashpatch1.diff
(to be applied against xmlparse.c rev. 1.109).

----------------------------------------------------------------------

>Comment By: Karl Waclawek (kwaclaw)
Date: 2003-04-01 12:50

Message:
Logged In: YES 
user_id=290026

The last change is not necessary, since the cast
to unsigned char takes care of overflow. It actually
introduces a bug, since for small sizes the step size
can now be larger than the table size.

We can change back and write it like this:

#define SECOND_HASH(hash, mask, power) \
  ((((hash) & ~(mask)) >> ((power) - 1)) & ((mask) >> 2))
#define PROBE_STEP(hash, mask, power) \
  ((unsigned char)((SECOND_HASH(hash, mask, power)) | 
1))

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2003-03-28 20:40

Message:
Logged In: YES 
user_id=290026

Further change for PROBE_STEP macro:
To prevent overflow (the result should fit into one Byte)
we replace "& (mask >> 2)" with "& (unsigned long)0xFF":

#define BYTE_MASK ((unsigned long)0xFF)
#define PROBE_STEP(hash, mask, power) \
  ((unsigned char)((((hash) & ~(mask)) >> ((power) - 1)) \
                   & BYTE_MASK) \
   | (unsigned char)(1))

which limits the maxium step size to 0xFF.
Large steps are not a good idea anyway, since that
might lead to cache misses.

----------------------------------------------------------------------

Comment By: Karl Waclawek (kwaclaw)
Date: 2003-03-24 13:29

Message:
Logged In: YES 
user_id=290026

Improvement to first patch version (hashpatch1.diff):

For calculating second hash, don't overwrite the
rightmost bit of the hash value's unused portion.
This means: change ">> (power)" to ">> ((power) - 1)":

#define PROBE_STEP(hash, mask, power) \
 ((unsigned char)((((hash) & ~(mask)) >> ((power) - 1)) & 
((mask) >> 2)) \
+   | (unsigned char)(1))

----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=310127&aid=699487&group_id=10127