Hashtrie testing2 (was: Re: [PATCH 4/4] first conntrack ID must be 1 not 2)

Martin Josefsson gandalf at wlug.westbo.se
Sun Mar 5 18:48:58 CET 2006


On Sun, 2006-03-05 at 12:24 +0100, Jozsef Kadlecsik wrote:
> Hi Martin,

Hi Jozsef

> HASHSIFT (i.e. HASHNUM) defines the hash size of the hashtrie.
> But the hashtrie - according to your tests - are much "wider" than
> "deep". So there might be no point in widening the hash.

HASHSHIFT and HASHNUM controls the number of struct hashentry per child.
The original idea was to allocate 1 page (4kB) for each child but I got
better results with 2kB, which means it isn't as wide as it was from the
start.

> > > - other HASHALIGN, hashbit_t values: (32, u8), (64, u16) and (64, u32).
> >
> > Yes this needs more testing as well.
> >
> > >   The current (32, u8) doesn't look optimal on 64bit CPUs, (64, u16) seems
> > >   to be the best, but without testing it's hard to choose.
> >
> > Currently I don't have any 64bit cpus to test things on.

HASHALIGN is the size of struct hashentry, the idea is that HASHALIGN
should be equal to the cacheline size of the cpu it's running on. That
way you only get one cachemiss per level in the tree.
HASHALIGN is set to 32 bytes just because my laptop has 32byte
cachelines, other cpus has 64bytes or even 128bytes.

hashbit_t is an u8 because in my tests I havn't see any gain by going to
more bits.
The only purpose of hashbits is to "cheat" a bit during lookup, store a
part of the hash-value in the datastructure so you can easily determine
if a member even can be a possible match by comparing the hashbits to
the hashvalue you are searching for, this eliminates a lot of
cachemisses (without this the datastructure will behave more like a
linked list).

I've tested with diffrent number of bits for hashbits but I havn't seen
any real gain by going over 7 bits which is what is currently used, and
the last bit is used as status bit for ASSURED.

> With the default (32, u8) settings NUMENTRY looks too small on 64bit
> CPUs:

Yes it's too small, luckily I don't know of any 64bit cpus that have
32byte cachelines :)
Lets say we keep hashbits as an u8 then we get the following results for
diffrent cacheline sizes (I believe the intel p4 has 128byte cachelines)

HASHALIGN = 32
hashbits_t = u8

    32bit CPU:  NUMENTRY = (32 - 4)/(1+4) = 5
                PADNUM = 32 - 4 - 5*(1+4) = 3

    64bit CPU:  NUMENTRY = (32 - 8)/(1+8) = 2
                PADNUM = 32 - 8 - 2*(1+8) = 6


HASHALIGN = 64
hashbits_t = u8

    32bit CPU: NUMENTRY = (64 - 4)/(1 + 4) = 12
                       PADNUM = 64 - 4 - 12*(1 + 4) = 0

    64bit CPU: NUMENTRY = (64 - 8)/(1 + 8) = 6
                       PADNUM = 64 - 8 - 6*(1 + 8) = 2

HASHALIGN = 128
hashbits_t = u8

    32bit CPU: NUMENTRY = (128 - 4)/(1 +4) = 24
                       PADNUM = 128 - 4 - 24*(1 + 4) = 4

    64bit CPU: NUMENTRY = (128 - 8)/(1 + 8) = 13
                       PADNUM = 128 - 8 - 13*(1 + 8) = 3

When HASHALIGN goes up we need to reduce HASHSHIFT in order to make sure
we don't try to allocate > 1 page (usually 4kB) of memory.

> > > - DoS against hashtrie by non-random tuples: single fixed destination IP,
> > >   port and successive source IP, port numbers. (I don't think the current
> > >   max 7 levels (childs) can survive such an attack.)
[snip]
> > 2.4 million entries in the hashtrie with fixed dstip, port and 1024
> > srcip's with random ports and a maxdepth of 3. I'd say the jenkins hash
> > is doing its job quite nicely, wasn't expecting such good results.
> 
> The hashtrie tends to grow wider and not deeper:
> 
> level:		childs (max)	entries (max)
> 0		64		320
> 1		4160		20800
> 2		266304		1331520
> 3		17043520	85217600

It grows wider if and only if we get hashvalues that aren't too similar
in the high bits. If that happens we grow deeper instead of wider. But
it seems the jenkins hash is really good at distributing the changes
over all of the bits in the hashvalue. (it's just a little bit on the
heavy side, it uses plenty of cpu)

> > I'm almost starting to think there's a major bug in there somewhere.
> > I need to write some code to walk the tree and calculate the usage of
> > each level to see that it's actually 100% for all buckets that has a
> > child.
> 
> Yes, that could help to spot the problem. But if there are clashes
> (due to using to small parts of the hash key, i.e  hashbits_t = u8), then
> the node can be fairly low utilized and still there is a need to expand by
> new childs.

Small hashbits_t doesn't have anything to do with it, hashbits_t is just
for accelerating lookups. If we have a low number of struct hashentry
per child-node we'll probably fill them up faster since there's fewer of
them to be filled. And if we have a low NUMENTRY they will also be
filled faster.

> > Couldn't this lead to the situation where we evict an entry early on the
> > path, and then that slot gets reused for another entry that's also
> > unassured, and it repeats...
> 
> You mean, we could always evict the newest unassured entries instead of
> the oldest ones? The algorithm could be extended so that we'd walk the
> whole tree and register the oldest unassured entry in the path. I'm more
> worried that we'd check too litle number of entries as the tree is shorter
> than I thought.

The problem is how do you know which entry along a path is the oldest?
Entries aren't always added to the front like with the current linked
lists in the hashtable.

> > > I have been thinking on wether we could use simply, separated, unordered
> > > locking:
> [...]
> > The only case I can think about where it might matter is the case of
> > simultaneous open from both sides with the same source/destination
> > ports, dns (some clients issue requests from port 53), games and ipsec
> > isakmp come to mind.
> 
> That can be fixed by checking that we undo the insertion for the same
> conntrack entry we were unable to add at the second step.

What I meant is the following situation:

clients A & B are both located behind NAT and are trying to connect to
each other.
Both bind to the same port, lets say X and try to connect to each others
external ip.
Currently if the packets from both clients arrive at the same time at an
SMP firewall we add a conntrack entry for one of the clients but drop
the second one because of the global lock.
With Rustys lock ordering we keep this behaviour since both buckets are
locked in the same order for both packets.
With the "unordered independent locking" of buckets we might end up
with:

these happen at the same time:
packet from client A hashes into bucket C, lock and add entry.
packet from client B hashes into bucket D, lock and add entry.

and after that these happen at the same time:
packet from client A hashes into bucket D, lock and find entry, remove
entry in bucket C
packet from client B hashes into bucket C, lock and find entry, remove
entry in bucket D

This leads to no conntrack entry created for any of the clients
connection attempts.

The solution would be "ordered independent locking" which locks and adds
entries based on bucket number that the tuplex hash into.
That way we still keep the current behaviour while simplifying the
locking, no chance of deadlocks.

-- 
/Martin
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
Url : /pipermail/netfilter-devel/attachments/20060305/7bc29f5e/attachment.pgp


More information about the netfilter-devel mailing list