[RFC][PATCH] optimise iptables interface matching

Eric Dumazet dada1 at cosmosbay.com
Wed Jun 6 08:13:13 CEST 2007


Philip Craig a écrit :
> Henrik Nordstrom wrote:
>> tis 2007-05-29 klockan 10:24 +1000 skrev Philip Craig:
>>> I'll try that, but that will also prevent loop unrolling if you're
>>> using -funroll-loops.  Not sure if any builds use this, but the
>>> comment in the code implies some do.
>> My GCC unrolls that loop just fine...
>>
>> x86_64 gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-51)
> 
> You're right, I should check things before making statements.
> 
>> Here is a corrected version of the for loop:
>>
>> for (i = 0, ret = 0;  i < IFNAMSIZ/sizeof(unsigned long) && ((const unsigned long *)ipinfo->outiface_mask)[i]; i++) {
>>
>>
>> but for modern 64-bit CPUs I suspect the original is actually fastest as
>> IFNAMSIZ is only 16 bytes and fits in two parallel 64-bit operations..
> 
> For my ARM platform, changing the for loop is a win for no interface
> matches, or short interface names (I didn't test longer interface names).
> 
> But looking at the generated assembly for x86_64, this results in more
> instructions and branches, and I can't see this being a win.  (I'm not
> set up to profile this.) 
> 
> Also, the two optimisations are not mutually exclusive: one is to skip
> the whole comparison completely (including inversion), and one is to
> terminate the for loop early.  So we can use your loop termination
> condition to skip the whole comparison too.
> 
> The attached patch is logically equivalent to my first (assuming a
> contiguous and zero padded mask), but it avoids messing with flags.  In
> practice, my profiling says it is slightly slower than the first patch
> for 0 interface matches, but slightly faster for 1 or 2.
> 
> Note: this patch (and the original patch) change the behaviour when
> inverting a zero length mask.  That is for either of:
> 	iptables -A INPUT ! -i +
> 	iptables -A INPUT ! -i ""
> Not sure if this matters.
> 
> I also tried testing just the first byte, instead of a long, but that
> was slower.
> 
> 

In my analysis (oprofiling), I found instructions were not really a problem.

The big problem comes from the size of data that have to be read by the CPU to 
perform a typical table lookup.

I consider 16 bytes masks (32 bytes per rule, one for iface, one for oface) is 
a waste of memory. And on current CPUS, with 64 bytes cache lines, memory 
bandwidth is the limiting factor.

We probably could use a mask length (one byte for iface, one for oface) to 
reduce memory footprint, and have better chance to keep tables in cpu caches.




More information about the netfilter-devel mailing list