[RFC][PATCH] optimise iptables interface matching
Eric Dumazet
dada1 at cosmosbay.com
Wed Jun 6 08:13:13 CEST 2007
Philip Craig a écrit :
> Henrik Nordstrom wrote:
>> tis 2007-05-29 klockan 10:24 +1000 skrev Philip Craig:
>>> I'll try that, but that will also prevent loop unrolling if you're
>>> using -funroll-loops. Not sure if any builds use this, but the
>>> comment in the code implies some do.
>> My GCC unrolls that loop just fine...
>>
>> x86_64 gcc (GCC) 4.1.1 20070105 (Red Hat 4.1.1-51)
>
> You're right, I should check things before making statements.
>
>> Here is a corrected version of the for loop:
>>
>> for (i = 0, ret = 0; i < IFNAMSIZ/sizeof(unsigned long) && ((const unsigned long *)ipinfo->outiface_mask)[i]; i++) {
>>
>>
>> but for modern 64-bit CPUs I suspect the original is actually fastest as
>> IFNAMSIZ is only 16 bytes and fits in two parallel 64-bit operations..
>
> For my ARM platform, changing the for loop is a win for no interface
> matches, or short interface names (I didn't test longer interface names).
>
> But looking at the generated assembly for x86_64, this results in more
> instructions and branches, and I can't see this being a win. (I'm not
> set up to profile this.)
>
> Also, the two optimisations are not mutually exclusive: one is to skip
> the whole comparison completely (including inversion), and one is to
> terminate the for loop early. So we can use your loop termination
> condition to skip the whole comparison too.
>
> The attached patch is logically equivalent to my first (assuming a
> contiguous and zero padded mask), but it avoids messing with flags. In
> practice, my profiling says it is slightly slower than the first patch
> for 0 interface matches, but slightly faster for 1 or 2.
>
> Note: this patch (and the original patch) change the behaviour when
> inverting a zero length mask. That is for either of:
> iptables -A INPUT ! -i +
> iptables -A INPUT ! -i ""
> Not sure if this matters.
>
> I also tried testing just the first byte, instead of a long, but that
> was slower.
>
>
In my analysis (oprofiling), I found instructions were not really a problem.
The big problem comes from the size of data that have to be read by the CPU to
perform a typical table lookup.
I consider 16 bytes masks (32 bytes per rule, one for iface, one for oface) is
a waste of memory. And on current CPUS, with 64 bytes cache lines, memory
bandwidth is the limiting factor.
We probably could use a mask length (one byte for iface, one for oface) to
reduce memory footprint, and have better chance to keep tables in cpu caches.
More information about the netfilter-devel
mailing list