Detailed report on SMB-build lockups [seems that it is locking problem in networking code] (2.4.0-test2-ac2 and later)
Tue, 11 Jul 2000 13:32:04 +0200
[cc list trimmed a bit]
On Tue, Jul 11, 2000 at 12:50:32PM +0200, Alexander Demenshin wrote:
> - Traffic generator used on _local_ interface:
> > A lot of fragmented packets:
> ifconfig lo mtu 256
> ping -f -s 8192 127.0.0.1
> > A lot of TCP traffic (connect/transfer/disconnect);
> > MTU does not matter.
> In my tests I used the following rules for iptables:
> iptables -t mangle -A PREROUTING -j QUEUE
> iptables -t mangle -A OUTPUT -j QUEUE
> I assume there are no other rules; but the problem occurs _only_
> when QUEUE target is in effect - other rules does not matter as long
> as there is no QUEUE targets or if packets are not accepted in userspace.
The only thing I can see in ipqueue is that it turns off local bottom halves
for a long time during packet receive. That could probably force other
> In case if I use table 'filter' it also occurs (so nothing magical
> in 'mangle' table).
> So, once rules above are in effect, userspace module is running, and after
> certain period of time running traffic generator system lockup occurs
> (in my case - after processing of ca. 300K packets; but it depends -
> be patient :).
> No OOPs, no other kernel messages, _nothing_ except SysRq is active.
> Examining of code under EIP shows, that lockup occurs at:
> - In case of TCP traffic:
> --- src/net/ipv4/tcp_timer.c:690 tcp_synack_timer() ---
> /* Drop this request */
> write_lock(&tp->syn_wait_lock); /* <<< AT THIS PLACE */
This one is strange. Any chance to get a multi CPU backtrace for this ?
(install kdb from oss.sgi.com:/projects/kdb/ , press pause during a hang,
enter bt and switch to the other CPUs using the cpu command and backtrace
> *reqp = req->dl_next;
> --- CUT ---
> - In case of ICMP (fragmented) traffic:
> --- src/net/ipv4/ip_fragment:202 ip_expire ---
> spin_lock(&ipfrag_lock); /* <<< AT THIS PLACE */
The fragment locking is known to be buggy. It should be fixed in 2.4.0pre3.
Also there was a NAT bug that it called ip_defrag without bhs turned off
that could cause deadlocks too, but that should be already fixed
(all ip_defrag calls in netfilter/* should be guarded by a local_bh_disable/