NMI Watchdog detected LOCKUP on CPU1

Martin Josefsson gandalf@wlug.westbo.se
Tue, 9 Oct 2001 21:09:29 +0200 (CEST)


Hi

(Sorry for the crosspost but I don't know on which list the people capable
of helping are subscribed :) (it's kernel, networkrelated and possibly
netfilter related)

As you can see (see topic) I have a small problem.

I got this while stresstesting an SMP router here.

hardware:
2 x pIII 600
440BX chipset motherboard
2 x 32MB pc100 ram
128MB flashdisk
D-Link DFE570-TX (quad DECchip 21143) NIC
Berkshire PCI watchdogcard.

Kernel: 2.4.8-ac12 (got a hang on 2.4.9-ac18 too but didn't have a monitor
or serial console attached to it and the watchdogcard rebooted it)

The tulipdriver I'm using isn't the "vanilla" one, it's the optimized
polling version by Jamal and Robert (ANK was involved here too I think?)

I can't rule this driver out but I don't think it's at fault here.

I was running a quite heavy stresstest between 3 ports on the NIC.
All was going just fine until I heard the watchdog go *beep* as it
rebooted the router. And then I found this on the serial-console:

"NMI Watchdog detected LOCKUP on CPU1, registers:" and a stackdump
(decoded below)

loaded modules:

ipt_MASQUERADE          1360   1  (autoclean)
ip_nat_irc              2768   0  (unused)
ip_conntrack_irc        2496   0  [ip_nat_irc]
ip_nat_ftp              3184   0  (unused)
ip_conntrack_ftp        3280   0  [ip_nat_ftp]
iptable_nat            13424   7  [ipt_MASQUERADE ip_nat_irc ip_nat_ftp]
ip_conntrack           14224   8  [ipt_MASQUERADE ip_nat_irc ip_conntrack_irc ip_nat_ftp ip_conntrack_ftp iptable_nat]
iptable_filter          1728   0  (unused)
ip_tables              10592   5  [ipt_MASQUERADE iptable_nat iptable_filter]
tulip                  42224   3 

/proc/version:
Linux version 2.4.8-ac12 (root@tux) (gcc version 2.95.4 20010604 (Debian
prerelease)) #7 SMP Sun Aug 26 22:02:31 CEST 2001

the router is set up to SNAT packets going out of eth0 and then I have a
portforward leading in to a machine on another port in the NIC.

I had one machine on each of eth0, eth1 and eth3 generating and recieving
traffic. (the traffic was generated by 'nc ip port < /dev/zero' and
revieced by 'nc -l -p port > /dev/null' so there's some pipeing going on
too.)

two streams coming in on eth1 and eth3 where SNAT'd out via eth0 and one
stream coming in on eth0 was DNAT'd to the machine on eth1.
And then there where two streams between eth1 and eth3, one in each
direction.

= there where quite a lot of packets.

The router had been running this test for over 36 hours when it died.
As I wrote earlier, 2.4.9-ac18 died too, within a few hours but I had no
chance of getting the output :(

Decoded Oops:

ksymoops 2.4.1 on i686 2.4.9-ac5.  Options used
     -V (default)
     -K (specified)
     -L (specified)
     -O (specified)
     -m System.map-2.4.8-ac12 (specified)

NMI Watchdog detected LOCKUP on CPU1, registers:            
CPU:    1                                       
EIP:    0010:[<c0105c5f>]
Using defaults from ksymoops -t elf32-i386 -a i386
EFLAGS: 00000087         
eax: c483f2a4   ebx: c1e8c200   ecx: c1e8c29c   edx: c1120000
esi: c1e8c2d0   edi: c11b2158   ebp: 5671b6f9   esp: c1121d58
ds: 0018   es: 0018   ss: 0018                               
Process swapper (pid: 0, stackpage=c1121000)
Stack: c483e8b5 c1e8c200 c11b2140 c4816734 c1e8c200 00000000 c022394e c1e8c200 
       00000000 c11b2158 c48065b1 c1970c80 c1679320 c26a3740 c11b2000 c02482d0 
       000000c8 c11b2158 00000000 5671b6f9 00000006 00000000 00000282 c11b2140 
Call Trace: [<c483e8b5>] [<c4816734>] [<c022394e>] [<c48065b1>] [<c02482d0>]   
   [<c022e71b>] [<c0226e9d>] [<c0248390>] [<c022d408>] [<c0245950>] [<c02482b2>] 
   [<c02482d0>] [<c02459aa>] [<c022d408>] [<c02458e8>] [<c0245950>] [<c0244aed>] 
   [<c0244910>] [<c022d408>] [<c0244776>] [<c0244910>] [<c022771e>] [<c011956f>] 
   [<c010861b>] [<c01051b0>] [<c01051b0>] [<c010a714>] [<c01051b0>] [<c01051b0>] 
   [<c01051dd>] [<c0105242>] [<c011577b>]                                        
Code: 81 38 00 00 00 01 75 f8 f0 81 28 00 00 00 01 0f 85 e4 ff ff 

>>EIP; c0105c5f <__write_lock_failed+7/20>   <=====
Trace; c483e8b5 <END_OF_CODE+4466b65/????>
Trace; c4816734 <END_OF_CODE+443e9e4/????>
Trace; c022394e <__kfree_skb+96/134>
Trace; c48065b1 <END_OF_CODE+442e861/????>
Trace; c02482d0 <ip_finish_output2+0/f8>
Trace; c022e71b <qdisc_restart+6b/17c>
Trace; c0226e9d <dev_queue_xmit+111/2ec>
Trace; c0248390 <ip_finish_output2+c0/f8>
Trace; c022d408 <nf_hook_slow+110/194>
Trace; c0245950 <ip_forward_finish+0/60>
Trace; c02482b2 <ip_finish_output+12a/134>
Trace; c02482d0 <ip_finish_output2+0/f8>
Trace; c02459aa <ip_forward_finish+5a/60>
Trace; c022d408 <nf_hook_slow+110/194>
Trace; c02458e8 <ip_forward+1f8/260>
Trace; c0245950 <ip_forward_finish+0/60>
Trace; c0244aed <ip_rcv_finish+1dd/220>
Trace; c0244910 <ip_rcv_finish+0/220>
Trace; c022d408 <nf_hook_slow+110/194>
Trace; c0244776 <ip_rcv+3b6/400>
Trace; c0244910 <ip_rcv_finish+0/220>
Trace; c022771e <net_rx_action+1da/334>
Trace; c011956f <do_softirq+6f/cc>
Trace; c010861b <do_IRQ+db/ec>
Trace; c01051b0 <default_idle+0/34>
Trace; c01051b0 <default_idle+0/34>
Trace; c010a714 <call_do_IRQ+5/d>
Trace; c01051b0 <default_idle+0/34>
Trace; c01051b0 <default_idle+0/34>
Trace; c01051dd <default_idle+2d/34>
Trace; c0105242 <cpu_idle+3e/54>
Trace; c011577b <_call_console_drivers+53/58>
Code;  c0105c5f <__write_lock_failed+7/20>
00000000 <_EIP>:
Code;  c0105c5f <__write_lock_failed+7/20>   <=====
   0:   81 38 00 00 00 01         cmpl   $0x1000000,(%eax)   <=====
Code;  c0105c65 <__write_lock_failed+d/20>
   6:   75 f8                     jne    0 <_EIP>
Code;  c0105c67 <__write_lock_failed+f/20>
   8:   f0 81 28 00 00 00 01      lock subl $0x1000000,(%eax)
Code;  c0105c6e <__write_lock_failed+16/20>
   f:   0f 85 e4 ff ff 00         jne    fffff9 <_EIP+0xfffff9> c1105c58 <END_OF_CODE+d2df08/????>


To me it looks like ip_finish_output2 freaked and did something nasty.

I hope someone who knows the networking a little better than me can help
with this one. My goal is to make linux survive at least a month of
stresstesting :) 

/Martin

Never argue with an idiot. They drag you down to their level, then beat you with experience.