Wednesday, May 26, 2010

Debugging: The elusive deadlock

It's very infrequent that I come across a deadlock bug in my code that 1) isn't easy to find, and 2) is very easy to reproduce.

I had a user report a bug in my solo6010 driver where he has two cards installed in the system. He is on a Core2Duo. If he starts mplayer up on each display for the two cards he has installed (2 mplayer instances), his machine instantly deadlocks and spews to the console.

At first I wasn't able to easily reproduce this. I'm on a Core2Quad, but since I have 4 cards installed I decided to start an mplayer instance for each display device for each card (4 mplayer instances). Oddly enough, I also deadlocked and spewed softlockup messages to the console.

Do you see where this is going? I decided, for clarity, to disable two of my cores:

echo 0 | sudo tee /sys/devices/system/cpu/cpu2/online
echo 0 | sudo tee /sys/devices/system/cpu/cpu3/online

Sure enough it only took two mplayer instances to deadlock my machine this time. Weird! Now my driver currently is able to pull 44 feeds from 4 cards at once for the MPEG feeds. Here, in this case, I am deadlocking with just two YUV feeds from the uncompressed video of the card. This code is much less complex, and the locking even less so. No parts of the driver share data between card instances (each card instance has it's own data and locks).

Upon further investigation I've noticed that this deadlock appears to happen in spin_unlock_irqrestore() during wake_up().

After carefully tracing the code, it was vaguely apparent that my logic around the wakeup routine for when it tries to grab a frame from the the hardware was a little off. I was using a different wake struct for each file handle, when I should have been using one per card. Not to mention, I was not taking advantage of the video sync IRQ to send a wakeup to the thread so that it knew a new frame was ready to grab (allowed me to spin less, and guaranteed the threads to be awoken when a new frame was ready).

Reworking this logic just a bit cleared the deadlock. Honestly, I'm not entirely sure of how the scenario caused a deadlock. It appears to be something in the underlying logic for wait/wake_up routines. I wont argue that it is fixed now, and my code is cleaner and more efficient, so I wont ask too many questions.

No comments:

Post a Comment