Friday, May 28, 2010

Dear Users: Do not withhold information from developers, kthxbye

I accidentally stumbled upon a debugging case today with a user that seems to be a common problem. I wont call this user out directly, but he was a case study in what not to do when you want help from a developer like myself.

The basic volley started off with the usual chit chat in an IRC channel:

<User> Can someone help me compile a module for my kernel?
<Me> Sure, what seems to be the trouble?

So off we went with some IRC and PasteBin exchanges of his compile problem. I looked at the source code for the driver he was trying to compile, and it was a one-line obvious fix to get it working with a newer kernel such as the one found on the Ubuntu 10.04 Lucid system he was working on.

So now the module compiled, and he tried loading it. Hmm...the module disagreed with symbols from modules on his running system, videodev to be exact.

Weird. That shouldn't happen. I asked him if he had compiled or installed different versions of v4l than what his system came with. He didn't recall. However, after getting him to pastebin "ls -lR" of his modules directory, it was apparent that 3 days ago, he did in fact completely replace the drivers/media install.

This meant that those modules didn't match the stock headers that came with his running kernel. This took a very short time for him, but considerable time for me (volunteer time) to find out. After finding out, he admitted to replacing the modules.

Now it was obvious that he was embarrassed to admit he had junked up his system, and even more embarrassing that I caught him in a lie. He could have saved time for both of us. If I had given up after helping with his initial problem, he would have been stuck not knowing how to fix it.

So the moral of the story here is, don't hide information from people trying to help you. Tell all the gross details. If you fed your cat buttermilk waffles off your keyboard, it might help to know that if your 'H' is stuck.

Wednesday, May 26, 2010

Debugging: The elusive deadlock

It's very infrequent that I come across a deadlock bug in my code that 1) isn't easy to find, and 2) is very easy to reproduce.

I had a user report a bug in my solo6010 driver where he has two cards installed in the system. He is on a Core2Duo. If he starts mplayer up on each display for the two cards he has installed (2 mplayer instances), his machine instantly deadlocks and spews to the console.

At first I wasn't able to easily reproduce this. I'm on a Core2Quad, but since I have 4 cards installed I decided to start an mplayer instance for each display device for each card (4 mplayer instances). Oddly enough, I also deadlocked and spewed softlockup messages to the console.

Do you see where this is going? I decided, for clarity, to disable two of my cores:

echo 0 | sudo tee /sys/devices/system/cpu/cpu2/online
echo 0 | sudo tee /sys/devices/system/cpu/cpu3/online

Sure enough it only took two mplayer instances to deadlock my machine this time. Weird! Now my driver currently is able to pull 44 feeds from 4 cards at once for the MPEG feeds. Here, in this case, I am deadlocking with just two YUV feeds from the uncompressed video of the card. This code is much less complex, and the locking even less so. No parts of the driver share data between card instances (each card instance has it's own data and locks).

Upon further investigation I've noticed that this deadlock appears to happen in spin_unlock_irqrestore() during wake_up().

After carefully tracing the code, it was vaguely apparent that my logic around the wakeup routine for when it tries to grab a frame from the the hardware was a little off. I was using a different wake struct for each file handle, when I should have been using one per card. Not to mention, I was not taking advantage of the video sync IRQ to send a wakeup to the thread so that it knew a new frame was ready to grab (allowed me to spin less, and guaranteed the threads to be awoken when a new frame was ready).

Reworking this logic just a bit cleared the deadlock. Honestly, I'm not entirely sure of how the scenario caused a deadlock. It appears to be something in the underlying logic for wait/wake_up routines. I wont argue that it is fixed now, and my code is cleaner and more efficient, so I wont ask too many questions.

Saturday, May 22, 2010

Review: Softlogic 6010 based MPEG-4/G.723 compression cards

So the company I work for (Bluecherry, LLC) is busy developing some products around the Softlogic 6010 based compression card. My job there has been to rewrite the driver from scratch in order to make it more Linux friendly. So to make things clear, I am writing this review from a programmer's perspective. I want to point out that I am not an MPEG expert, so I may skimp on some of the encoder details.

Let's start of with some specs. The base card supports full D1-quad compression of video into MPEG-4 video format. What this means is that it can encode 704x480 sized video at a rate of 120fps for NTSC, or 704x576 at a rate of 100fps for PAL. This breaks down to 4 full streams at 30fps and 25fps respectively. Alternately it can do CIF encoding (1/4 size of D1) at 4 times that frame rate, or for the math-lazy, 16 channels at 30fps at 320x240 frame size.

The card can be purchased in 4, 8 or 16 channel input models. So to take advantage of all 16 channels on the top model, you would either have to record in CIF mode (320x240) or reduce the frame interval to get 7.5fps per channel for full D1 mode (704x480). I will be speaking mostly in NTSC, but the card does support PAL, so do the conversions as we go.

The card allows for the usual MPEG encoding settings including GOP (Group of Pictures), Quantization and Intervals. Intervals are sort of the opposite of frames-per-second, but correlate the same way. An interval of 1 means that the encoder captures every frame, while an interval of 3 means it skips 2 frames between every frame it encodes. The video muxer on the card performs at 30fps, so the interval setting will decide how many of these frames get encoded.

The encoder itself performs quite well. It performs all encoding to an on-board SDRAM chip, and can DMA the frames directly to the host memory, which is great for performance. The original driver did not take advantage of this since it copied the frames to user space. The new driver I've written makes use of v4l2 and it's videobuf-dma-contiguous framework and thus allows for memory mapped buffers with userspace. This gives us zero-copy to userspace.

The encoder also supports side-by-side MJPEG compression of video frames. So while you can be recording the compressed MPEG-4 to disk, you can also frame grab JPEG images. This is useful for tools that want to do such frame grabbing for video analytics, or for live viewing over a web server (it's very easy to send frame grabs via an MJPEG cgi script).

All of this is built properly now on top of Linux's v4l2 API. Unfortunately the API does not expect compression cards to pipe MPEG-4 video, so most clients using v4l2 expect compressed video to be either MJPEG or MPEG-1/2 streams of some sort.

Currently the only drawback from the MPEG encoder is that the frames are self-standing MPEG-4 video frames. I have to add a header to the key frames for them to be usable by most decoders.

Overall the video capture is great. I've run 44 simultaneous records (16, 16, 8, 4 channel cards) on a Core2Duo with a system load average of 1.65, and only about 10% CPU usage. Most of the load is disk I/O.

Each encoder input also supports a graphical overlay that can be programmed at pixel level with varying colors. This is great for textual overlays. Currently we use it to place a descriptive name on the recording along with a timestamp.

In addition to the encoders, the card supports one uncompressed display port. It's currently exposed via v4l2 as a standard analog YUV device. It can be configured to show any of the inputs ports in tons of configurations. So you can do things like a 4-up display. This live display also supports a graphical overlay.

The display is sent to the video-out port on the card (hard-wired), so it can be hooked to a monitor as well (good for surveillance applications such as what Bluecherry offers).

Finally we'll discuss my least favorite part of this card. While it's not a killer, it is just odd that the card supports sound only in G.723 format. For surveillance applications this is just fine. Delivering 3-bit samples at 8Khz sample rate, which is a 24kbs. While this is good for bandwidth, it's bad for anything that needs better audio quality. Not to mention that storing the audio and video together in any sane format requires converting G.723 to linear PCM.

However, the G.723 to linear PCM conversion isn't much overhead on performance, and neither is the encoding to 16Khz MP2 audio, which is how we store it for our surveillance products. Overall, our format is MPEG-4/Video and MP2/Audio in a Matroska/mkv container. This is exactly how it was stored in my 44 stream example above.

So Pros:
  • Fast and efficient
  • Can handle multiple inputs easily
  • The new driver works well with v4l2 and alsa
  • Perfect for security applications
  • Nice OSD capabilities
  • Motion detection supported per input
  • Side-by-side MPEG-4 and JPEG capture modes per input
Cons:
  • MPEG-4 video. The SOLO-6110 will support H.264
  • Low quality audio is not great for anything other than special applications (no TV DVR)
  • G.723 audio format has been obsoleted twice since it was introduced. Nothing uses it so you must always re-encode it.

Friday, May 21, 2010

Video4Linux2 Hardware Motion Detection Support

In about a week or so I'll be making a proposal for V4L2 to have API support for hardware that offers motion detection. Since my experience with this is limited to only one type of hardware, I'm hoping to gain feedback on making sure that the approach I'm offering is as generic as possible.

I'll describe the hardware that I'm working with, which is a Softlogic 6010 MPEG4/G.723 encoder board supporting 4, 8 and 16 input channels (and all of them being able to be encoded at once). Note that all of this applies to the SOLO-6110 card as well (h.264 variant).

The motion detection exposed by the SOLO-6010 is on a per-input basis. It can be configured, when motion detection is enabled, to either signal start of motion events only, or signal start and stop events with a configurable delay after actual motion has stopped (i.e. it will not send the stop signal until there is no motion for n amount of seconds).

Next, SOLO-6010 allows you to set a threshold for when the hardware will detect an event. In my case, the higher the threshold, the less sensitive. It has a range of 0 (anal) to 65535 (off) with a default of 768.

Exposing this via v4l2 controls is quite simple. In my current version of the solo6010 driver, I expose this via private CIDs (Control IDs) which can be easily converted to native CIDs in v4l2.

#define V4L2_CID_MOTION_ENABLE    (V4L2_CID_PRIVATE_BASE+0)
#define V4L2_CID_MOTION_THRESHOLD (V4L2_CID_PRIVATE_BASE+1)
#define V4L2_CID_MOTION_MODE      (V4L2_CID_PRIVATE_BASE+2)
#define V4L2_CID_MOTION_EASE_OFF  (V4L2_CID_PRIVATE_BASE+3)

In this case, V4L2_CID_MOTION_ENABLE is a boolean to turn motion detection on or off, V4L2_CID_MOTION_THRESHOLD is the threshold value I spoke of (slider with said range), V4L2_CID_MOTION_MODE is a menu control for "Start events only" and "Start and stop events" and V4L2_CID_MOTION_EASE_OFF is the seconds of non-motion required before the stop event is triggered.

Now, I could combine V4L2_CID_MOTION_ENABLE and V4L2_CID_MOTION_MODE as just a menu control with "Disabled" as one option, but I'm not sure what the consensus would be. It would be confusing as a standard control for hardware that only supported on/off tuning of this feature.

Note that in "Start event only" mode, my hardware will continually produce motion events when the card sees it, and thus I can emulate V4L2_CID_MOTION_EASE_OFF and a stop event in software.

Whether it is a good idea to always have this support from the control, and have the v4l2 middle layer take care of using the hardware or it's own software to handle it, is up for discussion. I'm all for implementing it as transparent to the user, with the middle-layer handling the guts and the drivers deciding if they allow the middle-layer to do it, or expose the hardware support for it.

Now we just need to make userspace aware of these events. I've found the easiest way for me was to define some extra flags for struct v4l2_buffer that get set during dqbuf.

#define V4L2_BUF_FLAG_MOTION_ON         0x00000400
#define V4L2_BUF_FLAG_MOTION_START      0x00000800
#define V4L2_BUF_FLAG_MOTION_STOP       0x00001000

The reason for V4L2_BUF_FLAG_MOTION_ON is because we need userspace to be able to tell that motion detection is on without querying the controls every second or two. Remember that controls can be changed even while a recorder is on (and in the case of motion detection, I suspect that's a wanted feature).

So if userspace is reading packets, it knows that motion detection is on or off depending on that flag, and can act accordingly. The start and stop flags are self-explanatory.

Now, this is a good reason to promote software side (perhaps libv4l2?) ease-off on motion detection. Without creating another flag, there's no way to know if motion detection is in a start-only or start-stop mode. If we always implement the ease-off, then we know we'll get a stop event eventually, whether or not the hardware supports it.

Moving back to threshold values...SOLO-6010 actually supports a motion detection grid with block sizes of 32x32 pixels. SOLO-6010s NTSC viewable field is 704x480, 704x576 for PAL. So that's either a 22x15 or 22x18 grid of blocks that can have individual threshold settings each. I'm still up in the air about how to do this in a standard v4l2 API. For SOLO-6010 I am using the low 16-bits of the control value to pass back a threshold level, and the high 16-bits to determine the block being affected (0xff000000 being the x value and 0x00ff0000 being the y value on the grid). This works well in practice but is obviously not generic enough.

Well that's all I have for today.

Tuesday, May 11, 2010

Writing an ALSA driver: PCM handler callbacks

So here we are on the final chapter of the ALSA driver series. We will finally fill in the meat of the driver with some simple handler callbacks for the PCM capture device we've been developing. In the previous post, Writing an ALSA driver: Setting up capture, we defined my_pcm_ops, which was used when calling snd_pcm_set_ops() for our PCM device. Here is that structure again:

static struct snd_pcm_ops my_pcm_ops = {
        .open      = my_pcm_open,
        .close     = my_pcm_close,
        .ioctl     = snd_pcm_lib_ioctl,
        .hw_params = my_hw_params,
        .hw_free   = my_hw_free,
        .prepare   = my_pcm_prepare,
        .trigger   = my_pcm_trigger,
        .pointer   = my_pcm_pointer,
        .copy      = my_pcm_copy,
};

First let's start off with the open and close methods defined in this structure. This is where your driver gets notified that someone has opened the capture device (file open) and subsequently closed it.

static int my_pcm_open(struct snd_pcm_substream *ss)
{
        ss->runtime->hw = my_pcm_hw;
        ss->private_data = my_dev;

        return 0;
}

static int my_pcm_close(struct snd_pcm_substream *ss)
{
        ss->private_data = NULL;

        return 0;
}

This is the minimum you would do for these two functions. If needed, you would allocate private data for this stream and free it on close.

For the ioctl handler, unless you need something special, you can just use the standard snd_pcm_lib_ioctl callback.

The next three callbacks handle hardware setup.

static int my_hw_params(struct snd_pcm_substream *ss,
                        struct snd_pcm_hw_params *hw_params)
{
        return snd_pcm_lib_malloc_pages(ss,
                         params_buffer_bytes(hw_params));
}

static int my_hw_free(struct snd_pcm_substream *ss)
{
        return snd_pcm_lib_free_pages(ss);
}

static int my_pcm_prepare(struct snd_pcm_substream *ss)
{
        return 0;
}

Since we've been using standard memory allocation routines from ALSA, these functions stay fairly simple. If you have some special exceptions between different versions of the hardware supported by your driver, you can make changes to the ss->hw structure here (e.g. if one version of your card supports 96khz, but the rest only support 48khz max).

The PCM prepare callback should handle anything your driver needs to do before alsa-lib can ask it to start sending buffers. My driver doesn't do anything special here, so I have an empty callback.

This next handler tells your driver when ALSA is going to start and stop capturing buffers from your device. Most likely you will enable and disable interrupts here.

static int my_pcm_trigger(struct snd_pcm_substream *ss,
                          int cmd)
{
        struct my_device *my_dev = snd_pcm_substream_chip(ss);
        int ret = 0;

        switch (cmd) {
        case SNDRV_PCM_TRIGGER_START:
                // Start the hardware capture
                break;
        case SNDRV_PCM_TRIGGER_STOP:
                // Stop the hardware capture
                break;
        default:
                ret = -EINVAL;
        }

        return ret;
}

Let's move on to the handlers that are the work horse in my driver. Since the hardware that I'm writing my driver for cannot directly DMA into memory that ALSA has supplied for us to communicate with userspace, I need to make use of the copy handler to perform this operation.

static snd_pcm_uframes_t my_pcm_pointer(struct snd_pcm_substream *ss)
{
        struct my_device *my_dev = snd_pcm_substream_chip(ss);

        return my_dev->hw_idx;
}

static int my_pcm_copy(struct snd_pcm_substream *ss,
                       int channel, snd_pcm_uframes_t pos,
                       void __user *dst,
                       snd_pcm_uframes_t count)
{
        struct my_device *my_dev = snd_pcm_substream_chip(ss);

        return copy_to_user(dst, my_dev->buffer + pos, count);
}

So here we've defined a pointer function which gets called by userspace to find our where the hardware is in writing to the buffer.

Next, we have the actual copy function. You should note that count and pos are in sample sizes, not bytes. The buffer I've shown we assume to have been filled during interrupt.

Speaking of interrupt, that is where you should also signal to ALSA that you have more data to consume. In my ISR (interrupt service routine), I have this:

snd_pcm_period_elapsed(my_dev->ss);

And I think we're done. Hopefully now you have at least the stubs in place for a working driver, and will be able to fill in the details for your hardware. One day I may come back and write another post on how to add mixer controls (e.g. volume).

Hope this series has helped you out!

<< Prev

Tuesday, May 4, 2010

Writing an ALSA driver: PCM Hardware Description

Welcome to the fourth installment in my "Writing an ALSA Driver" series. In this post, we'll dig into the snd_pcm_hardware structure that will be used in the next post which will describe the PCM handler callbacks.

Here is a look at the snd_pcm_hardware structure I have for my driver. It's fairly simplistic:

static struct snd_pcm_hardware my_pcm_hw = {
        .info = (SNDRV_PCM_INFO_MMAP |
                 SNDRV_PCM_INFO_INTERLEAVED |
                 SNDRV_PCM_INFO_BLOCK_TRANSFER |
                 SNDRV_PCM_INFO_MMAP_VALID),
        .formats          = SNDRV_PCM_FMTBIT_U8,
        .rates            = SNDRV_PCM_RATE_8000,
        .rate_min         = 8000,
        .rate_max         = 8000,
        .channels_min     = 1,
        .channels_max     = 1,
        .buffer_bytes_max = (32 * 48),
        .period_bytes_min = 48,
        .period_bytes_max = 48,
        .periods_min      = 1,
        .periods_max      = 32,
};

This structure describes how my hardware lays out the PCM data for capturing. As I described before, it writes out 48 bytes at a time for each stream, into 32 pages. A period basically describes an interrupt. It sums up the "chunk" size that the hardware supplies data in.

This hardware only supplies mono data (1 channel) and only 8000HZ sample rate. Most hardware seems to work in the range of 8000 to 48000, and there is a define for that of SNDRV_PCM_RATE_8000_48000. This is a bit masked field, so you can add whatever rates your harware supports.

My hardware driver describes this data as unsigned 8-bit format (it's actually signed 3-bit g723-24, but ALSA doesn't support that, so I fake it). Most common PCM data is signed 16-bit little-endian (S16_LE). You would use whatever your hardware supplies, which can be more than one type. Since the format is a bit mask, you can define multiple data formats.

Lastly, the info field describes some middle layer features that your hardware/driver supports. What I have here is the base for what most drivers will supply. See the ALSA docs for more details. For example, if your hardware has stereo (or multiple channels) but it does not interleave these channels together, you would not have the interleave flag.

Next post will give us some handler callbacks. It will likely be split into two posts.

<< Prev | Next >>

Sunday, May 2, 2010

Writing an ALSA driver: Setting up capture

Now that we have an ALSA card initialized and registered with the middle layer we can move on to describing to ALSA our capture device. Unfortunately for anyone wishing to do playback, I will not be covering that since my device driver only provides for capture. If I end up implementing the playback feature, I will make an additional post.

So let's get started. ALSA provides a PCM API in its middle layer. We will be making use of this to register a single PCM capture device that will have a number of subdevices depending on the low level hardware I have. NOTE: All of the initialization below must be done just before the call to snd_card_register() in the last posting.

struct snd_pcm *pcm;
ret = snd_pcm_new(card, card->driver, 0, 0, nr_subdevs,
                  &pcm);
if (ret < 0)
        return ret;

In the above code we allocate a new PCM structure. We pass the card we allocated beforehand. The second argument is a name for the PCM device, which I have just conveniently set to the same name as the driver. It can be whatever you like. The third argument is the PCM device number. Since I am only allocating one, it's set to 0.

The third and fourth arguments are the number of playback and capture streams associated with this device. For my purpose, playback is 0 and capture is the number I have detected that the card supports (4, 8 or 16).

The last argument is where ALSA allocates the PCM device. It will associate any memory for this with the card, so when we later call snd_card_free(), it will cleanup our PCM device(s) as well.

Next we must associate the handlers for capturing sound data from our hardware. We have a struct defined as such:

static struct snd_pcm_ops my_pcm_ops = {
        .open      = my_pcm_open,
        .close     = my_pcm_close,
        .ioctl     = snd_pcm_lib_ioctl,
        .hw_params = my_hw_params,
        .hw_free   = my_hw_free,
        .prepare   = my_pcm_prepare,
        .trigger   = my_pcm_trigger,
        .pointer   = my_pcm_pointer,
        .copy      = my_pcm_copy,
};

I will go into the details of how to define these handlers in the next post, but for now we just want to let the PCM middle layer know to use them:

snd_pcm_set_ops(pcm, SNDRV_PCM_STREAM_CAPTURE,
                &my_pcm_ops);
pcm->private_data = mydev;
pcm->info_flags = 0;
strcpy(pcm->name, card->shortname);

Here, we first set the capture handlers for this PCM device to the one we defined above. Afterwards, we also set some basic info for the PCM device such as adding our main device as part of the private data (so that we can retrieve it more easily in the handler callbacks).

Now that we've made the device, we want to initialize the memory management associated with the PCM middle layer. ALSA provides some basic memory handling routines for various functions. We want to make use of it since it allows us to reduce the amount of code we write and makes working with userspace that much easier.

ret = snd_pcm_lib_preallocate_pages_for_all(pcm,
                     SNDRV_DMA_TYPE_CONTINUOUS,
                     snd_dma_continuous_data(GFP_KERNEL),
                     MAX_BUFFER, MAX_BUFFER);
if (ret < 0)
        return ret;

The MAX_BUFFER is something we've defined earlier and will be discussed further in the next post. Simply put, it's the maximum size of the buffer in the hardware (the maximum size of data that userspace can request at one time without waiting on the hardware to produce more data).

We are using the simple continuous buffer type here. Your hardware may support direct DMA into the buffers, and as such you would use something like snd_dma_dev() along with your PCI device to initialize this. I'm using standard buffers because my hardware will require me to handle moving data around manually.

Next post we'll actually define the hardware and the handler callbacks.

<< Prev | Next >>

Saturday, May 1, 2010

Writing an ALSA driver: The basics

In my last post I described a bit of hardware that I am writing an ALSA driver for. In this installment, I'll dig a little deeper into the base driver. I wont go into the details of the module and PCI initialization that was already present in my driver (I developed the core and v4l2 components first, so all of that is taken care of).

So first off I needed to register with ALSA that we actually have a sound card. This bit is easy, and looks like this:

struct snd_card *card;
ret = snd_card_create(SNDRV_DEFAULT_IDX1, "MySoundCard",
                      THIS_MODULE, 0, &card);
if (ret < 0)
        return ret;

This asks ALSA to allocate a new sound card with the name "MySoundCard". This is also the name that appears in /proc/asound/ as a symlink to the card ID (e.g. "card0"). In my particular instance I actually name the card with an ID number, so it ends up being "MySoundCard0". This is because I can, and typically do, have more than one installed at a time for this type of device. I notice some other sound drivers do not do this, probably because they don't expect more than one to be installed at a time (think HDA, which is usually embedded on the motherboard, and so wont have two or more inserted into a PCIe slot). Next, we set some of the properties of this new card.

strcpy(card->driver, "my_driver");
strcpy(card->shortname, "MySoundCard Audio");
sprintf(card->longname, "%s on %s IRQ %d", card->shortname,
        pci_name(pci_dev), pci_dev->irq);
snd_card_set_dev(card, &pci_dev->dev);

Here, we've assigned the name of the driver that handles this card, which is typically the same as the actual name of your driver. Next is a short description of the hardware, followed by a longer description. Most drivers seem to set the long description to something containing the PCI info. If you have some other bus, then the convention would follow to use information from that particular bus. Finally, set the parent device associated with the card. Again, since this is a PCI device, I set it to that.

Now to set this card up in ALSA along with a decent description of how the hardware works. We add the next bit of code to do this:

static struct snd_device_ops ops = { NULL };
ret = snd_device_new(card, SNDRV_DEV_LOWLEVEL, mydev, &ops);
if (ret < 0)
        return ret;

We're basically telling ALSA to create a new card that is a low level sound driver. The mydev argument is passed as the private data that is associated with this device, for your convenience. We leave the ops structure as a no-op here for now.

Lastly, to complete the registration with ALSA:

if ((ret = snd_card_register(card)) < 0)
        return ret;

ALSA now knows about this card, and lists it in /proc/asound/ among other places such as /sys. We still haven't told ALSA about the interfaces associated with this card (capture/playback). This will be discussed in the next installment. One last thing, when you cleanup your device/driver, you must do so through ALSA as well, like this:

snd_card_free(card);

This will cleanup all items associated with this card, including any devices that we will register later.

<< Prev | Next >>