Multi-Target Tracking in OpenCV (28 Days of Hacking: Day 25)

I was playing around last night with some of the sample code provided with OpenCV when I came across the people detect code based on Histogram of Oriented Gradients or HOG. It worked pretty well against some test images I had, but those were fairly clear images with people posing. I remembered Lehigh had a webcam near the library, but unfortunately it was getting late and there was no way I was going to be able to make out anyone. So I started looking for other webcams and came across some others in well-lighted and well-travelled areas.

The first camera angle I came across was slightly looking down, fairly low resolution and at a slight distance from a busy area. Not ideal and the HOG detector didn’t do so well. After scavenging around a bit, I found another camera, that was at ground level and wasn’t too far from a busy sidewalk. The HOG detector worked pretty well at picking up people, however was quite susceptible to noise from even static objects like a tree.

Late last night I realized I had to rethink the problem a bit if I wanted to be able to successfully track people. However, if I reshaped the problem, the solution is quite easy. If I can constrain myself to an environment were the only moving things are people, then the problem turns into one of detecting motion and associating that with a person. This is significantly easier. Unfortunately, it appears that Lehigh removed the public cameras or at least I could not find them, so I decided to run my own capture of the same area. For obvious reasons, I will not be releasing the raw image files.

First a bit of basic theory. If we want to find the (rate of) change of a continuous function, we want to take the derivative. Unfortunately, images are not continuous, but we can think of a derivative as Δf/Δt. This can be discretized into: Δf/1 since we are progressing by 1 time step or sample. So if we take f[n]-f[n-1], we can find the derivative of two images. This can be realized in OpenCV by the absdiff function, since we only care about the absolute value when it comes to pixels.

As a demo, I ran this code with my webcam and got the following result:



This works great in theory, but if we apply it to the camera feed we barely get any output! If you look really closely, you might see a few non-black pixels here and there.



So I guess we need to apply some type of a thresholding to the image. I had to set a threshold of 5 (ranges from 0 to 255) to get a reasonable output.



The algorithm is really simple. If the value of a pixel is greater than the threshold, set it to white, otherwise make it black. This is known as binary thresholding. However, I am lazy and if I write something, I would like it to work on varying conditions without me having to tweak it. Sounds like we might want to check out adaptive filters.

The first one to I decided to try out is the mean threshold. The name adaptive filter may sound scary but it really isn’t too difficult. For the mean threshold, we just compute the mean of a fixed size area can use this to determine the threshold value. But this isn’t really exactly what we want, since half of the pixels would be above the filtered value (for continuous values). So we use a scaling factor c, which gets subtracted from the mean. By doing so, we can better control the threshold to add some additional filtering. To show this, I fluctuated c from 0 to 2.







Another approach is to use a distribution, such as a Gaussian instead of the mean. This approach basically weights the values in a non-uniform manner. The Gaussian in this case is windowed such that it has a zero value outside the of the neighborhood area we specified.



I thought the mean threshold performed a little better in this use case, however I probably could have tuned either one for a better performance given more time. If you notice, there is still noise in the image so we want to apply morphological transforms such as dilate to enhance our image. If we wanted to remove the noise, we could have applied an erosion to remove noisy pixels, then dilated to enhance the remaining pixels. However, some of the points seemed a little sparse so I decided to skip the erosion step. Dilation involves a kernel which is convolved with the image, and then sets the value of a pixel to white if at least one value under the kernel is white. As a result, the objects in our image will expand. This only really works on binary images so white equates to a 1, and black equates to a 0.



Now there is not a lot of noise and we have a pretty good representation of the objects. From this, we can compute the contours.



We can get the contours then add create a bounding box. There is some noise where we have multiple bounding boxes around a person.



With a little math, we can find the centroid of each object. I decided to draw a line from point (0, 0) to the centroid to represent it.



So now that we have the centroids, we can generate a really neat image, which highlights the various paths and how frequently they are traveled.



We can use some estimation techniques to find vectors from the centroids, then construct each path taken. I am still working on the path construction part. Another neat result, is looking at the number of detections with respect to time. I started running the capture at 11:52 am and finished at 12:11 pm.



It is probably worth noting that classes get out at 12:00 pm, and the next class doesn't start until 12:10 pm, so the data matches up with the traffic pattern I expected.