< Previous | Contents | Manuals Home | Boris FX | Next >

Why It’s Not Just Cats and Dogs

It turns out that tracking is a lot more difficult and time consuming than deciding whether something is a cat or a dog! And while we’re all impressed with the capabilities of self-driving cars, when you look more carefully you’ll see that there are limitations that aren’t immediately apparent—that those algorithms are rightfully carefully tailored to exactly what they need to do, nothing more. Here are some of the issues with using neural nets for tracking, versus typical uses:

Source Resolution . Many neural net algorithms are applied to quite small (pre-down-sampled) images, typically 100-200 pixels across. You don’t need every pixel to tell a cat from a dog, or find a face.

Result Resolution . Typical image processing neural nets continuously down-sample the image as additional processing is applied, so results are at only a fraction of the original (down-sampled!) resolution.

Positional Accuracy . Typically items (such as faces) being identified in a picture aren’t located that accurately, because the original image has been downsampled and the results even more dramatically. If an image is downsampled by 16x, you’re only locating features to the nearest sixteen pixels in the original image. (There are some techniques to try to go back and find the original pixels.)

Feature Size . Neural nets have a specific receptive field corresponding to the size of what they’re looking for. A perfectly good neural net for finding 80-pixel-high pedestrians in an autonomous car application isn’t going to

find a person occupying a quarter of the overall image (270 high), or vice versa. And bigger receptive fields require more compute time.

Efficiency . While a complex net can be used to decide if an entire image is a cat or a dog, if we need to find all the cats or dogs, we need to be able to re-use many parts of the network to avoid truly stratospheric compute times. (Convolutional Neural Nets are used for efficiency.)

Accuracy of Results . Neural nets are often considered successes if they make mistakes only 20% of the time, and 5% is very good. With many trackers and many frames, we need to do even better.

Consistency . In tracking, there isn’t often a hard-core exact right answer for where a tracker should be put, given a camera image with noise and various processing and compression artifacts. Instead, we need consistency from one frame to another, so that a tracker doesn’t dance around too much and have a high RMS error. You can see frame/frame inconsistency in self-driving car output displays, especially looking at adjacent stopped cars.

With these complications, we tried to tailor our neural nets to the specific novel needs of tracking. The neural net capabilities in SynthEyes are a starting point, an experiment, something that can improve over time as advancements in algorithms and hardware processing power allow. Like Sizzle scripts allow exports to 3 rd party applications to be updated easily, without an overall SynthEyes app update, the neural nets are data files that can be updated without necessarily changing SynthEyes itself.

©2024 Boris FX, Inc. — UNOFFICIAL — Converted from original PDF.