How To Reduce Vision and Image Processing Times

Have you ever tried writing a program to analyze or process images? If so, you’re likely no stranger to the fact that analyzing large numbers of images can take forever. Whether you’re trying to perform real-time vision processing, machine learning with images, or an IoT image processing solution, you’ll often need to find ways to reduce the processing times if you’re handling large data sets.

All of the techniques listed in this article take advantage of the fact that images more often than not have more data than needed. For example, suppose you get a data set full of 4K resolution full-color images of planes. We’ll use this image below to track our optimization steps.

Removing Colors

There are many situations in which color is necessary. For example, if you’re trying to detect fresh bloodstains in an image, you normally wouldn’t turn an image into grayscale. This is because all fresh bloodstains are red, and so you would be throwing away critical information if you were to remove the color from an image.

However, if color is not necessary, it should be the first thing that you remove from an image to decrease processing times.

The reason removing color from an image decreases processing time is because there are fewer features to process, where we’ll say a feature is some measurable property.

With RGB (red, green, blue, ie; colored images), you have three separate features to measure, whereas with grayscale, you only have one. Our current plane image should now look like this:

Using Convolution Matrices

A convolution matrix, also known as a mask or a kernel, is a 3×3 or 5×5 matrix that is applied over an entire image. For this article, we will examine only 3×3 matrices.

For a 3×3 matrix, we select a 3×3 square in the image, and for each pixel, we multiply that pixel by its corresponding matrix position. We then set the pixel in the center of that 3×3 square to the average of those 9 pixels after the multiplication.

If you wanted this to output visually, you can simply set a pixel to 0 if it’s less than 0, and 255 if it’s greater than 255.

Immediately, you might realize that if we have to select a 3×3 square in the original image, then our convolution matrix would be useless if we selected the top left pixel. If the top left pixel is selected, then you wouldn’t be able to create a 3×3, since you would only have 4 pixels from the 3×3 (ie; you’d have a 2×2) and would be missing the remaining 5 pixels.

There are a wide variety of ways to handle these cases, although we won’t cover them in any depth in this article. For example, You could duplicate the 2×2 four times, by rotating the 2×2 around the center pixel to fill in the missing pixels, or you could just trivially set the missing pixels to 0 (results may be poor if you do this though).

There are massive lists of convolution matrices that can do all sorts of things from sharpening, blurring, detecting vertical lines, and detecting horizontal lines. Here’s our plane after applying a convolution matrix for detecting horizontal lines. Specifically, this matrix is [(-1, -1, -1), (2, 2, 2), (-1, -1, -1)]

Similarly, here’s the result after applying a convolution matrix for detecting vertical lines. The matrix for this one is [(-1, 2, -1), (-1, 2, -1), (-1, 2, -1)].

You might be wondering, “But how does this help me? It doesn’t reduce processing times at all!”. And you’re right. This only makes your processing time longer. However, notice that once you use convolution to extract out the high-level details you want, like edges, your image now has a lot of the excessive noise removed. For example, in the image above, you can see that the sky is no longer in the image.

This means that we’ve isolated the important parts of the images, which allows us to safely reduce the size of the resulting matrix without a huge loss in detail.

SIDE NOTE: You may be wondering why we can’t just downsize the image before we perform any processing steps on it. The reason for this is that if you downsize the image right away, you will almost always lose important detail. Additionally, downsizing an image can create artifacts, and if you are looking for particularly small details, like a 2-4 pixel pattern in a large image, you will almost certainly lose that detail when you scale down the image. This is why you should capture those details first before scaling down.


In a nutshell, pooling is a technique to reduce the size of a matrix. You pool after you apply your convolutions, because each time you pool, you will lose some features.

Generally, each cycle of pooling will decrease the number of features in your image by some multiplicative constant. It’s trivial to see that if you continuously pool your image over and over again, you will eventually lose too much detail (like if you pooled until you just had a single 1×1 matrix).

Pooling works by first selecting an arbitrarily sized square. Let’s say you want to use a 4×4 square. The goal of pooling is to take this 4×4 square in a matrix, and reduce it to a single 1×1 matrix. This can be done in many ways. For example, max pooling is when you take the maximum value in that 4×4 matrix, average pooling is when you average all the values of the matrix, and min pooling is when you take the minimum value from the matrix.

As a rule of thumb, you will want to use max pooling since that captures the most prominent part of the 4×4 matrix. For example, in edge detection, you would want to use max pooling because it would downsize the matrix while still showing you the location of the edges.

What you would not use is min pooling, because if there is even a single cell where no edge was detected inside a 4×4 matrix that is otherwise full of edges, the pooling step would leave you with a value of 0, indicating that there was no edge in that 4×4 matrix.

For a better understanding of why you should pool, consider the fact that a 4K image is a 3840 x 2160 image, which is 8,294,400 individual features to process. Suppose we can process ten 4K images a second (82,940,000 features a second). Let’s compare the original 3840 x 2160 representation versus a 480 x 270 pooled representation.

# Images3840 x 2160 image (time)480 x 270 image (time)
101 second0.015625 seconds
1,00016.67 minutes1.56 seconds
1,000,00011.57 days26.04 minutes
1,000,000,00031.71 years18.0844 days

At ten 4K images a second, it would take over 30 years to process a million images, whereas it would only take 18 days if you had done pooling.


When processing images, especially high-resolution images, it’s important that you shrink down the number of features. This can be done through many methods. In this article, we covered converting an image to grayscale, as well as techniques such as convolution to extract important features, and then pooling to reduce the spatial complexity.

In this article, we compared the difference between pooling and not pooling, and found that the difference of analyzing a million 4K grayscale image without pooling would take 31 years, versus 18 days if we had pooled it down to a 480 x 270 image. However, not turning the images into grayscale can also have a noticeable effect.

As a final food for thought, if you had performed none of the optimizations mentioned in this article, analyzing a million full-color 4K resolution images with convolutions would take nearly an entire century, versus a measly 18 days if you had turned them into grayscale and then performed convolution and pooling.

In other words, with no optimizations, your image processing would take so long, that you could be rolling in your grave, and your program still wouldn’t be done running.