How To Reduce Vision and Image Processing Times

Have you ever tried writing a program to analyze or process images? If so, you’re likely no stranger to the fact that analyzing large numbers of images can take forever. Whether you’re trying to perform real-time vision processing, machine learning with images, or an IoT image processing solution, you’ll often need to find ways to reduce the processing times if you’re handling large data sets.

All of the techniques listed in this article take advantage of the fact that images more often than not have more data than needed. For example, suppose you get a data set full of 4K resolution full-color images of planes. We’ll use this image below to track our optimization steps.

Removing Colors

There are many situations in which color is necessary. For example, if you’re trying to detect fresh bloodstains in an image, you normally wouldn’t turn an image into grayscale. This is because all fresh bloodstains are red, and so you would be throwing away critical information if you were to remove the color from an image.

However, if color is not necessary, it should be the first thing that you remove from an image to decrease processing times.

The reason removing color from an image decreases processing time is because there are fewer features to process, where we’ll say a feature is some measurable property.

With RGB (red, green, blue, ie; colored images), you have three separate features to measure, whereas with grayscale, you only have one. Our current plane image should now look like this:

Using Convolution Matrices

A convolution matrix, also known as a mask or a kernel, is a 3×3 or 5×5 matrix that is applied over an entire image. For this article, we will examine only 3×3 matrices.

For a 3×3 matrix, we select a 3×3 square in the image, and for each pixel, we multiply that pixel by its corresponding matrix position. We then set the pixel in the center of that 3×3 square to the average of those 9 pixels after the multiplication.

If you wanted this to output visually, you can simply set a pixel to 0 if it’s less than 0, and 255 if it’s greater than 255.

Immediately, you might realize that if we have to select a 3×3 square in the original image, then our convolution matrix would be useless if we selected the top left pixel. If the top left pixel is selected, then you wouldn’t be able to create a 3×3, since you would only have 4 pixels from the 3×3 (ie; you’d have a 2×2) and would be missing the remaining 5 pixels.

There are a wide variety of ways to handle these cases, although we won’t cover them in any depth in this article. For example, You could duplicate the 2×2 four times, by rotating the 2×2 around the center pixel to fill in the missing pixels, or you could just trivially set the missing pixels to 0 (results may be poor if you do this though).

There are massive lists of convolution matrices that can do all sorts of things from sharpening, blurring, detecting vertical lines, and detecting horizontal lines. Here’s our plane after applying a convolution matrix for detecting horizontal lines. Specifically, this matrix is [(-1, -1, -1), (2, 2, 2), (-1, -1, -1)]

Similarly, here’s the result after applying a convolution matrix for detecting vertical lines. The matrix for this one is [(-1, 2, -1), (-1, 2, -1), (-1, 2, -1)].

You might be wondering, “But how does this help me? It doesn’t reduce processing times at all!”. And you’re right. This only makes your processing time longer. However, notice that once you use convolution to extract out the high-level details you want, like edges, your image now has a lot of the excessive noise removed. For example, in the image above, you can see that the sky is no longer in the image.

This means that we’ve isolated the important parts of the images, which allows us to safely reduce the size of the resulting matrix without a huge loss in detail.

SIDE NOTE: You may be wondering why we can’t just downsize the image before we perform any processing steps on it. The reason for this is that if you downsize the image right away, you will almost always lose important detail. Additionally, downsizing an image can create artifacts, and if you are looking for particularly small details, like a 2-4 pixel pattern in a large image, you will almost certainly lose that detail when you scale down the image. This is why you should capture those details first before scaling down.

Pooling

In a nutshell, pooling is a technique to reduce the size of a matrix. You pool after you apply your convolutions, because each time you pool, you will lose some features.

Generally, each cycle of pooling will decrease the number of features in your image by some multiplicative constant. It’s trivial to see that if you continuously pool your image over and over again, you will eventually lose too much detail (like if you pooled until you just had a single 1×1 matrix).

Pooling works by first selecting an arbitrarily sized square. Let’s say you want to use a 4×4 square. The goal of pooling is to take this 4×4 square in a matrix, and reduce it to a single 1×1 matrix. This can be done in many ways. For example, max pooling is when you take the maximum value in that 4×4 matrix, average pooling is when you average all the values of the matrix, and min pooling is when you take the minimum value from the matrix.

As a rule of thumb, you will want to use max pooling since that captures the most prominent part of the 4×4 matrix. For example, in edge detection, you would want to use max pooling because it would downsize the matrix while still showing you the location of the edges.

What you would not use is min pooling, because if there is even a single cell where no edge was detected inside a 4×4 matrix that is otherwise full of edges, the pooling step would leave you with a value of 0, indicating that there was no edge in that 4×4 matrix.

For a better understanding of why you should pool, consider the fact that a 4K image is a 3840 x 2160 image, which is 8,294,400 individual features to process. Suppose we can process ten 4K images a second (82,940,000 features a second). Let’s compare the original 3840 x 2160 representation versus a 480 x 270 pooled representation.

# Images3840 x 2160 image (time)480 x 270 image (time)
101 second0.015625 seconds
1,00016.67 minutes1.56 seconds
1,000,00011.57 days26.04 minutes
1,000,000,00031.71 years18.0844 days

At ten 4K images a second, it would take over 30 years to process a million images, whereas it would only take 18 days if you had done pooling.

Conclusion

When processing images, especially high-resolution images, it’s important that you shrink down the number of features. This can be done through many methods. In this article, we covered converting an image to grayscale, as well as techniques such as convolution to extract important features, and then pooling to reduce the spatial complexity.

In this article, we compared the difference between pooling and not pooling, and found that the difference of analyzing a million 4K grayscale image without pooling would take 31 years, versus 18 days if we had pooled it down to a 480 x 270 image. However, not turning the images into grayscale can also have a noticeable effect.

As a final food for thought, if you had performed none of the optimizations mentioned in this article, analyzing a million full-color 4K resolution images with convolutions would take nearly an entire century, versus a measly 18 days if you had turned them into grayscale and then performed convolution and pooling.

In other words, with no optimizations, your image processing would take so long, that you could be rolling in your grave, and your program still wouldn’t be done running.

Checking Whether or Not an Ad Is a Tide Ad with Keras And NumPy

Note : If you’re interested in machine learning, you can get a copy of my E-book, “The Mostly Mathless Guide to TensorFlow Machine Learning” by clicking HERE

There’s a bunch of kids running around with a Coca-Cola in their hands. But hold
on — look at their clothes! They’re so clean and white. Too clean, almost. Could this be a Tide ad?

Machine learning to the rescue! In this article, I’ll be showing you how to use TensorFlow, a machine learning library, to predict whether or not an ad is a Tide ad.

Prerequisites


This tutorial will be using Linux. You can probably do it on Windows too, but you may have to change some things. Here are the things you will need :

1. Python 3
2. TensorFlow (pip3 install tensorflow)
3. Keras (pip3 install keras)
4. ffmpeg (sudo apt-get install ffmpeg)
5. h5py (pip install h5py)
6. HDF5 (sudo apt-get install libhdf5-serial-dev)
7. Pillow (pip3 install pillow)
8. NumPy (pip3 install numpy)

Although VirtualEnv is not required, it is suggested that you use VirtualEnv to prevent any conflicts / version mistakes between Python 2 and Python 3.

Also, you can find all the code and bash scripts here : https://github.com/HenryDangPRG/TideAdIdentifier

Getting Started


First, we need to describe what our neural network will do. In this case, our neural network will will take one image as input, and tell you whether or not that image belongs to a Tide ad or not. Using ffmpeg, we can split a video into its frames to input an entire video into the neural network as well, and if over 50% of the frames in a video are classified as “Tide ads”, then we will consider it to be a Tide ad.

Next, we need data for our neural network to train on. The data will be a large set of .png images that we will get from slicing a video into individual pictures. I will not provide the videos as a download here, so you will need to find the 1 minute 45 second video of all the SuperBowl Tide ads, as well as 5 minutes worth of non-Tide ads. Also, the two videos should have the same size dimensions so that the images that come out are all the same size.

Once you obtain these two videos, convert them into .avi format and use ffmpeg to split them into its constituent frames. I’ve created a simple Bash script that will do the splitting process automatically for you, as long as you name the Tide ad video “tide.avi” and the non-Tide ad video “non_tide.avi”.

You can find the script here : https://github.com/HenryDangPRG/TideAdIdentifier/blob/master/generate_data.sh

The script above will take the two videos, and split 5 frames per second of the video, each frame being 512 x 288, into two separate folders. You can choose to do this on your own as well, but in this tutorial, as a convention, all Tide ad pictures will be in a directory called “tide_ads”, and all non-Tide ad pictures will be in a directory called “non_tide_ads”.

We’ll have to do the same with the test data, and the prediction data, and these are the bash scripts for those:

https://github.com/HenryDangPRG/TideAdIdentifier/blob/master/generate_test.sh
https://github.com/HenryDangPRG/TideAdIdentifier/blob/master/generate_predictions.sh

For the predictions, you can input any video format as the argument for the Bash script, but .avi is suggested for consistency.

NOTE : Remember to use chmod +x BASH_SCRIPT_NAME.sh on all of the Bash scripts so that you can execute them!

Creating A Convolutional Neural Network


Although this analogy is not perfect, you can think of a neural network as a group of students in a classroom who are all shouting out an answer. In this classroom, students are trying to determine whether or not a single image is from a Tide ad. Some students have a louder voice, so their “vote” for an answer counts more. A neural network can be thought of as thousands of students, all shouting different answers. The loudest answer gets passed to the next classroom, and those students discuss the answer (with their answers being modified by the previous classroom’s answer, perhaps by peer pressure), until we reach the very last classroom, where the loudest answer is the answer for the neural network.

201802081619561000

A convolutional neural network (CNN) is similar in that the students are still looking at an image, but they are only looking at a piece of the image. When they finish analyzing this image, they pass it to the next classroom, but the next classroom gets an even tinier piece of the original image. And so on and so forth, with each next classroom’s image getting smaller and smaller. When they’re done, a vote is outputted.

This is different in that in a normal neural network, the students vote for the entire image at once, but in a CNN, they each only vote for a piece of the image.

Now that you know the basics, let’s jump into the code.

Making The Magic Happen


First, let’s define the parameters for our CNN.

from keras.preprocessing.image import ImageDataGenerator
from keras.models import Sequential
from keras.layers import Conv2D, MaxPooling2D
from keras.layers import Activation, Dropout, Flatten, Dense
from keras import backend as K

# Could use larger dimensions, but will make training
# times much much longer
img_width, img_height = (128, 72)

train_dir = 'train_data'
test_dir = 'test_data'

num_train_samples = 4000
num_test_samples = 2000
epochs = 20
batch_size = 8

Each image will be shrunk to 128 x 72 pixels. Although we could go smaller, we would risk losing too much information. Larger could be better, but the larger the image dimensions are, the longer it will take for the CNN to train.

Next, we’ll have to specify in our CNN whether the color channels are first or last. Usually, they are first, though (at least, for png files). Note that an image could have just one color channel if it was grayscale, but in this case, we will only be using color images. We have to specify these because Keras will reduce our images into NumPy arrays, and the ordering matters.

# If data is formatted to have the channels first,
# then stick the RGB channels in front, else put them
# at the end.
if K.image_data_format() == 'channels_first':
    input_shape = (3, img_width, img_height)
else:
    input_shape = (img_width, img_height, 3)

Now that our CNN knows how many color channels (3 means RGB) our pictures have, as well as the dimensions of the image, we can apply three layers of convolution -> RELu -> max pooling.

The convolution step is essentially taking a tiny matrix and multiplying it to sections of the original matrix. When this is done, the result of each multiplication is recorded into a new matrix. The result is that we get a smaller matrix, which is called a feature map, containing unique features of the image that the machine can then process.

Next, we will need to apply ReLu, or Rectified Linear Unit. ReLu is incredibly simple. If the number is negative, make it zero, otherwise, leave it alone. This is because for a neural network, a negative value doesn’t really offer much information in the context of an image.

Imagine if we were detecting whether or not an image has dark blue lines. A value of zero in the feature map just means that there are no dark blue lines, while a positive value means that there might be a dark blue line. If so, then a negative value has no useful meaning, and can just be set to zero. ReLu also makes computations easier because zeroes are incredibly easy to deal with.

Finally, we apply max pooling, which takes subsections of a matrix and extracts the highest value from that subsection. This will shrink the matrix,reducing computation times, as well as giving us the most important parts of the matrix.

For our Tide ad predictor, we’ll use three layers of convolution, ReLu, and max pooling. When we’re done, we’ll put it all together with a fully connected layer.

# First convolution layer
model = Sequential()
model.add(Conv2D(32, (3, 3), input_shape=input_shape))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Second convolution layer
model.add(Conv2D(32, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Third convolution layer
model.add(Conv2D(64, (3, 3)))
model.add(Activation('relu'))
model.add(MaxPooling2D(pool_size=(2, 2)))

# Convolution is done, so make the fully connected layer
model.add(Flatten())
model.add(Dense(64))
model.add(Activation('relu'))

Then, we apply dropout. Dropout, in our classroom analogy, is like duct taping certain students so that they cannot vote. By using dropout, every student needs to learn to give the right answer, rather than depending on the “smart” students to give the right answers. This gives us an extra layer of redundancy and prevents the loudest students from always overpowering the quiet students.

201802081702131000

In the context of neural networks, it stops overfitting, which is when the neural network becomes too accustomed to the training data and fails to generalize for data that it hasn’t seen before. This can be caused when certain neurons have their weights modified too much by the previous layer’s weights, especially when one neuron has an abnormally high weight. Dropout makes it so that the influence of any one neuron (or group of neurons) is significantly reduced.

From here, everything is pretty self-explanatory. We create some extra data points by slightly modifying the original images, and then plug those into our model. Finally, we train the data and save the completed CNN model.

# perform random transformations so that the
# data is more varied
train_datagen = ImageDataGenerator(
    rescale=1. / 255,
    shear_range=0.2,
    zoom_range=0.2,
    horizontal_flip=True)

test_datagen = ImageDataGenerator(rescale=1. / 255)

# make extra training data by modifying original training images
train_generator = train_datagen.flow_from_directory(
    train_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary')

# make extra test data by modifying original test images
validation_generator = test_datagen.flow_from_directory(
    test_dir,
    target_size=(img_width, img_height),
    batch_size=batch_size,
    class_mode='binary')

# Train the CNN
model.fit_generator(
    train_generator,
    steps_per_epoch=num_train_samples // batch_size,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=num_test_samples // batch_size)

# Saved CNN model for use with predictions
model.save('saved_model.h5')

If you don’t want to spend the time finding videos and training the CNN, the trained model is available in the GitHub repository here : https://github.com/HenryDangPRG/TideAdIdentifier

Predicting Whether a Video Is a Tide Ad


The prediction part is fairly trivial. All we need to do is take a video, turn it into image frames, turn those image frames into NumPy arrays, and then feed them into the trained CNN.

The video splitting part is done by generate_predictions.sh. All this Python code below does is feed those image frames into the CNN. If more than half of the frames aren’t Tide ads, then we can conclude that the video probably isn’t a Tide ad. (note that in the prediction array, a value of 1 means it is NOT a Tide ad, and a value of 0 means that it IS a tide ad)

import numpy as np
import subprocess
from keras.preprocessing import image
from keras.models import load_model

def img_to_array(file_name):
    loaded_img = image.load_img(file_name, target_size=(img_width, img_height))
    img_array = image.img_to_array(loaded_img)
    img_array = np.expand_dims(img_array, axis=0)
    return img_array

if __name__ == "__main__":
    directory = "predictions/"
    num_images = int(subprocess.getoutput("ls predictions | wc -l"))
    img_width, img_height = (128, 72)

    # load the saved model, and use it for prediction
    model = load_model("saved_model.h5")
    images = []

    for i in range(1, num_images+1):
        # Need up to 6 leading zeroes for formatting
        i_ = "{:0>7}".format(i)
        next_img = img_to_array(directory + "prediction-" + str(i_) + ".png")
        images.append(next_img)  
    images = np.vstack(images) 
    prediction = model.predict(images) 
    percent_tide = sum(prediction)[0] * 100 / num_images

    print("There is a " + str(percent_tide) + "% chance this is not a Tide ad.")

    # If more than half the images are NOT Tide ads
    if(sum(prediction) >= num_images / 2):
        print("It can be concluded that this is NOT a Tide ad!")
    else:
        print("It can be concluded that this IS a Tide ad!")

Conclusion


How well does this work? Well, not too well. This neural network uses a pitifully small data set, and was trained for very few epochs, so it makes sense that the results are not particularly accurate.

Inputting in Pepsi’s 2018 SuperBowl ad gives the following result :

There is a 45.833333333333336% chance this is not a Tide ad.
It can be concluded that this IS a Tide ad!

And inputting in Skittle’s 2018 SuperBowl ad gives this result :

There is a 53.4675615212528% chance this is not a Tide ad.
It can be concluded that this is NOT a Tide ad!

So, does this make every SuperBowl ad a Tide ad? Our neural network seems to think it does. Almost.