Neural Style Transfer in Pytorch
A Modern Starry Night
In this article we're implementing a neural algorithm of artistic style based on the original paper by Gatys et al. This algorithm will allow us to separate the content and style of images and create new images by mixing the style of one given image and the content of another. For this purpose we'll be using a pre-trained convolutional neural network and perform gradient descent on the pixel values of an input image to minimize a combination of two distance measures, one for context and one for style, simultaneously. The model of choice is the so-called VGG-19, where the 19 stands for the number of layers within the network. It is the same model the authors used in the original paper and it is readily available with pre-trained weights in
Note that the gradient descent process here is a bit different from how neural networks are usually trained. Where one would usually adjust the weights of the neural network layers, we will instead keep them fixed and treat the input image's pixel values as parameters. Our gradients with respect to the distance measure will then be backpropagated to the inputs, thus transforming the inputs (and therefore the image itself).
In addition to Pytorch, we'll make heavy use of the Torchvision package, which offers handy image transformation methods as well as pre-trained models. Let's import all the packages we need for our neural style transfer algorithm.
Before we start setting up the model, we'll first define some utility functions, which will help us transfer images to and from Pytorch tensors, compute features of a given layer, as well as a Gram matrix which we'll use in our style loss later on.
Since we want to work with our own uploaded image, we have to define a function to load such an image and turn it into a Pytorch tensor, so we can use it as an input to our model. The function will take as arguments an image path, as well as a maximum size and an optional shape argument. Large images will slow down the processing later on, and since we're impatient, we'll cap the image resolution to 600x400.
With the size set, we'll build up a list of image transformations using
transforms.Compose. The first transform in this list resizes the image. Afterwards we transform the image into a Pytorch tensor because our model expects tensor inputs. The last transformation is also specific to the model we will be using: a normalization of the kind which was used when training the model for its original purpose - image classification on the ImageNet dataset. The exact numbers used there are a result of statistics of the ImageNet dataset. We'll just take them as a given for now.
Let's test this function by loading our style image, The Starry Night by Vincent van Gogh.
torch.Size([1, 3, 400, 600])
As we can see, our function returned a tensor of shape
torch.Size([1, 3, 400, 600]). This corresponds to the batch dimension, rgb channels, as well as height and width, which is the expected input shape for 2D convolution layers in Pytorch.
Next we also want a function to do the opposite: convert an image tensor back to a numpy array which we can display. For that we have to rearrange the dimensions in the right way and undo the normalization.
Now for the functions which compute our model's features. Why do we even need them? The feature maps of certain layers within a deep convolutional neural network (CNN) have been shown to capture both style and content of the images fed into the model. In the their seminal paper, Gatys et al. write:
"When Convolutional Neural Networks are trained on object recognition, they develop a representation of the image that makes object information increasingly explicit along the processing hierarchy. Therefore, along the processing hierarchy of the network, the input image is transformed into representations that increasingly care about the actual content of the image compared to its detailed pixel values."
Therefore, we can refer to features of the later layers in the model as content features. Figure 1 in the above-mentioned paper provides a nice visualization of this.
Calculating the features of individual layers in our model requires the image for which we want to compute the features and the model itself. We'll keep the function general so we can use it to obtain features of the style layers as well as the content layers. The function performs a forward pass through the model, one layer at a time, and stores the feature map responses if the name of the layer matches one of the keys in the predefined layer dict. This dict serves as a mapping from the Pytorch VGG19 implementation's layer indices to the layer names defined in the paper. If no layers are specified, we'll use a complete set of both the content layer and the style layers as a default.
With that in place we can turn our focus to the representation of style in the layers of our model. As it turns out, style representations can be obtained by measuring the correlation between different feature map responses of a given layer, which boils down to computing the Gram matrix of the vectorised feature map. For this we take as input a feature map tensor, reshape the spatial extent (height and width) of the tensor to be one vector, and then just compute the inner product of the reshaped tensor.
Now we have all the ingredients in place for the actual style process.
For that we now load the VGG model with its pre-trained weights and set
False for all parameters (weights), so that no gradients are computed for the model's weights. For that we first download the weights.
OrderedDict([('features.0.weight', tensor([[[[-5.3474e-02, -4.9257e-02, -6.7942e-02],
[ 1.5314e-02, 4.5068e-02, 2.1444e-03],
[ 3.6226e-02, 1.9999e-02, 1.9864e-02]],
[[ 1.7015e-02, 5.5403e-02, -6.2293e-03],
... 2.4562e-03, -5.5497e-02,
-3.7866e-02, -5.5367e-02, -6.2990e-02, -2.2284e-03, -2.9548e-02,
-2.1679e-02, -5.1211e-02, -5.7297e-02, -5.1590e-02, -1.6711e-02,
-2.0071e-02, -5.8494e-02, 2.6189e-02, -2.8736e-02, 5.8095e-05]))])
As mentioned above, we don't want to compute gradients with respect to our model's weights, but only with respect to our input image tensor.
A Little Trick
The authors propose to replace all max-pooling layers in the network with average pooling for better-looking results. So let's go ahead and do exactly that:
Now we select the device on which we'll run the image transformation process. If available, we of course want to utilize the power of a GPU.
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(4): AvgPool2d(ker...ures=4096, bias=True)
(3): Linear(in_features=4096, out_features=4096, bias=True)
(6): Linear(in_features=4096, out_features=1000, bias=True)
Let's not forget to store all the relevant tensors on the same device.
Loading the Content Image
Now we're ready to load the content image and compute content & style features using the utility functions defined above. Here we'll use this beautiful image taken in Shanghai by Photographer Cagdas Eli as the content image and transfer the style from The Starry Night, which we already loaded earlier. We'll also resize the style image to match the content, so we don't have to bother with dimensions later on.
With the images loaded we can then also start to compute the feature map responses of the layers we specified in the defaults of the function.
As stated above, we also want to compute the gram matrices for all the style layers. Let's go ahead and build a dictionary with all the style gram matrices.
Now we can create a third image, which will serve as our starting point for the image transformation process. Here we have to choose: do we start from the original content image or simply from random noise? There's definitely room to play around with different starting points, but for now, let's choose the latter.
Note that we set
True for this tensor, so we can perform gradient descent updates on the image. Earlier during the model setup we set the model parameters'
requires_grad_ flag to
False, so in total we have a fixed model and a target tensor (image) which we can update. This is the crux of the neural style transfer algorithm.
For completeness, below is the code we could use when choosing to start from the original content image. For this we would just have to create a copy of the content image using
content.clone() and again set the
requires_grad_ flag to
Choosing random noise as a starting point will make it easier to juggle the content and style terms of the loss function, which we'll define in the following section.
A Loss Function of Artistic Style
We already established that our loss function will rely on feature map responses of various content and style layers. Now let's start to actually implement the full loss function with its content and style term.
In the style term of our loss function, we'll have multiple style layers contributing. It's helpful to have different layers contribute to the style term to different extents. We can achieve this by simple multiplicative weights for each layer. This enables us to tune the style artifacts to our liking. As a tendency, larger weights for earlier layers yield larger artifacts.
Of course we also want to have weights for the overall strength of both individual loss terms (content and style). While the original paper reports a ratio of content to style weights of and , we'll go for a different fraction here.
Now we have all the weights in place, but what does the loss function actually look like? As it turns out, it's rather simple: In the case of the content loss it's a mean squared-error loss between the two feature map responses of the target image and the content image.
The style loss will look pretty similar, just replacing the feature map responses by the Gram matrices and also dividing the mean squared-error loss by the total number of elements in the respective feature map.
The Style Transfer Loop
But before we construct the total loss, let's set a few hyperparameters for the style transfer process. First, we need an optimizer. The original paper reports using an L-BFGS optimizer, but we'll just stick with the standard Adam optimizer, which nowadays is the default optimizer choice in most deep learning settings. If we were to use the L-BFGS optimizer, we'd just have to replace
Now let's define for how many iterations we wish to run the style transfer loop. We'll be going for iterations. To track the progress we'll print out our total loss value, along with its composition in terms of content and style losses, every now and then.
Now we can define the actual loop. For the defined number of iterations we'll compute the content and style losses (remember that multiple layers contribute to the style loss), multiply them by their respective weights and add them up for the total loss. With the total loss we can then perform the backpropagation step and iteratively update the pixel values of our image until finished.
Time to look at the result! Let's see how far we got with our choice of number of iterations and visualize the final image we converted back to a numpy array after finishing the style transfer loop.
Quite the modern Starry Night! But maybe the interested reader can find an even better set of parameters, or entirely different content and style images, yielding more beautifully stylized images! Time to let the inner neural artist take over!
As a short summary, this article demonstrated how to mix the style and content of two images using a pre-trained CNN and the gradient descent procedure to directly alter the image pixel values - quite in the same way as Gatys et al. The results demonstrate the power of CNNs in creating artistic imagery.