Dog breed classification from udacity

This blog is about project which is classifying dog breed using several models using transfer learning. Here we go through following

  • Project Overview
  • Problem Statement
  • Metrics
  • Data Exploration
  • Visualization
  • Data Preprocessing steps
  • Implementation
  • Refinement
  • Results
  • Justification
  • Reflection
  • Future Improvements

Project overview

This project uses Convolutional Neural Networks (CNNs). This project is built using a pipeline to process real-world, user-supplied images. Given an image of a dog, this algorithm will identify an estimate of the canine’s breed. If supplied an image of a human, the code will identify the resembling dog breed.

Problem Statement

This project is about building Image classification model which classifies dog breed of image given and outputs corresponding dog breed or resembling dog breed.

This model also uses transfer learning on Resnet50 and Inception V3 models for getting better results and save time.

This model uses dog data and human data provided in udacity reference and instructions. Link to full code is given below.


Metrics used in this model is accuracy, categorical cross entropy is used as loss and rmsprop is used as optimizer.

Accuracy is used here as metric to evaluate the model performance because accuracy is a valid choice of evaluation for classification problems which are well balanced and not skewed or no class imbalance, accuracy will be a better choice as metric to select a good model as evaluates well in this case.

The rmsprop optimizer restricts the oscillations in the vertical direction. Therefore, we can increase our learning rate and our algorithm could take larger steps in the horizontal direction converging faster. So, rmsprop is better choice here.

Categorical cross entropy is a loss function that is used in multi-class classification tasks. So, it is our choice here.

Data Exploration

To get files we need, in the code cell below, we import a dataset of dog images. We populate a few variables through the use of the load_files function from the scikit-learn library. The following are files we are going to use.

  • train_files, valid_files, test_files - numpy arrays containing file paths to images
  • train_targets, valid_targets, test_targets - numpy arrays containing onehot-encoded classification labels
  • dog_names - list of string-valued dog breed names for translating labels

The below are count of files.

There are 133 total dog categories.
There are 8351 total dog images.

There are 6680 training dog images.
There are 835 validation dog images.
There are 836 test dog images.

The above results in the following

The maximum number of dogs per category is 96.0
The minimum number of dogs per category is 33.0
The average number of dogs per category is 62.78947368421053

Maximum, minimum and average number of dogs per category of dog is displayed, which also visualized in below as of here we can see all dog breeds have at least 33 images each and since average is nearly 62 with maximum 96 images for a breed, we can use accuracy metric for this.

Also some images are blurred and since we have lot of images with good quality than blurred it may not effect our model and also blurred image are also needed to increase model performance for complex classification tasks like when image is not properly captured and used in model, which can be case in web apps if camera quality of webcam is not good.

Now we move forward to visualize dog data.


Data can be visualized using below lines where given image is converted to suitable format by OpenCV and displayed.

The following image is resulted from above code, from where we can see data not fully balanced or fully imbalanced. Here it is clear that all breeds have at least more than 30 images.

Image for post
Image for post

The following output is displayed, which is human image with bounding box.

Image for post
Image for post

Data Preprocessing steps

When using TensorFlow as backend, Keras CNNs require a 4D array (which we’ll also refer to as a 4D tensor) as input ( total number of images, rows, columns, channels ).

The path_to_tensor function below takes a string-valued file path to a color image as input and returns a 4D tensor suitable for supplying to a Keras CNN. The function first loads the image and resizes it to a square image. Next, the image is converted to an array, which is then resized to a 4D tensor.

In this case, since we are working with color images, each image has three channels. Likewise, since we are processing a single image (or sample), the returned tensor will always have shape ( 1, 224, 224, 3 ).

The paths_to_tensor function takes a numpy array of string-valued image paths as input and returns a 4D tensor with shape ( num_of_samples, 224, 224, 3 ).


Model uses Convolutional Layers in order to identify increasingly complex patterns with the hinted architecture as a base, then filter size is increased with each Convolutional Layer and its size decreases as it results in slightly better accuracy. Max Pooling Layer is added between each Convolutional Layer in order to increase the training efficiency of the model. The Relu Activation function was used in four layers and for last layer softmax is used.

Layer (type)                 Output Shape              Param #   
conv2d_568 (Conv2D) (None, 224, 224, 16) 208
max_pooling2d_30 (MaxPooling (None, 112, 112, 16) 0
conv2d_569 (Conv2D) (None, 112, 112, 32) 2080
max_pooling2d_31 (MaxPooling (None, 56, 56, 32) 0
conv2d_570 (Conv2D) (None, 56, 56, 64) 8256
max_pooling2d_32 (MaxPooling (None, 28, 28, 64) 0
dropout_3 (Dropout) (None, 28, 28, 64) 0
flatten_4 (Flatten) (None, 50176) 0
dense_5 (Dense) (None, 512) 25690624
dropout_4 (Dropout) (None, 512) 0
dense_6 (Dense) (None, 133) 68229
Total params: 25,769,397
Trainable params: 25,769,397
Non-trainable params: 0

Then model is compiled using rmsprop optimizer and accuracy metrics and then trained with 5 epochs.

This is how model is implemented and this gave just 8.25% accuracy.

But due to simple model architecture, results are not good and for that we use transfer learning which uses better architecture.

Now the following is functions to classify breed of dog which uses VGG or Inception model which ever we use. Later in the following step building VGG or Inception model is discussed.


To reduce training time without sacrificing accuracy, CNN is trained using transfer learning. In the following step, transfer learning is used to train our own CNN.

The model uses the the pre-trained VGG-16 model as a fixed feature extractor, where the last convolutional output of VGG-16 is fed as input to our model. Global average pooling layer is only added and a fully connected layer, where the latter contains one node for each dog category and is equipped with a softmax is used.

Then model is compiled, trained and used which produced good better accuracy as compared to previous model without transfer learning as it gave 43.66 % accuracy.

This can be done using Inception V3 model too which might give better results as VGG-16 gave.

Inception model gave 76 % accuracy which is better than our previous models.


To test our model following function is used, which was created for classifying breed above.

This outputs the following

Image for post
Image for post
Photo by Edgar on Unsplash

This is human and resembles breed of ages/train/084.Icelandic_sheepdog.


Here, first basic architecture is used which gave just nearly 8 % accuracy, then VGG 16 model is used which gave 43 % accuracy then Inception model gave 76 % accuracy. But VGG can also give better accuracy when added with more convolution layers and more filters which takes more time.

VGG16 is definitely a good neural network architecture but what I think is it may not perform well for the complex problems as it is a simple stack of convolutional and max-pooling layers followed by one another and finally fully connected layers. So, it is not able to extract very complex features. On the other hand, Inception nets have inception modules that consist of 1X1 filters also known as pointwise convolutions followed by convolutional layers with different filter sizes applied simultaneously. This allows Inception nets to learn more complex features. They have more hidden layers as compared to VGG16. Hence, they are used for more complex problems.

For us here, that is why Inception model gave good results compared to other two.


First we took images data of dogs and humans and we pre-processed it to fixed size and then we built simple keras model with few convolutional layers and filters which gave less accuracy

Next we used our data with VGG model using transfer learning which gave more accuracy.

Finally we used Inception model which further increased accuracy of model and thus what I find interesting is concept of Inception model which is deep and challenging and exciting to work with as it gives lot of ease for classification of image.

Future Improvements

To improve our model what I feel is, when we add further layers to model with Inception architecture, we can increase performance of model.

Important aspect what I feel is filters which is being used in models, if we increase and adjust filters with combination of pooling layers model will give more accuracy and good classification of breed which is our goal for this project.

The code for this blog is here.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store