{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "tTbiC0_eruI1"
   },
   "source": [
    "# Pre-lab description\n",
    "In this pre-lab we will learn how to implement gradient descent for finding local minima of a given cost function. This will provide us with a basic tool for many learning and classification problems since, at the end of the day, finding a classifier amounts to solving some optimization problem. In this lab we will also learn how gradient descent\n",
    "can be implemented using [PyTorch](https://pytorch.org/tutorials/), a scientific library for developing machine (deep) learning methods. Towards that goal, we will be learning a [linear classifier](https://en.wikipedia.org/wiki/Linear_classifier) on the [MNIST](https://en.wikipedia.org/wiki/MNIST_database) digit [dataset](http://yann.lecun.com/exdb/mnist/). As a loss function, we will be using a simple quadratic function. You will first apply your calculus skills to the problem, and analytically solve it. Then you will learn how to solve the same problem by implementing the gradient descent method and applying it to the cost function. Finally, you will learn the basics of PyTorch by using the built-in functions to train the classifier. This pre-lab assignment needs to be solved in this Notebook.\n",
    "\n",
    "## The data\n",
    "The MNIST database \\[[1](http://yann.lecun.com/exdb/mnist/)\\]\\[[2](https://en.wikipedia.org/wiki/MNIST_database)\\] consists of $28\\times 28$ grayscale images of handwritten digits, along with the correct label for each image. These are in the form of $28\\times 28$ matrices with the value of each index being an 8-bit integer ranging from 0 (black) to 255 (white), and one integer between 0 and 9, corresponding to the digit displayed in the image. The database is split into two separate training and testing sets.\n",
    "\n",
    "\n",
    "## Problem description\n",
    "We are given a set of $N$ feature-label pairs $\\big\\{\\big(\\boldsymbol x_i,c_i\\big)\\big\\}_{i=0}^{N-1}$ where each $\\boldsymbol x_i\\in\\mathbb{R}^p$ corresponds to a vectorized $28\\times28$ grayscale image of a digit, and $c_i=\\{0,1,\\ldots,9\\}$ is the digit's class. Since we are dealing with a multi-class classification problem we will encode each digit's class with a one-hot embedding vector as:\n",
    "$$\\boldsymbol y_i = [y_{i0},\\ldots, y_{in}],\\quad y_{ij} = \\begin{cases}1 &c_i = j\\\\0&\\textrm{else}\\end{cases}.$$\n",
    "The goal is then to find a prediction function $f:\\mathbb{R}^p\\mapsto\\{0,1\\}^n$ that maps features $\\boldsymbol x_i$ (images) to labels $\\boldsymbol y_i$. In order to do so, we will use a linear prediction function:\n",
    "\n",
    "$$f(\\boldsymbol x) = \\boldsymbol W \\boldsymbol x,\\quad \\boldsymbol W\\in\\mathbb{R}^{p\\times n},$$\n",
    "where the $j$th row of $\\boldsymbol W$ represents a predictor for the $j$th class. In order to decide upon the estimated class we take the strongest response of our set of predictors, that is:\n",
    "$$\\widehat{c}_i = \\arg\\max_{j}\\,\\boldsymbol W\\boldsymbol x_i.$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Problem formulation (optimization problem)\n",
    "With all previous considerations in mind we can now define the optimization problem to estimate the parameters $\\boldsymbol W$ of our linear predictor. In order to do that, we need to define some loss function on our predictions that penalizes deviations from the true target. For this problem, we will be using a simple quadratic loss function $L\\big(f(\\boldsymbol x),\\boldsymbol y\\big) = \\lVert \\boldsymbol y - f(\\boldsymbol x)\\rVert_2^2$. The goal is then to find the parameters $\\boldsymbol W$ of our linear predictor function $f(\\cdot)$ that minimize the average loss over the set of samples:\n",
    "$$\\min_{\\boldsymbol W}\\; \\frac{1}{N}\\sum_{i=0}^{N-1}\\lVert \\boldsymbol y_i - \\boldsymbol W\\boldsymbol x_i\\rVert_2^2.$$\n",
    "Note that the above optimization problem can be expressed in a compact form as:\n",
    "$$\\min_{\\boldsymbol W}\\; \\frac{1}{N}\\lVert \\boldsymbol Y - \\boldsymbol W\\boldsymbol X\\rVert_F^2,$$\n",
    "where $\\lVert\\cdot\\rVert_F$ is the Frobenius ($\\ell_2$) norm of a matrix, and where the matrices $\\boldsymbol Y=[\\boldsymbol y_0,\\ldots,\\boldsymbol y_{N-1}]$ and $\\boldsymbol X = [\\boldsymbol x_0,\\ldots,\\boldsymbol x_{N-1}]$ consist of stacking the label and feature vector representations, respectively."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Exercise 1.** Given the feature and label matrices $\\boldsymbol X\\in\\mathbb{R}^{p\\times N}$ and $\\boldsymbol Y\\in\\mathbb{R}^{n\\times N}$, find a closed-form solution $\\boldsymbol W^\\star$ for the optimization problem:\n",
    "$$\\min_{\\boldsymbol W}\\; \\frac{1}{N}\\lVert \\boldsymbol Y - \\boldsymbol W\\boldsymbol X\\rVert_F^2.$$\n",
    "You can find the minimizer by setting the derivative of the cost function to zero."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Add derivation here and put the expression of the optimal solution:\n",
    "\n",
    "$$\\boldsymbol W^\\star = \\cdots$$"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "o3tBC7c6QOi4"
   },
   "source": [
    "# Data loading and pre-processing\n",
    "\n",
    "First we need to download the data to Google Drive and pre-process it. The MNIST database is very well known and available to download in PyTorch using a [pre-defined function](https://pytorch.org/docs/stable/torchvision/datasets.html#mnist). Then, we scale and shift the images such that the value associated with each index lies between -1 and +1. Again, we can use pre-defined PyTorch [transform functions](https://pytorch.org/docs/stable/torchvision/transforms.html) to do so. This requires an extra step of casting the image to a `torch.Tensor` object. Tensors are multi-dimensional array objects that PyTorch uses as variables."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "0oL54kYHruI2"
   },
   "source": [
    "As usual, we start our Python code by importing the dependencies, and mounting Google Drive:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 74
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 690,
     "status": "ok",
     "timestamp": 1568614925907,
     "user": {
      "displayName": "Pouria Tohidi",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AAuE7mDXZnza4SI7Mw2FVcQCql-1mx1oJlf3HpytIeLg0Q=s64",
      "userId": "07206253912023649922"
     },
     "user_tz": 240
    },
    "id": "RXhX_Wh9ruI4",
    "outputId": "e33784a9-5a52-4199-9184-96f42b20508f"
   },
   "outputs": [],
   "source": [
    "# import modules here\n",
    "import torch\n",
    "import torchvision\n",
    "import torchvision.transforms as transforms\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "\n",
    "# inline plots for matplotlib\n",
    "%matplotlib inline\n",
    "\n",
    "# mount GDrive\n",
    "from google.colab import drive\n",
    "drive.mount('gdrive/')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "mW4jxoUSTS8N"
   },
   "source": [
    "Next, we use the `torchvision.datasets.MNIST` command to load the database. For this purpose, make a folder named `prelab-02` inside a `bmdslab` directory in your Google Drive. As the database is very large and contains more images than we require, we use the PyTorch [dataloader function](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) to load a certain number of images, along with their labels, as local varables: "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {},
    "colab_type": "code",
    "id": "YZ4kUG9qmREf"
   },
   "outputs": [],
   "source": [
    "# set the seed of PyTorch random number generator for reproducibility\n",
    "torch.manual_seed(0)\n",
    "\n",
    "# define transformation object to be applied to the data, list of transformations through Compose\n",
    "# first convert to tensor\n",
    "# then subtract 0.5 to every entry\n",
    "transform = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5,), (0.5,))])\n",
    "\n",
    "# download the data if not already in the specified directory\n",
    "trainset = torchvision.datasets.MNIST('gdrive/My Drive/bmdslab/prelab-02/', train=True, transform=transform, download=True)\n",
    "testset  = torchvision.datasets.MNIST('gdrive/My Drive/bmdslab/prelab-02/', train=False, transform=transform, download=True)\n",
    "\n",
    "# specify the number of points to be extracted at every iteration\n",
    "trainloader = torch.utils.data.DataLoader(trainset, batch_size=10000,shuffle=False) # 10,000 images from the training set\n",
    "testloader = torch.utils.data.DataLoader(testset, batch_size=1000,shuffle=False)    #  1,000 images from the testing set\n",
    "\n",
    "# create an iterator to return the data from the data loader\n",
    "dataiter = iter(trainloader)\n",
    "# loading the training data into the images and labels variables\n",
    "images, labels = dataiter.next()\n",
    "\n",
    "# similarly for the test data\n",
    "test_dataiter = iter(testloader)\n",
    "test_images, test_labels = test_dataiter.next()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 92
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 2908,
     "status": "ok",
     "timestamp": 1568614928136,
     "user": {
      "displayName": "Pouria Tohidi",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AAuE7mDXZnza4SI7Mw2FVcQCql-1mx1oJlf3HpytIeLg0Q=s64",
      "userId": "07206253912023649922"
     },
     "user_tz": 240
    },
    "id": "yLAlH2ZJklzK",
    "outputId": "5ad9685c-7050-4bee-a07c-1ca3bf7022d5"
   },
   "outputs": [],
   "source": [
    "# check the sizes to make sure you have done everything right:\n",
    "print(images.shape) # you should get [10000,1,28,28]\n",
    "print(labels.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Now your data is in the `images` variable. Let's just display an example."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 309
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 3042,
     "status": "ok",
     "timestamp": 1568614928278,
     "user": {
      "displayName": "Pouria Tohidi",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AAuE7mDXZnza4SI7Mw2FVcQCql-1mx1oJlf3HpytIeLg0Q=s64",
      "userId": "07206253912023649922"
     },
     "user_tz": 240
    },
    "id": "xkWWjQavl-sr",
    "outputId": "7626d68b-58ab-4e0b-f6c2-692027193571"
   },
   "outputs": [],
   "source": [
    "# choose an index\n",
    "num_image = 5\n",
    "\n",
    "# for displaying an image we need to convert the tensor to a numpy array\n",
    "# the squeeze() function removes all redundant dimensions of the array i.e., \n",
    "# images[num_array] is a 1x1x28x28 array\n",
    "plt.imshow(images[num_image].numpy().squeeze(), cmap='gray');\n",
    "\n",
    "# maybe investigate the values:\n",
    "print(images[1,0,22,10].squeeze())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(images[1,0,0:5,0:5].squeeze())"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "qn8imXYvX42f"
   },
   "source": [
    "**Exercise 2.** Using the data provided for training and the expression for the optimal predictor's weights derived in the previous exercise compute the optimal predictor over the training data. Apply also your predictor to the testing set. Report classification accuracy over both training and testing sets."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "For a set of ground-truth labels $c_i$ and their corresponding estimates $\\widehat{c}_i$, the __accuracy__ is defined as:\n",
    "\n",
    "$$Acc = \\frac{1}{N}\\sum_{i=0}^{N-1}\\mathbb{1}\\big(c_i == \\widehat{c}_i\\big),\\quad \\mathbb{1}\\big(z\\big) = \\begin{cases}1&z\\; \\textrm{is true}\\\\0 &\\textrm{else}\\end{cases}.$$"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 92
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 3238,
     "status": "ok",
     "timestamp": 1568614928482,
     "user": {
      "displayName": "Pouria Tohidi",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AAuE7mDXZnza4SI7Mw2FVcQCql-1mx1oJlf3HpytIeLg0Q=s64",
      "userId": "07206253912023649922"
     },
     "user_tz": 240
    },
    "id": "IceYhRVam_LO",
    "outputId": "ff9fb2ff-d530-440f-da7b-80de36b612b8"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "Z2HEbvXhbVne"
   },
   "source": [
    "## Optimization via gradient descent\n",
    "The quadratic optimization problem considered in this pre-lab has a closed-form solution. In many practical cases however, a closed-form solution does not exist and the solution needs to be computed via an iterative method. Suppose we are given a smooth cost function $C(\\boldsymbol x)$ that we want to minimize over $\\boldsymbol x$. A very simple method to find a local minimizer of the function is to use a _gradient descent_ method. The main idea is to compute the gradient (derivative) of the cost function at a given point and move towards the direction opposite to the gradient of the function (recall that the gradient of a function gives you the direction of maximum variation of the signal). This process is repeated until convergence to a critical point of the cost function to be minimized. The general procedure for the gradient descent method at every iteration is given by the following update rule:\n",
    "$$\n",
    " \t\\boldsymbol x^{(k+1)} = \\boldsymbol x^{(k)} - \\mu \\nabla_{\\boldsymbol x} C\\big(\\boldsymbol x^{(k)}\\big),\n",
    "$$\n",
    "where $\\boldsymbol x^{(k)}$ denotes the estimate at $k$th iteration, $ \\nabla_{\\boldsymbol x} C\\big(\\cdot\\big)$ is the gradient of $C\\big(\\cdot\\big)$ with respect to $\\boldsymbol x$, and $\\mu$ is the step-size for the gradient descent updates."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "a52ffMcrYu1G"
   },
   "source": [
    "**Exercise 3.** _(Optional)_ Write down the equation for the gradient descent of the considered problem. Starting from an initial weight matrix of all zeros $\\boldsymbol W^{(0)}=\\boldsymbol 0$ implement a gradient descent optimization algorithm to find the optimal solution to our classification problem. Run the method for a sufficiently large number of iterations or until you meet some convergence criterion (_e.g.,_ relative change of the cost function smaller than some threshold). On two separate plots, display the evolution of the cost function over the iterations and the difference between your current estimate and the optimal estimate $\\lVert \\boldsymbol W^\\star - \\boldsymbol W^{(k)} \\rVert_F^2$. Since the considered cost function has a unique and global minimizer your iterates should converge to the optimal solution obtained from the analytical expression."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "hDnK4upde4F2"
   },
   "source": [
    "*Write the expression for the gradient update in this cell*\n",
    "\n",
    "$$ \\boldsymbol W^{(k)} = \\cdots$$ "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "colab": {
     "base_uri": "https://localhost:8080/",
     "height": 74
    },
    "colab_type": "code",
    "executionInfo": {
     "elapsed": 4426,
     "status": "ok",
     "timestamp": 1568614929683,
     "user": {
      "displayName": "Pouria Tohidi",
      "photoUrl": "https://lh3.googleusercontent.com/a-/AAuE7mDXZnza4SI7Mw2FVcQCql-1mx1oJlf3HpytIeLg0Q=s64",
      "userId": "07206253912023649922"
     },
     "user_tz": 240
    },
    "id": "Qb4FNIthqdcw",
    "outputId": "62c40f16-774c-40cc-9fb8-7e21b709bc9d"
   },
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "DD3QUdwLKImc"
   },
   "source": [
    "**Exercise 4.** Repeat the exercise before but now using PyTorch taking advantage of its automatic differentiation feature that will compute the gradients for you. Go to the [PyTorch tutorials](https://pytorch.org/tutorials/beginner/pytorch_with_examples.html) page in order to learn how to manipulate tensors with autograd. You can use random initialization for the weights $\\boldsymbol W^{(0)}$. Why initialization does not matter in this problem?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "name": "prelab_2.ipynb",
   "provenance": [
    {
     "file_id": "1Isp5iA7eYiGKf8bW5Yt1Twvj1yh-F-8S",
     "timestamp": 1568574907710
    }
   ],
   "toc_visible": true,
   "version": "0.3.2"
  },
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 1
}