Getting Emotional with Deep Learning

Machine learning Inception – Convolutional Neural networks, genetic algorithms and emotion recognition…

Table of Contents

Essentially, this was a project exploring using machine learning to optimize model parameters of models that optimize model parameters of deep learning models for emotion recognition!

Current AI systems such as Amazon Alexa, are able to respond to human feedback and can be refined based on this feedback. They can also learn different patterns in terms of routines and predict needs based on them. However, something which these cutting-edge AI systems are lacking is that they are not able to refine their behaviour based on emotional cues.

The goal of intelligent computers is arguably to serve mankind. However, until AI systems can effectively adapt based on the emotions of the consumer, this is hampered. One cannot have a meaningful relationship with an AI system unless it shows empathy. A large part of empathy comes from recognizing the emotions of others.

Emotion recognition is something which is implicit in human cognition. One does not need to be taught specific emotions.

However, it is something that is extremely complicated. Emotions can be conveyed in the tone of the voice, cues in the motor activity, and also in facial expressions. For facial expressions, the face needs to be perceived using the visual system, and then this image needs to be translated into a representation of electrical signals in the brain, through the activation of different patterns of neurons [1].

The issue with designing computational systems for emotion recognition is that it is difficult to formalize in terms of rigid symbolic manipulation, formalizing emotions as sequences of 1s and 0s. Therefore, a question that needs to be answered when constructing such a system, is precisely how to define a scale for emotions? Emotions convey information about the state of a being. However, they are not merely binary, positive and negative, but there is a whole range of emotions.

There are two schools of thought with regards to the classification of emotions: firstly, that each emotion is discrete and physiologically distinct [2]; secondly, that emotions are not discrete and so can be described in terms of common dimensions between the emotions [3].

In 1971, the research of Paul Ekman [4] led him to propose that there are six basic emotions which are universal and have an innate biological basis: happiness, sadness fear, anger, disgust, and surprise.

Today, one of the most popular models for the classification of emotions is that of Robert Plutchik [5]. This model represents the basic emotions as a wheel and they are grouped on a positive and negative basis: joy opposing sadness, fear opposing anger and trust versus disgust. In this model, the complex emotions can be formed by modifying the basic emotions, either by combining these basic emotions or arising from cultural conditioning.

In 2011, Lövheim [6] proposed a Cube of Emotion which has a physiological basis. This model asserts that there are eight basic emotions and these result from different levels of dopamine, serotonin, and noradrenaline. For example, low serotonin, high dopamine, and high noradrenaline result in anger.

Although there are different models of the basic emotions, it is a universally accepted truth that the basic emotions contain at least anger, disgust, fear, happiness, sadness, and surprise. Each of these emotions has characteristic facial features, and this work will look at training a deep CNN to learn the facial characteristics that define these emotions, as well as contempt as there is evidence that the facial expression for contempt is universally recognized [7].

Emotion Recognition and Deep Learning

Whilst emotions can be conveyed by many means such as voice tone, body language and even measuring brain activity [8], a more simple method is analyzing facial expressions.

Studies have found that humans predominantly use the shape for emotion recognition, and in particular the distance between key facial landmarks such as the mouth and eyebrows [9]. For instance, when the mouth is open and eyebrows raised someone looks surprised, and when the corners of the mouth are raised then someone appears happy.

The ability for AI to detect emotions would be invaluable in many different fields such as in marketing, medicine and entertainment [10]. However, this is problematic for two reasons. Firstly, creating a model to characterize the facial expressions of emotions is limited by access to labeled training images, for which there is not a large abundance in the public domain. Secondly, lighting, perspective, and facial expressions vary dynamically and so to effectively categorize emotions, a system needs to learn how to account for these changes.

Deep CNN’s have been found to be particularly successful for facial expression categorization of images and are relatively invariant to changes in lighting [11]. However, something that is arguably the most common criticism of deep learning is that it is used as a black box
[12].

The problem lies in the fact that the number of non-linear transformations in a neural network is especially large and this complexity constrains them to function as black boxes, as it is particularly difficult to understand the higher dimensionality of intermediate layers of a deep neural network. This can make it challenging to know whether CNN’s are learning the same features which humans do for categorizing emotions. Perhaps they are using other features as well which humans are not able to.

For example, one study found that a neural network was able to identify criminals and non-criminals correctly from face images with an accuracy of 89.5% [13]. This points to the fact that CNN’s are able to pick up on subtle features which humans are not able to as it seems unlikely that a person would be able to tell if someone were a criminal just by looking at a picture of them close to this degree of accuracy. However, CNN’s are modeled upon the visual system and therefore if CNN’s process visual information in a similar way to a biological visual system, it follows that CNN’s would use predominantly the same features as humans to categorize facial expressions of emotions. If CNN’s use many of the same features as humans do to detect emotions, one would expect that removing certain features would affect the accuracy of a CNN in the same way as humans. Therefore, in this work, I want to test this hypothesis by manipulating the face images to retain just the locations of key landmarks and cropping out the rest of the face image apart from key face landmarks.

Furthermore, I will analyze the impacts of image preprocessing and augmentation on classification accuracy. Image preprocessing is a technique which is able to correct lighting differences between images. Therefore, as CNN’s process images to extract features in a hierarchical manner, one would expect better a shallow CNN model to have better classification accuracy with preprocessed images compared to un-preprocessed images.

This work aims to use genetic algorithms to analyze how the optimal parameters of a CNN change with image augmentation with a limited number of training epochs (with an epoch defined as a single forward and the backward pass of all the training examples). One would expect the genetic algorithm to favor shallower CNN models with image augmentation, as a recent study found that shallow networks can mimic deep networks with few parameters. The majority of parameters in deep neural networks are redundant, and techniques to prevent overfitting of deep neural networks promote redundancy, such as regularisation and dropout.

Therefore, a shallow network which is efficient and maximizes the utility of each of its parameters, in theory, can get similar results to a deep CNN without the risk of overfitting. The problem with a shallow network though is that it requires more training data to achieve the same accuracy as a deeper CNN [14]. However, augmentation is used to generate an increased number of training examples, and deeper networks will have more parameters to learn and so can take longer to converge as a lot of redundancy occurs during training. Thus, the hypothesis is that the genetic algorithm will favor shallower networks and these are able to train effectively with less redundancy by utilizing image augmentation [15].

Objectives of this work

Convolutional neural networks (CNN’s) have been found to be extremely effective for image classifiers.

In face recognition, CNN’s have achieved state-of-the-art results with a large decrease in error compared to other learning models [16].

This work aims to train a CNN to successfully recognize emotions of human faces, using the extended Facial Expression Recognition (Fer+) dataset [17]. The effects of preprocessing and augmentation were tested on classification accuracy. Then, the trained CNN was then applied to live video data. Also, this study demonstrates that a genetic algorithm can be used to pick the optimal hyperparameters of a neural network (parameters of the CNN which are set prior to the initialisation of the CNN training process), and as the number of training epochs decreases (an epoch defined as inputting the complete set of training images into the network), smaller networks are favoured by the genetic algorithm. This is advantageous in that a network with fewer units is less computationally expensive to train, and with fewer weights to learn so that there is a reduced chance of overfitting. Whilst deep networks typically have higher accuracy after convergence, a relatively small decrease in classification accuracy is a potentially beneficial trade-off for having a neural network which has less redundancy and so if more computationally efficient for image classification.

The study then demonstrates that as different features of the face images are extracted, the optimal topology of the network changes, with a smaller network for fewer face features. This supports the hypothesis that the neural network size is optimal for the input feature space. Furthermore, as using the locations of the key facial features, results in only a small decrease in accuracy, this points to the idea that these locations are a large component of what the neural network is learning when classifying emotions. This work examines whether the locations of key features were sufficient, or whether information pertaining to each feature, the pixels making up the feature, were of significant importance in classifying emotions, then masks were created to crop out the image not located at a key feature coordinate.

Furthermore, the optimal CNN topology and hyperparameters found by the genetic algorithm, trained using the face images, was then trained for more epochs until convergence and the classification accuracy evaluated. Furthermore, this study compares the time taken for this CNN to classify the emotion of each image and for comparison with the previous CNN configuration which was adapted from Barsoum et al. [17].

This work has been limited by resources and time constraints. The genetic algorithms are extremely computationally intensive, each one involves training 100s of CNN’s, and take >48 hours training time using a Nvidia 1060 GTX GPU. Also, alternative models to CNN’s were not researched and future work could investigate other deep learning methods, such as deep Support vector machines (SVMs), random forests and deep belief networks.

The primary and secondary objectives are defined below.

Primary Objectives

  • Implementation of a deep CNN for recognizing emotions of face images.
  • Compare the effects of image augmentation and preprocessing on classification performance.
  • Prototype an emotion recognition system which uses live video with a webcam.

Secondary Objectives

  • Successfully use genetic algorithms to optimize the hyperparameters of deep CNN’s for emotion recognition.
  • Investigate how image augmentation and preprocessing impact the optimal CNN hyperparameters of the genetic algorithm.
  • Investigate how extracting features of the face images affect the classification accuracy of the CNN models, and how different features spaces have an impact on the optimal CNN hyperparameters of the genetic algorithm.
  • Investigate whether the genetic algorithm can be used to optimize the CNN architecture for live video.

Primary Achievements

The primary objectives have been achieved through this research. These achievements are described below:

  • A deep CNN was successfully used to categorize facial emotions using the FER+ dataset [17].
  • The CNN was used to evaluate the effects of image augmentation and preprocessing for training the deep CNN model.
  • The CNN was used with live video data from a laptop for emotion recognition.

Secondary Achievements

All secondary achievements have been achieved through this research:

  • This work compares how training CNN’s with different facial features affects the accuracy of the CNN to categorize emotions. It was found that just the locations of key features were sufficient to successfully categorize the emotions in the majority of cases, however, cropping out the rest of the image apart from the key features resulted in a higher validation accuracy which was only 3% lower than using the full faces.
  • A genetic algorithm was successfully used to optimize the hyperparameters of a CNN. An adaptive genetic algorithm, with adaptive crossover and mutation rates, was compared with a non-adaptive algorithm.
  • The adaptive genetic algorithm was then used to investigate the impact of augmentation and preprocessing on the optimal hyperparameters of deep CNN’s. Furthermore, different manipulations of the face images were compared to evaluate whether the locations of key facial features are adequate for emotion recognition and how the optimal CNN hyperparameters change with different features spaces (different types of training data).
  • The genetic algorithms were found to favor shallower networks which converge quickly as they were trained over a limited number of epochs. The CNN with optimal hyperparameters found by the genetic algorithm was then trained over a hundred epochs and the weights saved after the epoch with greatest validation accuracy. This model was found to be more efficient at emotion recognition than the model adapted from Barsoum et al. [17].

Context Survey

There are many different types of computational models for facial expression recognition. These include recurrent neural networks, CNN’s, combining deep convolfutional features with hand-crafted features, analyzing relations among expression specific facial features using expression specific Action Units, and using statistical local features which are boosted as local binary patterns [18].

Some of the most successful methods for human facial emotion recognition have been deep CNN’s. In 2015, there was a competition called Emotions in the Wild for such systems and a CNN developed by Kim et al. won the contest with an accuracy of 61.6% [19].

The Perceptron Model

The perceptron was developed in 1958 by Rosenblatt, as a model which was inspired by neuroscience [20]. The perceptron receives multiple binary inputs (representing neural dendrites) and sums them together to produce a single binary output. The output of the neuron is 1 if the weighted sum of inputs is greater than the neuron firing threshold value and 0 if it less than the threshold. The decision making can thus be altered by just changing the threshold values and the weights.

Neural Networks

Neural networks are composed of multiple perceptrons. Networks of perceptrons are able to compute logical functions such as AND, OR and NAND. Because NAND gates can perform universal computation, multilayer perceptrons (MLPs) are therefore universal function approximators [21]. Typically, the neurons are arranged in layers, with an input layer, output layer and one or more hidden layers which transform the input.
Multilayer perceptrons (MLPs) have been shown to be universal approximators [22], they are able to approximate any continuous function. The layers are typically fully-connected, there is a connection between every neuron of adjacent layers, but there are no inter-connections between neurons of the same layer.

In order to approximate functions of different complexity, the number of units in each layer and the number of layers can be modified. However, this is challenging as if the network is too large then it will be modeling the underlying noise in the data and will not generalize well (overfitting). If the network is too small then it will not be adequate to model the underlying data well.

Activation Functions

Input units and hidden units in neural networks typically have activation functions. There are many different non-linear activation functions. Different activation functions may be advantageous for different problem domains. One of the most common activation functions is the sigmoid function, and this is a non-linear curve which mimics the firing of a biological neuron. The sigmoid function takes any real-valued input and transforms it in the range of 0 and 1. Alternatively, a hyperbolic tangent can be used as a zero-centered activation function. Recently, it has been found that Rectified Linear Unit (ReLU) activation functions are often advantageous over sigmoid functions as they improve training speed and they do not suffer from the vanishing gradient problem.

Unlike binary units, ReLUs can preserve information about the relative intensity of the inputs as information is passed through many layers of a neural network [23]. This is why they have been found to be particularly effective in convolutional neural networks, as they can represent translated feature activity patterns with translation invariance; given that ReLUs have zero biases and no noise, they translated feature representations vary in the same way as the image.

Loss Functions

Neural networks are initialized conventionally with randomly generated weights. The input data is fed into the neural network in iterations called batches, and a cost function is used to modify the weights to attempt to find a function which maps the set of inputs to their correct output. The loss function is simply a function which calculates the difference between the actual and the expected output of the network. This is then used to modify the weights to reduce the gradient (gradient descent) between the network output and expected output. There are many different loss functions, as a suitable loss function depends on the type of output information. For example, Euclidean distance is an effective means of calculating the distance between two vectors.

The weights of the neural network are updated using the loss function. In gradient descent, the derivative of the cost function with respect to the weight is subtracted from the weight. During training, this results in the derivatives getting smaller, with the aim of getting the derivative term of the loss function converging to zero. Each weight update is also scaled by a learning factor which usually decreases during training, as initially when the network has random weights it will likely have a large loss but during training the derivative term of the loss will decrease, and so smaller weight updates will be needed to get closer to the minimum the network is converging towards.

The Limitations of multilayer perceptron models for image data

The issue with MLPs is that each neuron in fully connected to every neuron in the subsequent layer. This means that if a 48*48*1 image were fed into an MLP model, the input layer would have 2304 neurons (each representing one pixel). Each neuron in the subsequent hidden layer would, therefore, have 2304 weights. This would mean that if there were as many neurons in the hidden layer as the input layer, there would be 2304^2 weights. With successive layers, it essentially becomes unfeasible with a large input space. Another issue with using MLPs is that they are not able to learn the spatial structure of the data. A neural network is essentially a function which maps an input to a target. Because the layers are fully connected, this means that it treats pixels the same regardless of where they are in the image. This makes it poor for things like pattern recognition, as patterns can occur in different parts of the image.

It is nontrivial to train an MLP model to memorize images if one gives it the same dataset in the training and test data. It will simply learn that certain image pixels map to a certain output. However, one can use a look-up table for this. The aim of training a model, however, is so that it is able to generalize for patterns of input data that were not present in the training dataset. This is the fundamental goal of machine learning:

“the goal of supervised learning is not to learn an exact representation of the training data itself, but rather to build a statistical model of the process which generates the data.”[24]

In achieving this goal there is a trade-off. On one hand, if a model is too simple it will be too crude to accurately model the information contained within the training data. However, a model that is too complex, will be fitting on noise and so will be unable to account for changes in the input data which are not due to changes in the target values but are resultant from external sources such as random noise and anomalies.

This is why MLP models do not scale well to high-resolution images, as they suffer from the curse of dimensionality. This is the problem that when the dimensionality increases, the volume of space increases at such a rate that the data available becomes sparse and the amount of data required to obtain reliable results often grows exponentially.

CNN’s are able to better overcome this problem as they can exploit temporal and spatial in-variance in image classification. For example, if an MLP is trained with images of dogs, if the MLP does not repeat weights across space then if the MLP has only been trained with input images where the dog is in the top right corner of the image then the neurons which receive inputs from the top right of the image would have to learn the representation of a dog separately from the neurons receiving inputs from the neurons receiving inputs from the other parts of the image. This means that the network would have to be trained with enough images of dogs so that it had seen dogs in all the possible locations of the image. A CNN on the other hand, can extract local features from an image, breaking components of the image into sub-components. The filters which identify patterns in the image can learn the features from one part of the image and generalize the feature to other image locations.

Convolutional Neural Networks

Convolutional neural networks are designed based on the visual system of living organisms.

The visual cortex of animals contains neurons which are activated only by the illumination of a specific region of the retina [25]. A receptive field of a visual neuron, therefore, is the two-dimensional region of visual space which the neuron responds to. The receptive fields of neighboring cells are similar and overlap. The size and location of receptive fields vary across the visual cortex.

At successive stages of visual processing, the size of receptive fields increases. Retinal ganglion cells which are located in the center of the retina, the fovea, have the smallest receptive fields and cells located in the periphery of the retina have the largest receptive fields – which is why we have a poor spatial resolution in the periphery of our vision.

Early work on the visual system in the brain identified two broad categories of cell called simple cells and complex cells, which respond to different inputs within their receptive fields:

  1. Simple cells: respond to oriented edges and gratings and are selectively activated by different orientations and frequencies.
  2. Complex cells: they respond to oriented edges and gratings. However, they also are spatially invariant meaning that their output is independent of the exact location of the orientation within their receptive field. Some of these cells respond to movements in a specific direction.

Neocognitron

Kunihiko Fukushima developed the neocognitron, a hierarchical and multilayered artificial neural network, in the 1980s [26]. This is an early form of a CNN and has achieved success in tasks such as pattern and handwriting recognition. This model was inspired by the simple cells and complex cells which were found in the visual system.

The architecture of this network is comprised of multiple stages. There are alternating layers of S-cells and C-cells and each cell received input from a certain area of the preceding layer.

S-cells

These are for feature-extraction. They are analogous to the simple cells in the primary visual cortex. The input for each S-cell is learned so that after learning, each of these cells will respond to a specific feature within its receptive field.

C-Cells

These are analogous to complex cells. The inputs of C-cells are fixed so that each of these cells will receive input from an area of S-cells in the previous layer. Each of these S-cells will extract the same feature but will have a difference in position. Therefore they enable the network to be resilient against errors in the position of the features of stimuli. Each C-cell will be activated if at least one of the connected S-cells in the previous layer is activated. Therefore they are deformation-resistant as the feature could move slightly but the C-cell will still be activated and so effectively they make a blurring operation.

LeNet-5

LeNet-5 was one of the earliest convolutional neural networks, developed in 1998 by Yann LeCun et al. [27]. It has seven layers and is trained with backpropagation for tasks such as hand-writing classification.

The LeNet network is comprised of alternating convolutional layers and sub-sampling layers. The output of the final sub-sampling layer feeds into a fully connected layer.

GPUs for training CNN’s

GPUs are designed to handle a high throughput of parallel tasks. GPUs speed up neural network computations; they are significantly faster than CPUs for training CNN’s – they can reduce the training time of a neural network from days to minutes.

How do Convolutional Neural Networks Learn Features?

In order to classify emotions from facial expressions, a neural network needs to extract features of the face that conveys the expression. This is important to extract features as the simulated faces could contain many thousands of pixels and so it is computationally unfeasible to process every pixel. The dimensionality of the face image is therefore reduced to extract the important features which are used to distinguish between different emotions. A feature can, therefore, be thought of as a relationship between multiple pixels in the image and these could be edges, corners, blobs, ridges, or a point which is a region of interest (ROI).

Feature Extraction

Convolutional neural networks are used for feature extraction as they are more suitable for image data than standard feed-forward networks. They are able to learn features in a hierarchical manner, each layer extracting features from the previous one. A simple convolutional neural network typically consists of a convolution layer, followed by a non-linear activation function and then a pooling layer. The output of these layers is input into fully connected layers.

CNN Layers

There are four typical operations in a convolutional network:

  • Convolutional layer
  • Non Linearity (ReLU)
  • Pooling-Layer
  • Fully Connected Layer

Convolutional Layer

The convolutional layer performs convolutional operations. The convolutional operation is a mathematical operation which takes two input functions to produce an output function. The output is an integral and it expresses how much overlap there is between the two input functions as one of the functions is shifted over the other function.

The convolutional layer also has a non-linear transformation which is applied after the convolution. Often a ReLU non-linear transformation is used. This is a rectified linear unit (ReLU) which is a non-linear function. ReLU is applied to each individual element which is output from the previous layer (pixel by pixel). It replaces the negative output values in the output from the previous layer (feature map) with zeros. The purpose of this is to introduce a non-linearity into the network. The convolutional layer is linear. However, we want the convolutional network to be able to transform the input data in a non-linear fashion which is the purpose of this layer. The output of this layer is known as the ‘rectified’ feature map.

Non-Linearity (ReLU) Activation

This operation is typically used after each convolution. This is a rectified linear unit (ReLU) which is a non-linear function. ReLU is applied to each individual element which is output from the previous layer (pixel by pixel). It replaces the negative output values in the output from the previous layer (feature map) with zeros. The purpose of this is to introduce a non-linearity into the network. The convolutional layer is linear. However, we want the convolutional network to be able to transform the input data in a non-linear fashion which is the purpose of this layer. The output of this layer is known as the ‘rectified’ feature map.

Pooling Layer

Pooling means sub-sampling and so this is essentially taking multiple dimensions of the feature map and reducing them to a fewer number of dimensions which are able to capture the most important features. This layer bins the output of the preceding layer by applying maximum or average pooling. Maximum pooling has been found to achieve better performance, and the maximum pooling function selects the maximum value from a small window of the image [28]. The window traverses the image and can either overlap other windows or not allow overlap. Often a 2 by 2 window is used without overlap.

This layer reduces the spatial dimensions of the input for two reasons. Firstly, it reduces the number of parameters and this reduces computation costs and controls over-fitting. This means the model is better able to generalize to the unseen test data.

Fully Connected Layer

The aim of this layer is to classify the images into classes which are task-dependent. This layer is an MLP, and every individual neuron from the previous layer is connected to all neurons in the next layer. There could be more than one fully-connected layer, depending on the task, and the error rate could be used to determine whether more than one fully-connected layer is required.

The output of the pooling layer is input to this layer. The input to this layer represents a high-level features map. The purpose of this layer is to use these features to classify the input image according to the training data. This layer can also learn non-linear combinations of these features to aid classification. The sum of the output of this layer is usually 1 and this is ensured by using a softmax activation function as the output function. This takes any arbitrary vector and transforms the values between zero and one so that the sum of values is 1.

CNN Architectures

The optimal architecture of the CNN is dependent on the problem. However, reading previously literature, there have been many different architectures of CNN used for emotion classification. The three most popular architectures are GoogleNet [29], AlexNet [30] and VGG [31] and there were handcrafted using large quantities of ImageNet data. Whilst these networks are popular as they can successfully generalize to a large range of image classifying problems, they also have a significant number of redundant parameters for many problems. Studies have found that reducing the number of parameters of these networks by up to 90 percent will not yield a significant decrease in accuracy (REF).

Reducing the number of parameters is beneficial in that it will reduce the computation, and could increase accuracy and prevent overfitting. The architecture of CNN’s can be changed in three ways: the network depth, the connections between layers and the number of neurons in any layer.
Studies on optimizing the hyperparameters of CNN’s have approached this problem in different ways such as grid search, random search [32] particle swarm optimization [33], genetic algorithms [34] and Deep Q-learning [35].

Facial expression recognition using CNN’s

Overview

Recently, there have been many developments in the field of facial expression recognition using computer-vision. There have been many different machine learning techniques applied to this problem, such as feature extraction using histograms of gradients [36], AU aware facial features [37], boosted multi-resolution spatio-temporal descriptors [38], and recurrent neural networks for emotion recognition in video [39].

However, the most successful approaches recently have involved CNN’s, with all the top entries in the Emotions in the Wild 2015 contest – which used static face images – using CNN’s [40]. Furthermore, a Kaggle facial expression recognition challenge, a competition to develop the best system for recognizing the emotion expressed in a human face, had the top three teams all using convolutional neural networks trained discriminantly with image transformations [41]. The best system used hand-crafted feature learning, using Scale Invariant Feature Transformation (SIFT) and multiple kernel learning [42], came in fourth place. The winner of the competition, Yichuan Tang, used L2-SVM loss function, replacing the softmax output layer with a linear SVM top layer. This achieved an accuracy of 71.2% on the test data [43].

Yu and Zhang achieved the winning score in Emotions in the Wild in 2015 using CNN’s [44]. They used an ensemble of CNN’s, randomly perturbing the training images with a composition of skew, rotation, and scale to augment the training images. They achieved an accuracy of 61.29% on the Static Facial Expressions in the Wild (SFEW) test set and surpassed the winning model for the Facial Expression Recognition 2013 (FER) dataset with a test data accuracy of over 72%.

Each of their networks contained five convolutional layers, three stochastic pooling layers, three fully connected layers and a softmax output layer with a negative likelihood loss function. ReLU was used as the activation function for all the fully connected layers. Also, dropout was applied to the fully connected layers to improve generalization and prevent overfitting. Over-fitting is when a statistical model describes random noise instead of the underlying relationship, and deep networks are prone to this due to their high capacity. Stochastic pooling was chosen instead of max pooling as it has been found to prevent overfitting [45].

Related Work

A facial expression recognition system that runs on smart-phones was developed in 2014 by Song et al. [46] using convolutional neural networks. This network has 65,000 neurons spanning five layers and was trained using the CK+ dataset along with three other datasets created by the authors. One problem with using large convolutional neural networks is that they tend to suffer from overfitting, particularly with small training data sets. To overcome this problem, Song et al. increased the amount of training data using data augmentation techniques and they used training drop-out. The images were cropped to contain regions of the images that contained face changes corresponding to expressions. They achieved a 99.2% accuracy rating using the CK+ dataset, however, it is unmentioned if there was overlap with subjects between the training and test sets.

Another method used by Burkert et al. [47] does not use hand-crafted feature extraction. Instead, the network consists of four components. The raw images are initially pre-processed automatically and then the output is fed into the remaining components for automatic feature extraction. Finally, the extracted features are input into a fully connected layer to classify the facial expressions. This network architecture has 15 layers: 7 convolutional layers, 5 pooling layers, 2 concatenation layers and a normalization layer. Using the CK+ dataset they achieved a 99.6\% accuracy rate. However, there was an overlap with subjects between training and test sets.

A method by Lopes et al. [48] proposes a system which performs the three learning stages with a CNN. During the training phase to learn the network weights, the system is trained with grayscale images of faces along with the respective expression and the locations of the eye centers. They separate a validation set of images in order to select the optimal network weights after inputting the image samples to the network in a randomized order. During the test phase, grayscale face images are input along with the location of the eye centers and the predicted expression that is output from the network is used. The expressions are coded as integer numbers.

During testing, they generate new images to increase the size of the training set. Then, the images are rotated so that the eyes are horizontally aligned. The images are then cropped to remove background information, preserving the features of the image that are specific to the facial expression. The images are then down-sampled so that the features are in the same position for each of the different images. Finally, the intensity of each image is normalized and these normalized images are used for training the CNN. This preprocessing is repeated for testing.

Research methodologies

Preprocessing and augmentation

Much of the research into emotion recognition using deep learning use images which have been created in the lab with invariant lighting conditions and acted expressions [49]. However, in reality, lighting conditions would be dynamic and unpredictable between different face images in different settings. Convolutional neural networks have been shown to deal with variable lighting conditions without image preprocessing. In the paper supplied with the Fer+ dataset, Barsoum et al. [50] do not discuss preprocessing the images to adjust for different lighting conditions. However, image preprocessing to adjust for this has been found to be effective to improve emotion recognition [51].

Genetic Algorithms

A genetic algorithm is a metaheuristic inspired by the natural selection process. They are often used in search and optimization problems to obtain high-quality solutions by using operators inspired by biology such as mutation, crossover, and selection [52].
Genetic algorithms essentially use two components: an individual genome which is a genetic encoding of a candidate solution and a fitness function in order to evaluate the fitness of every individual candidate solution. A typical genetic encoding of a candidate solution is as an array of bits [53], the solution represented as an array of binary 1s and 0s.

Typically, the genetic algorithm will start with a population of individuals which each have a random candidate solution. Each individual of the starting population will be evaluated by the fitness function. The individuals with more fitness will have a greater probability of being selected for the next population and each of these successive populations is known as a generation. Each individual selected for the next population will typically be modified by recombination and/or random mutation. The next generation is then used for the next iteration of the algorithm.

Fundamentally, this process of selection enables survival of the fittest, preserving the stronger individuals and eliminating the weaker individuals.

A major limitation of using genetic algorithms for deep convolutional neural networks is that they are extremely computationally expensive, as a complete network training process needs to be conducted for every individual.

In 2017, Xie et al. [54], demonstrated the feasibility of using genetic algorithms for obtaining efficient modern CNN architectures. In their paper, they state that prior to their work in 2017, genetic algorithms had been used to search for efficient neural network architectures, however, they had not been used to learn efficient architectures of modern CNN’s.

In their paper, they demonstrate the feasibility of using genetic algorithms for this domain and found that the architectures generated by the genetic algorithm often performed better than the manually designed ones. On the ImageNet Large Scale Visual Recognition Challenge 2012 (ILSVRC2012) classification task [55], a benchmark for object classification and detection, their model achieved better performance than VGGNet.

Furthermore, they found that running the genetic algorithm for training networks on a small dataset, they found network structures which transferred well to large datasets as well.
Methodology


Methodology


Overview

A deep CNN will be trained to classify the emotion of a face image. The impacts of image preprocessing and augmentation on emotion recognition accuracy will be evaluated. Each image in the dataset contains a face and so face detection is not required when training the CNN. However, for the live video emotion recognition, it is not assumed that each frame contains a face, and so face detection needs to be implemented. Face detection will be implemented using Haar Feature-based Cascade Classifiers [56], however the aim of this work is to analyse the modelling of emotion recognition using the Fer+ dataset.

Also, the images are manipulated to analyse how reducing the feature space, such as extracting the locations of key features and cropping out parts of the face (apart from the eyebrows, eyes, mouth, nose, and jawline) affects the emotion recognition accuracy of the deep CNNs. Next, genetic algorithms will be used to examine how changing the feature space affects the optimal hyperparameters of a deep CNN model. Furthermore, the genetic algorithm is used to find a CNN architecture which is optimal for live video data. Whilst deeper CNN architectures typically have higher accuracy, this comes at the cost of computational complexity. Therefore, the genetic algorithm is used to search for optimal hyperparameters when constrained by a short number of training epochs, in the aim of finding an architecture with a reduced number of parameters and thus more optimal for live video data than the models which have been found to be successful in previous studies of emotion recognition.


Training a DCNN for Emotion Recognition


The Dataset

The datasets to be used to train and test the network is the FER+ dataset, based on the Emotion FER dataset [57] but with additional labels [50] . This dataset consists of 35,887 examples of 48×48 grayscale images of faces. The FER dataset has a label for every image from seven emotional categories (0=Angry, 1=Disgust, 2=Fear, 3=Happy, 4=Sad, 5=Surprise, 6=Neutral). The FER data was prepared by Pierre-Luc Carrier and Aaron Courville [57].
The FER+ new label annotations, are a set of new labels for the FER dataset. Whereas the original dataset was labelled by one crowd-sourced tagger, the FER+ dataset was labelled by 10 crowd-sourced taggers for each image. This enables a probability distribution to be estimated for the emotion of each face and to model multi-label outputs rather than just a single-label output. The FER+ dataset has labels for eight emotional categories: neutral, happiness, surprise, sadness, anger, disgust, fear, contempt. Furthermore, it has labels for unknown and not a face. These labels were modified by dividing each label by the total number of labels for that image which were not unknown or not a face, and images without any labels for the emotion categories were removed from the dataset.


Hypothesis: Image Preprocessing and Augmentation will improve model performance for emotion recognition

The face images used for training will vary in terms of the lighting conditions, rotation of the face and the size of images. It is important that these variations in the images are corrected before using the images to train the network, as these variations are independent of the facial expressions and so this noise will reduce the model accuracy. The OpenCV libraries will be used for many aspects of image preprocessing [58].

The first step in preprocessing the images is rescaling the image pixel values from a range of 0-255 to 0-1. Then, for training models with preprocessing, the features of the dataset are centred so that the mean input is set at zero. This works by finding the mean value for each pixel in the dataset, and then it subtracts the mean image from each image of the dataset. Then Zero Components Analysis (ZCA) whitening is applied to the dataset, and this reduces the variance in illumination and shadow between images. This is employed to decorrelate features by transforming the data so that its covariance matrix matches the identity matrix [59]. This preserves edge information, but sets regions of the image that are relatively uniform in intensity to a pixel value of zero.

For the validation and test dataset, featurewise centering and ZCA whitening are also applied to each image but using the statistics of the training dataset.

CNN’s are able to learn functions which are invariant to transformation, however they need a lot of training data. Therefore a process Simard et al. [60] proposed is to increase the size of the training set by modifying the images using combinations of rotations, shifts, zooms, shears and horizontal flips. This augmentation is applied on-the-fly during training, but is not applied to the test and validation images.


Network Architecture

First of all, a deep CNN network will be used to evaluate how image preprocessing and augmentation affects the performance of the model. Then the same architecture will be used to evaluate how changing the feature space affects the performance of the model. The architecture used is based on the custom VGG-13 model of Barsoum et al. [50], which they used to evaluate the FER+ dataset. This model will be referred to as the FER+ CNN model in this paper. However, because they used images at 64 by 64 resolution and the input images in this study are 48×48 resolution, the size of the network has been reduced.

The deep CNN is comprised of ten convolution layers, two fully connected layers and a softmax output layer. The activation functions used are ReLUs and dropout is applied before each of the fully connected layers. Also, max pooling layers are applied after the second, fourth, seventh and tenth convolution layers. In the softmax layer, each of the eight units outputs a probability distribution between 0 and 1, with the outputs summing up to a total of 1. Each output unit represents an emotion, and represents the percent confidence of the image belonging to the emotion class.

Figure:
The deep CNN architecture which was modified from the Fer+ VGG13 network. It consists of 8 convolutional layers with kernel sizes of 3×3. The fully-connected layers use ReLU as an activation function and each convolutional layer is followed by a ReLU layer.

The FER+ network structure compared to the network which achieved best results in the genetic algorithm


Training

The CNN will be trained using an extension of stochastic gradient descent (SGD). This is an incremental method whereby the error gradient is approximated after inputting a mini-batch of training examples and used to update the weights of the CNN to minimise the error between the target output and expected output.

The images are input during training as mini-batches of 64, updating the weights after every mini-batch, so that the CNN weights are updated multiple times per epoch (with one epoch defined as one forward and one backward pass of all the training examples). The batch size is set at 64 as batch size directly impacts GPU memory usage and so a small batch size enables larger CNN topologies to be tested without running out of memory. The loss function used to calculate the error is categorical cross-entropy as this takes advantage of the multiple labels per image and was found to result in better training accuracy compared to majority voting which uses the majority of the label distribution for calculating the cost function [50]. The cross-entropy loss is defined below:

4#4 (1)

The CNN weights are initialised with a uniform random distribution between 0 and 0.05. The training set is shuffled separately each epoch. The optimiser used for training the CNN is Adam because it has a small memory footprint and is effective for problems with noisy gradients, large datasets and for networks with large numbers of parameters (REF).

Adam adapts the learning rates for different parameters through estimating the first-order and second-order moments of the momentum [61]. It is designed to combine the advantages of two stochastic gradient descent extensions. Firstly, the Adaptive Gradient Algorithm (AdaGrad), which is effective for problems with sparse gradients as it maintains a per-parameter learning rate (REF). Secondly, Root Mean Square Propagation (RMSProp), as this adapts per-parameter learning rates based on how quickly the weight gradients are training; this is effective for non-stationary problems.


Visualising the CNN learning

In order to visualise what the CNN is learning, two methods can be used. Firstly, layer activations and convolutional filters can be visualised during the forward pass. an image is input into the CNN and the filters of different layers visualised.

Secondly, the issue with trying to visualise what a CNN is learning is that the CNN is transforming the images into a high dimensional space, therefore a dimensionality reduction technique developed by Geoffrey Hinton and Laurens van der Maaten [62] called t-Distributed Stochastic Neighbor Embedding (t-SNE) is used to embed the high-dimensional space in 2D space.
It does this so that similar points in high dimensional space are modelled by nearby points in 2D space, and dissimilar points are distant.


Manipulating the Feature Space


Feature Extraction and Image masks

This study aims to examine how manipulating the feature space of the face images, affects classification accuracy for emotion recognition.


Hypothesis: The location of key face features will be sufficient for emotion categorisation

The key features for emotion detection are the eyes, eyebrows, nose and mouth. If the CNN is primarily using the relative positions of these features to detect emotions then using just the coordinates of features of a face to train a model will be sufficient to get good accuracy for emotion categorisation.
Studies have found that humans use predominantly shape and distance between facial features for the perception of the facial expressions of emotion. As CNNs mimic the human visual system, they should be able to get good accuracy using just this information. If similar accuracy is achieved using the whole images then this suggests that only the positions and outlines of the facial features are required to discriminate between emotions. A Haar feature extractor can be used to extract the positions of these key features in the image with good accuracy. The features can be represented in terms of dots for each of the coordinates and connecting lines within each feature (eyebrows, eye outlines, mouth outline, nose outline and jaw line) to a CNN.


Hypothesis: The information contained within key face features will be sufficient for emotion categorisation

The key features for emotion detection are the eyes, eyebrows, nose and mouth. To test this hypothesis, a Haar feature detector is used to detect the positions of key features, and the rest of the image is cropped out. If the CNN is primarily using the relative positions of these features to detect emotions as well as information contained within each features, cropping out the rest of the image apart from the eyebrows, eyes, nose, mouth and jawline should not have a significant decrease in accuracy for training, validation or testing.

If good accuracy is achieved with this condition but not with just the locations of key features, it suggests that the CNN requires pixel intensity information from within each feature and not just the positions of the features for categorising the emotions. The pixel intensity information could provide more information about the shapes of the features. Also, the CNN could be using texture discrimination between and within the features to learn information about the shape and perspective of the face and features. If the accuracy is much lower than training a CNN with the whole image, it suggests that the CNN is using information from the rest of the face (and perhaps background) for categorising the emotions.


Hypothesis: Supplementing the face images with locations of key face features will improve model training and emotion categorisation

The face images are combined with the facial landmarks which is added as an additional channel to the image. The hypothesis is that this will improve model training and emotion categorisation accuracy as this additional information will aid the CNN to learn the locations of key facial features.


Using Genetic Algorithms to Search for optimal CNN topology and hyper-parameters


Hypothesis: A genetic algorithm can be used to Search for optimal CNN topology and hyper-parameters

A aforementioned, study in 2017 [54] demonstrated the viability of genetic algorithms in optimising modern CNN architectures. However, it is believed that no previous study has applied genetic algorithms to deep convolutional networks in order to test how the optimal topology changes with the feature space. This is what this study aims to do.

In the study of Xie et al. [54], they used an Nvidia Titan-X GPU, with a present RRP of £1,179 [63]. The training time for a revised LeNet network for CIFAR10 recognition (50,000 32×32 RGB images for training) was approximately 17 GPU-days (population size of 20, over 50 rounds and 240 training epochs per individual).

Therefore, a major issue with running multiple genetic algorithms for different manipulations of the dataset is that it can quickly become in-feasible without high-end GPU clusters.

A consideration with this is that running a genetic algorithm on one training set is extremely computationally expensive, and so running it with training set conditions quickly becomes unfeasible.
Therefore, for this study, an issue that had to be overcome was to ensure that the training time was feasible considering the time constraints. The GPU used was an Nvidia GTX 1060, with half the frame-buffer of an Nvidia Titan-X GPU and considerably decreased performance – with an RRP of £279.00 [63].

A particular issue with genetic algorithms, is that often they will have slow finishing, where after several generations the average fitness converges without finding a global maximum. Alternatively, premature convergence can occur where the population converges to a local maximum as several individuals with high fitness dominate the population and not enough exploration occurs. One solution to these problems, which was implemented in this work is using tournament selection; this is a method of selecting individuals from the population by running several tournaments among random subsets of the population, with the winner of each tournament (individual with best fitness) being selected for crossover and/or mutation.

Tournament selection has been found to be independent of the fitness function scaling of the genetic algorithm and allows for easy adjustment of the selection pressure of the genetic algorithm.

Furthermore, it has been found that adaptive genetic algorithms can be more effective in terms of faster convergence towards solutions with better fitness and they are less likely to get stuck in local optima than static genetic algorithms [64]. In this study, an adaptive learning rate is used. It is adaptive because the crossover rate, mutation rate and tournament size change depending on the population diversity and fitness.

A genetic algorithm is an improvement over randomly generating models due to the fact that there are selection pressures. The individuals (networks) are evaluated in terms of their fitness function (validation loss or accuracy), and the likelihood that a gene (layer parameter) in an individual will be passed on to the next generation is proportional to the fitness of the individual. However, there is a need to balance converging to an optimum, with not decreasing the diversity too much so that the algorithm gets stuck at a local optima.


Making the Genetic Algorithm Adaptive

The passing on of genes from generation to generation, is controlled by parameters of the genetic algorithm. These are crossover probability, mutation probability and tournament size. Definitions of each of these parameters are below:

Mutation Probability: the probability that the individual will undergo mutation, with a parameter in the genome randomly changing.

Tournament Size: in tournament selection, a sample of the population is selected, with the fittest individual being selected for crossover and/or mutation. This occurs with replacement to make up the next generation.

The work of this study is based on previous research [64], which found that making these parameters adaptive was more effective in finding an optimal solution in fewer generations.

These parameters are adaptive as they are influenced by the current state of the population. The state of the population is measured in terms of the diversity of the population and the diversity weighted by the fitness of the population. These are used to calculate the crossover rate and tournament size. Also, the individuals that do not crossover are mutated with a mutation probability for each gene and this mutation probability is adaptive to the fitness of the individual compared to the fitness of the best individual in the previous generation and the fitness of the current population.

These parameters govern the convergence in two ways. Firstly, as diversity decreases, the tournament size and crossover rate decreases in order to stop the genetic algorithm converging at a local optima. Less individuals undergoing crossover also means that a greater proportion of individuals will undergo mutation to increase the diversity of the gene pool. Secondly, if an individual has high fitness, they will have a lower mutation rate. This works in a similar way to how learning rate of gradient descent is typically decreased during training. As one gets closer to the optimal solutions (higher fitness), a smaller mutation rate should aid convergence to an optimum genome.


Hypothesis: The Optimal Parameters of the CNN depend on the feature space

The genetic algorithm was tested for various configurations of feature space – manipulations of the face images as described above. For extracting the locations of key features, a pre-trained facial landmark detector (described above). The images are all rescaled to 0-1, featurewise mean centred and ZCA whitening is applied using the statistics of the training dataset for the test and validation data.


Implementation

The implementation will be discussed in this section. It will give details as to the chosen network configuration and parameters along with the parameters for pre-processing. This section could be merged with the Design section.


Hardware

The Nvidia 1060GTX GPU was used for training the deep CNNs as this model supports Cuda and CuDNN. The Nvidia CUDA Deep Neural Network library (cuDNN) enables high tuning for standard deep learning routines such as forward and backward convolutions, pooling, normalisation and activation layers. The GPU has 6GB of memory, and this is why after testing, the batch size of the genetic algorithm was set at 64 as the models were exceeding this memory with greater batch sizes.


Software

The data was processed using Python 3.6 with OpenCV [65] and DLib [66] for image processing and feature extraction. Python 3.6 was chosen because most of the popular deep learning libraries support python.

The deep learning framework used for implementing the CNNs was Tensorflow (see [67]) using the Keras front-end [68]. This was chosen because Tensorflow is the most widely adopted Python deep learning library, as measured by the number of Github contributions [69]. Furthermore, Keras is advantageous over Tensorflow as it allows rapid prototyping with it’s high level python API, yet because it can run on top of TensorFlow there is no performance cost of using Keras over a lower-level framework such as Tensorflow. Also, Keras provides functions for image processing and augmentation on the fly, using random zooms, crops, flips and warping. A python package NumPy was used for working with large multi-dimensional arrays.


Dataset Scripts

The FER dataset consists of a CSV file with the first column representing the emotion as an integer between 1 and 7, and the next column being a vector representing the pixel values of the grayscale image. The images from this dataset were combined with the labels (between 0-7) of the FER+ dataset, which was in CSV file format. A script was programmed to convert these vectors into grayscale images. Then a CSV file was created with the emotion labels as the first column and the second column with the image locations. The output CSV contains a column for the emotion labels as integers and another column for the image locations.
A script was written to convert the CSV file to a NumPy array, containing the labels and each image represented by a 48 by 48 NumPy array containing values from 0-255, each integer representing a pixel of the grayscale image.

For evaluating preprocessing and augmentation of the face images, the training, validation and test sets were split according to the original splits in the FER. Keras was used for image preprocessing and on-the-fly augmentation.


Manipulation of the face images to generate the training data

Different training sets were generated by manipulating the face images, as described in the methodology. For analysing how extracting different facial features affects the deep CNN emotion categorisation, first of all a facial landmark detector was used to try and locate the following face regions:

    • Eyes
    • Eyebrows
    • Nose
    • Mouth
    • Jawline

The facial landmark detector used is included in the dlib C++ Library using the dlib python API. This landmark detector is an implementation of Kazemi and Sullivan’s [70] landmark detection model and was trained on the iBUG 300-W face landmark dataset [71]. The landmark detection model uses an ensemble of regression trees for estimating the positions of different facial landmarks. The facial landmark detector did not detect facial landmarks in all the image. Therefore, the dataset was randomised and then split with a 70% training set, 15% validation set and 15% test set. Then, for comparison, only the images which dlib could extract the features of were compared for different manipulations of the feature space so that it was fair comparison between the conditions.
Furthermore, preprocessing was applied to each of the images, as described above.

Figure:
The different image manipulations are shown. In the second condition, the landmark detector is used to output images of features locations in 200×200 (shown in the image) and 48×48, and in the third condition the rest of the face is cropped out apart from the key facial landmarks .

Image image_manipulation

However, augmentation was not applied as this increased training time considerably from 48 hours to 5#5 hours and so was not feasible for all of the conditions


Genetic Algorithm

The genetic algorithm was implemented with the DEAP genetic algorithm python library [72] and the DEvol (DeepEvolution) python library [73]. The maximum number of convolutional layers was set at 9, with the first always active, and the maximum number of fully-connected layers set as 2 not included the output softmax layer. The configuration of each network was randomly generated with the following possible parameters encoded into the genome:

Table:
The genome parameters
Training Parameters
batch size 64
optimizer Adam, RmsProp, AdaGrad, AdaDelta
Convolutional layers:
active 0, 1
number of filters 8, 16, 32, 64, 128, 256
batch_normalisation 0,1
dropout 0%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%
max pooling 0, 1
Dense Layers:
active 0, 1
number of nodes 16, 32, 64, 128, 256
batch_normalisation 0, 1
activation ReLU, Sigmoid
dropout 0%, 5%, 10%, 15%, 20%, 25%, 30%, 35%, 40%, 45%, 50%

If a layer has a value of 0 for active, then the layer is not included in the CNN model. The following parameter ranges were set for the genetic algorithm:

    • tournament size: 5 to 10
    • crossover probability: 10% to 80%
    • maximum mutation probability: 10.6%

After preliminary testing, several modifications were made to the adaptive genetic algorithm of McGinley et al. [64].


Population Size

Papers have found that having a low population size and more populations is more effective. The population size was therefore decreased from 50 to 30, and the mutation rate was made adaptive. Although variable filter sizes were initially part of the genome, this resulted in a lot of the networks being too large in memory for the video card and so to reduce the number of parameters in such very deep networks, 3×3 filters are used in all convolutional layers (the convolution stride is set to 1).


Calculating Population Diversity

A particular issue with the implementation of the genetic algorithm is ththat different parameters, such as number of filters in a layer, and the activation function, have different numbers of possible values. This means that taking the Euclidean distance as a measure of population diversity was not an effective solution without extensive manipulation of the data. Initially, the Euclidean distance was used as a measure of population diversity by normalising the parameters each to between 0 and 1. However, this was not a good measure of diversity as it meant that some parameters had an over-representative effect in terms of how they affected the diversity of the population.
Therefore, instead of the Euclidean distance, the Hamming distance is calculated between every pair of individuals in the population. For calculating the weighted population diversity, the Hamming distance is multiplied by the fitness scores of each of the genomes in the pair. Hamming Distance provides a better measure for weighted fitness as it does not prioritize any parameter over another if the parameters have different scales. Secondly, it takes into account the distances between every pair of individuals, not just every individual to the average. Whilst in the original study, McGinley et al. [64] argue that taking Euclidean distance from each individual and weighting this by fitness is a good measure of diversity, this has inherent flaws with a non-binary genome. Using this measure for calculating diversity, is only a very crude approximation of the actual spread of the data because it only takes the distance from this weighted average, which is susceptible to being skewed by outliers. For example, if there were only two small clusters of genome’s with very high fitness, but each cluster being very distant from each other in terms of the search space, than this would result in a very high variance score and diversity score, regardless of the spread of each of these clusters perpendicular to the weighted average. Taking the Hamming Distance between each pair of individuals, in this case would have a higher variance.

A consideration in imthe plementation of the genetic algorithm was that some genes were dominant, just as in living organisms which have inactive genes, if a layer is inactive then the genes representing the parameters do not have an effect on the phenotype of the network. Therefore, it was decided that these genes should not be used in calculating the diversity, instead the Hamming Distance between genes of an inactive layer in two individuals is set at 0. If one is active and one is inactive than every gene has a distance of 1. This is effective as it means that the genetic algorithm will increase the mutation rate when there is little expression diversity. Yet, if there are two competing models in the genome with a different number of active layers, this will increase the measure of diversity and so reduce tournament size with the aim of preventing a local optima being reached in either of the sub-populations with high fitness.


Adaptive Selection

In the original algorithm (see [64]), tournament size is calculated by taking the Weighted Population Diversity (WPD) as a fraction of the Maximum Population Diversity (MPD) and multiplying by the maximum tournament size. This was found to result in tournament sizes of only a couple of individuals after the first generation, because of the large diversity drop between the starting generation and first offspring generation. Therefore, the maximum WPD is reset at the second generation – the WPD of the second generation is taken then.

Furthermore, it was found that sometimes a population would have an individual with relatively good fitness compared to the rest of the population, however, after mutation and crossover the fitness would be decreased. Therefore, with tournament selection with replacement, a copy of the fittest individual in a population is passed onto the next generation without modification.


Adaptive Mutation and Crossover

The mutation was implemented by every individual not undergoing crossover, undergoing mutation. For each individual, a random genome is generated. The code iterates through each parameter and generates a random number between 0 and 1. If the random number is greater than the mutation probability than that parameter is inserted into the genome from the randomly generated individual. Therefore, if the calculated mutation probability is set at 10% it would be lower as the inserted parameter could have the same value as the one present in that position. This is actually advantageous as it favours mutations of parameters with multiple values, with dropout having the highest mutation probability of almost twice that of a parameter with only two possible values (e.g. active).

The table indicates the mutation of a parameter depends on the number of possible values for the parameter. It displays if there were a calculated mutation probability for that genome of 100%.

Table:
The mutation probability is adjusted for each parameter by a percentage depending on the number of possible values of a parameter:
Possible Parameter Values Mutation Possibility
1 0.00%
2 50.00%
3 66.67%
4 75.00%
5 80.00%
6 83.33%
7 85.71%
8 87.50%
9 88.89%
10 90.00%
11 90.91%

The aim was to have approximately three mutations in every genome. Approximately with a maximum of 10% calculated mutation probability, with a genome length of 66 (minus the first layer being active, batch size and filter size only having one value but these still included in the genome for easier modification), that gives an estimated number of mutations at this mutation probability of 3-6 (6#6 to 9.1%).


Working with Live Video

For the live video emotion recognition model, preliminary testing indicated that the recognition system was too slow for live video input. Therefore, at intervals of 3 seconds, a frame from the web-cam video feed is captured with OpenCV and converted from colour to grayscale. A pre-trained dlib face detector is used to detect face(s) in the image. This uses a Histogram of Oriented Gradients (HOG) feature in combination with an image pyramid, sliding window detection process and a linear classifier [74] .

Image happy_webcam

Then, the image frame is cropped to retain the first face detected. The face image is re-sized to 48 x 48 and preprocessed in the same way as the images for which the neural network was trained (see above) and then input into the CNN. The output layer of the CNN is a softmax layer for the eight different facial expression categories. These outputs are then plotted as a bar-chart.


Results


Image preprocessing and augmentation

It was found that augmenting the training images resulted in an increase in validation accuracy compared to the preprocessed and unedited images. Interestingly, the augmented training data (both with and without preprocessing) did not result in an increase in training accuracy compared to the unaugmented data. This demonstrates that the augmentation is beneficial in that it enables the CNN to generalise better to unseen data. Image augmentation is frequently used in image classification deep learning models for this reason. By adding random transformations to the input images, it is enabling the network to better learn universal features which it can use to categorise the emotions.

Figure:
Training and validation accuracy of the FER+ CNN model with preprocessing, augmentation and both preprocessing and augmentation

Image all_faces_accuracy

The same results can be seen in the loss values during training. Loss and accuracy are frequently used interchangeably to evaluate model performance, with a lower loss value usually correlating to an increase in accuracy. It is interesting that although the preprocessing with augmentation had a lower accuracy than the augmentation without preprocessing, however, they had similar loss values.

This is because accuracy is a binary method of evaluating model performance; accuracy is measured as the percentage of validation images for which the model correctly predicts the majority emotion. However, loss uses categorical cross-entropy and so is arguably a better performance metric. This means that it will not just evaluate model performance based on the majority class but involves taking the difference between the softmax output of the CNN (probability distribution) and the class distribution, as the dataset has been labelled by 10 people per image.

Figure:
Training and validation loss of the FER+ CNN model with preprocessing, augmentation and both preprocessing and augmentation.

Image all_faces_loss


The Key Facial Features for Emotion Recognition in DCNNs and Humans


Learning from the distances between key features

The CNN’s had worse performance (lower validation accuracy and higher validation loss) when just the locations of key features were used to train and evaluate the network and when the masked faces are used. This points to the fact that the CNN is not just using the locations of key features to categorise the emotions. It is interesting that the masked faces have a higher accuracy than the feature locations. This supports the hypothesis that not just the location of key features is required to characterise the emotions.

Figure:
This graph shows the FER+ CNN model training and validation accuracy and loss on key feature locations, as well as faces, and faces which have been cropped out apart from at the location of key features (eyebrows, eyes, nose, mouth and jawline).

Image filtered_faces_accuracy
Image filtered_faces_loss

It was hypothesised that the feature locations as 200 by 200 resolution would have an increase in accuracy compared to the 48 by 48 resolutions. However, the network did not effectively train on this data. This points to the fact that the network architecture is not optimal for this image resolution. It is likely that increasing the depth of the CNN would enable it to train to a comparable accuracy as the other conditions. Also, the filter size of 3 by 3 is likely too small for this image resolution as the patterns in the image are too coarse for this small filter size.

Also, adding the feature locations as an additional image channel did not change validation accuracy significantly compared to control. This demonstrates that the network is already effective at learning the locations of key features in the image. The filters in the CNN are learnable and respond selectively to different patterns in the input feature space. Therefore, providing the information pertaining to the location of the key features, is effectively redundant. During training, the weights of the filters are adjusted so that different filters will respond selectively to these key features.


Comparing the adaptive genetic algorithm with non-adaptive

The results demonstrate that a genetic algorithm can be used effectively to optimise the hyperparameters of deep CNNs. Whilst previous work often uses larger population sizes and more generations than in this study, with resource and time constraints this was not feasible.

The adaptive and non-adaptive genetic algorithm were compared with the preprocessed images
The adaptive and non-adaptive genetic algorithm were compared with the preprocessed images.<\caption>

The adaptive and non-adaptive genetic algorithm were compared with the preprocessed images.It was found that the adaptive genetic algorithm converged to a higher maximum accuracy than the non-adaptive one. However, because only one run of the genetic algorithm is compared, it is inconclusive whether more runs would find a significant difference between the non-adaptive and the adaptive genetic algorithm for this configuration. In the subsequent section, the adaptive genetic algorithm is used to examine how augmentation and preprocessing affects the optimal hyperparameters of the CNN.


The adaptive genetic algorithm network is more efficient than the VGG13 model

The optimal network from the adaptive genetic algorithm trained with augmentation and preprocessing was trained for 100 epochs, until convergence, and the model saved at the epoch for which the validation accuracy was highest. The optimal genetic algorithm CNN is shallower than the other two networks, has fewer parameters than the Fer+ network and has far fewer fully connected units than the Fer+ network. Interestingly, the last convolutional layer has a sigmoid activation function, whilst all the other weight layers of each network have ReLU activations. As discussed in the introduction, ReLU is typically used in preference to sigmoid activation functions in CNNs. Therefore, as ReLU is conventionally used instead of sigmoid in CNN’s, and as the genetic algorithm likely had not reached a global optimum after only ten generations, this network configuration was trained both with the original sigmoid activation function in the last convolutional layer and also with ReLU activation in this layer.

Figure:
The GA model, had a peak validation accuracy of 81.72% (epoch 99) compared to 81.25% (epoch 68) when ReLU activation was used instead of sigmoid activation, and 80.88% for the FER+ model (epoch 31).
8#8

It was found that the optimal CNN network found by the genetic algorithm (GA), had better validation accuracy (81.72%) than the same model with ReLU after the last convolutional layer instead of the sigmoid activation (81.25%) and the FER+ model
(80.88%).

Figure:
The optimal genetic algorithm CNN configuration compared to the Fer+ network and the network adapted from the Fer+ network which was used in this work. All the non-output weight layers have ReLU activation functions apart from the last convolutional layer of the genetic algorithm architecture which has a sigmoid activation function. The GA model has 4 convolutional layers and one fully connected layer, along with batch normalisation after the second and third convolutional layer. There is a max-pooling layer after the first second and fourth convolutional layer, 10% dropout after the first two convolutional layers, with 15% dropout after the third and fourth and 5% dropout after the fully connected layer.
degree of Do

Image model_architectures


Optimising the Network Hyper-parameters


The optimal topology of the CNN depends on augmentation and preprocessing

The graph indicates that augmentation resulted in an increase in both mean and max accuracy of the CNN during convergence of the genetic algorithm.

Figure:
The adaptive genetic algorithm was used to find the optimal hyperparameters for CNNs trained with augmentation, balanced classes and no augmentation. The images were preprocessed.

9#9

Balancing the classes reduced the training accuracy. Although synthetic image generation typically is used to aid generalisation, in this case it seems it has had the converse effect and led to overfitting of the training data. The classes were balanced by augmenting the minority classes before training the CNN. It is likely that this results from the CNN overfitting when learning the characteristics of the minority classes. The augmented images are not completely unique, even though these images have random transformations, rotations etc etc added to them, synthetic images will have features in common with the originals. The CNN will train on the features, and will not be able to distinguish real features of the emotion classes from features in the original images which are properties of those images and not generalisable properties of the emotion.

This type of overfitting can be in the image below:

Image face_miscategorisation

In this example, the colours represent the majority emotion label of the image (blue as happy and neutral as gray). The image is taken from a representation of the last convolutional layer of the CNN, after applying t-SNE to reduce the dimensionality to 2d. The CNN categorised the baby as happy when it should be neutral (indicated by gray). It can be seen that there are several other images of happy babies in close proximity to it. It is likely that the CNN has over-fitted on the images of the babies, and believes that some baby features are characteristics of happiness. This is why more training data is always better, as if it were trained with more images of babies in other emotional categories, it could better learn to separate the image characteristics that define a baby from the characteristics that define the emotions of a baby.

Number of Active Layers

The violin plots below, show how the validation accuracy of the networks in each condition, distributed across the number of active layers. The balanced classes have a worse validation accuracy, and as discussed in the previous section, the pre-balancing rather than live augmentation, likely results in the model overfitting and unable to generalise to the unseen validation images for the minority emotion classes.

13#13

For the CNN that was trained without augmentation, the peak validation accuracy occurs with 7-8 active layers. With the augmented dataset, the networks with the highest validation accuracy are distributed across 5-8 active layers, with the optimal architecture found at 5 active layers. One might expect convergence towards a higher number of layers in the augmentation condition compared to without augmentation. Typically, augmentation is used with neural networks to avoid overfitting. In theory, a single layer MLP is a universal approximator and with enough training data it can model any non-linear function. However, the issue with a shallower rather than deeper network is that in order to achieve the same generalisation error as a deeper network, the shallower network would need a much larger sample size.

The augmentation, in effect increases the size of the training data and this means that it is able to generalise with a shallower network. Given a longer training time then ten epochs, it is likely that the genetic algorithm would favour a deeper architecture. However, with such a constrained number of training runs the shallow network is advantageous, in that it has less redundancy in the network and so can achieve a better generalisation accuracy after only ten training epochs.

Optimiser Function

The chart below indicates that Adam was the optimiser which most frequently in networks with high validation accuracy. The advantages of Adam have been discussed in the Methodology.

Figure:
A violin plot with each dot representing a CNN. The individual CNNs are grouped with regards to Optimizer.The genetic algorithm was run three times with augmentation, balanced classes and without augmentation. It was found that for the CNNs with high validation accuracy, the majority of them had the Adam optimizer between conditions. However, in the balanced classes condition, most of the CNNs had the AdaDelta Optimizer.

10#10

It was found that with no augmentation, and the pre-balanced classes, the average dropout for CNNs did not drop below 0.15. However, with augmentation, high accuracy was achieved with low dropout 11#11. This is because both augmentation and dropout are different techniques used to prevent overfitting. Therefore, the CNN is able to have fewer layers and less dropout without overfitting the data when training over 10 epochs.

Figure:
The average dropout of each active layers during training of the genetic algorithm to optimise the hyperparameters of CNNs trained with different image augmentation

12#12


How changing the feature space affects the optimal network hyperparameters

The graph below shows that as before, training and evaluating the networks using feature locations rather than the face images results in lower training and validation accuracy than the face images. Interestingly, there is only small difference in accuracy between the masked faces and the full face images. This suggests that the key features are adequate to convey emotions.

14#14

Number of Active Layers
The chart shows that the full face images show a distribution of 7-8 active layers as optimal for validation accuracy, and as the feature space is reduced, the optimal number of layers decreases (5-8 for masked faces, 3-6 for feature locations).

15#15

16#16

Again, for the networks which achieved highest accuracy, Adam looks to be the optimal optimiser. It is interesting that for features spaces that resulted in poor accuracy, there was not a clear difference between the optimisers.

Evaluation and Discussion


Evaluation and Discussion

This study successful implements emotion recognition of face images using CNNs and live webcam video. It was demonstrated that image augmentation and preprocessing are beneficial for training CNNs.
Furthermore, whilst most research in the area compares different types of models with the aim of increasing accuracy, this study successfully takes a different approach of reducing the information contained within the feature space (e.g. the full faces or just the feature locations) and examining how it decreases the accuracy. It was found that the locations of key features were not sufficient for categorising emotions. However, cropping out the image apart from the key face features did not decrease accuracy much and resulted in little change in optimal CNN hyperparameters compared to the face images.

Whilst typically, state of the art CNN image classification models such as AlexNet are handcrafted, and state-of-the-art CNN models have been getting deeper [75] as the performance of GPU’s has increased, this leads to CNN models with increasing redundancy. For example, each of the top five models for the Labeled Faces in the Wild (LFW) dataset contain hundreds of millions of parameters [76]. This is an extremely inefficient means to studying CNNs, as increasing redundancy increases the number of computations that the model needs to perform to predict a target, and also increases the space of the model.

This project was successful in demonstrating that genetic algorithms can be an effective means of studying the properties of convolutional neural networks. This study demonstrated that genetic algorithms can be used for studying how changes to the feature space of a deep CNN changes the optimal hyperparameters of a deep CNN. In the framework of emotion recognition, the genetic algorithm was used to show how as the amount of information is reduced in the image, feature locations and masks, the optimal number of layers decreases. One surprising finding was that despite ReLU being the typical activation function for CNNs, preferred over the use of sigmoid activation, the optimal CNN architecture had a sigmoid activation layer.

The reason why image augmentation resulted in a decreased learning for the genetic algorithm is the following. The genetic algorithm, having a short training duration, will result in optimal network which train fast. Whilst better accuracy could be achieved with a deeper network, the fact that there is only a small amount of an accuracy increase points to the fact that a deeper network would have more redundancy in the network. This means that the deeper network, will have a larger number of nodes, and so each node will have a smaller contribution to the accuracy classifications of an image. Therefore, by constraining the network, the redundancy should decrease. This is beneficial in terms of live video data improving performance, as it is not feasible to input images at a high rate to an extremely deep network as the increased number of computations of the deep network processing the input image will result in a longer computation time for each image.

Also, this leads on to the possibility of dropping out weights from the network to boost speed. The genetic algorithm has found a network that is approximately optimal, although with a larger range of parameters and a longer training duration, accuracy would likely increase slightly further. One could use the same methodology to then prune the network further. A genetic algorithm could be used to select nodes the maximum number of nodes that results in a small 17#171percent drop in network accuracy. Whilst decreasing accuracy slightly, it would be interesting to test what percent of the nodes could be dropped out before accuracy starts so significantly decrease.

Furthermore, one area of future research could be to use a Q-learning CNN, trained to pick a small percentage of pixels (e.g. a 3,3 grid of pixels) with the goal of decreasing the accuracy of the trained CNN which is classifying emotions. This would lead one to see which pixels of the image are most important for the CNN to classify the facial expressions. Furthermore, the network could also be trained to select a filter to drop out with the reward proportional to the decrease in accuracy for emotion recognition. This would be interesting as it would show which filters are most important for classifying each of the emotions in addition to the location of the most important pixels. It is likely that the filter dropped out would correspond to the pixels which are dropped out as well.

[1] J. D. Schall, A. Morel, D. J. King, and J. Bullier, “Topography of visual cortex connections with frontal eye field in macaque: convergence and segregation of processing streams.,” The journal of neuroscience : the official journal of the society for neuroscience, vol. 15, iss. 6, p. 4464–4487, 1995.
[Bibtex]
@article{Schall1995,
abstract = {The primate visual system consists of at least two processing streams, one passing ventrally into temporal cortex that is responsible for object vision, and the other running dorsally into parietal cortex that is responsible for spatial vision. How information from these two streams is combined for perception and action is not understood. Visually guided eye movements require information about both feature identity and location, so we investigated the topographic organization of visual cortex connections with frontal eye field (FEF), the final stage of cortical processing for saccadic eye movements. Multiple anatomical tracers were placed either in parietal and temporal cortex or in different parts of FEF in individual macaque monkeys. Convergence from the dorsal and ventral processing streams occurred in lateral FEF but not in medial FEF. Certain extrastriate areas with retinotopic visual field organizations projected topographically onto FEF. The dorsal bank of the superior temporal sulcus projected to medial FEF; the ventral bank, to lateral FEF, and the fundus, throughout FEF. Thus, lateral FEF, which is responsible for generating short saccades, receives visual afferents from the foveal representation in retinotopically organized areas, from areas that represent central vision in inferotemporal cortex and from other areas having no retinotopic order. In contrast, medial FEF, which is responsible for generating longer saccades, is innervated by the peripheral representation of retinotopically organized areas, from areas that emphasize peripheral vision or are multimodal and from other areas that have no retinotopic order or are auditory.},
author = {Schall, J D and Morel, A and King, D J and Bullier, J},
isbn = {0270-6474 (Print)$\backslash$n0270-6474 (Linking)},
issn = {0270-6474},
journal = {The Journal of neuroscience : the official journal of the Society for Neuroscience},
keywords = {frontal eye field,saccade},
mendeley-groups = {Dissertation},
number = {6},
pages = {4464--4487},
pmid = {7540675},
title = {{Topography of visual cortex connections with frontal eye field in macaque: convergence and segregation of processing streams.}},
volume = {15},
year = {1995}
}
[2] [doi] P. Ekman, An argument for basic emotions, 1992.
[Bibtex]
@misc{Ekman1992,
abstract = {Emotions are viewed as having evolved through their adaptive value in dealing with fundamental life-tasks. Each emotion has unique features: signal, physiology, and antecedent events. Each emotion also has characteristics in common with other emotions: rapid onset, short duration, unbidden occurrence, automatic appraisal, and coherence among responses. These shared and unique characteristics are the product of our evolution, and distinguish emotions from other affective pheonomena.},
archivePrefix = {arXiv},
arxivId = {a},
author = {Ekman, Paul},
booktitle = {Cognition {\&} Emotion},
doi = {10.1080/02699939208411068},
eprint = {a},
isbn = {0269-9931},
issn = {0269-9931},
mendeley-groups = {Dissertation},
number = {3},
pages = {169--200},
pmid = {665},
title = {{An argument for basic emotions}},
url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.454.1984},
volume = {6},
year = {1992}
}
[3] W. Wundt, “An Outline of Psychology,” Science, vol. 5, iss. 127, p. 882–884, 1897.
[Bibtex]
@article{Wundt1897,
abstract = {Wundt, W. (1998b). Outlines of psychology (C.H. Judd, Trans.). Bristol, UK: Thoemmes Press (Original work published 1897) online: http://www.uni-leipzig.de/{\~{}}psycho/wundt/opera/wundt/OLiPsych/OLiPsyIn.htm},
author = {Wundt, W},
isbn = {185506-685-8},
journal = {Science},
mendeley-groups = {Dissertation},
number = {127},
pages = {882--884},
title = {{An Outline of Psychology}},
volume = {5},
year = {1897}
}
[4] [doi] P. Ekman and W. V. Friesen, “Constants across cultures in the face and emotion.,” Journal of personality and social psychology, vol. 17, iss. 2, p. 124–129, 1971.
[Bibtex]
@article{Ekman1971,
abstract = {Investigated the question of whether any facial expressions of emotion are universal. Recent studies showing that members of literate cultures associated the same emotion concepts with the same facial behaviors could not demonstrate that at least some facial expressions of emotion are universal; the cultures compared had all been exposed to some of the same mass media presentations of facial expression, and these may have taught the people in each culture to recognize the unique facial expressions of other cultures. To show that members of a preliterate culture who had minimal exposure to literate cultures would associate the same emotion concepts with the same facial behaviors as do members of Western and Eastern literate cultures, data were gathered in New Guinea by telling 342 Ss a story, showing them a set of 3 faces, and asking them to select the face which showed the emotion appropriate to the story. Ss were members of the Fore linguistic-cultural group, which up until 12 yr. ago was an isolated, Neolithic, material culture. Results provide evidence in support of the hypothesis. (30 ref.) (PsycINFO Database Record (c) 2012 APA, all rights reserved)},
author = {Ekman, Paul and Friesen, Wallace V},
doi = {10.1037/h0030377},
isbn = {1939-1315(Electronic);0022-3514(Print)},
issn = {0022-3514},
journal = {Journal of Personality and Social Psychology},
mendeley-groups = {Dissertation},
number = {2},
pages = {124--129},
pmid = {5542557},
title = {{Constants across cultures in the face and emotion.}},
url = {http://content.apa.org/journals/psp/17/2/124},
volume = {17},
year = {1971}
}
[5] [doi] R. Plutchik, “The nature of emotions: Human emotions have deep evolutionary roots,” American scientist, vol. 89, iss. 4, p. 344–350, 2001.
[Bibtex]
@article{Plutchik2001,
abstract = {What is an emotion? More than 90 definitions have been offered over the past century, and there are almost as many theories of emotionnot to mention a complex array of overlapping words in our languages to describe them. Plutchik offers an integrative theory based on evolutionary principles. Emotions are adaptivein fact, they have a complexity born of a long evolutionary historyand although we conceive of emotions as feeling states, Robert Plutchik says the feeling state is part of a process involving both cognition and behavior and containing several feedback loops.},
author = {Plutchik, Robert},
doi = {10.1511/2001.4.344},
isbn = {0003-0996},
issn = {00030996},
journal = {American Scientist},
mendeley-groups = {Dissertation},
number = {4},
pages = {344--350},
pmid = {19895021},
title = {{The nature of emotions: Human emotions have deep evolutionary roots}},
volume = {89},
year = {2001}
}
[6] [doi] H. Lövheim, “A new three-dimensional model for emotions and monoamine neurotransmitters,” Medical hypotheses, vol. 78, iss. 2, p. 341–348, 2012.
[Bibtex]
@article{Lovheim2012,
abstract = {The monoamines serotonin, dopamine and noradrenaline have a great impact on mood, emotion and behavior. This article presents a new three-dimensional model for monoamine neurotransmitters and emotions.In the model, the monoamine systems are represented as orthogonal axes and the eight basic emotions, labeled according to Tomkins, are placed at each of the eight possible extreme values, represented as corners of a cube.The model may help in understanding human emotions, psychiatric illness and the effects of psychotropic drugs. However, further empirical studies are needed to establish its validity. ?? 2011 Elsevier Ltd.},
author = {L{\"{o}}vheim, Hugo},
doi = {10.1016/j.mehy.2011.11.016},
isbn = {0306-9877},
issn = {03069877},
journal = {Medical Hypotheses},
mendeley-groups = {Dissertation},
month = {feb},
number = {2},
pages = {341--348},
pmid = {22153577},
title = {{A new three-dimensional model for emotions and monoamine neurotransmitters}},
url = {http://www.ncbi.nlm.nih.gov/pubmed/22153577 http://linkinghub.elsevier.com/retrieve/pii/S0306987711005883},
volume = {78},
year = {2012}
}
[7] [doi] P. Ekman and K. G. Heider, “The universality of a contempt expression: A replication,” Motivation and emotion, vol. 12, iss. 3, p. 303–308, 1988.
[Bibtex]
@article{Ekman1988,
abstract = {Two experiments replicated Ekman and Friesen's finding of an expression that signals contempt across cultures. The subjects, from West Sumatra, Indonesia, were members of a culture that differs in a number of ways from Western cultures. In one experiment the subjects judged photographs of Japanese and American faces, both males and females, which showed many different emotions. There was very high agreement about which expressions signaled contempt in preference to anger, disgust, happiness, sadness, fear, or surprise. In a second experiment the Indonesian subjects judged expressions shown by members of their own culture, and again there was very high agreement about which expression signals contempt.},
author = {Ekman, Paul and Heider, Karl G.},
doi = {10.1007/BF00993116},
isbn = {0146-7239},
issn = {01467239},
journal = {Motivation and Emotion},
mendeley-groups = {Dissertation},
number = {3},
pages = {303--308},
title = {{The universality of a contempt expression: A replication}},
volume = {12},
year = {1988}
}
[8] [doi] P. Abhang, S. Rao, B. W. Gawali, and P. Rokade, “Emotion recognition using speech and EEG signal: A review,” International journal of computer applications, vol. 15, iss. 3, p. 37–40, 2011.
[Bibtex]
@article{Abhang2011,
abstract = {In recent years the research interest is improving in the field of human computer interaction. This paper focus on one of the aspect of human computer interaction in concern with, the recognition of emotion in a person with the help of Electroencephalogram (EEG) signals and speech. EEG uses an electrical activity of the neurons inside the brain. EEG machine is used for acquisition of the electrical potential generated by the neurons when they are active. The Brain cells communicate with each other by sending electrical impulses. Emotions allow people to express themselves beyond the verbal domain. Speech is the most natural form of communication. A much of work is been done in speech recognition in various languages. It is one of the components that closely related to emotions. Very less work has been carried out using combine aspects of speech, emotion and EEG. Thus this paper attempts to review the combine efforts of EEG brain signal and Speech to recognize the emotions in humans.},
author = {Abhang, Priyanka and Rao, Shashibala and Gawali, Bharti W. and Rokade, Pramod},
doi = {10.5120/1925-2570},
issn = {09758887},
journal = {International Journal of Computer Applications},
keywords = {electroencephalogram,speech recognition and emotion},
mendeley-groups = {Dissertation},
number = {3},
pages = {37--40},
title = {{Emotion recognition using speech and EEG signal: A review}},
volume = {15},
year = {2011}
}
[9] [doi] A. Martinez and S. Du, “A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives.,” Journal ofmachine learning research, vol. 13, iss. 2012, p. 1589–1608, 2012.
[Bibtex]
@article{Martinez2012,
abstract = {In cognitive science and neuroscience, there have been two leading models describing how humans perceive and classify facial expressions of emotion-the continuous and the categorical model. The continuous model defines each facial expression of emotion as a feature vector in a face space. This model explains, for example, how expressions of emotion can be seen at different intensities. In contrast, the categorical model consists of C classifiers, each tuned to a specific emotion category. This model explains, among other findings, why the images in a morphing sequence between a happy and a surprise face are perceived as either happy or surprise but not something in between. While the continuous model has a more difficult time justifying this latter finding, the categorical model is not as good when it comes to explaining how expressions are recognized at different intensities or modes. Most importantly, both models have problems explaining how one can recognize combinations of emotion categories such as happily surprised versus angrily surprised versus surprise. To resolve these issues, in the past several years, we have worked on a revised model that justifies the results reported in the cognitive science and neuroscience literature. This model consists of C distinct continuous spaces. Multiple (compound) emotion categories can be recognized by linearly combining these C face spaces. The dimensions of these spaces are shown to be mostly configural. According to this model, the major task for the classification of facial expressions of emotion is precise, detailed detection of facial landmarks rather than recognition. We provide an overview of the literature justifying the model, show how the resulting model can be employed to build algorithms for the recognition of facial expression of emotion, and propose research directions in machine learning and computer vision researchers to keep pushing the state of the art in these areas. We also discuss how the model can aid in studies of human perception, social interactions and disorders.},
archivePrefix = {arXiv},
arxivId = {NIHMS150003},
author = {Martinez, Aleix and Du, Shichuan},
doi = {10.1038/nature13314.A},
eprint = {NIHMS150003},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Martinez, Du - 2012 - A Model of the Perception of Facial Expressions of Emotion by Humans Research Overview and Perspectives.pdf:pdf},
isbn = {1532-4435 (Print) 1532-4435},
issn = {1532-4435},
journal = {Journal ofMachine Learning Research},
keywords = {categorical perception,computational modeling,emotions,face perception,vision},
mendeley-groups = {Dissertation},
month = {may},
number = {2012},
pages = {1589--1608},
pmid = {23950695},
publisher = {NIH Public Access},
title = {{A Model of the Perception of Facial Expressions of Emotion by Humans: Research Overview and Perspectives.}},
url = {http://www.ncbi.nlm.nih.gov/pubmed/23950695 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=PMC3742375 http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=3742375{\&}tool=pmcentrez{\&}rendertype=abstract},
volume = {13},
year = {2012}
}
[10] [doi] Z. S. Hippe, J. L. Kulikowski, T. Mroczek, and J. Wtorek, “Human-Computer Systems Interaction: Backgrounds and Applications 3,” Advances in intelligent systems and computing, vol. 300, 2014.
[Bibtex]
@article{Hippe2014,
abstract = {The article presents a concept of using an audience response system (ARS) or "clickers", which can run on a mobile, tablet, PDA or PC for enhancing interaction between a lecturer and students in large classrooms. The developed in�teractive audio response software is based on Joomla! content management system (CMS) and its JVoteSystem component. The article compares results of students' learning outcomes for the students taught with and without question based learn�ing methodology and also with and without using clickers software.},
author = {Hippe, Zdzis{\l}aw S. and Kulikowski, Juliusz L. and Mroczek, Teresa and Wtorek, Jerzy},
doi = {10.1007/978-3-319-08491-6},
isbn = {978-3-319-08490-9},
issn = {21945357},
journal = {Advances in Intelligent Systems and Computing},
mendeley-groups = {Dissertation},
title = {{Human-Computer Systems Interaction: Backgrounds and Applications 3}},
volume = {300},
year = {2014}
}
[11] [doi] H. Su, S. Maji, E. Kalogerakis, and E. Learned-miller, “Multi-view Convolutional Neural Networks for 3D Shape Recognition,” Ieee iccv, p. 945–953, 2015.
[Bibtex]
@article{Su,
abstract = {A longstanding question in computer vision concerns the representation of 3D shapes for recognition: should 3D shapes be represented with descriptors operating on their native 3D formats, such as voxel grid or polygon mesh, or can they be effectively represented with view-based descrip- tors? We address this question in the context of learning to recognize 3D shapes from a collection of their rendered views on 2D images. We first present a standard CNN ar- chitecture trained to recognize the shapes' rendered views independently of each other, and show that a 3D shape can be recognized even from a single view at an accuracy far higher than using state-of-the-art 3D shape descriptors. Recognition rates further increase when multiple views of the shapes are provided. In addition, we present a novel CNN architecture that combines information from multiple views of a 3D shape into a single and compact shape de- scriptor offering even better recognition performance. The same architecture can be applied to accurately recognize human hand-drawn sketches of shapes. We conclude that a collection of 2D views can be highly informative for 3D shape recognition and is amenable to emerging CNN archi- tectures and their derivatives.},
archivePrefix = {arXiv},
arxivId = {1505.00880},
author = {Su, Hang and Maji, Subhransu and Kalogerakis, Evangelos and Learned-miller, Erik},
doi = {10.1109/ICCV.2015.114},
eprint = {1505.00880},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Su et al. - Unknown - Multi-view Convolutional Neural Networks for 3D Shape Recognition.pdf:pdf},
isbn = {978-1-4673-8391-2},
issn = {15505499},
journal = {Ieee Iccv},
mendeley-groups = {Dissertation},
pages = {945--953},
pmid = {7410471},
title = {{Multi-view Convolutional Neural Networks for 3D Shape Recognition}},
url = {https://people.cs.umass.edu/{~}kalo/papers/viewbasedcnn/su15mvcnn.pdf http://vis-www.cs.umass.edu/mvcnn/docs/su15mvcnn.pdf},
year = {2015}
}
[12] R. Shwartz-Ziv and N. Tishby, “Opening the Black Box of Deep Neural Networks via Information,” , p. 1–19, 2017.
[Bibtex]
@article{Shwartz-Ziv2017,
abstract = {Despite their great success, there is still no comprehensive theoretical understanding of learning with Deep Neural Networks (DNNs) or their inner organization. Previous work proposed to analyze DNNs in the $\backslash$textit{\{}Information Plane{\}}; i.e., the plane of the Mutual Information values that each layer preserves on the input and output variables. They suggested that the goal of the network is to optimize the Information Bottleneck (IB) tradeoff between compression and prediction, successively, for each layer. In this work we follow up on this idea and demonstrate the effectiveness of the Information-Plane visualization of DNNs. Our main results are: (i) most of the training epochs in standard DL are spent on {\{}$\backslash$emph compression{\}} of the input to efficient representation and not on fitting the training labels. (ii) The representation compression phase begins when the training errors becomes small and the Stochastic Gradient Decent (SGD) epochs change from a fast drift to smaller training error into a stochastic relaxation, or random diffusion, constrained by the training error value. (iii) The converged layers lie on or very close to the Information Bottleneck (IB) theoretical bound, and the maps from the input to any hidden layer and from this hidden layer to the output satisfy the IB self-consistent equations. This generalization through noise mechanism is unique to Deep Neural Networks and absent in one layer networks. (iv) The training time is dramatically reduced when adding more hidden layers. Thus the main advantage of the hidden layers is computational. This can be explained by the reduced relaxation time, as this it scales super-linearly (exponentially for simple diffusion) with the information compression from the previous layer.},
archivePrefix = {arXiv},
arxivId = {1703.00810},
author = {Shwartz-Ziv, Ravid and Tishby, Naftali},
eprint = {1703.00810},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Schwartz-Ziv, Tishby - 2017 - OPENING THE BLACK BOX OF DEEP NEURAL NETWORKS VIA INFORMATION.pdf:pdf},
keywords = {Deep Learning,Deep Neural Networks,Information Bottleneck,Representation Learning},
mendeley-groups = {Dissertation},
pages = {1--19},
title = {{Opening the Black Box of Deep Neural Networks via Information}},
url = {https://arxiv.org/pdf/1703.00810.pdf http://arxiv.org/abs/1703.00810},
year = {2017}
}
[13] X. Wu and X. Zhang, “Automated Inference on Criminality using Face Images,” Arxiv:1611.04135, 2016.
[Bibtex]
@article{Wu2016,
abstract = {We study, for the first time, automated inference on criminality based solely on still face images. Via supervised machine learning, we build four classifiers (logistic regression, KNN, SVM, CNN) using facial images of 1856 real persons controlled for race, gender, age and facial expressions, nearly half of whom were convicted criminals, for discriminating between criminals and non-criminals. All four classifiers perform consistently well and produce evidence for the validity of automated face-induced inference on criminality, despite the historical controversy surrounding the topic. Also, we find some discriminating structural features for predicting criminality, such as lip curvature, eye inner corner distance, and the so-called nose-mouth angle. Above all, the most important discovery of this research is that criminal and non-criminal face images populate two quite distinctive manifolds. The variation among criminal faces is significantly greater than that of the non-criminal faces. The two manifolds consisting of criminal and non-criminal faces appear to be concentric, with the non-criminal manifold lying in the kernel with a smaller span, exhibiting a law of normality for faces of non-criminals. In other words, the faces of general law-biding public have a greater degree of resemblance compared with the faces of criminals, or criminals have a higher degree of dissimilarity in facial appearance than normal people.},
archivePrefix = {arXiv},
arxivId = {1611.04135},
author = {Wu, Xiaolin and Zhang, Xi},
eprint = {1611.04135},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Wu, Zhang - 2016 - Responses to Critiques on Machine Learning of Criminality Perceptions (Addendum of arXiv1611.04135).pdf:pdf},
isbn = {978-7-5613-5065-2},
journal = {arXiv:1611.04135},
mendeley-groups = {Dissertation},
month = {nov},
title = {{Automated Inference on Criminality using Face Images}},
url = {http://arxiv.org/abs/1611.04135},
year = {2016}
}
[14] [doi] T. Poggio, H. Mhaskar, L. Rosasco, B. Miranda, and Q. Liao, Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review, 2017.
[Bibtex]
@misc{Poggio2017,
abstract = {The paper reviews and extends an emerging body of theoretical results on deep learning including the conditions under which it can be exponentially better than shallow learning. A class of deep convolutional networks represent an important special case of these conditions, though weight sharing is not the main reason for their exponential advantage. Implications of a few key theorems are discussed, together with new results, open problems and conjectures.},
archivePrefix = {arXiv},
arxivId = {1611.00740},
author = {Poggio, Tomaso and Mhaskar, Hrushikesh and Rosasco, Lorenzo and Miranda, Brando and Liao, Qianli},
booktitle = {International Journal of Automation and Computing},
doi = {10.1007/s11633-017-1054-2},
eprint = {1611.00740},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Poggio et al. - 2017 - Why and when can deep-but not shallow-networks avoid the curse of dimensionality A review.pdf:pdf},
issn = {17518520},
keywords = {Machine learning,convolutional neural networks,deep and shallow networks,deep learning,function approximation,neural networks},
mendeley-groups = {Dissertation},
pages = {1--17},
title = {{Why and when can deep-but not shallow-networks avoid the curse of dimensionality: A review}},
url = {https://cbmm.mit.edu/sites/default/files/publications/art{\%}253A10.1007{\%}252Fs11633-017-1054-2.pdf},
year = {2017}
}
[15] S. H. Hasanpour, M. Rouhani, and M. Fayyaz, “Let ‘ s keep it simple , Using simple architectures to outperform deeper and more complex architectures,” , p. 7–14, 2016.
[Bibtex]
@article{HasanPour2016,
abstract = {There have been many new methods and tricks to make creating an architecture, more robust and perform better in the last several years. While these methods are proved to indeed contribute to better performance, they impose some restrictions as well, such as memory consumption that makes their application in deeper networks not always possible[1]. Having this in mind, in this work, we propose a simple architecture for convolutional neural network that not only achieves better performance and accuracy and imposes less computational cost than the latest well-known deeper architectures such as VGGNet, ResNet, in the several recognition benchmark datasets but can be trained more easily as well. We also introduce a new trick we call leap-out which along with the new architecture, let us achieve state of the art and very close to the state of the art it in benchmarking datasets such as Cifar10, cifar100 and mnist.},
archivePrefix = {arXiv},
arxivId = {1608.06037},
author = {Hasanpour, Seyyed Hossein and Rouhani, Mohammad and Fayyaz, Mohsen},
eprint = {1608.06037},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Hasanpour et al. - Unknown - Let's keep it simple, Using simple architectures to outperform deeper and more complex architectures.pdf:pdf},
keywords = {classification,convolutional neural network,deep learning},
mendeley-groups = {Dissertation},
pages = {7--14},
title = {{Let ' s keep it simple , Using simple architectures to outperform deeper and more complex architectures}},
url = {https://arxiv.org/pdf/1608.06037.pdf http://arxiv.org/abs/1608.06037},
year = {2016}
}
[16] [doi] S. Lawrence, C. L. Giles, Ah Chung Tsoi, and A. D. Back, “Face recognition: a convolutional neural-network approach,” Ieee transactions on neural networks, vol. 8, iss. 1, p. 98–113, 1997.
[Bibtex]
@article{Lawrence1997,
abstract = {We present a hybrid neural-network for human face recognition which compares favourably with other methods. The system combines local image sampling, a self-organizing map (SOM) neural network, and a convolutional neural network. The SOM provides a quantization of the image samples into a topological space where inputs that are nearby in the original space are also nearby in the output space, thereby providing dimensionality reduction and invariance to minor changes in the image sample, and the convolutional neural network provides partial invariance to translation, rotation, scale, and deformation. The convolutional network extracts successively larger features in a hierarchical set of layers. We present results using the Karhunen-Loeve transform in place of the SOM, and a multilayer perceptron (MLP) in place of the convolutional network for comparison. We use a database of 400 images of 40 individuals which contains quite a high degree of variability in expression, pose, and facial details. We analyze the computational complexity and discuss how new classes could be added to the trained recognizer.},
author = {Lawrence, S. and Giles, C.L. and {Ah Chung Tsoi} and Back, A.D.},
doi = {10.1109/72.554195},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Lawrence et al. - 1997 - Face recognition a convolutional neural-network approach.pdf:pdf},
isbn = {1045-9227 VO - 8},
issn = {10459227},
journal = {IEEE Transactions on Neural Networks},
mendeley-groups = {Dissertation},
number = {1},
pages = {98--113},
pmid = {18255614},
title = {{Face recognition: a convolutional neural-network approach}},
url = {http://ieeexplore.ieee.org/document/554195/ http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=554195},
volume = {8},
year = {1997}
}
[17] [doi] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution,” in Proceedings of the 18th acm international conference on multimodal interaction, 2016, p. 279–283.
[Bibtex]
@inproceedings{Barsoum2016,
abstract = {Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is a well-known problem that these labels can be very noisy. In this paper, we demonstrate how to learn a deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches to utilizing the multiple labels: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. An enhanced FER+ data set with multiple labels for each face image will also be shared with the research community.},
archivePrefix = {arXiv},
arxivId = {1608.01041},
author = {Barsoum, Emad and Zhang, Cha and Ferrer, Cristian Canton and Zhang, Zhengyou},
booktitle = {Proceedings of the 18th ACM International Conference on Multimodal Interaction},
doi = {10.1145/2993148.2993165},
eprint = {1608.01041},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Barsoum et al. - Unknown - Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution(2).pdf:pdf},
isbn = {9781450345569},
keywords = {annotation,con-,crowd sourcing,emotion recognition,facial expression recognition,volutional neural network},
mendeley-groups = {Dissertation},
pages = {279--283},
title = {{Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution}},
url = {https://arxiv.org/pdf/1608.01041.pdf http://arxiv.org/abs/1608.01041},
year = {2016}
}
[18] [doi] J. Kumari, R. Rajesh, and K. M. Pooja, “Facial Expression Recognition: A Survey,” in Procedia computer science, 2015, p. 486–491.
[Bibtex]
@inproceedings{Kumari2015,
abstract = {Automatic facial expression recognition system has many applications including, but not limited to, human behavior understanding, detection of mental disorders, and synthetic human expressions. Two popular methods utilized mostly in the literature for the automatic FER systems are based on geometry and appearance. Even though there is lots of research using static images, the research is still going on for the development of new methods which would be quiet easy in computation and would have less memory usage as compared to previous methods. This paper presents a quick survey of facial expression recognition. A comparative study is also carried out using various feature extraction techniques on JAFFE dataset.},
author = {Kumari, Jyoti and Rajesh, R. and Pooja, K. M.},
booktitle = {Procedia Computer Science},
doi = {10.1016/j.procs.2015.08.011},
isbn = {0000000000},
issn = {18770509},
keywords = {Facial Expression Recognition(FER),HOG,LBP,LDP,LGC},
mendeley-groups = {Dissertation},
pages = {486--491},
title = {{Facial Expression Recognition: A Survey}},
url = {http://linkinghub.elsevier.com/retrieve/pii/S1877050915021225},
volume = {58},
year = {2015}
}
[19] [doi] B. Kim, H. Lee, J. Roh, and S. Lee, “Hierarchical Committee of Deep CNNs with Exponentially-Weighted Decision Fusion for Static Facial Expression Recognition,” Proceedings of the 2015 acm on international conference on multimodal interaction, p. 427–434, 2015.
[Bibtex]
@article{Kim2015,
abstract = {韩国科学技术学院 Korea Advanced Institute of Science and Technology (KAIST)},
author = {Kim, Bo-Kyeong and Lee, Hwaran and Roh, Jihyeon and Lee, Soo-Young},
doi = {10.1145/2818346.2830590},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Kim et al. - Unknown - Hierarchical Committee of Deep CNNs with Exponentially-Weighted Decision Fusion for Static Facial Expression Reco.pdf:pdf},
isbn = {9781450339124},
journal = {Proceedings of the 2015 ACM on International Conference on Multimodal Interaction},
keywords = {4,9,applications,deep convolutional neural network,exponentially-weighted,fusion,hierarchical committee,i,image processing and computer,vision},
mendeley-groups = {Dissertation},
pages = {427--434},
title = {{Hierarchical Committee of Deep CNNs with Exponentially-Weighted Decision Fusion for Static Facial Expression Recognition}},
url = {http://delivery.acm.org/10.1145/2840000/2830590/p427-kim.pdf?ip=138.251.30.96{\&}id=2830590{\&}acc=ACTIVE SERVICE{\&}key=C2D842D97AC95F7A.9A2EB7BEEEBA61C5.4D4702B0C3E38B35.4D4702B0C3E38B35{\&}CFID=976369694{\&}CFTOKEN=66657620{\&}{\_}{\_}acm{\_}{\_}=1503605123{\_}4ec4342dfaaf4d59c48761a2},
year = {2015}
}
[20] [doi] F. Rosenblatt, “The perceptron: a probabilistic model for information storage and organization in the brain.,” Psychological review, vol. 65, iss. 6, p. 386–408, 1958.
[Bibtex]
@article{Rosenblatt1958,
abstract = {To answer the questions of how information about the physical world is sensed, in what form is information remembered, and how does information retained in memory influence recognition and behavior, a theory is developed for a hypothetical nervous system called a perceptron. The theory serves as a bridge between biophysics and psychology. It is possible to predict learning curves from neurological variables and vice versa. The quantitative statistical approach is fruitful in the understanding of the organization of cognitive systems. 18 references.},
archivePrefix = {arXiv},
arxivId = {arXiv:1112.6209},
author = {Rosenblatt, F},
doi = {10.1037/h0042519},
eprint = {arXiv:1112.6209},
isbn = {0033-295X},
issn = {0033-295X},
journal = {Psychological review},
keywords = {PERCEPTION},
mendeley-groups = {Dissertation},
month = {nov},
number = {6},
pages = {386--408},
pmid = {13602029},
title = {{The perceptron: a probabilistic model for information storage and organization in the brain.}},
url = {http://www.ncbi.nlm.nih.gov/pubmed/13602029},
volume = {65},
year = {1958}
}
[21] [doi] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward networks are universal approximators,” Neural networks, vol. 2, iss. 5, p. 359–366, 1989.
[Bibtex]
@article{Hornik1989,
abstract = {This paper rigorously establishes that standard multilayer feedforward networks with as few as one hidden layer using arbitrary squashing functions are capable of approximating any Borel measurable function from one finite dimensional space to another to any desired degree of accuracy, provided sufficiently many hidden units are available. In this sense, multilayer feedforward networks are a class of universal approximators. ?? 1989.},
archivePrefix = {arXiv},
arxivId = {arXiv:1011.1669v3},
author = {Hornik, Kurt and Stinchcombe, Maxwell and White, Halbert},
doi = {10.1016/0893-6080(89)90020-8},
eprint = {arXiv:1011.1669v3},
isbn = {08936080 (ISSN)},
issn = {08936080},
journal = {Neural Networks},
keywords = {Back-propagation networks,Feedforward networks,Mapping networks,Network representation capability,Sigma-Pi networks,Squashing functions,Stone-Weierstrass Theorem,Universal approximation},
mendeley-groups = {Dissertation},
number = {5},
pages = {359--366},
pmid = {74},
title = {{Multilayer feedforward networks are universal approximators}},
volume = {2},
year = {1989}
}
[22] [doi] G. Cybenko, “Correction: Approximation by Superpositions of a Sigmoidal Function,” Mathematics of control, signals, and systems, vol. 2, p. 303–314, 1989.
[Bibtex]
@article{Cybenko1989,
abstract = {In this paper we demonstrate that finite linear combinations of compositions of a fixed, univatirate function and a set of affine functionals can uniformly approximate any continuous function of n real variables with support in the unit hypercube: only mild conditions are imposed on the univariate function. Our results settle and open question about representability in the class of single hidden layer neural networks. In particular, we show that arbitrary decision regions can be arbitrarily well approximated by continuous feedforward neural networks with only a single internal, hidden layer and any continuous sigmoidal nonlinearity.The paper discusses approximation properties or other possible types of nonlinearities that might be implemented by artificial neural networks.},
author = {Cybenko, G.},
doi = {doi: 10.1007/BF02134016},
isbn = {0932-4194 (Print) 1435-568X (Online)},
issn = {10009221},
journal = {Mathematics of Control, Signals, and Systems},
mendeley-groups = {Dissertation},
pages = {303--314},
title = {{Correction: Approximation by Superpositions of a Sigmoidal Function}},
volume = {2},
year = {1989}
}
[23] [doi] V. Nair and G. E. Hinton, “Rectified Linear Units Improve Restricted Boltzmann Machines,” Proceedings of the 27th international conference on machine learning, iss. 3, p. 807–814, 2010.
[Bibtex]
@article{Nair2010,
abstract = {Restricted Boltzmann machines were developed using binary stochastic hidden units. These can be generalized by replacing each binary unit by an infinite number of copies that all have the same weights but have progressively more negative biases. The learning and inference rules for these “Stepped Sigmoid Units” are unchanged. They can be approximated efficiently by noisy, rectified linear units. Compared with binary units, these units learn features that are better for object recognition on the NORB dataset and face verification on the Labeled Faces in the Wild dataset. Unlike binary units, rectified linear units preserve information about relative intensities as information travels through multiple layers of feature detectors.},
archivePrefix = {arXiv},
arxivId = {1111.6189v1},
author = {Nair, Vinod and Hinton, Geoffrey E},
doi = {10.1.1.165.6419},
eprint = {1111.6189v1},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Nair, Hinton - Unknown - Rectified Linear Units Improve Restricted Boltzmann Machines.pdf:pdf},
isbn = {9781605589077},
issn = {1935-8237},
journal = {Proceedings of the 27th International Conference on Machine Learning},
mendeley-groups = {Dissertation},
number = {3},
pages = {807--814},
pmid = {22404682},
title = {{Rectified Linear Units Improve Restricted Boltzmann Machines}},
url = {http://machinelearning.wustl.edu/mlpapers/paper{\_}files/icml2010{\_}NairH10.pdf},
year = {2010}
}
[24] [doi] K. Du and M. N. S. Swamy, “Fundamentals of Machine Learning,” in Neural networks and statistical learning, London: Springer London, 2014, p. 15–65.
[Bibtex]
@incollection{Du2014,
abstract = {Learning is a fundamental capability of neural networks. Learning rules are algorithms for finding suitable weights W and/or other network parameters.},
address = {London},
author = {Du, Ke-Lin and Swamy, M. N. S.},
booktitle = {Neural Networks and Statistical Learning},
doi = {10.1007/978-1-4471-5571-3_2},
isbn = {978-1-4471-5570-6 978-1-4471-5571-3},
mendeley-groups = {Dissertation},
pages = {15--65},
pmid = {2205174},
publisher = {Springer London},
title = {{Fundamentals of Machine Learning}},
url = {http://link.springer.com/10.1007/978-1-4471-5571-3{\_}2 http://link.springer.com/chapter/10.1007/978-1-4471-5571-3{\_}2{\%}5Cnhttp://link.springer.com/content/pdf/10.1007{\%}2F978-1-4471-5571-3{\_}2.pdf{\%}5Cnhttps://link.springer.com/chapter/10.1007/978-1-4471-5571-3{\_}2},
year = {2014}
}
[25] [doi] H. Hartline, “The response of single optic nerve fibers of the vertebrate eye to illumination of the retina.,” American journal of physiology, vol. 121, p. 400 – 415, 1938.
[Bibtex]
@article{Hartline1938,
abstract = {Previous workers have studied the optic nerve of the eel's eye and single fibers in that of Limulus. In this study the cold-blooded vertebrate optic nerve fiber was used (mostly bull frog). Potentials from small bundles dissected off the anterior surface of the retina were recorded by amplifier and oscillograph. Three types of response were shown by different fibers: (1) a discharge during the total light period, by 20{\%} of the fibers; (2) an "on" and an "off" effect only, by 50{\%}; (3) a vigorous "off" effect only, by 30{\%}. Latency, frequency, and intensity of "on" and "off" responses were related to stimulus intensity and duration in expected fashion. Similar results were obtained from fish, amphibian and reptilian eyes. The response from a fiber was obtained by stimulation of a definitely restricted location on the retina, but the type of response was not dependent on a given location. It is suggested that the type of response is probably not a function of rod and cone characteristics. It may be the result of a balance of inhibitory and excitatory effects on the ganglion cell or the result of functional differences among ganglion cells.},
author = {Hartline, HK},
doi = {10.1234/12345678},
issn = {0002-9394},
journal = {American Journal of Physiology},
keywords = {NERVOUS SYSTEM,POTENTIAL,RECEPTIVE AND PERCEPTUAL PROCESSES,RETINA,SINGLE FIBERS},
mendeley-groups = {Dissertation},
pages = {400 -- 415},
pmid = {29513},
title = {{The response of single optic nerve fibers of the vertebrate eye to illumination of the retina.}},
url = {http://psycnet.apa.org/psycinfo/1938-02811-001 http://www.citeulike.org/user/klouie/article/943511},
volume = {121},
year = {1938}
}
[26] K. Fukushima, “A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.,” Biol cybern, vol. 36, iss. 4, p. 193–202, 1980.
[Bibtex]
@article{Fukushima1980,
abstract = {A neural network model for a mechanism of visual pattern recognition is proposed in this paper. The network is self-organized by "learning without a teacher", and acquires an ability to recognize stimulus patterns based on the geometrical similarity (Gestalt) of their shapes without affected by their positions. This network is given a nickname "neocognitron". After completion of self-organization, the network has a structure similar to the hierarchy model of the visual nervous system proposed by Hubel and Wiesel. The network consists of an input layer (photoreceptor array) followed by a cascade connection of a number of modular structures, each of which is composed of two layers of cells connected in a cascade. The first layer of each module consists of "S-cells', which show charac-teristics similar to simple cells or lower order hyper-complex cells, and the second layer consists of "C-cells" similar to complex cells or higher order hypercomplex cells. The afferent synapses to each S-cell have plasticity and are modifiable. The network has an ability of unsupervised learning: We do not need any "teacher" during the process of self-organization, and it is only needed to present a set of stimulus patterns repeatedly to the input layer of the network. The network has been simulated on a digital computer. After repetitive presentation of a set of stimulus patterns, each stimulus pattern has become to elicit an output only from one of the C-cells of the last layer, and conversely, this C-cell has become selectively responsive only to that stimulus pattern. That is, none of the C-cells of the last layer responds to more than one stimulus pattern. The response of the C-cells of the last layer is not affected by the pattern's position at all. Neither is it affected by a small change in shape nor in size of the stimulus pattern. 1. Introduction The mechanism of pattern recognition in the brain is little known, and it seems to be almost impossible to reveal it only by conventional physiological experi-ments. So, we take a slightly different approach to this problem. If we could make a neural network model which has the same capability for pattern recognition as a human being, it would give us a powerful clue to the understanding of the neural mechanism in the brain. In this paper, we discuss how to synthesize a neural network model in order to endow it an ability of pattern recognition like a human being. Several models were proposed with this intention (Rosenblatt, 1962; Kabrisky, 1966; Giebel, 1971; Fukushima, 1975). The response of most of these models, however, was severely affected by the shift in position and/or by the distortion in shape of the input patterns. Hence, their ability for pattern recognition was not so high. In this paper, we propose an improved neural network model. The structure of this network has been suggested by that of the visual nervous system of the vertebrate. This network is self-organized by "learning without a teacher", and acquires an ability to recognize stimulus patterns based on the geometrical similarity (Gestalt) of their shapes without affected by their position nor by small distortion of their shapes. This network is given a nickname "neocognitron"l, because it is a further extention of the "cognitron", which also is a self-organizing multilayered neural network model proposed by the author before (Fukushima, 1975). Incidentally, the conventional cognitron also had an ability to recognize patterns, but its response was dependent upon the position of the stimulus patterns. That is, the same patterns which were presented at different positions were taken as different patterns by the conventional cognitron. In the neocognitron proposed here, however, the response of the network is little affected by the position of the stimulus patterns.},
author = {Fukushima, Kunihiko},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Fukushima - 1980 - Biological Cybernetics Neocognitron A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Una.pdf:pdf},
journal = {Biol Cybern},
mendeley-groups = {Dissertation},
number = {4},
pages = {193--202},
title = {{A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position.}},
url = {http://www.cs.princeton.edu/courses/archive/spr08/cos598B/Readings/Fukushima1980.pdf},
volume = {36},
year = {1980}
}
[27] [doi] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the ieee, vol. 86, iss. 11, p. 2278–2323, 1998.
[Bibtex]
@article{Lecun1998,
abstract = {Multilayer neural networks trained with the back-propagation$\backslash$nalgorithm constitute the best example of a successful gradient based$\backslash$nlearning technique. Given an appropriate network architecture,$\backslash$ngradient-based learning algorithms can be used to synthesize a complex$\backslash$ndecision surface that can classify high-dimensional patterns, such as$\backslash$nhandwritten characters, with minimal preprocessing. This paper reviews$\backslash$nvarious methods applied to handwritten character recognition and$\backslash$ncompares them on a standard handwritten digit recognition task.$\backslash$nConvolutional neural networks, which are specifically designed to deal$\backslash$nwith the variability of 2D shapes, are shown to outperform all other$\backslash$ntechniques. Real-life document recognition systems are composed of$\backslash$nmultiple modules including field extraction, segmentation recognition,$\backslash$nand language modeling. A new learning paradigm, called graph transformer$\backslash$nnetworks (GTN), allows such multimodule systems to be trained globally$\backslash$nusing gradient-based methods so as to minimize an overall performance$\backslash$nmeasure. Two systems for online handwriting recognition are described.$\backslash$nExperiments demonstrate the advantage of global training, and the$\backslash$nflexibility of graph transformer networks. A graph transformer network$\backslash$nfor reading a bank cheque is also described. It uses convolutional$\backslash$nneural network character recognizers combined with global training$\backslash$ntechniques to provide record accuracy on business and personal cheques.$\backslash$nIt is deployed commercially and reads several million cheques per day$\backslash$n},
archivePrefix = {arXiv},
arxivId = {1102.0183},
author = {LeCun, Yann and Bottou, L??on and Bengio, Yoshua and Haffner, Patrick},
doi = {10.1109/5.726791},
eprint = {1102.0183},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Lecun et al. - 1998 - Gradient-based learning applied to document recognition.pdf:pdf},
isbn = {0018-9219},
issn = {00189219},
journal = {Proceedings of the IEEE},
keywords = {Convolutional neural networks,Document recognition,Finite state transducers,Gradient-based learning,Graph transformer networks,Machine learning,Neural networks,Optical character recognition (OCR)},
mendeley-groups = {Dissertation},
number = {11},
pages = {2278--2323},
pmid = {15823584},
title = {{Gradient-based learning applied to document recognition}},
url = {http://ieeexplore.ieee.org/document/726791/},
volume = {86},
year = {1998}
}
[28] [doi] H. Lee, R. Grosse, R. Ranganath, and A. Y. Ng, “Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations,” in Proceedings of the 26th annual international conference on machine learning – icml ’09, 2009, p. 1–8.
[Bibtex]
@inproceedings{Lee2009,
abstract = {There has been much interest in unsupervised learning of hierarchical generative models such as deep belief networks. Scaling such models to full-sized, high-dimensional images remains a difficult problem. To address this problem, we present the convolutional deep belief network, a hierarchical generative model which scales to realistic image sizes. This model is translation-invariant and supports efficient bottom-up and top-down probabilistic inference. Key to our approach is probabilistic max-pooling, a novel technique which shrinks the representations of higher layers in a probabilistically sound way. Our experiments show that the algorithm learns useful high-level visual features, such as object parts, from unlabeled images of objects and natural scenes. We demonstrate excellent performance on several visual recognition tasks and show that our model can perform hierarchical (bottom-up and top-down) inference over full-sized images.},
archivePrefix = {arXiv},
arxivId = {arXiv:1301.3605v3},
author = {Lee, Honglak and Grosse, Roger and Ranganath, Rajesh and Ng, Andrew Y},
booktitle = {Proceedings of the 26th Annual International Conference on Machine Learning - ICML '09},
doi = {10.1145/1553374.1553453},
eprint = {arXiv:1301.3605v3},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Lee et al. - Unknown - Convolutional Deep Belief Networks for Scalable Unsupervised Learning of Hierarchical Representations.pdf:pdf},
isbn = {9781605585161},
issn = {02643294},
mendeley-groups = {Dissertation},
pages = {1--8},
pmid = {20957573},
title = {{Convolutional deep belief networks for scalable unsupervised learning of hierarchical representations}},
url = {http://ai.stanford.edu/{~}ang/papers/icml09-ConvolutionalDeepBeliefNetworks.pdf http://portal.acm.org/citation.cfm?doid=1553374.1553453},
year = {2009}
}
[29] [doi] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Proceedings of the ieee computer society conference on computer vision and pattern recognition, 2015, p. 1–9.
[Bibtex]
@inproceedings{Szegedy2015,
abstract = {Abstract We propose a deep convolutional neural network architecture codenamed Incep- tion, which was responsible for setting the new state of the art for classification and detection in the ImageNet Large-Scale Visual Recognition Challenge 2014 (ILSVRC14). The main hallmark of this architecture is the improved utilization of the computing resources inside the network. This was achieved by a carefully crafted design that allows for increasing the depth and width of the network while keeping the computational budget constant. To optimize quality, the architectural decisions were based on the Hebbian principle and the intuition of multi-scale processing. One particular incarnation used in our submission for ILSVRC14 is called GoogLeNet, a 22 layers deep network, the quality of which is assessed in the context of classification and detection.},
archivePrefix = {arXiv},
arxivId = {1409.4842},
author = {Szegedy, Christian and Liu, Wei and Jia, Yangqing and Sermanet, Pierre and Reed, Scott and Anguelov, Dragomir and Erhan, Dumitru and Vanhoucke, Vincent and Rabinovich, Andrew},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
doi = {10.1109/CVPR.2015.7298594},
eprint = {1409.4842},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Szegedy et al. - Unknown - Going deeper with convolutions.pdf:pdf},
isbn = {9781467369640},
issn = {10636919},
mendeley-groups = {Dissertation},
pages = {1--9},
pmid = {24920543},
title = {{Going deeper with convolutions}},
url = {https://arxiv.org/pdf/1409.4842.pdf},
volume = {07-12-June},
year = {2015}
}
[30] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classification with Deep Convolutional Neural Networks,” in Proceedings of the th international conference on neural information processing systems, 2012, p. 1097–1105.
[Bibtex]
@inproceedings{Krizhevsky2012,
author = {Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoffrey E},
booktitle = {Proceedings of the th International Conference on Neural Information Processing Systems},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Krizhevsky, Sutskever, Hinton - 2012 - ImageNet Classification with Deep Convolutional Neural Networks.pdf:pdf},
mendeley-groups = {Dissertation},
pages = {1097--1105},
title = {{ImageNet Classification with Deep Convolutional Neural Networks}},
url = {http://vision.stanford.edu/teaching/cs231b{\_}spring1415/slides/alexnet{\_}tugce{\_}kyunghee.pdf},
year = {2012}
}
[31] [doi] K. Simonyan and A. Zisserman, “Very Deep Convolutional Networks for Large-Scale Image Recognition,” International conference on learning representations (icrl), p. 1–14, 2015.
[Bibtex]
@article{Simonyan2015,
abstract = {In this work we investigate the effect of the convolutional network depth on its accuracy in the large-scale image recognition setting. Our main contribution is a thorough evaluation of networks of increasing depth using an architecture with very small (3x3) convolution filters, which shows that a significant improvement on the prior-art configurations can be achieved by pushing the depth to 16-19 weight layers. These findings were the basis of our ImageNet Challenge 2014 submission, where our team secured the first and the second places in the localisation and classification tracks respectively. We also show that our representations generalise well to other datasets, where they achieve state-of-the-art results. We have made our two best-performing ConvNet models publicly available to facilitate further research on the use of deep visual representations in computer vision.},
archivePrefix = {arXiv},
arxivId = {1409.1556},
author = {Simonyan, Karen and Zisserman, Andrew},
doi = {10.1016/j.infsof.2008.09.005},
eprint = {1409.1556},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Simonyan, Zisserman - 2015 - VERY DEEP CONVOLUTIONAL NETWORKS FOR LARGE-SCALE IMAGE RECOGNITION.pdf:pdf},
isbn = {9781450341448},
issn = {09505849},
journal = {International Conference on Learning Representations (ICRL)},
keywords = {()},
mendeley-groups = {Dissertation},
pages = {1--14},
pmid = {16873662},
title = {{Very Deep Convolutional Networks for Large-Scale Image Recognition}},
url = {https://arxiv.org/pdf/1409.1556.pdf http://arxiv.org/abs/1409.1556},
year = {2015}
}
[32] [doi] J. {Bergstra JAMESBERGSTRA} and U. {Yoshua Bengio YOSHUABENGIO}, “Random Search for Hyper-Parameter Optimization,” Journal of machine learning research, vol. 13, p. 281–305, 2012.
[Bibtex]
@article{BergstraJAMESBERGSTRA2012,
abstract = {Grid search and manual search are the most widely used strategies for hyper-parameter optimiza-tion. This paper shows empirically and theoretically that randomly chosen trials are more efficient for hyper-parameter optimization than trials on a grid. Empirical evidence comes from a compar-ison with a large previous study that used grid search and manual search to configure neural net-works and deep belief networks. Compared with neural networks configured by a pure grid search, we find that random search over the same domain is able to find models that are as good or better within a small fraction of the computation time. Granting random search the same computational budget, random search finds better models by effectively searching a larger, less promising con-figuration space. Compared with deep belief networks configured by a thoughtful combination of manual search and grid search, purely random search over the same 32-dimensional configuration space found statistically equal performance on four of seven data sets, and superior performance on one of seven. A Gaussian process analysis of the function from hyper-parameters to validation set performance reveals that for most data sets only a few of the hyper-parameters really matter, but that different hyper-parameters are important on different data sets. This phenomenon makes grid search a poor choice for configuring algorithms for new data sets. Our analysis casts some light on why recent " High Throughput " methods achieve surprising success—they appear to search through a large number of hyper-parameters because most hyper-parameters do not matter much. We anticipate that growing interest in large hierarchical models will place an increasing burden on techniques for hyper-parameter optimization; this work shows that random search is a natural base-line against which to judge progress in the development of adaptive (sequential) hyper-parameter optimization algorithms.},
archivePrefix = {arXiv},
arxivId = {1504.05070},
author = {{Bergstra JAMESBERGSTRA}, James and {Yoshua Bengio YOSHUABENGIO}, Umontrealca},
doi = {10.1162/153244303322533223},
eprint = {1504.05070},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Bergstra JAMESBERGSTRA, Yoshua Bengio YOSHUABENGIO - 2012 - Random Search for Hyper-Parameter Optimization.pdf:pdf},
isbn = {1532-4435},
issn = {1532-4435},
journal = {Journal of Machine Learning Research},
keywords = {deep learning,global optimization,model selection,neural networks,response surface modeling},
mendeley-groups = {Dissertation},
pages = {281--305},
pmid = {18244602},
title = {{Random Search for Hyper-Parameter Optimization}},
url = {http://www.jmlr.org/papers/volume13/bergstra12a/bergstra12a.pdf},
volume = {13},
year = {2012}
}
[33] [doi] P. R. Lorenzo, J. Nalepa, M. Kawulok, L. S. Ramos, and J. Ranilla, “Particle Swarm Optimization for Hyper-Parameter Selection in Deep Neural Networks,” , vol. 8, 2017.
[Bibtex]
@article{Lorenzo2017,
abstract = {Deep neural networks (DNNs) have achieved unprecedented suc-cess in a wide array of tasks. However, the performance of these systems depends directly on their hyper-parameters which ooen must be selected by an expert. Optimizing the hyper-parameters remains a substantial obstacle in designing DNNs in practice. In this work, we propose to select them using particle swarm opti-mization (PSO). Such biologically-inspired approaches have not been extensively exploited for this task. We demonstrate that PSO eeciently explores the solution space, allowing DNNs of a minimal topology to obtain competitive classiication performance over the MNIST dataset. We showed that very small DNNs optimized by PSO retrieve promising classiication accuracy for CIFAR-10. Also, PSO improves the performance of existing architectures. Extensive experimental study, backed-up with the statistical tests, revealed that PSO is an eeective technique for automating hyper-parameter selection and eeciently exploits computational resources.},
author = {Lorenzo, Pablo Ribalta and Nalepa, Jakub and Kawulok, Michal and Ramos, Luciano Sanchez and Ranilla, Jos{\'{e}}},
doi = {10.1145/3071178.3071208},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Lorenzo et al. - 2017 - Particle Swarm Optimization for Hyper-Parameter Selection in Deep Neural Networks.pdf:pdf},
isbn = {9781450349208},
keywords = {Bio-inspired approaches,Neural networks,•Computing methodologies Machine learning},
mendeley-groups = {Dissertation},
title = {{Particle Swarm Optimization for Hyper-Parameter Selection in Deep Neural Networks}},
url = {http://delivery.acm.org/10.1145/3080000/3071208/p481-ribalta{\_}lorenzo.pdf?ip=138.251.30.96{\&}id=3071208{\&}acc=ACTIVE SERVICE{\&}key=C2D842D97AC95F7A.9A2EB7BEEEBA61C5.4D4702B0C3E38B35.4D4702B0C3E38B35{\&}CFID=801411421{\&}CFTOKEN=58130938{\&}{\_}{\_}acm{\_}{\_}=1503613380{\_}fc16f5efaaca},
volume = {8},
year = {2017}
}
[34] [doi] M. Suganuma, S. Shirakawa, and T. Nagao, “A genetic programming approach to designing convolutional neural network architectures,” in Gecco 2017 – proceedings of the 2017 genetic and evolutionary computation conference, 2017.
[Bibtex]
@inproceedings{Suganuma2017,
abstract = {{\textcopyright} 2017 ACM. The convolutional neural network (CNN), which is one of the deep learning models, has seen much success in a variety of computer vision tasks. However, designing CNN architectures still requires expert knowledge and a lot of trial and error. In this paper, we attempt to automatically construct CNN architectures for an image classification task based on Cartesian genetic programming (CGP). In our method, we adopt highly functional modules, such as con-volutional blocks and tensor concatenation, as the node functions in CGP. The CNN structure and connectivity represented by the CGP encoding method are optimized to maximize the validation accuracy. To evaluate the proposed method, we constructed a CNN architecture for the image classification task with the CIFAR-10 dataset. The experimental result shows that the proposed method can be used to automatically find the competitive CNN architecture compared with state-of-the-art models.},
author = {Suganuma, Masanori and Shirakawa, Shinichi and Nagao, Tomoharu},
booktitle = {GECCO 2017 - Proceedings of the 2017 Genetic and Evolutionary Computation Conference},
doi = {10.1145/3071178.3071229},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Suganuma, Shirakawa, Nagao - 2017 - A Genetic Programming Approach to Designing Convolutional Neural Network Architectures.pdf:pdf},
isbn = {9781450349208},
keywords = {Convolutional neural network,Deep learning,Designing neural network architectures,Genetic programming},
mendeley-groups = {Dissertation},
number = {17},
title = {{A genetic programming approach to designing convolutional neural network architectures}},
url = {https://doi.org/10.1145/3071178.3071229},
year = {2017}
}
[35] B. Baker, O. Gupta, N. Naik, and R. Raskar, “Designing Neural Network Architectures using Reinforcement Learning,” Arxiv preprint, p. 1–16, 2016.
[Bibtex]
@article{Baker2016,
abstract = {At present, designing convolutional neural network (CNN) architectures requires both human expertise and labor. New architectures are handcrafted by careful experimentation or modified from a handful of existing networks. We propose a meta-modelling approach based on reinforcement learning to automatically generate high-performing CNN architectures for a given learning task. The learning agent is trained to sequentially choose CNN layers using Q-learning with an {\$}\backslashepsilon{\$}-greedy exploration strategy and experience replay. The agent explores a large but finite space of possible architectures and iteratively discovers designs with improved performance on the learning task. On image classification benchmarks, the agent-designed networks (consisting of only standard convolution, pooling, and fully-connected layers) beat existing networks designed with the same layer types and are competitive against the state-of-the-art methods that use more complex layer types. We also outperform existing network design meta-modelling approaches on image classification.},
archivePrefix = {arXiv},
arxivId = {1611.02167},
author = {Baker, Bowen and Gupta, Otkrist and Naik, Nikhil and Raskar, Ramesh},
eprint = {1611.02167},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Baker et al. - 2016 - Designing Neural Network Architectures using Reinforcement Learning.pdf:pdf},
isbn = {2857825749},
journal = {arXiv preprint},
mendeley-groups = {Dissertation},
month = {nov},
pages = {1--16},
title = {{Designing Neural Network Architectures using Reinforcement Learning}},
url = {http://arxiv.org/abs/1611.02167},
year = {2016}
}
[36] [doi] P. Carcagnì, M. {Del Coco}, M. Leo, and C. Distante, “Facial expression recognition and histograms of oriented gradients: a comprehensive study,” Springerplus, vol. 4, iss. 1, p. 645, 2015.
[Bibtex]
@article{Carcagni2015,
abstract = {Automatic facial expression recognition (FER) is a topic of growing interest mainly due to the rapid spread of assistive technology applications, as human–robot interaction, where a robust emotional awareness is a key point to best accomplish the assistive task. This paper proposes a comprehensive study on the application of histogram of oriented gradients (HOG) descriptor in the FER problem, highlighting as this powerful technique could be effectively exploited for this purpose. In particular, this paper highlights that a proper set of the HOG parameters can make this descriptor one of the most suitable to characterize facial expression peculiarities. A large experimental session, that can be divided into three different phases, was carried out exploiting a consolidated algorithmic pipeline. The first experimental phase was aimed at proving the suitability of the HOG descriptor to characterize facial expression traits and, to do this, a successful comparison with most commonly used FER frameworks was carried out. In the second experimental phase, different publicly available facial datasets were used to test the system on images acquired in different conditions (e.g. image resolution, lighting conditions, etc.). As a final phase, a test on continuous data streams was carried out on-line in order to validate the system in real-world operating conditions that simulated a real-time human–machine interaction.},
author = {Carcagn{\`{i}}, Pierluigi and {Del Coco}, Marco and Leo, Marco and Distante, Cosimo},
doi = {10.1186/s40064-015-1427-3},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Carcagn{\`{i}} et al. - 2015 - Facial expression recognition and histograms of oriented gradients a comprehensive study.pdf:pdf},
isbn = {10.1186/s40064-015-1427-3},
issn = {2193-1801},
journal = {SpringerPlus},
keywords = {facial expression recognition,hog,svm},
mendeley-groups = {Dissertation},
number = {1},
pages = {645},
title = {{Facial expression recognition and histograms of oriented gradients: a comprehensive study}},
url = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4628009/pdf/40064{\_}2015{\_}Article{\_}1427.pdf http://www.springerplus.com/content/4/1/645},
volume = {4},
year = {2015}
}
[37] [doi] A. Yao, Y. Chen, J. Shao, N. Ma, and Y. Chen, “Capturing au-aware facial features and their latent relations for emotion recognition in the wild,” in Proceedings of the 2015 acm on international conference on multimodal interaction, 2015, p. 451–458.
[Bibtex]
@inproceedings{Yao,
abstract = {The Emotion Recognition in the Wild (EmotiW) Challenge has been held for three years. Previous winner teams primarily focus on designing specific deep neural networks or fusing diverse hand-crafted and deep convolutional features. They all neglect to explore the significance of the latent relations among changing features resulted from facial muscle motions. In this paper, we study this recognition challenge from the perspective of analyzing the relations among expression-specific facial features in an explicit manner. Our method has three key components. First, we propose a pair-wise learning strategy to automatically seek a set of facial image patches which are important for discriminating two particular emotion categories. We found these learnt local patches are in part consistent with the locations of expression-specific Action Units (AUs), thus the features extracted from such kind of facial patches are named AU-aware facial features. Second, in each pair-wise task, we use an undirected graph structure, which takes learnt facial patches as individual vertices, to encode feature relations between any two learnt facial patches. Finally, a robust emotion representation is constructed by concatenating all task-specific graph-structured facial feature relations sequentially. Extensive experiments on the EmotiW 2015 Challenge testify the efficacy of the proposed approach. Without using additional data, our final submissions achieved competitive results on both sub-challenges including the image based static facial expression recognition (we got 55.38{\%} recognition accuracy outperforming the baseline 39.13{\%} with a margin of 16.25{\%}) and the audio-video based emotion recognition (we got 53.80{\%} recognition accuracy outperforming the baseline 39.33{\%} and the 2014 winner team's final result 50.37{\%} with the margins of 14.47{\%} and 3.43{\%}, respectively).},
author = {Yao, Anbang and Chen, Yurong and Shao, Junchao and Ma, Ningning and Chen, Yurong},
booktitle = {Proceedings of the 2015 ACM on International Conference on Multimodal Interaction},
doi = {10.1145/2818346.2830585},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Yao et al. - Unknown - Capturing AU-Aware Facial Features and Their Latent Relations for Emotion Recognition in the Wild.pdf:pdf},
isbn = {9781450339124},
keywords = {action,emotion recognition,emotiw 2015 challenge,facial expression recognition,facial feature relation,unit},
mendeley-groups = {Dissertation},
pages = {451--458},
title = {{Capturing au-aware facial features and their latent relations for emotion recognition in the wild}},
url = {http://delivery.acm.org/10.1145/2840000/2830585/p451-yao.pdf?ip=138.251.30.96{\&}id=2830585{\&}acc=ACTIVE SERVICE{\&}key=C2D842D97AC95F7A.9A2EB7BEEEBA61C5.4D4702B0C3E38B35.4D4702B0C3E38B35{\&}CFID=801411421{\&}CFTOKEN=58130938{\&}{\_}{\_}acm{\_}{\_}=1503613720{\_}b45ac97e5bed06a35477dc7b},
year = {2015}
}
[38] [doi] G. Zhao and M. Pietikäinen, “Boosted multi-resolution spatiotemporal descriptors for facial expression recognition,” Pattern recognition letters, vol. 30, iss. 12, p. 1117–1127, 2009.
[Bibtex]
@article{Zhao2009,
abstract = {Recently, a spatiotemporal local binary pattern operator from three orthogonal planes (LBP-TOP) was proposed for describing and recognizing dynamic textures and applied to facial expression recognition. In this paper, we extend the LBP-TOP features to multi-resolution spatiotemporal space and use them for describing facial expressions. AdaBoost is utilized to learn the principal appearance and motion, for selecting the most important expression-related features for all the classes, or between every pair of expressions. Finally, a support vector machine (SVM) classifier is applied to the selected features for final recognition. {\textcopyright} 2009 Elsevier B.V. All rights reserved.},
author = {Zhao, Guoying and Pietik{\"{a}}inen, Matti},
doi = {10.1016/j.patrec.2009.03.018},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Zhao, Pietik{\"{a}}inen - 2009 - Boosted multi-resolution spatiotemporal descriptors for facial expression recognition.pdf:pdf},
issn = {01678655},
journal = {Pattern Recognition Letters},
keywords = {AdaBoost,Facial expression recognition,Principal appearance and motion,Spatiotemporal descriptors},
mendeley-groups = {Dissertation},
number = {12},
pages = {1117--1127},
title = {{Boosted multi-resolution spatiotemporal descriptors for facial expression recognition}},
url = {http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=32526473BC45CA0A5E6C25EC4940D13A?doi=10.1.1.718.9023{\&}rep=rep1{\&}type=pdf},
volume = {30},
year = {2009}
}
[39] [doi] S. E. Kahou, V. Michalski, and R. Memisevic, “Recurrent Neural Networks for Emotion Recognition in Video Categories and Subject Descriptors,” Proceedings of the 2015 acm on international conference on multimodal interaction, p. 467–474, 2015.
[Bibtex]
@article{Kahou2015,
abstract = {Deep learning based approaches to facial analysis and video analysis have recently demonstrated high performance on a variety of key tasks such as face recognition, emotion recog-nition and activity recognition. In the case of video, infor-mation often must be aggregated across a variable length sequence of frames to produce a classification result. Prior work using convolutional neural networks (CNNs) for emo-tion recognition in video has relied on temporal averag-ing and pooling operations reminiscent of widely used ap-proaches for the spatial aggregation of information. Recur-rent neural networks (RNNs) have seen an explosion of re-cent interest as they yield state-of-the-art performance on a variety of sequence analysis tasks. RNNs provide an attrac-tive framework for propagating information over a sequence using a continuous valued hidden layer representation. In this work we present a complete system for the 2015 Emo-tion Recognition in the Wild (EmotiW) Challenge. We fo-cus our presentation and experimental analysis on a hybrid CNN-RNN architecture for facial expression analysis that can outperform a previously applied CNN approach using temporal averaging for aggregation.},
author = {Kahou, Samira Ebrahimi and Michalski, Vincent and Memisevic, Roland},
doi = {10.1145/2818346.2830596},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Kahou et al. - 2015 - Recurrent Neural Networks for Emotion Recognition in Video.pdf:pdf},
isbn = {9781450339124},
journal = {Proceedings of the 2015 ACM on International Conference on Multimodal Interaction},
keywords = {all or part of,deep learning,emotion recognition,model combination,multimodal learning,or,or hard copies of,permission to make digital,recurrent neural networks,this work for personal},
mendeley-groups = {Dissertation},
pages = {467--474},
title = {{Recurrent Neural Networks for Emotion Recognition in Video Categories and Subject Descriptors}},
url = {http://dx.doi.org/10.1145/2818346.2830596 http://dl.acm.org/citation.cfm?doid=2818346.2830596},
year = {2015}
}
[40] [doi] A. Dhall, R. Goecke, T. Gedeon, and N. Sebe, “Emotion recognition in the wild,” Journal on multimodal user interfaces, vol. 10, iss. 2, p. 95–97, 2016.
[Bibtex]
@article{Dhall2016,
abstract = {In this paper, we investigate the relevance of using voice and lip activity to improve performance of audiovisual emotion recognition in unconstrained settings, as part of the 2014 Emotion Recognition in the Wild Challenge (EmotiW14). Indeed, the dataset provided by the organisers contains movie excerpts with highly challenging variability in terms of audiovisual content; e.g., speech and/or face of the subject expressing the emotion can be absent in the data. We therefore propose to tackle this issue by incorporating both voice and lip activity as additional features in a decision-level fusion. Results obtained on the blind test set show that the decision-level fusion can improve the best mono-modal approach, and that the addition of both voice and lip activity in the feature set leads to the best performance (UAR=35.27{\%}), with an absolute improvement of 5.36{\%} over the baseline.},
author = {Dhall, Abhinav and Goecke, Roland and Gedeon, Tom and Sebe, Nicu},
doi = {10.1007/s12193-016-0213-z},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Dhall et al. - 2016 - Emotion recognition in the wild.pdf:pdf},
isbn = {9781450328852},
issn = {17838738},
journal = {Journal on Multimodal User Interfaces},
mendeley-groups = {Dissertation},
number = {2},
pages = {95--97},
title = {{Emotion recognition in the wild}},
url = {https://link.springer.com/content/pdf/10.1007{\%}2Fs12193-016-0213-z.pdf},
volume = {10},
year = {2016}
}
[41] G. Observation, R. Algorithms, S. F. P. Gas, and E. Holding, “On three machine learning contests,” , vol. 4, iss. 8, p. 1–8, 2002.
[Bibtex]
@article{Goodfellow,
abstract = {The ICML 2013 Workshop on Challenges in Representation Learning 3 focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learn-ing challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for or-ganizers of future challenges and some comments on what kind of knowl-edge can be gained from machine learning competitions.},
archivePrefix = {arXiv},
arxivId = {arXiv:1307.0414v1},
author = {Observation, Grid and Algorithms, Reliable and Gas, San Francisco-based Pacific and Holding, Endesa},
eprint = {arXiv:1307.0414v1},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Goodfellow et al. - Unknown - Challenges in Representation Learning A report on three machine learning contests(2).pdf:pdf},
keywords = {competition,dataset,representation learning},
mendeley-groups = {Dissertation},
number = {8},
pages = {1--8},
title = {on three machine learning contests},
url = {https://arxiv.org/pdf/1307.0414.pdf},
volume = {4},
year = {2002}
}
[42] G. Observation, R. Algorithms, S. F. P. Gas, and E. Holding, “On three machine learning contests,” , vol. 4, iss. 8, p. 1–8, 2002.
[Bibtex]
@article{Goodfellowa,
abstract = {The ICML 2013 Workshop on Challenges in Representation Learning 3 focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learn-ing challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for or-ganizers of future challenges and some comments on what kind of knowl-edge can be gained from machine learning competitions.},
archivePrefix = {arXiv},
arxivId = {arXiv:1307.0414v1},
author = {Observation, Grid and Algorithms, Reliable and Gas, San Francisco-based Pacific and Holding, Endesa},
eprint = {arXiv:1307.0414v1},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Goodfellow et al. - Unknown - Challenges in Representation Learning A report on three machine learning contests(2).pdf:pdf},
keywords = {competition,dataset,representation learning},
mendeley-groups = {Dissertation},
number = {8},
pages = {1--8},
title = {on three machine learning contests},
url = {https://arxiv.org/pdf/1307.0414.pdf},
volume = {4},
year = {2002}
}
[43] Y. Tang, “Deep Learning using Linear Support Vector Machines,” Deeplearning.net, 2013.
[Bibtex]
@article{Tang,
abstract = {Recently, fully-connected and convolutional neural networks have been trained to achieve state-of-the-art performance on a wide vari-ety of tasks such as speech recognition, im-age classification, natural language process-ing, and bioinformatics. For classification tasks, most of these “deep learning” models employ the softmax activation function for prediction and minimize cross-entropy loss. In this paper, we demonstrate a small but consistent advantage of replacing the soft-max layer with a linear support vector ma-chine. Learning minimizes a margin-based loss instead of the cross-entropy loss. While there have been various combinations of neu-ral nets and SVMs in prior art, our results using L2-SVMs show that by simply replac-ing softmax with linear SVMs gives signifi-cant gains on popular deep learning datasets MNIST, CIFAR-10, and the ICML 2013 Rep-resentation LearningWorkshop's face expres-sion recognition challenge.},
archivePrefix = {arXiv},
arxivId = {arXiv:1306.0239v2},
author = {Tang, Yichuan},
eprint = {arXiv:1306.0239v2},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Tang - Unknown - Deep Learning using Linear Support Vector Machines.pdf:pdf},
journal = {Deeplearning.Net},
mendeley-groups = {Dissertation},
title = {{Deep Learning using Linear Support Vector Machines}},
url = {http://deeplearning.net/wp-content/uploads/2013/03/dlsvm.pdf},
year = {2013}
}
[44] [doi] Z. Yu and C. Zhang, “Image based Static Facial Expression Recognition with Multiple Deep Network Learning,” Proceedings of the 2015 acm on international conference on multimodal interaction, p. 435–442, 2015.
[Bibtex]
@article{Yu,
abstract = {We report our image based static facial expression recognition method for the Emotion Recognition in the Wild Challenge (EmotiW) 2015. We focus on the sub-challenge of the SFEW 2.0 dataset, where one seeks to automatically classify a set of static images into 7 basic emotions. The proposed method contains a face detection module based on the ensemble of three state-of-the-art face detectors, followed by a classification module with the ensemble of multiple deep convolutional neural networks (CNN). Each CNN model is initialized randomly and pre-trained on a larger dataset provided by the Facial Expression Recognition (FER) Challenge 2013. The pre-trained models are then fine-tuned on the training set of SFEW 2.0. To combine multiple CNN models, we present two schemes for learning the ensemble weights of the network responses: by minimizing the log likelihood loss, and by minimizing the hinge loss. Our proposed method generates state-of-the-art result on the FER dataset. It also achieves 55.96{\%} and 61.29{\%} respectively on the validation and test set of SFEW 2.0, surpassing the challenge baseline of 35.96{\%} and 39.13{\%} with significant gains.},
author = {Yu, Zhiding and Zhang, Cha},
doi = {10.1145/2818346.2830595},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Yu, Zhang - Unknown - Image based Static Facial Expression Recognition with Multiple Deep Network Learning.pdf:pdf},
isbn = {9781450339124},
journal = {Proceedings of the 2015 ACM on International Conference on Multimodal Interaction},
keywords = {convolutional neural net-,emotiw 2015 challenge,facial expression recognition,multiple network learning,work},
mendeley-groups = {Dissertation},
pages = {435--442},
title = {{Image based Static Facial Expression Recognition with Multiple Deep Network Learning}},
url = {http://www.contrib.andrew.cmu.edu/{~}yzhiding/publications/ICMI15.pdf},
year = {2015}
}
[45] M. D. Zeiler and R. Fergus, “Stochastic Pooling for Regularization of Deep Convolutional Neural Networks,” International conference on representation learning, p. 1–9, 2013.
[Bibtex]
@article{Zeiler,
abstract = {We introduce a simple and effective method for regularizing large convolutional neural networks. We replace the conventional deterministic pooling operations with a stochastic procedure, randomly picking the activation within each pooling region according to a multinomial distribution, given by the activities within the pooling region. The approach is hyper-parameter free and can be combined with other regularization approaches, such as dropout and data augmentation. We achieve state-of-the-art performance on four image datasets, relative to other approaches that do not utilize data augmentation.},
archivePrefix = {arXiv},
arxivId = {1301.3557},
author = {Zeiler, Matthew D and Fergus, Rob},
eprint = {1301.3557},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Zeiler, Fergus - Unknown - Stochastic Pooling for Regularization of Deep Convolutional Neural Networks.pdf:pdf},
journal = {International Conference on Representation Learning},
mendeley-groups = {Dissertation},
pages = {1--9},
title = {{Stochastic Pooling for Regularization of Deep Convolutional Neural Networks}},
url = {https://arxiv.org/pdf/1301.3557.pdf http://arxiv.org/abs/1301.3557},
year = {2013}
}
[46] [doi] I. Song, H. J. Kim, and P. B. Jeon, “Deep learning for real-time robust facial expression recognition on a smartphone,” in Digest of technical papers – ieee international conference on consumer electronics, 2014, p. 564–567.
[Bibtex]
@inproceedings{Song2014,
abstract = {We developed a real-time robust facial expression recognition function on a smartphone. To this end, we trained a deep convolutional neural network on a GPU to classify facial expressions. The network has 65k neurons and consists of 5 layers. The network of this size exhibits substantial overfitting when the size of training examples is not large. To combat overfitting, we applied data augmentation and a recently introduced technique called "dropout". Through experimental evaluation over various face datasets, we show that the trained network outperformed a classifier based on hand-engineered features by a large margin. With the trained network, we developed a smartphone app that recognized the user's facial expression. In this paper, we share our experiences on training such a deep network and developing a smartphone app based on the trained network.},
author = {Song, Inchul and Kim, Hyun Jun and Jeon, Paul Barom},
booktitle = {Digest of Technical Papers - IEEE International Conference on Consumer Electronics},
doi = {10.1109/ICCE.2014.6776135},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Song, Kim, Jeon - 2014 - Deep learning for real-time robust facial expression recognition on a smartphone.pdf:pdf},
isbn = {9781479912919},
issn = {0747668X},
mendeley-groups = {Dissertation},
month = {jan},
pages = {564--567},
publisher = {IEEE},
title = {{Deep learning for real-time robust facial expression recognition on a smartphone}},
url = {http://ieeexplore.ieee.org/document/6776135/},
year = {2014}
}
[47] P. Burkert, F. Trier, M. Z. Afzal, A. Dengel, and M. Liwicki, “DeXpression: Deep Convolutional Neural Network for Expression Recognition,” , 2015.
[Bibtex]
@article{Burkert2015,
abstract = {We propose a convolutional neural network (CNN) architecture for facial expression recognition. The proposed architecture is independent of any hand-crafted feature extraction and performs better than the earlier proposed convolutional neural network based approaches. We visualize the automatically extracted features which have been learned by the network in order to provide a better understanding. The standard datasets, i.e. Extended Cohn-Kanade (CKP) and MMI Facial Expression Databse are used for the quantitative evaluation. On the CKP set the current state of the art approach, using CNNs, achieves an accuracy of 99.2{\%}. For the MMI dataset, currently the best accuracy for emotion recognition is 93.33{\%}. The proposed architecture achieves 99.6{\%} for CKP and 98.63{\%} for MMI, therefore performing better than the state of the art using CNNs. Automatic facial expression recognition has a broad spectrum of applications such as human-computer interaction and safety systems. This is due to the fact that non-verbal cues are important forms of communication and play a pivotal role in interpersonal communication. The performance of the proposed architecture endorses the efficacy and reliable usage of the proposed work for real world applications.},
archivePrefix = {arXiv},
arxivId = {1509.05371},
author = {Burkert, Peter and Trier, Felix and Afzal, Muhammad Zeshan and Dengel, Andreas and Liwicki, Marcus},
eprint = {1509.05371},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Burkert et al. - 2015 - DeXpression Deep Convolutional Neural Network for Expression Recognition.pdf:pdf},
mendeley-groups = {Dissertation},
month = {sep},
title = {{DeXpression: Deep Convolutional Neural Network for Expression Recognition}},
url = {http://arxiv.org/abs/1509.05371},
year = {2015}
}
[48] [doi] A. T. Lopes, E. de Aguiar, A. F. {De Souza}, and T. Oliveira-Santos, “Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order,” Pattern recognition, vol. 61, p. 610–628, 2017.
[Bibtex]
@article{Lopes2017,
abstract = {Facial expression recognition has been an active research area in the past 10 years, with growing application areas including avatar animation, neuromarketing and sociable robots. The recognition of facial expressions is not an easy problem for machine learning methods, since people can vary significantly in the way they show their expressions. Even images of the same person in the same facial expression can vary in brightness, background and pose, and these variations are emphasized if considering different subjects (because of variations in shape, ethnicity among others). Although facial expression recognition is very studied in the literature, few works perform fair evaluation avoiding mixing subjects while training and testing the proposed algorithms. Hence, facial expression recognition is still a challenging problem in computer vision. In this work, we propose a simple solution for facial expression recognition that uses a combination of Convolutional Neural Network and specific image pre-processing steps. Convolutional Neural Networks achieve better accuracy with big data. However, there are no publicly available datasets with sufficient data for facial expression recognition with deep architectures. Therefore, to tackle the problem, we apply some pre-processing techniques to extract only expression specific features from a face image and explore the presentation order of the samples during training. The experiments employed to evaluate our technique were carried out using three largely used public databases (CK+, JAFFE and BU-3DFE). A study of the impact of each image pre-processing operation in the accuracy rate is presented. The proposed method: achieves competitive results when compared with other facial expression recognition methods – 96.76{\%} of accuracy in the CK+ database – it is fast to train, and it allows for real time facial expression recognition with standard computers.},
author = {Lopes, Andr{\'{e}} Teixeira and de Aguiar, Edilson and {De Souza}, Alberto F. and Oliveira-Santos, Thiago},
doi = {10.1016/j.patcog.2016.07.026},
issn = {00313203},
journal = {Pattern Recognition},
keywords = {Computer vision,Convolutional Neural Networks,Expression specific features,Facial expression recognition,Machine learning},
mendeley-groups = {Dissertation},
pages = {610--628},
title = {{Facial expression recognition with Convolutional Neural Networks: Coping with few data and the training sample order}},
url = {http://www.sciencedirect.com/science/article/pii/S0031320316301753},
volume = {61},
year = {2017}
}
[49] [doi] H. W. H. Wang, S. Z. Li, Y. W. Y. Wang, and W. Z. W. Zhang, “Illumination modeling and normalization for face recognition,” 2003 ieee international soi conference. proceedings (cat. no.03ch37443), vol. 2003, p. 104––111, 2003.
[Bibtex]
@article{Wang2003,
abstract = {We present a general framework for face modeling under varying lighting conditions. First, we show that a face lighting subspace can be constructed based on three or more training face images illuminated by noncoplanar lights. The lighting of any face image can be represented as a point in this subspace. Second, we show that the extreme rays, i.e. the boundary of an illumination cone, cover the entire light sphere. Therefore, a relatively sparsely sampled face images can be used to build a face model instead of calculating each extremely illuminated face image. Third, we present a face normalization algorithm, illumination alignment, i.e. changing the lighting of one face image to that of another face image. Experiments are presented.},
author = {Wang, Haitao Wang Haitao and Li, S.Z. and Wang, Yangsheng Wang Yangsheng and Zhang, Weiwei Zhang Weiwei},
doi = {10.1109/AMFG.2003.1240831},
isbn = {0-7695-2010-3},
journal = {2003 IEEE International SOI Conference. Proceedings (Cat. No.03CH37443)},
mendeley-groups = {Dissertation},
pages = {104----111},
title = {{Illumination modeling and normalization for face recognition}},
url = {http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.8358},
volume = {2003},
year = {2003}
}
[50] [doi] E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution,” in Proceedings of the 18th acm international conference on multimodal interaction, 2016, p. 279–283.
[Bibtex]
@inproceedings{Barsoum,
abstract = {Crowd sourcing has become a widely adopted scheme to collect ground truth labels. However, it is a well-known problem that these labels can be very noisy. In this paper, we demonstrate how to learn a deep convolutional neural network (DCNN) from noisy labels, using facial expression recognition as an example. More specifically, we have 10 taggers to label each input image, and compare four different approaches to utilizing the multiple labels: majority voting, multi-label learning, probabilistic label drawing, and cross-entropy loss. We show that the traditional majority voting scheme does not perform as well as the last two approaches that fully leverage the label distribution. An enhanced FER+ data set with multiple labels for each face image will also be shared with the research community.},
archivePrefix = {arXiv},
arxivId = {1608.01041},
author = {Barsoum, Emad and Zhang, Cha and Ferrer, Cristian Canton and Zhang, Zhengyou},
booktitle = {Proceedings of the 18th ACM International Conference on Multimodal Interaction},
doi = {10.1145/2993148.2993165},
eprint = {1608.01041},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Barsoum et al. - Unknown - Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution(2).pdf:pdf},
isbn = {9781450345569},
keywords = {annotation,con-,crowd sourcing,emotion recognition,facial expression recognition,volutional neural network},
mendeley-groups = {Dissertation},
pages = {279--283},
title = {{Training Deep Networks for Facial Expression Recognition with Crowd-Sourced Label Distribution}},
url = {https://arxiv.org/pdf/1608.01041.pdf http://arxiv.org/abs/1608.01041},
year = {2016}
}
[51] M. Leszczyński, “Image Preprocessing for Illumination Invariant Face Verification,” Journal of telecommunications and information technology, p. 19–25, 2010.
[Bibtex]
@article{Leszczynski,
abstract = {—Performance of the face verification system depend on many conditions. One of the most problematic is varying illumination condition. In this paper 14 normalization algo-rithms based on histogram normalization, illumination prop-erties and the human perception theory were compared using 3 verification methods. The results obtained from the exper-iments showed that the illumination preprocessing methods significantly improves the verification rate and it's a very im-portant step in face verification system. Keywords—DLDA, face verification, histogram normalization, homomorphic filtering, illumination normalization, LDA, PCA, preprocessing techniques, quotient image, retinex.},
author = {Leszczy{\'{n}}ski, Mariusz},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Leszczy{\'{n}}ski - Unknown - Image Preprocessing for Illumination Invariant Face Verification.pdf:pdf},
journal = {Journal of Telecommunications and Information Technology},
keywords = {dlda,face verification,histogram normalization,homomorphic filtering,illumination normalization,lda,pca,preprocessing methods to,preprocessing techniques,quotient image,retinex,the second approach using},
mendeley-groups = {Dissertation},
pages = {19--25},
title = {{Image Preprocessing for Illumination Invariant Face Verification}},
url = {http://dlibra.itl.waw.pl/dlibra-webapp/Content/1063/ISSN{\_}1509-4553{\_}4{\_}2010{\_}19.pdf},
year = {2010}
}
[52] [doi] M. Mitchell, “An Introduction to Genetic Algorithms (Complex Adaptive Systems),” The mit press, p. 221, 1998.
[Bibtex]
@article{Mitchell1998,
abstract = {Genetic algorithms have been used in science and engineering as adaptive algorithms for solving practical problems and as computational models of natural evolutionary systems. This brief, accessible introduction describes some of the most interesting research in the field and also enables readers to implement and experiment with genetic algorithms on their own. It focuses in depth on a small set of important and interesting topics-particularly in machine learning, scientific modeling, and artificial life-and reviews a broad span of research, including the work of Mitchell and her colleagues. The descriptions of applications and modeling projects stretch beyond the strict boundaries of computer science to include dynamical systems theory, game theory, molecular biology, ecology, evolutionary biology, and population genetics.},
author = {Mitchell, Melanie},
doi = {10.1016/S0898-1221(96)90227-8},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Mitchell - 1999 - An introduction to genetic algorithms.pdf:pdf},
isbn = {0262631857},
issn = {08981221},
journal = {The MIT Press},
mendeley-groups = {Dissertation},
pages = {221},
pmid = {21368999},
title = {{An Introduction to Genetic Algorithms (Complex Adaptive Systems)}},
url = {https://svn-d1.mpi-inf.mpg.de/AG1/MultiCoreLab/papers/ebook-fuzzy-mitchell-99.pdf http://www.amazon.ca/exec/obidos/redirect?tag=citeulike09-20{\&}path=ASIN/0262631857},
year = {1998}
}
[53] [doi] D. Whitley, “A genetic algorithm tutorial,” Statistics and computing, vol. 4, iss. 2, p. 65–85, 1994.
[Bibtex]
@article{Whitley1994,
abstract = {This tutorial covers the canonical genetic algorithm as well as more experimental forms of genetic algorithms, including parallel island models and parallel cellular genetic algorithms. The tutorial also illustrates genetic search by hyperplane sampling. The theoretical foun- dations of genetic algorithms are reviewed, include the schema theorem as well as recently developed exact models of the canonical genetic algorithm.},
archivePrefix = {arXiv},
arxivId = {arXiv:1011.1669v3},
author = {Whitley, Darrell},
doi = {10.1007/BF00175354},
eprint = {arXiv:1011.1669v3},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Whitley - 1994 - A genetic algorithm tutorial.pdf:pdf},
isbn = {1101001100},
issn = {09603174},
journal = {Statistics and Computing},
keywords = {Genetic algorithms,parallel algorithms,search},
mendeley-groups = {Dissertation},
month = {jun},
number = {2},
pages = {65--85},
pmid = {848},
publisher = {Kluwer Academic Publishers},
title = {{A genetic algorithm tutorial}},
url = {http://link.springer.com/10.1007/BF00175354},
volume = {4},
year = {1994}
}
[54] L. Xie and A. Yuille, “Genetic CNN,” Arxiv preprint arxiv:1703.01513, 2017.
[Bibtex]
@article{Xie2017,
abstract = {The deep Convolutional Neural Network (CNN) is the state-of-the-art solution for large-scale visual recognition. Following basic principles such as increasing the depth and constructing highway connections, researchers have manually designed a lot of fixed network structures and verified their effectiveness. In this paper, we discuss the possibility of learning deep network structures automatically. Note that the number of possible network structures increases exponentially with the number of layers in the network, which inspires us to adopt the genetic algorithm to efficiently traverse this large search space. We first propose an encoding method to represent each network structure in a fixed-length binary string, and initialize the genetic algorithm by generating a set of randomized individuals. In each generation, we define standard genetic operations, e.g., selection, mutation and crossover, to eliminate weak individuals and then gen-erate more competitive ones. The competitiveness of each individual is defined as its recognition accuracy, which is obtained via training the network from scratch and evaluat-ing it on a validation set. We run the genetic process on two small datasets, i.e., MNIST and CIFAR10, demonstrating its ability to evolve and find high-quality structures which are little studied before. These structures are also trans-ferrable to the large-scale ILSVRC2012 dataset.},
archivePrefix = {arXiv},
arxivId = {1703.01513},
author = {Xie, Lingxi and Yuille, Alan},
eprint = {1703.01513},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Xie, Yuille - 2017 - Genetic CNN.pdf:pdf},
journal = {arXiv preprint arXiv:1703.01513},
mendeley-groups = {Dissertation},
title = {{Genetic CNN}},
url = {https://arxiv.org/pdf/1703.01513.pdf},
year = {2017}
}
[55] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, and Others, “Imagenet large scale visual recognition challenge,” International journal of computer vision, vol. 115, iss. 3, p. 211–252, 2015.
[Bibtex]
@article{Russakovsky,
abstract = {The ImageNet Large Scale Visual Recogni-tion Challenge is a benchmark in object category classi-fication and detection on hundreds of object categories and millions of images. The challenge has been run an-nually from 2010 to present, attracting participation from more than fifty institutions. lenges of collecting large-scale ground truth annotation, highlight key breakthroughs in categorical object recog-nition, provide a detailed analysis of the current state of the field of large-scale image classification and ob-ject detection, and compare the state-of-the-art com-puter vision accuracy with human accuracy. We con-clude with lessons learned in the five years of the chal-lenge, and propose future directions and improvements.},
author = {Russakovsky, Olga and Deng, Jia and Su, Hao and Krause, Jonathan and Satheesh, Sanjeev and Ma, Sean and Huang, Zhiheng and Karpathy, Andrej and Khosla, Aditya and Bernstein, Michael and Others},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Russakovsky et al. - 2015 - Imagenet large scale visual recognition challenge.pdf:pdf},
journal = {International Journal of Computer Vision},
keywords = {Benchmark {\textperiodcentered},Dataset {\textperiodcentered},Large-scale {\textperiodcentered},Object detection,Object recognition {\textperiodcentered}},
mendeley-groups = {Dissertation},
number = {3},
pages = {211--252},
title = {{Imagenet large scale visual recognition challenge}},
url = {https://arxiv.org/pdf/1409.0575.pdf},
volume = {115},
year = {2015}
}
[56] S. Soo, “Object detection using Haar-cascade Classifier,” , vol. 2, iss. 3, p. 1–12, 2014.
[Bibtex]
@article{Soo2014,
abstract = {Object detection is an important feature of computer science. The benefits of object detection is however not limited to someone with a doctorate of informatics. Instead, object detection is growing deeper and deeper into the common parts of the information society, lending a helping hand wherever needed. This paper will address one such possibility, namely the help of a Haar-cascade classifier. The main focus will be on the case study of a vehicle detection and counting system and the possibilities it will provide in a semi-enclosed area - both the statistical kind and also for the common man. The goal of the system to be developed is to further ease and augment the everyday part of our lives.},
author = {Soo, Sander},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Soo - Unknown - Object detection using Haar-cascade Classifier.pdf:pdf},
keywords = {classifier,ect detection using haar-cascade},
mendeley-groups = {Dissertation},
number = {3},
pages = {1--12},
title = {{Object detection using Haar-cascade Classifier}},
url = {http://ds.cs.ut.ee/Members/artjom85/2014dss-course-media/Object detection using Haar-final.pdf},
volume = {2},
year = {2014}
}
[57] [doi] I. J. Goodfellow, D. Erhan, P. {Luc Carrier}, A. Courville, M. Mirza, B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D. H. Lee, Y. Zhou, C. Ramaiah, F. Feng, R. Li, X. Wang, D. Athanasakis, J. Shawe-Taylor, M. Milakov, J. Park, R. Ionescu, M. Popescu, C. Grozea, J. Bergstra, J. Xie, L. Romaszko, B. Xu, Z. Chuang, and Y. Bengio, “Challenges in representation learning: A report on three machine learning contests,” Neural networks, vol. 64, p. 59–63, 2015.
[Bibtex]
@article{Goodfellow2015,
abstract = {The ICML 2013 Workshop on Challenges in Representation Learning. 11http://deeplearning.net/icml2013-workshop-competition. focused on three challenges: the black box learning challenge, the facial expression recognition challenge, and the multimodal learning challenge. We describe the datasets created for these challenges and summarize the results of the competitions. We provide suggestions for organizers of future challenges and some comments on what kind of knowledge can be gained from machine learning competitions.},
archivePrefix = {arXiv},
arxivId = {1307.0414},
author = {Goodfellow, Ian J. and Erhan, Dumitru and {Luc Carrier}, Pierre and Courville, Aaron and Mirza, Mehdi and Hamner, Ben and Cukierski, Will and Tang, Yichuan and Thaler, David and Lee, Dong Hyun and Zhou, Yingbo and Ramaiah, Chetan and Feng, Fangxiang and Li, Ruifan and Wang, Xiaojie and Athanasakis, Dimitris and Shawe-Taylor, John and Milakov, Maxim and Park, John and Ionescu, Radu and Popescu, Marius and Grozea, Cristian and Bergstra, James and Xie, Jingjing and Romaszko, Lukasz and Xu, Bing and Chuang, Zhang and Bengio, Yoshua},
doi = {10.1016/j.neunet.2014.09.005},
eprint = {1307.0414},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Goodfellow et al. - Unknown - Challenges in Representation Learning A report on three machine learning contests(2).pdf:pdf},
isbn = {9783642420504},
issn = {18792782},
journal = {Neural Networks},
keywords = {Competition,Dataset,Representation learning},
mendeley-groups = {Dissertation},
month = {jul},
pages = {59--63},
pmid = {25613956},
title = {{Challenges in representation learning: A report on three machine learning contests}},
url = {http://arxiv.org/abs/1307.0414},
volume = {64},
year = {2015}
}
[58] [doi] OpenCV, About – OpenCV library, 2017.
[Bibtex]
@misc{OpenCV2017,
author = {OpenCV},
doi = {OpenCV},
mendeley-groups = {Dissertation},
title = {{About - OpenCV library}},
url = {http://opencv.org},
urldate = {2017-06-15},
year = {2017}
}
[59] [doi] A. Krizhevsky, “Learning Multiple Layers of Features from Tiny Images,” \ldots science department, university of toronto, tech. \ldots, p. 1–60, 2009.
[Bibtex]
@article{Krizhevsky2009,
abstract = {Groups at MIT and NYU have collected a dataset of millions of tiny colour images from the web. It is, in principle, an excellent dataset for unsupervised training of deep generative models, but previous researchers who have tried this have found it difficult to learn a good set of filters from the images. We show how to train a multi-layer generative model that learns to extract meaningful features which resemble those found in the human visual cortex. Using a novel parallelization algorithm to distribute the work among multiple machines connected on a network, we show how training such a model can be done in reasonable time. A second problematic aspect of the tiny images dataset is that there are no reliable class labels which makes it hard to use for object recognition experiments. We created two sets of reliable labels. The CIFAR-10 set has 6000 examples of each of 10 classes and the CIFAR-100 set has 600 examples of each of 100 non-overlapping classes. Using these labels, we show that object recognition is significantly improved by pre-training a layer of features on a large set of unlabeled tiny images.},
archivePrefix = {arXiv},
arxivId = {arXiv:1011.1669v3},
author = {Krizhevsky, Alex},
doi = {10.1.1.222.9220},
eprint = {arXiv:1011.1669v3},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Krizhevsky - 2009 - Learning Multiple Layers of Features from Tiny Images.pdf:pdf},
isbn = {9788578110796},
issn = {1098-6596},
journal = {{\ldots} Science Department, University of Toronto, Tech. {\ldots}},
mendeley-groups = {Dissertation},
pages = {1--60},
pmid = {25246403},
title = {{Learning Multiple Layers of Features from Tiny Images}},
url = {http://www.cs.toronto.edu/{~}kriz/learning-features-2009-TR.pdf http://scholar.google.com/scholar?hl=en{\&}btnG=Search{\&}q=intitle:Learning+Multiple+Layers+of+Features+from+Tiny+Images{\#}0},
year = {2009}
}
[60] [doi] P. Y. Simard, D. Steinkraus, and J. C. Platt, “Best practices for convolutional neural networks applied to visual document analysis,” Seventh international conference on document analysis and recognition, 2003. proceedings., vol. 1, iss. Icdar, p. 958–963, 2003.
[Bibtex]
@article{Simard2003,
abstract = {Neural Networks are a powerful technology for classification of visual inputs arising from documents. However, there is a confusing plethora of different neural network methods that are used in the literature and in industry. This paper describes a set of concrete best practices that document analysis researchers can use to get good results with neural networks. The most important practice is that convolutional neural networks are better suited for visual document tasks than fully connected networks. We propose that a simple "do-it-yourself" implementation of convolution neural networks does not require complex methods, such as momentum, weight decay, structure-dependent learning rates, averaging layers, tangent prop, or even finely-tuning the architecture. The end result is a very simple yet general architecture which can yield state-of-the-art performance for document analysis. We illustrate our claims on the MNIST set of English digit images.},
author = {Simard, P.Y. and Steinkraus, Dave and Platt, J.C.},
doi = {10.1109/ICDAR.2003.1227801},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Simard, Steinkraus, Platt - Unknown - Best Practices for Convolutional Neural Networks Applied to Visual Document Analysis.pdf:pdf},
isbn = {0-7695-1960-1},
issn = {15205363},
journal = {Seventh International Conference on Document Analysis and Recognition, 2003. Proceedings.},
keywords = {Best practices,Concrete,Convolution,Handwriting recognition,Industrial training,Information processing,Neural networks,Performance analysis,Support vector machines,Text analysis},
mendeley-groups = {Dissertation},
number = {Icdar},
pages = {958--963},
title = {{Best practices for convolutional neural networks applied to visual document analysis}},
url = {https://pdfs.semanticscholar.org/7b1c/c19dec9289c66e7ab45e80e8c42273509ab6.pdf http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1227801},
volume = {1},
year = {2003}
}
[61] [doi] D. P. Kingma and J. B. Adam, “A method for stochastic optimization,” in International conference on learning representation, 2015.
[Bibtex]
@inproceedings{Kingma2015,
abstract = {We introduce Adam, an algorithm for first-order gradient-based optimization of stochastic objective functions, based on adaptive estimates of lower-order mo-ments. The method is straightforward to implement, is computationally efficient, has little memory requirements, is invariant to diagonal rescaling of the gradients, and is well suited for problems that are large in terms of data and/or parameters. The method is also appropriate for non-stationary objectives and problems with very noisy and/or sparse gradients. The hyper-parameters have intuitive interpre-tations and typically require little tuning. Some connections to related algorithms, on which Adam was inspired, are discussed. We also analyze the theoretical con-vergence properties of the algorithm and provide a regret bound on the conver-gence rate that is comparable to the best known results under the online convex optimization framework. Empirical results demonstrate that Adam works well in practice and compares favorably to other stochastic optimization methods. Finally, we discuss AdaMax, a variant of Adam based on the infinity norm.},
author = {Kingma, Diederik P and Adam, Jimmy Ba},
booktitle = {International Conference on Learning Representation},
doi = {10.1109/ICCCBDA.2017.7951902},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Kingma, Ba - Unknown - ADAM A METHOD FOR STOCHASTIC OPTIMIZATION.pdf:pdf},
isbn = {9781509044986},
mendeley-groups = {Dissertation},
title = {{A method for stochastic optimization}},
url = {https://arxiv.org/pdf/1412.6980.pdf},
year = {2015}
}
[62] [doi] C. R. García-Alonso, L. M. Pérez-Naranjo, and J. C. Fernández-Caballero, “Multiobjective evolutionary algorithms to identify highly autocorrelated areas: The case of spatial distribution in financially compromised farms,” Annals of operations research, vol. 219, iss. 1, p. 187–202, 2014.
[Bibtex]
@article{Maaten2008,
abstract = {Local Indicators of Spatial Aggregation (LISA) can be used as objectives in a multicriteria framework when highly autocorrelated areas (hot-spots) must be identified and geographically located in complex areas. To do so, a Multi-Objective Evolutionary Algorithm (MOEA) based on SPEA2 (Strength Pareto Evolutionary Algorithm v.2) has been designed to evaluate three different fitness functions (fine-grained strength, the weighted sum of objectives and fuzzy evaluation of weighted objectives) and three LISA methods. MOEA makes it possible to achieve a compromise between spatial econometric methods as it highlights areas where a specific phenomenon shows significantly high autocorrelation. The spatial distribution of financially compromised olive-tree farms in Andalusia (Spain) was selected for analysis and two fuzzy hot-spots were statistically identified and spatially located. Hot-spots can be considered to be spatial fuzzy sets where the spatial units have a membership degree that can also be calculated.},
archivePrefix = {arXiv},
arxivId = {1307.1662},
author = {Garc{\'{i}}a-Alonso, Carlos R. and P{\'{e}}rez-Naranjo, Leonor M. and Fern{\'{a}}ndez-Caballero, Juan C.},
doi = {10.1007/s10479-011-0841-3},
eprint = {1307.1662},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Maaten, Hinton - 2008 - Visualizing Data using t-SNE.pdf:pdf},
isbn = {1532-4435},
issn = {15729338},
journal = {Annals of Operations Research},
keywords = {Financially compromised areas,Fuzzy hot-spots,Local indicators of spatial aggregation,Multiobjective evolutionary algorithms,Spatial analysis},
mendeley-groups = {Dissertation},
number = {1},
pages = {187--202},
pmid = {20652508},
title = {{Multiobjective evolutionary algorithms to identify highly autocorrelated areas: The case of spatial distribution in financially compromised farms}},
url = {http://www.jmlr.org/papers/volume9/vandermaaten08a/vandermaaten08a.pdf http://link.springer.com/10.1007/s10479-011-0841-3{\%}5Cnhttp://www.ncbi.nlm.nih.gov/pubmed/20652508},
volume = {219},
year = {2014}
}
[63] GeForce, NVIDIA TITAN X Graphics Card with Pascal, 2017.
[Bibtex]
@misc{Nvidia,
author = {GeForce},
booktitle = {geforce.co.uk},
mendeley-groups = {Dissertation},
title = {{NVIDIA TITAN X Graphics Card with Pascal}},
url = {https://www.geforce.co.uk/hardware/10series/titan-x/ http://www.geforce.co.uk/hardware/10series/titan-x/},
urldate = {2017-08-25},
year = {2017}
}
[64] [doi] B. {Mc Ginley}, F. Morgan, and C. O’Riordan, “Maintaining diversity through adaptive selection, crossover and mutation,” Gecco’08: proceedings of the 10th annual conference on genetic and evolutionary computation 2008, vol. 4, p. 1127–1128, 2008.
[Bibtex]
@article{McGinley2008,
abstract = {This paper presents an Adaptive Genetic Algorithm (AGA) where selection pressure, crossover and mutation probabilities are adapted according to population diversity statistics. The creation and maintenance of a diverse population of healthy individuals is a central goal of this research. To realise this objective, population diversity measures are utilised by the parameter adaptation process to both explore (through diversity promotion) and exploit (by local search and maintenance of a presence in known good regions of the fitness landscape). The performance of the proposed AGA is evaluated using a multi-modal, multi-dimensional function optimisation benchmark. Results presented indicate that the AGA achieves better fitness scores faster compared to a traditional GA.},
author = {{Mc Ginley}, Brian and Morgan, Fearghal and O'Riordan, Colm},
doi = {10.1145/1389095.1389311},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Mc Ginley, Morgan, O 'riordan - Unknown - Maintaining Diversity through Adaptive Selection, Crossover and Mutation.pdf:pdf},
isbn = {9781605581309},
journal = {GECCO'08: Proceedings of the 10th Annual Conference on Genetic and Evolutionary Computation 2008},
keywords = {Adaptive Genetic Algorithm (AGA),Adaptive selection,Parameter adaptation,Weighted population diversity},
mendeley-groups = {Dissertation},
pages = {1127--1128},
title = {{Maintaining diversity through adaptive selection, crossover and mutation}},
url = {http://www.cs.bham.ac.uk/{~}wbl/biblio/gecco2008/docs/p1127.pdf http://www.scopus.com/inward/record.url?eid=2-s2.0-57349178416{\&}partnerID=tZOtx3y1},
volume = {4},
year = {2008}
}
[65] [doi] Opencv, OpenCV Library, 2014.
[Bibtex]
@misc{Opencv2014,
abstract = {Open Source Computer Vision is a library of programming functions for real time computer vision},
author = {Opencv},
doi = {OpenCV},
mendeley-groups = {Dissertation},
title = {{OpenCV Library}},
url = {http://opencv.org/},
year = {2014}
}
[66] Dlib, D-lib C++ library, 2016.
[Bibtex]
@misc{Dlib2016,
author = {Dlib},
mendeley-groups = {Dissertation},
pages = {1},
title = {{D-lib C++ library}},
url = {http://dlib.net/ blog.dlib.net},
year = {2016}
}
[67] [doi] P. Goldsborough, “A Tour of TensorFlow,” Arxiv, vol. 1610.01178, p. 1–16, 2016.
[Bibtex]
@article{Goldsborough2016,
abstract = {Deep learning is a branch of artificial intelligence employing deep neural network architectures that has signifi-cantly advanced the state-of-the-art in computer vision, speech recognition, natural language processing and other domains. In November 2015, Google released TensorFlow, an open source deep learning software library for defining, training and deploying machine learning models. In this paper, we review TensorFlow and put it in context of modern deep learning concepts and software. We discuss its basic computational paradigms and distributed execution model, its programming interface as well as accompanying visualization toolkits. We then compare Ten-sorFlow to alternative libraries such as Theano, Torch or Caffe on a qualitative as well as quantitative basis and finally comment on observed use-cases of TensorFlow in academia and industry.},
archivePrefix = {arXiv},
arxivId = {1610.01178},
author = {Goldsborough, Peter},
doi = {10.1017/CBO9781107415324.004},
eprint = {1610.01178},
isbn = {9788578110796},
issn = {1098-6596},
journal = {Arxiv},
mendeley-groups = {Dissertation},
pages = {1--16},
pmid = {25246403},
title = {{A Tour of TensorFlow}},
volume = {1610.01178},
year = {2016}
}
[68] F. Chollet, Keras Documentation, 2017.
[Bibtex]
@misc{Chollet2017,
author = {Chollet, Francois},
booktitle = {Keras.io},
mendeley-groups = {Dissertation},
title = {{Keras Documentation}},
url = {https://keras.io/ https://keras.io},
urldate = {2017-08-25},
year = {2017}
}
[69] Maciej, Machine Learning Frameworks Comparison, 2016.
[Bibtex]
@misc{Maciej2016,
author = {Maciej},
booktitle = {PaperspaceBlog},
mendeley-groups = {Dissertation},
title = {{Machine Learning Frameworks Comparison}},
url = {https://blog.paperspace.com/which-ml-framework-should-i-use/},
urldate = {2017-08-25},
year = {2016}
}
[70] [doi] V. Kazemi and J. Sullivan, “One millisecond face alignment with an ensemble of regression trees,” in Proceedings of the ieee computer society conference on computer vision and pattern recognition, 2014, p. 1867–1874.
[Bibtex]
@inproceedings{Kazemi2014,
abstract = {This paper addresses the problem of Face Alignment for a single image. We show how an ensemble of regression trees can be used to estimate the face's landmark positions directly from a sparse subset of pixel intensities, achieving super-realtime performance with high quality predictions. We present a general framework based on gradient boosting for learning an ensemble of regression trees that optimizes the sum of square error loss and naturally handles missing or partially labelled data. We show how using appropriate priors exploiting the structure of image data helps with ef- ficient feature selection. Different regularization strategies and its importance to combat overfitting are also investi- gated. In addition, we analyse the effect of the quantity of training data on the accuracy of the predictions and explore the effect of data augmentation using synthesized data.},
author = {Kazemi, Vahid and Sullivan, Josephine},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
doi = {10.1109/CVPR.2014.241},
isbn = {9781479951178},
issn = {10636919},
keywords = {Decision Trees,Face Alignment,Gradient Boosting,Real-Time},
mendeley-groups = {Dissertation},
month = {jun},
pages = {1867--1874},
publisher = {IEEE},
title = {{One millisecond face alignment with an ensemble of regression trees}},
url = {http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6909637},
year = {2014}
}
[71] [doi] C. Sagonas, G. Tzimiropoulos, S. Zafeiriou, and M. Pantic, “300 faces in-the-wild challenge: The first facial landmark Localization Challenge,” in Proceedings of the ieee international conference on computer vision, 2013, p. 397–403.
[Bibtex]
@inproceedings{Sagonas,
abstract = {Automatic facial point detection plays arguably the most important role in face analysis. Several methods have been proposed which reported their results on databases of both constrained and unconstrained conditions. Most of these databases provide annotations with different mark-ups and in some cases the are problems related to the accuracy of the fiducial points. The aforementioned issues as well as the lack of a evaluation protocol makes it difficult to compare performance between different systems. In this paper, we present the 300 Faces in-the-Wild Challenge: The first facial landmark localization Challenge which is held in conjunction with the International Conference on Computer Vision 2013, Sydney, Australia. The main goal of this challenge is to compare the performance of different methods on a new-collected dataset using the same evaluation protocol and the same mark-up and hence to develop the first standardized benchmark for facial landmark localization.},
author = {Sagonas, Christos and Tzimiropoulos, Georgios and Zafeiriou, Stefanos and Pantic, Maja},
booktitle = {Proceedings of the IEEE International Conference on Computer Vision},
doi = {10.1109/ICCVW.2013.59},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Sagonas et al. - 2013 - 300 faces in-the-wild challenge The first facial landmark Localization Challenge.pdf:pdf},
isbn = {9781479930227},
issn = {15505499},
mendeley-groups = {Dissertation},
pages = {397--403},
title = {{300 faces in-the-wild challenge: The first facial landmark Localization Challenge}},
url = {https://ibug.doc.ic.ac.uk/media/uploads/documents/sagonas{\_}iccv{\_}2013{\_}300{\_}w.pdf},
year = {2013}
}
[72] [doi] C. Gagn, “DEAP : Evolutionary Algorithms Made Easy,” Journal of machine learning research, vol. 13, p. 2171–2175, 2012.
[Bibtex]
@article{FortinFELIX-ANTOINEFORTIN2012,
abstract = {DEAP is a novel evolutionary computation framework for rapid prototyping and testing of ideas. Its design departs from most other existing frameworks in that it seeks to make algorithms explicit and data structures transparent, as opposed to the more common black-box frameworks. Freely avail- able with extensive documentation at http://deap.gel.ulaval.ca, DEAP is an open source project under an LGPL license.},
author = {Gagn, Christian},
doi = {10.1.1.413.6512},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Fortin FELIX-ANTOINEFORTIN et al. - 2012 - DEAP Evolutionary Algorithms Made Easy Fran{\c{c}}ois-Michel De Rainville.pdf:pdf},
isbn = {1532-4435},
issn = {1533-7928},
journal = {Journal of Machine Learning Research},
keywords = {distributed evolutionary algorithms,software tools},
mendeley-groups = {Dissertation},
pages = {2171--2175},
title = {{DEAP : Evolutionary Algorithms Made Easy}},
url = {http://www.jmlr.org/papers/volume13/fortin12a/fortin12a.pdf http://jmlr.csail.mit.edu/papers/volume13/fortin12a/fortin12a.pdf},
volume = {13},
year = {2012}
}
[73] Unknown bibtex entry with key [devol]
[Bibtex]
[74] Weblet Importer, 2005.
[Bibtex]
@misc{dlib,
mendeley-groups = {Dissertation},
pages = {183},
title = {{Weblet Importer}},
url = {http://dlib.net/face{\_}detector.py.html},
urldate = {2017-08-25},
year = {2005}
}
[75] L. Wang, C. Lee, Z. Tu, and S. Lazebnik, “Training Deeper Convolutional Networks with Deep Supervision,” Arxiv, 2015.
[Bibtex]
@article{Wang2015,
abstract = {One of the most promising ways of improving the perfor-mance of deep convolutional neural networks is by increas-ing the number of convolutional layers. However, adding layers makes training more difficult and computationally expensive. In order to train deeper networks, we propose to add auxiliary supervision branches after certain intermedi-ate layers during training. We formulate a simple rule of thumb to determine where these branches should be added. The resulting deeply supervised structure makes the train-ing much easier and also produces better classification re-sults on ImageNet and the recently released, larger MIT Places dataset.},
archivePrefix = {arXiv},
arxivId = {arXiv:1505.02496v1},
author = {Wang, Liwei and Lee, Chen-Yu and Tu, Zhuowen and Lazebnik, Svetlana},
eprint = {arXiv:1505.02496v1},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Wang et al. - Unknown - Training Deeper Convolutional Networks with Deep Supervision.pdf:pdf},
journal = {arXiv},
mendeley-groups = {Dissertation},
title = {{Training Deeper Convolutional Networks with Deep Supervision}},
url = {https://arxiv.org/pdf/1505.02496.pdf},
year = {2015}
}
[76] [doi] Y. Sun, X. Wang, and X. Tang, “Deeply learned face representations are sparse, selective, and robust,” in Proceedings of the ieee computer society conference on computer vision and pattern recognition, 2015, p. 2892–2900.
[Bibtex]
@inproceedings{Sun2015,
abstract = {This paper designs a high-performance deep convolutional network (DeepID2+) for face recognition. It is learned with the identification-verification supervisory signal. By increasing the dimension of hidden representations and adding supervision to early convolutional layers, DeepID2+ achieves new state-of-the-art on LFW and YouTube Faces benchmarks. Through empirical studies, we have discovered three properties of its deep neural activations critical for the high performance: sparsity, selectiveness and robustness. (1) It is observed that neural activations are moderately sparse. Moderate sparsity maximizes the discriminative power of the deep net as well as the distance between images. It is surprising that DeepID2+ still can achieve high recognition accuracy even after the neural responses are binarized. (2) Its neurons in higher layers are highly selective to identities and identity-related attributes. We can identify different subsets of neurons which are either constantly excited or inhibited when different identities or attributes are present. Although DeepID2+ is not taught to distinguish attributes during training, it has implicitly learned such high-level concepts. (3) It is much more robust to occlusions, although occlusion patterns are not included in the training set.},
archivePrefix = {arXiv},
arxivId = {1412.1265},
author = {Sun, Yi and Wang, Xiaogang and Tang, Xiaoou},
booktitle = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
doi = {10.1109/CVPR.2015.7298907},
eprint = {1412.1265},
file = {:cs/home/hj22/.local/share/data/Mendeley Ltd./Mendeley Desktop/Downloaded/Sun, Wang, Tang - Unknown - Deeply learned face representations are sparse, selective, and robust.pdf:pdf},
isbn = {9781467369640},
issn = {10636919},
mendeley-groups = {Dissertation},
pages = {2892--2900},
pmid = {21808091},
title = {{Deeply learned face representations are sparse, selective, and robust}},
url = {https://arxiv.org/pdf/1412.1265.pdf},
volume = {07-12-June},
year = {2015}
}

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.