There are thousands of deep learning papers; where to start? Here is a curated list of the greatest hits.
Year | Author/Title | Notes |
---|---|---|
1943 | Warren S. McCulloch and Walter Pitts, A Logical Calculus of the Ideas Immanent in Nervous Activity | Is a neural network a computing machine? McCulloch and Pitts are the first to model neural networks as an abstract computational system. They find that under various assumptions, networks of neurons are as powerful as propositional logic, sparking widespread interest in neural models of computation. |
1958 | Frank Rosenblatt, The Perceptron: A Probablistic Model for Information Storage and Organization in the Brain | Can an artificial neural network learn? Rosenblatt proposes the Perception Algorithm, a method for iteratively adjusting variable weight connections between neurons to learn to solve a problem. He raises funds from the U.S. Navy to build a physical Perceptron machine. In press coverage, Rosenblatt anticipates walking, talking, self-conscious machines. |
1959 | Jerome Lettvin, Humberto Maturana, Warren McCulloch and Walter Pitts, What the Frog's Eye Tells the Frog's Brain | Do nerves transmit ideas? Lettvin provocatively proposes that the frog optic nerve signals the presence of meaningful patterns rather than just brightness, demonstrating that the eye is doing part of the computational work of vision. Lettvin is also known for his famous thought experiment that your brain might contain a Grandmother Neuron that you use to conceptualize your grandmother. |
1959 | David H. Hubel and Torsten N Wiesel, Receptive Fields of Single Neurones in the Cat's Striate Cortex | How does biological vision work? This paper and its 1962 extension kick off a 25-year collaboration in which Hubel and Wiesel methodically analyze the processing of signals through mammalian visual systems, developing many specific insights about the operation of the Visual Cortex that later inspire and inform the design of convolutional neural networks. They win the Nobel Prize in 1981. |
1969 | Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry | What cannot be learned by a perceptrons? During the early 1960's, while Rosenblatt argues that his neural networks could do almost anything, Minsky counters that they could do very little. This influential book lays out the negative argument, showing that many simple problems such as maze-solving or even XOR cannot be solved by a single-layer perceptron network. The sharp critique leads to one of the first AI Winter periods during which many researchers abandon neural networks. |
1972 | Teuvo Kohonen, Correlation Matrix Memories | Can a neural network store memories? Kohonen (and simultaneously Anderson) observed that a single-layer network can act as a matrix Associative Memory if keys and data are seen as vectors of neural activations, and if keys are linearly independent. Associative memory would become a major focus of neural network research in coming decades. |
1981 | Geoffrey E. Hinton, Implementing Semantic Networks in Parallel Hardware | How are concepts represented? In an book on associative memory with Anderson, Hinton proposes that concepts should not be represented as single units, but as vectors of activations, and he demonstrates a scheme that encodes complex relationships in a distributed fashion. Distributed representation becomes a core tenet of the Parallel Distributed Processing (PDP) framework, advanced in a book by Rumelhart, McCleland, and Hinton (1986), and a central dogma in the understanding of large neural networks. |
1986 | David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, Learning Representations by Back-Propagating Errors | How can a deep network learn? Learning in multilayer networks was not widely understood until this paper's explanation of the Backpropagation method, which updates weights by efficiently computing gradients. While Griewank (2012) notes that reverse-mode auto-differentiation was discovered independently several times, notably by Seppo Linnainmaa (1970) and by Paul Werbos (1981), Rumelhart's letter to Nature demonstrating its power to learn nontrivial representations gains widespread attention and unleashes a new wave of innovation in neural networks. |
1988 | Sarah Solla, Esther Levin and Michael Fleisher, Accelerated Learning in Layered Neural Networks | What should deep networks learn? In three concurrent papers, Solla et al, John Hopfield (1987) and Eric Baum and Frank Wilczek (1988) describe the insight that neural networks should often compute log probabilities rather than just arbitrary scales of numbers and that the Cross Entropy Objective is frequently more natural and more effective than squared error minimization. (How effective remains an open area of research: see Hui 2021 and Golik 2013.) |
1989 | Yann Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel, Handwritten Digit Recognition with a Back-Propagation Network | Can a deep network learn to see? In a technical tour-de-force, Le Cun devises the Convolutional Neural Network (CNN) (inspired and informed by Hubel and Weisel's biological studies), and demonstrates that backpropagation can train a CNN to accurately read handwritten digits on U.S. Postal Mail addresses. The work demonstrates the value of a good network architecture, and proves that deep networks can solve real-world problems. |
1990 | Jeffrey L Elman, Finding Structure in Time | Can a deep network learn language? Adopting a three-layer Recurrent Neural Network (RNN) architecture devised by Michael Jordan (1986), Elman trains an RNN to model natural language text, starting from letters. Strikingly, he finds that the network learns to represent the structure of words, grammar, and elements of semantics. |
1990 | Léon Bottou and Patrick Gallinari A Framework for the Cooperation of Learning Algorithms | What is the right notation for neural network architecture? Bottou observes that the backpropagation algorithm allows an elegant graphical notation where instead of a graph of neurons, the network is written as a graph of computation modules that encapsulate vectorized forward and backward gradient computations. Bottou's modular idea is the basis for deep learning libraries such as Torch (Collobert 2002), Theano (Bergstra 2010), Caffe, (Jia 2014), Tensorflow. (Abadi 2016) and PyTorch (Paszke 2019). |
1991 | Kurt Hornik, Approximation Capabilities of Multilayer Feedforward Networks | What functions can a deep network compute? In 1989, George Cybenko proves that typical two-layer neural networks have the ability to approximate any compact continuous function given enough neurons, and Hornik generalizes the result, showing that any architecture with a nonconstant bounded nonlinearity will work. Cybenko and Hornik's results show that deep networks are Universal Approximators, far more expressive than the single-layer systems analyzed by Minsky and Papert. |
1991 | Anders Krogh and John A. Hertz, A Simple Weight Decay Can Improve Generalization | How can overfitting be avoided? This paper analyzes and advocates Weight Decay, a simple regularizer originally proposed as Ridge Regression (Hoerl, 1970) that imposes a penalty on the square of the weights of a model. Krogh analyzes this trick in neural networks, demonstrating generalization gains in single-layer and mulilayer networks. |
1997 | Sepp Hochreiter and Jürgen Schmidhuber, Long Short-Term Memory | How can long recurrences be stabilized? Iterating an RNN many times will invariably to lead to an explosion of gradients without special measures. This paper proposes the Long Short-Term Memory (LSTM) architecture, a gated but differentiable neural memory structure that can retain state over very long sequences while preventing the gradient from exploding. |
2003 | Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model | Can a neural network model language at scale? This paper scales a nonrecurrent neural language model to a 15-million word training set, beating the state-of-the-art traditional language modeling methods by a large margin. Rather than using a fully recurrent network, Bengio processes a fixed window of n words and devotes a network layer to learn a position-indepenent Word Embedding. |
2005 | Rodrigo Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch and Itzhak Fried, Invariant Visual Representation by Single Neurons in the Human Brain | What do individual biological neurons do? In a series of remarkable experiments probing single neurons of human epilepsy patients, several Multimodal Neurons are found: individual neurons that are selectively responsive to very different stimuli that evoke the same concept, for example a neuron reponsive to a written name, sketch, photo, or costumed figure of Halle Berry, while not responding to other people, suggesting a simple physical encoding for high-level concepts in the brain. |
2005 | Geoffrey Hinton, What Kind of Graphical Model is the Brain? | Can networks be deepend like a spin glass? In the early 2000s, neural network research is focused on the problem of scaling networks deeper than three layers. A breakthrough comes from bidirectional-link models of neural networks inspired by spin-glass physics, like Hopfield Networks (Hopfield, 1982), and Restricted Boltzmann Machines (RBM) (Hinton, 1983). In 2005, Hinton shows that a RBM called a Deep Belief Network can train a stack of many layers efficiently, and in 2006, Hinton and Salakhutdinov show that layers of autoencoders can be stacked if initialized by RBMs. |
2010 | Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio and Pierre-Antoine Manzagol, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion | Can networks be deepend with unsupservised training? The search for simpler deep network initialization methods continues, and in 2010, Vincent finds an alternative to initialization by Boltzmann machines: train each layer as a Denoising Autoencoder that must learn to remove noise added to training data. That group also devises the Contractive Autoencoder (Rifai, 2011), in which a gradient penalty is incorporated into the loss. |
2010 | Xavier Glorot and Yoshua Bengio, Understanding the Difficulty of Training Deep Feedforward Neural Networks | Can networks be deepend with simple changes? Glorot analyzes the problems with ordinary feed-forward training and proposes Xavier Initialization, a simple random initialization that is scaled to avoid vanishing or exploding gradients. In a second important development, Nair (2010) and Glorot (2011) experimentally find that Rectified Linear Units (ReLU) work much better than the sigmoid nonlinearities that have previously been ubiquitous. These simple-to-apply innovations eliminate the need for complex preintraining, so that deep feedforward networks can be trained directly, end-to-end, from scratch, using backpropagation. |
2011 | Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa, Natural Language Processing (Almost) from Scratch | Can a neural network solve language problems? Previous work in natural language processing treats the problems of chunking, part-of-speech tagging, named entity recognition, and semantic role labeling separately. Collobert claims that a single neural network can do it all at once, using a Multi-Task Objective to learn a unified representation of language for all the tasks. They find that their network learns a satisfying word embedding that groups together meaningfully related words, but the performance claims are initially met with skepticism. |
2012 | Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton ImageNet Classification with Deep Convolutional Neural Networks | Can a neural network do state-of-the-art computer vision? Krizhevsy shocks the computer vision community with a deep convolutonal network that wins the annual ImageNet classification challenge (Deng, 2009) by a large margin. Krizhevsky's AlexNet is a deep eight-layer 60-million parameter convolutional network that combines the latest tricks such as ReLU and Dropout (Srivatsava, 2014 and Hinton, 2012), and it is run on a pair of consumer Graphical Processing Units (GPU). The superior performance on the high-profile large-scale benchmark sparks an explosive resurgence of interest in deep network applications. |
2012 | Tomas Mikolov, Illya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean, Distributed Representations of Words and Phrases and their Compositionality | Does massive data beat a complex network? While excitement grows over the power of vector representations, Google researcher Mikolov finds that his simple (non-deep) skip-gram model (Mikolov, 2012a) can learn a good word embedding that outperforms other (deep) embeddings by a large margin if trained on a massive 30-billion word data set. This Word2Vec model exhibits Semantic Vector Composition for the first time. Google also trains an unsupervised model on Youtube image data (Le, 2011) using an Topographic Independent Component Analysis loss (Hyvärinen 2009), and observes the emergence of individual neurons for human faces and cats. |
2013 | Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller, Playing Atari with Deep Reinforcement Learning | Can a network learn to play a game from raw input? DeepMind proposes Deep Reinforcement Learning (DRL), applying neural networks directly to the Q-learning algorithm, and demonstrates that their Deep Q-Network (DQN) architecture directly predicts actions from state observations and can learn to control joystick controls well enough to learn to play several Atari games better than humans. The work inspires many other DRL methods such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap 2016) and Proximal Policy Optimization (PPO) (Shulman 2017), and touches off development of Atari-capable RL testing environments like OpenAI Gym. |
2013 | Diederik P. Kingma and Max Welling, Auto-Encoding Variational Bayes | What should an autoencoder reconstruct? The Variational Autoencoder (VAE) casts the autoencoder as variational inference problem, matching distributions rather than instances, by maximizing the Evidence Lower Bound (ELBO) of the likelihood of the data by minimizing information in the stochastic latent, and using a Reparameterization Trick to train a sampling process at the bottleneck (see the Doersch tutorial). Descendants such as Beta-VAE (Higgins 2017) can learn disentangled representations, and VQ-VAE (van der Oord 2017) can do state-of-the-art image generation. |
2013 | Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus, Intriguing Properties of Neural Networks | Do artificial neural networks have bugs? Using a simple optimization, Szegedy finds that it is easy to construct Adversarial Examples: inputs that are imperceptibly different from a natural input that fool a deep network into misclassifying an image. The observation touches off many discoveries of further attacks (e.g., Papernot 2017), defenses (Madry 2018) and evaluations (Carlini 2017). |
2014 | Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation | Can a CNN locate an object in a scene? Computer vision is concerned with not just classifying, but locating and understanding the arrangments of objects in a scene. By exploiting the spatial arrangement of CNN features, Girshick's R-CNN (and Faster R-CNN Ren 2015) can identify not only the class of an object, but the location of an object in a scene via both bounding-box estimation and semantic segmentation. |
2014 | Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio and Jitendra Malik, Generative Adversarial Nets | Can an adversarial objective be learned? A Generative Adversarial Network (GAN) is trained to imitate a data set by learning to synthesize examples that fool a second adversarial model simultaneously trained to distinguish real from generated data. The elegant method sparks a wave of new theoretical work as well as a new category of highly-realistic image generation methods such as DCGAN (Radford 2016), Wasserstein GAN (Arjovsky 2017), BigGAN (Brock 2019), and StyleGAN (Karras 2019). |
2014 | Jason Yosinksi, Jeff Clune, Yoshua Bengio and Hod Lipson How Transferable are Features in Deep Neural Networks? | Can network parameters be reused in another network? Transfer Learning takes layers of a pretrained network to initialize a network that is trained to solve a different problem. Yosinksi conducts shows that such Fine-Tuning will outperform training a new network from scratch, and practioners quickly recognize that initialization with a large Pretrained Model (PTM) is a way to get a high-performance network using only a small amount of training data. |
2014 | Matthew D. Zeiler and Rob Fergus Visualizing and Understanding Convolutional Networks | Can people understand deep networks? One of the critiques of deep learning is that its huge models are opaque to humans. Zeiler tackles this problem by reviewing and introducing several methods for Deep Feature Visualization, which depict individual signals within a network, and Salience Mapping, which summarize the parts of the input that most influence the outcome of the complex computation. Zeiler's goal of Explainable AI (XAI) is futher developed in feature optimization methods (Olah 2017), feature dissection (Bau 2017), and salience methods such as Grad-CAM (Selvaraju 2016) and Integrated Gradients (Sundararajan 2017). |
2014 | Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks | Can a neural network translate human languages? Sutskever applies the LSTM architecture to English-to-French translation, combining an encoder phase with an autoregressive decoder phase. This demonstration of Neural Machine Translation; does not beat state-of-the art machine translation methods at the time, but its competitive performance establishes the feasibility of the neural approach to translation, one of the classical grand challenges of AI. |
2015 | Dzmitry Bahdanau, KyngHyun Cho and Yoshua Bengio Neural Machine Translation by Jointly Learning to Align and Translate | Can a network learn its own attention? While CNNs compare adjacent pixels and RNNs examine adjacent words, sometimes the most important data dependencies are not adjacencies. Bahandau notices this problem in the way word order changes in machine translation and proposes a learned Attention model that can compute which parts of the input are relevant to each part of the output. This innovation dramatically improves performance of neural machine translation, and the idea of using learnable attention proves effective for many kinds of data including graphs (Veličković 2018), and images (Zhang 2019). |
2015 | Kingma, Adam | What learning rate should be used? The Adam Optimizer adaptively chooses the step size by using smaller steps for parameters in regions with more gradient variation. Combining ideas from Momentum (Polyak 1964), Adagrad (Duchi 2011) and RMSProp (Tieleman 2012), the Adam optimizer proves very effective in practice, enabling optimization of huge models with little or no manual tuning. |
2015 | Sergei Ioffe, BatchNorm | How can very large networks be stabilized? Even with clever initalization, in very deep ReLU networks eventually signals will get very large or very small. Batch Normalization solves this problem by normalizing each neuron to have zero mean and unit variance within every training batch. This practical step yields huge benefits, improving training speed, network performance and stability, and enabling very large models to be trained. |
2015 | Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition | Can backpropagation work through 100 network layers? Building on top of batchnorm and analyzing gradient behavior, Kaiming He introduces two new ideas: his Residual Network (ResNet) architecture uses each layers to calculate a residual that is added to the previous layer intead of replacing it, shortening the path taken by most gradient signals, and his Kaiming Initialization improves upon previous initialization schemes for ReLU networks. These innovations allow his convolutional network to stack to more than 100 layers deep, beating state-of-the-art classification accuracy on ImageNet, and squarely closing the question of whether pure feedforward training methods can go deep enough: they can. |
2015 | Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan, Show and Tell: A Neural Image Caption Generator | Can a network describe an image in words? While it may seem that visual and language data must be very different, this work treats the problem like a language translation problem by feeding the output of a convolutional network to a recurrent network langauge decoder. The result is the first neural Image Captioning model that can produce free-text descriptions from any given image. A followup paper (Xu 2015) improves the method by incorporating learned attention over the image. |
2016 | David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel and & Demis Hassabis, Mastering the Game of Go with Deep Neural Networks and Tree Search | Can an AI beat a master at Go? With hundreds of choices at every move, Go is far more difficult to reason about than chess, which was conquered by traditional AI in 1997. David Silver's AlphaGo system applies a deep convolutional network to evaluate positions, combining the network with tree-search for training and evaluation. The system stuns both computer scientists and Go players by beating master Lee Sedol in four out of five games, becoming the first computer program to play Go at a championship level, achieving the breakthrough years before experts anticipated. |
2017 | Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals, Understanding Deep Learning Requires Rethinking Generalization | Why don't deep networks overfit? According to Vapnik-Chervonenkis (VC) theory, models with too many free parameters should be Overfitting instad of Generalizing, so the standard advice would be to reduce the number of parameters. Yet Zhang demonstrates that a standard AlexNet is highly overparameterized, demonstrating its ability to literally memorize a random labeling of ImageNet without any generalization. The observation as well as others such as Double-Descent (Nakkiran 2019) lead to the ongoing question: what can replace VC theory when explaining generalization of deep networks? |
2017 | Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin, Attention is All You Need | Transformer Network |
2017 | Isola Image-to-Image | Pix2Pix CycleGAN (Zhu, 2017) |
2018 | Radford GPT | GPT |
2019 | Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding | BERT |
2020 | Mildenhall NeRF | Neural Radiance Fields |
2020 | Ho Diffusion Model | Diffusion Models |
2021 | Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger and Ilya Sutskever, Learning Transferable Visual Models From Natural Language Supervision | CLIP Contrastive Learning |
2022 | Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents | DALL-E 2 Text-to-Image Diffusion Model |