CS 7150

There are thousands of deep learning papers; where to start? Here is a curated list of the greatest hits.

Year	Author/Title	Notes
1943	Warren S. McCulloch and Walter Pitts, A Logical Calculus of the Ideas Immanent in Nervous Activity	Is a neural network a computing machine? McCulloch and Pitts are the first to model neural networks as an abstract computational system. They find that under various assumptions, networks of neurons are as powerful as propositional logic, sparking widespread interest in neural models of computation.
1958	Frank Rosenblatt, The Perceptron: A Probablistic Model for Information Storage and Organization in the Brain	Can an artificial neural network learn? Rosenblatt proposes the Perception Algorithm, a method for iteratively adjusting variable weight connections between neurons to learn to solve a problem. He raises funds from the U.S. Navy to build a physical Perceptron machine. In press coverage, Rosenblatt anticipates walking, talking, self-conscious machines.
1959	Jerome Lettvin, Humberto Maturana, Warren McCulloch and Walter Pitts, What the Frog's Eye Tells the Frog's Brain	Do nerves transmit ideas? Lettvin provocatively proposes that the frog optic nerve signals the presence of meaningful patterns rather than just brightness, demonstrating that the eye is doing part of the computational work of vision. Lettvin is also known for his famous thought experiment that your brain might contain a Grandmother Neuron that you use to conceptualize your grandmother.
1959	David H. Hubel and Torsten N Wiesel, Receptive Fields of Single Neurones in the Cat's Striate Cortex	How does biological vision work? This paper and its 1962 extension kick off a 25-year collaboration in which Hubel and Wiesel methodically analyze the processing of signals through mammalian visual systems, developing many specific insights about the operation of the Visual Cortex that later inspire and inform the design of convolutional neural networks. They win the Nobel Prize in 1981.
1969	Marvin Minsky and Seymour Papert, Perceptrons: An Introduction to Computational Geometry	What cannot be learned by a perceptrons? During the early 1960's, while Rosenblatt argues that his neural networks could do almost anything, Minsky counters that they could do very little. This influential book lays out the negative argument, showing that many simple problems such as maze-solving or even XOR cannot be solved by a single-layer perceptron network. The sharp critique leads to one of the first AI Winter periods during which many researchers abandon neural networks.
1972	Teuvo Kohonen, Correlation Matrix Memories	Can a neural network store memories? Kohonen (and simultaneously Anderson) observed that a single-layer network can act as a matrix Associative Memory if keys and data are seen as vectors of neural activations, and if keys are linearly independent. Associative memory would become a major focus of neural network research in coming decades.
1981	Geoffrey E. Hinton, Implementing Semantic Networks in Parallel Hardware	How are concepts represented? In an book on associative memory with Anderson, Hinton proposes that concepts should not be represented as single units, but as vectors of activations, and he demonstrates a scheme that encodes complex relationships in a distributed fashion. Distributed representation becomes a core tenet of the Parallel Distributed Processing (PDP) framework, advanced in a book by Rumelhart, McCleland, and Hinton (1986), and a central dogma in the understanding of large neural networks.
1986	David E. Rumelhart, Geoffrey E. Hinton and Ronald J. Williams, Learning Representations by Back-Propagating Errors	How can a deep network learn? Learning in multilayer networks was not widely understood until this paper's explanation of the Backpropagation method, which updates weights by efficiently computing gradients. While Griewank (2012) notes that reverse-mode auto-differentiation was discovered independently several times, notably by Seppo Linnainmaa (1970) and by Paul Werbos (1981), Rumelhart's letter to Nature demonstrating its power to learn nontrivial representations gains widespread attention and unleashes a new wave of innovation in neural networks.
1988	Sarah Solla, Esther Levin and Michael Fleisher, Accelerated Learning in Layered Neural Networks	What should deep networks learn? In three concurrent papers, Solla et al, John Hopfield (1987) and Eric Baum and Frank Wilczek (1988) describe the insight that neural networks should often compute log probabilities rather than just arbitrary scales of numbers and that the Cross Entropy Objective is frequently more natural and more effective than squared error minimization. (How effective remains an open area of research: see Hui 2021 and Golik 2013.)
1989	Yann Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard and L. D. Jackel, Handwritten Digit Recognition with a Back-Propagation Network	Can a deep network learn to see? In a technical tour-de-force, Le Cun devises the Convolutional Neural Network (CNN) (inspired and informed by Hubel and Weisel's biological studies), and demonstrates that backpropagation can train a CNN to accurately read handwritten digits on U.S. Postal Mail addresses. The work demonstrates the value of a good network architecture, and proves that deep networks can solve real-world problems.
1990	Jeffrey L Elman, Finding Structure in Time	Can a deep network learn language? Adopting a three-layer Recurrent Neural Network (RNN) architecture devised by Michael Jordan (1986), Elman trains an RNN to model natural language text, starting from letters. Strikingly, he finds that the network learns to represent the structure of words, grammar, and elements of semantics.
1990	Léon Bottou and Patrick Gallinari A Framework for the Cooperation of Learning Algorithms	What is the right notation for neural network architecture? Bottou observes that the backpropagation algorithm allows an elegant graphical notation where instead of a graph of neurons, the network is written as a graph of computation modules that encapsulate vectorized forward and backward gradient computations. Bottou's modular idea is the basis for deep learning libraries such as Torch (Collobert 2002), Theano (Bergstra 2010), Caffe, (Jia 2014), Tensorflow. (Abadi 2016) and PyTorch (Paszke 2019).
1991	Kurt Hornik, Approximation Capabilities of Multilayer Feedforward Networks	What functions can a deep network compute? In 1989, George Cybenko proves that typical two-layer neural networks have the ability to approximate any compact continuous function given enough neurons, and Hornik generalizes the result, showing that any architecture with a nonconstant bounded nonlinearity will work. Cybenko and Hornik's results show that deep networks are Universal Approximators, far more expressive than the single-layer systems analyzed by Minsky and Papert.
1991	Anders Krogh and John A. Hertz, A Simple Weight Decay Can Improve Generalization	How can overfitting be avoided? This paper analyzes and advocates Weight Decay, a simple regularizer originally proposed as Ridge Regression (Hoerl, 1970) that imposes a penalty on the square of the weights of a model. Krogh analyzes this trick in neural networks, demonstrating generalization gains in single-layer and mulilayer networks.
1997	Sepp Hochreiter and Jürgen Schmidhuber, Long Short-Term Memory	How can long recurrences be stabilized? Iterating an RNN many times will invariably to lead to an explosion of gradients without special measures. This paper proposes the Long Short-Term Memory (LSTM) architecture, a gated but differentiable neural memory structure that can retain state over very long sequences while preventing the gradient from exploding.
2003	Yoshua Bengio, Réjean Ducharme, Pascal Vincent and Christian Jauvin, A Neural Probabilistic Language Model	Can a neural network model language at scale? This paper scales a nonrecurrent neural language model to a 15-million word training set, beating the state-of-the-art traditional language modeling methods by a large margin. Rather than using a fully recurrent network, Bengio processes a fixed window of n words and devotes a network layer to learn a position-indepenent Word Embedding.
2005	Rodrigo Quian Quiroga, Leila Reddy, Gabriel Kreiman, Christof Koch and Itzhak Fried, Invariant Visual Representation by Single Neurons in the Human Brain	What do individual biological neurons do? In a series of remarkable experiments probing single neurons of human epilepsy patients, several Multimodal Neurons are found: individual neurons that are selectively responsive to very different stimuli that evoke the same concept, for example a neuron reponsive to a written name, sketch, photo, or costumed figure of Halle Berry, while not responding to other people, suggesting a simple physical encoding for high-level concepts in the brain.
2005	Geoffrey Hinton, What Kind of Graphical Model is the Brain?	Can networks be deepend like a spin glass? In the early 2000s, neural network research is focused on the problem of scaling networks deeper than three layers. A breakthrough comes from bidirectional-link models of neural networks inspired by spin-glass physics, like Hopfield Networks (Hopfield, 1982), and Restricted Boltzmann Machines (RBM) (Hinton, 1983). In 2005, Hinton shows that a RBM called a Deep Belief Network can train a stack of many layers efficiently, and in 2006, Hinton and Salakhutdinov show that layers of autoencoders can be stacked if initialized by RBMs.
2010	Pascal Vincent, Hugo Larochelle, Isabelle Lajoie, Yoshua Bengio and Pierre-Antoine Manzagol, Stacked Denoising Autoencoders: Learning Useful Representations in a Deep Network with a Local Denoising Criterion	Can networks be deepend with unsupservised training? The search for simpler deep network initialization methods continues, and in 2010, Vincent finds an alternative to initialization by Boltzmann machines: train each layer as a Denoising Autoencoder that must learn to remove noise added to training data. That group also devises the Contractive Autoencoder (Rifai, 2011), in which a gradient penalty is incorporated into the loss.
2010	Xavier Glorot and Yoshua Bengio, Understanding the Difficulty of Training Deep Feedforward Neural Networks	Can networks be deepend with simple changes? Glorot analyzes the problems with ordinary feed-forward training and proposes Xavier Initialization, a simple random initialization that is scaled to avoid vanishing or exploding gradients. In a second important development, Nair (2010) and Glorot (2011) experimentally find that Rectified Linear Units (ReLU) work much better than the sigmoid nonlinearities that have previously been ubiquitous. These simple-to-apply innovations eliminate the need for complex preintraining, so that deep feedforward networks can be trained directly, end-to-end, from scratch, using backpropagation.
2011	Ronan Collobert, Jason Weston, Léon Bottou, Michael Karlen, Koray Kavukcuoglu and Pavel Kuksa, Natural Language Processing (Almost) from Scratch	Can a neural network solve language problems? Previous work in natural language processing treats the problems of chunking, part-of-speech tagging, named entity recognition, and semantic role labeling separately. Collobert claims that a single neural network can do it all at once, using a Multi-Task Objective to learn a unified representation of language for all the tasks. They find that their network learns a satisfying word embedding that groups together meaningfully related words, but the performance claims are initially met with skepticism.
2012	Alex Krizhevsky, Ilya Sutskever and Geoffrey E. Hinton ImageNet Classification with Deep Convolutional Neural Networks	Can a neural network do state-of-the-art computer vision? Krizhevsy shocks the computer vision community with a deep convolutonal network that wins the annual ImageNet classification challenge (Deng, 2009) by a large margin. Krizhevsky's AlexNet is a deep eight-layer 60-million parameter convolutional network that combines the latest tricks such as ReLU and Dropout (Srivatsava, 2014 and Hinton, 2012), and it is run on a pair of consumer Graphical Processing Units (GPU). The superior performance on the high-profile large-scale benchmark sparks an explosive resurgence of interest in deep network applications.
2012	Tomas Mikolov, Illya Sutskever, Kai Chen, Greg Corrado and Jeffrey Dean, Distributed Representations of Words and Phrases and their Compositionality	Does massive data beat a complex network? While excitement grows over the power of vector representations, Google researcher Mikolov finds that his simple (non-deep) skip-gram model (Mikolov, 2012a) can learn a good word embedding that outperforms other (deep) embeddings by a large margin if trained on a massive 30-billion word data set. This Word2Vec model exhibits Semantic Vector Composition for the first time. Google also trains an unsupervised model on Youtube image data (Le, 2011) using an Topographic Independent Component Analysis loss (Hyvärinen 2009), and observes the emergence of individual neurons for human faces and cats.
2013	Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra and Martin Riedmiller, Playing Atari with Deep Reinforcement Learning	Can a network learn to play a game from raw input? DeepMind proposes Deep Reinforcement Learning (DRL), applying neural networks directly to the Q-learning algorithm, and demonstrates that their Deep Q-Network (DQN) architecture directly predicts actions from state observations and can learn to control joystick controls well enough to learn to play several Atari games better than humans. The work inspires many other DRL methods such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap 2016) and Proximal Policy Optimization (PPO) (Shulman 2017), and touches off development of Atari-capable RL testing environments like OpenAI Gym.
2013	Diederik P. Kingma and Max Welling, Auto-Encoding Variational Bayes	What should an autoencoder reconstruct? The Variational Autoencoder (VAE) casts the autoencoder as variational inference problem, matching distributions rather than instances, by maximizing the Evidence Lower Bound (ELBO) of the likelihood of the data by minimizing information in the stochastic latent, and using a Reparameterization Trick to train a sampling process at the bottleneck (see the Doersch tutorial). Descendants such as Beta-VAE (Higgins 2017) can learn disentangled representations, and VQ-VAE (van der Oord 2017) can do state-of-the-art image generation.
2013	Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian Goodfellow and Rob Fergus, Intriguing Properties of Neural Networks	Do artificial neural networks have bugs? Using a simple optimization, Szegedy finds that it is easy to construct Adversarial Examples: inputs that are imperceptibly different from a natural input that fool a deep network into misclassifying an image. The observation touches off many discoveries of further attacks (e.g., Papernot 2017), defenses (Madry 2018) and evaluations (Carlini 2017).
2014	Ross Girshick, Jeff Donahue, Trevor Darrell and Jitendra Malik, Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation	Can a CNN locate an object in a scene? Computer vision is concerned with not just classifying, but locating and understanding the arrangments of objects in a scene. By exploiting the spatial arrangement of CNN features, Girshick's R-CNN (and Faster R-CNN Ren 2015) can identify not only the class of an object, but the location of an object in a scene via both bounding-box estimation and semantic segmentation.
2014	Ian J. Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, Yoshua Bengio and Jitendra Malik, Generative Adversarial Nets	Can an adversarial objective be learned? A Generative Adversarial Network (GAN) is trained to imitate a data set by learning to synthesize examples that fool a second adversarial model simultaneously trained to distinguish real from generated data. The elegant method sparks a wave of new theoretical work as well as a new category of highly-realistic image generation methods such as DCGAN (Radford 2016), Wasserstein GAN (Arjovsky 2017), BigGAN (Brock 2019), and StyleGAN (Karras 2019).
2014	Jason Yosinksi, Jeff Clune, Yoshua Bengio and Hod Lipson How Transferable are Features in Deep Neural Networks?	Can network parameters be reused in another network? Transfer Learning takes layers of a pretrained network to initialize a network that is trained to solve a different problem. Yosinksi conducts shows that such Fine-Tuning will outperform training a new network from scratch, and practioners quickly recognize that initialization with a large Pretrained Model (PTM) is a way to get a high-performance network using only a small amount of training data.
2014	Matthew D. Zeiler and Rob Fergus Visualizing and Understanding Convolutional Networks	Can people understand deep networks? One of the critiques of deep learning is that its huge models are opaque to humans. Zeiler tackles this problem by reviewing and introducing several methods for Deep Feature Visualization, which depict individual signals within a network, and Salience Mapping, which summarize the parts of the input that most influence the outcome of the complex computation. Zeiler's goal of Explainable AI (XAI) is futher developed in feature optimization methods (Olah 2017), feature dissection (Bau 2017), and salience methods such as Grad-CAM (Selvaraju 2016) and Integrated Gradients (Sundararajan 2017).
2014	Ilya Sutskever, Oriol Vinyals and Quoc V. Le, Sequence to Sequence Learning with Neural Networks	Can a neural network translate human languages? Sutskever applies the LSTM architecture to English-to-French translation, combining an encoder phase with an autoregressive decoder phase. This demonstration of Neural Machine Translation; does not beat state-of-the art machine translation methods at the time, but its competitive performance establishes the feasibility of the neural approach to translation, one of the classical grand challenges of AI.
2015	Dzmitry Bahdanau, KyngHyun Cho and Yoshua Bengio Neural Machine Translation by Jointly Learning to Align and Translate	Can a network learn its own attention? While CNNs compare adjacent pixels and RNNs examine adjacent words, sometimes the most important data dependencies are not adjacencies. Bahandau notices this problem in the way word order changes in machine translation and proposes a learned Attention model that can compute which parts of the input are relevant to each part of the output. This innovation dramatically improves performance of neural machine translation, and the idea of using learnable attention proves effective for many kinds of data including graphs (Veličković 2018), and images (Zhang 2019).
2015	Kingma, Adam	What learning rate should be used? The Adam Optimizer adaptively chooses the step size by using smaller steps for parameters in regions with more gradient variation. Combining ideas from Momentum (Polyak 1964), Adagrad (Duchi 2011) and RMSProp (Tieleman 2012), the Adam optimizer proves very effective in practice, enabling optimization of huge models with little or no manual tuning.
2015	Sergei Ioffe, BatchNorm	How can very large networks be stabilized? Even with clever initalization, in very deep ReLU networks eventually signals will get very large or very small. Batch Normalization solves this problem by normalizing each neuron to have zero mean and unit variance within every training batch. This practical step yields huge benefits, improving training speed, network performance and stability, and enabling very large models to be trained.
2015	Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun Deep Residual Learning for Image Recognition	Can backpropagation work through 100 network layers? Building on top of batchnorm and analyzing gradient behavior, Kaiming He introduces two new ideas: his Residual Network (ResNet) architecture uses each layers to calculate a residual that is added to the previous layer intead of replacing it, shortening the path taken by most gradient signals, and his Kaiming Initialization improves upon previous initialization schemes for ReLU networks. These innovations allow his convolutional network to stack to more than 100 layers deep, beating state-of-the-art classification accuracy on ImageNet, and squarely closing the question of whether pure feedforward training methods can go deep enough: they can.
2015	Oriol Vinyals, Alexander Toshev, Samy Bengio and Dumitru Erhan, Show and Tell: A Neural Image Caption Generator	Can a network describe an image in words? While it may seem that visual and language data must be very different, this work treats the problem like a language translation problem by feeding the output of a convolutional network to a recurrent network langauge decoder. The result is the first neural Image Captioning model that can produce free-text descriptions from any given image. A followup paper (Xu 2015) improves the method by incorporating learned attention over the image.
2016	David Silver, Aja Huang, Chris J. Maddison, Arthur Guez, Laurent Sifre, George van den Driessche, Julian Schrittwieser, Ioannis Antonoglou, Veda Panneershelvam, Marc Lanctot, Sander Dieleman, Dominik Grewe, John Nham, Nal Kalchbrenner, Ilya Sutskever, Timothy Lillicrap, Madeleine Leach, Koray Kavukcuoglu, Thore Graepel and & Demis Hassabis, Mastering the Game of Go with Deep Neural Networks and Tree Search	Can an AI beat a master at Go? With hundreds of choices at every move, Go is far more difficult to reason about than chess, which was conquered by traditional AI in 1997. David Silver's AlphaGo system applies a deep convolutional network to evaluate positions, combining the network with tree-search for training and evaluation. The system stuns both computer scientists and Go players by beating master Lee Sedol in four out of five games, becoming the first computer program to play Go at a championship level, achieving the breakthrough years before experts anticipated.
2017	Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht and Oriol Vinyals, Understanding Deep Learning Requires Rethinking Generalization	Why don't deep networks overfit? According to Vapnik-Chervonenkis (VC) theory, models with too many free parameters should be Overfitting instad of Generalizing, so the standard advice would be to reduce the number of parameters. Yet Zhang demonstrates that a standard AlexNet is highly overparameterized, demonstrating its ability to literally memorize a random labeling of ImageNet without any generalization. The observation as well as others such as Double-Descent (Nakkiran 2019) lead to the ongoing question: what can replace VC theory when explaining generalization of deep networks?
2017	Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser and Illia Polosukhin, Attention is All You Need	Transformer Network
2017	Isola Image-to-Image	Pix2Pix CycleGAN (Zhu, 2017)
2018	Radford GPT	GPT
2019	Jacob Devlin, Ming-Wei Chang, Kenton Lee and Kristina Toutanova, BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding	BERT
2020	Mildenhall NeRF	Neural Radiance Fields
2020	Ho Diffusion Model	Diffusion Models
2021	Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger and Ilya Sutskever, Learning Transferable Visual Models From Natural Language Supervision	CLIP Contrastive Learning
2022	Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu and Mark Chen, Hierarchical Text-Conditional Image Generation with CLIP Latents	DALL-E 2 Text-to-Image Diffusion Model

CS 7150: Deep Learning Famous Papers