\Welcome to IE598 -ODL, 2019 Spring (2nd half semester)!
The course starts at Mar 12, and ends at Apr 30.
Note: The course is often held in 303 Transportation Building (unless otherwise notified via email).
If you want to audit (and receive notification of room), you are encouraged to email me since the classroom may not be big enough.
Instructor: Ruoyu Sun, assistant professor, Dept of ISE, CSL (affiliate) and ECE (affiliate)
Office: Transportation Bldg 209D
Office Hour: Thursdays 2-3pm, or by appointment via email
Course time/location: Tu & Th: 3:30pm – 4:50PM, Transportation Building 303
Mathematical optimization is a crucial component in deep learning. This course aims to provide an understanding of optimization in deep learning from optimization theory perspective, and discuss open research questions. For algorithms, we will discuss various algorithms in deep learning, and existing optimization theoretical results and gaps. For geometry, we discuss the optimization landscape of neural-network problems, which is closed related to the quality of the local minima. We will also briefly discuss min-max optimization related to neural networks, including GANs and adversarial attacks. This is a theory-oriented course.
Knowledge of linear algebra, calculus and elementary probability is necessary. Should have taken a class in optimization and machine learning, or can self-learn the basics. A little exposure to computer vision and complexity theory will be helpful.
Mostly based on research papers in recent decade. Some (but not all) of the recent results may be found in the following books/monographs:
- R. Sun, “Optimization Theory for Deep Learning: An Overview”, forthcoming, 2019 Apr.
- Goodfellow, Bengio and Courville, “Deep Learning”, MIT Press, 2016.
- R. Sun, “IE 510 Applied Nonlinear Programming”, course notes of Lecture 1-15.
- S. Sra, N. Sebastian, and S. Wright. “Optimization for machine learning”. MIT Press, 2012.
- Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. "Optimization methods for large-scale machine learning." arXiv preprint arXiv:1606.04838 (2016).
There will be no exam or traditional “homework”. Grades will be based on:
Attendance (10%): attend the class and involve in discussions.
Scribing (40%): Each student needs to scribe lecture notes (typically, a group of 1-2 people scribe one lecture). You need to write a coherent and logical lecture notes based on the class notes and related references. Readability is a high priority. The grade for this part depends on the quality of the lecture notes. See the page "scribing instructions".
Reviewing (30%): Each student needs to review 2 lecture notes posted online, somewhat like reviewing a paper. The grade for this part depends on the quality of the reviews. See the page "scribing instructions".
Practice Problems (20%): Practice problems will be provided occationally in class. You can choose any two problems throughout the whole course, and email the instructor your solution within 2 weeks from the time it is assigned. Each problem is 10 points. The email should have a title like "[IE598] Problem 3 Submission from Adam Gates [your name] 1st time".
Tentative Schedule (subject to change)
Part 0: Basics of Deep Learning. (1.5 weeks).
Lecture 1: Motivation and Syllabus. (slides)
Discussion on "is AI Alchemey" example. Discussion on theory of deep learning. Syllabus.
Lecture 2: Overview of DL theory: Representation.
Decomposition of optimization error and representation error.
Lecture 3: Overview of DL theory: Generalization error. Decomposition of test error as three parts: representation error, optimization error and generalization error.
Part I: Optimization for Supervised Learning.
Lecture 4: Deep neural networks: standard form, recursive form. Back-propagation.
See wikipedia for more discussion on back-propagation. It was indepdently discovered by a number of groups. Some examples are:
Rumelhart, Hinton and Williams. Learning internal representations by error propagation. No. ICS-8506. California Univ San Diego La Jolla Inst for Cognitive Science, 1985.
GGCS, KS. "Learning representations by back-propagating errors." Nature 323.9 (1986).
Lecture 5: Classical convergence theory applied to neural-nets. Gradient explosion and vanishing. Lipschitz constant explosion and vanishing. Deep 1-dim linear network.
Bertsekas, Nonlinear programming, version 2 (1998).
Shamir, "Exponential convergence time of gradient descent for one-dimensional deep linear neural networks." arXiv preprint arXiv:1809.08587 (2018).
Lecture 6: Taming explosion/vanishing: initialization (1). Controlling networkflow.
Glorot and Bengio, AISTATS'10 Understanding the difficulty of training deep feedforward neural networks
Hanin and Rolnick, NeurIPS'18 How to Start Training: The Effect of Initialization and Architecture
Lecture 7: Taming explosion/vanishing: initialization (2). Dynamical isometry.
Lecture 8: BatchNorm.
Lecture 9: ResNet and other architecture.
Lecture 10: learning rate schedule (step-size), SGD with momentum, optimization theory and gaps to practice.
Adam, AdaGrad, RMSProp, etc. Convergence analysis. Possible gaps.
Lecture 11: Saddle points and local minima. Matrix factorization: no bad local minima. Landscape of Neural networks: bad examples. Linear networks. Over-parameterized networks.
Part II: Min-max Problems (1 week)
Lecture 12: Adversarial attack and defense methods.
Things I wish to talk about: relation between robustness and generalization (see practice problem); trade-off between robustness and generalization; relation between robustness and Lipschitz constant; certified defense methods; poison attack.
Lecture 13: GANs: various formulations and recent progress.
Things I wish to talk about: a general framework to derive GAN in f-GAN paper; formulation of W-GAN by duality; analysis of failure modes of GAN in BigGAN paper; sample complexity of GANs;
Original GAN paper: https://arxiv.org/pdf/1406.2661.pdf
A few tricks, and proposed Inception score: Improved techinques for training GANs p
Are GANs all created equal?: an interesting empirical study that shows most GANs have similar performance at that time (first available on line in 2017). Table 1 gives a good summary of various GANs mentioned in class.
ICLR'2018: spectral normalized GAN which generated impressive ImageNet images
ICLR'2019: BigGANs which improved the IS and FID by a large amount, by a collection of a few tricks and large-scale training
Optional: Min-max optimization.
Optional: Sharp and Wide Minima.
Optional: Visualization of neural-net landscape.
Optional: Over-parameterization eases landscape of non-convex problems.
Optional: Deep networks train faster.
Recent space activity