Open the notebook in Colab
Open the notebook in Colab
Open the notebook in Colab
Open the notebook in SageMaker Studio Lab

Abadi et al., 2016

Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … others. (2016). Tensorflow: a system for large-scale machine learning. 12th $\$USENIX$\$ symposium on operating systems design and implementation ($\$OSDI$\$ 16) (pp. 265–283).

Abdel-Hamid et al., 2014

Abdel-Hamid, O., Mohamed, A.-r., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on audio, speech, and language processing, 22(10), 1533–1545.

Ahmed et al., 2012

Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in latent variable models. Proceedings of the fifth ACM international conference on Web search and data mining (pp. 123–132).

Aji & McEliece, 2000

Aji, S. M., & McEliece, R. J. (2000). The generalized distributive law. IEEE transactions on Information Theory, 46(2), 325–343.

Alayrac et al., 2022

Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … others. (2022). Flamingo: a visual language model for few-shot learning. arXiv preprint arXiv:2204.14198.

Alsallakh et al., 2020

Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the pad–cnns can develop blind spots. arXiv preprint arXiv:2010.02178.

Aronszajn, 1950

Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American mathematical society, 68(3), 337–404.

Ba et al., 2016

Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. arXiv preprint arXiv:1607.06450.

Baevski & Auli, 2018

Baevski, A., & Auli, M. (2018). Adaptive input representations for neural language modeling. International Conference on Learning Representations.

Bahdanau et al., 2014

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.

Bay et al., 2006

Bay, H., Tuytelaars, T., & Van Gool, L. (2006). Surf: speeded up robust features. European conference on computer vision (pp. 404–417).

Bengio et al., 2003

Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of machine learning research, 3(Feb), 1137–1155.

Bergstra et al., 2010

Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a cpu and gpu math compiler in python. Proc. 9th python in science conf (pp. 3–10).

Beutel et al., 2014

Beutel, A., Murray, K., Faloutsos, C., & Smola, A. J. (2014). Cobafi: collaborative bayesian filtering. Proceedings of the 23rd international conference on World wide web (pp. 97–108).

Bishop, 1995

Bishop, C. M. (1995). Training with noise is equivalent to tikhonov regularization. Neural computation, 7(1), 108–116.

Bishop, 2006

Bishop, C. M. (2006). Pattern recognition and machine learning. springer.

Black & Scholes, 1973

Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. The Journal of Political Economy, pp. 637–654.

Bodla et al., 2017

Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-nms–improving object detection with one line of code. Proceedings of the IEEE international conference on computer vision (pp. 5561–5569).

Bojanowski et al., 2017

Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.

Bollobas, 1999

Bollobás, B. (1999). Linear analysis. Cambridge University Press, Cambridge.

Bottou, 2010

Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 (pp. 177–186). Springer.

Bottou & Le Cun, 1988

Bottou, L., & Le Cun, Y. (1988). Sn: a simulator for connectionist models. Proceedings of NeuroNimes 88 (pp. 371–382). Nimes, France. URL:

Boucheron et al., 2005a

Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: probability and statistics, 9, 323–375.

Boucheron et al., 2005b

Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: probability and statistics, 9, 323–375.

Bowman et al., 2015

Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326.

Boyd & Vandenberghe, 2004

Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge, England: Cambridge University Press.

Bradley & Terry, 1952

Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: i. the method of paired comparisons. Biometrika, 39(3/4), 324–345.

Brown & Sandholm, 2017

Brown, N., & Sandholm, T. (2017). Libratus: the superhuman ai for no-limit poker. IJCAI (pp. 5226–5228).

Brown et al., 1990

Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., … Roossin, P. S. (1990). A statistical approach to machine translation. Computational linguistics, 16(2), 79–85.

Brown et al., 1988

Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Mercer, R. L., & Roossin, P. (1988). A statistical approach to language translation. Coling Budapest 1988 Volume 1: International Conference on Computational Linguistics.

Brown et al., 2020

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … others. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877–1901.

Buslaev et al., 2020

Buslaev, A., Iglovikov, V. I., Khvedchenya, E., Parinov, A., Druzhinin, M., & Kalinin, A. A. (2020). Albumentations: fast and flexible image augmentations. Information, 11(2), 125.

Campbell et al., 2002

Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial intelligence, 134(1-2), 57–83.

Canny, 1987

Canny, J. (1987). A computational approach to edge detection. Readings in computer vision (pp. 184–203). Elsevier.

Cer et al., 2017

Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). Semeval-2017 task 1: semantic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).

Cheng et al., 2016

Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 551–561).

Chetlur et al., 2014

Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). Cudnn: efficient primitives for deep learning. arXiv preprint arXiv:1410.0759.

Cho et al., 2014a

Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: encoder-decoder approaches. arXiv preprint arXiv:1409.1259.

Cho et al., 2014b

Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using rnn encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Chowdhery et al., 2022

Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … others. (2022). Palm: scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.

Chung et al., 2014

Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555.

Clark et al., 2020

Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). Electra: pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations.

Collobert et al., 2011

Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of machine learning research, 12(ARTICLE), 2493–2537.

Cordonnier et al., 2020

Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. International Conference on Learning Representations.

Cover & Thomas, 1999

Cover, T., & Thomas, J.M. (1999). Elements of information theory. John Wiley & Sons.

Csiszar, 2008

Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3), 261–273.

Cybenko, 1989

Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of control, signals and systems, 2(4), 303–314.

Dalal & Triggs, 2005

Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR'05) (pp. 886–893).

DeCock, 2011

De Cock, D. (2011). Ames, iowa: alternative to the boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3).

Dean et al., 2012

Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … others. (2012). Large scale distributed deep networks. Proceedings of the 25th International Conference on Neural Information Processing Systems-Volume 1 (pp. 1223–1231).

DeCandia et al., 2007

DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W. (2007). Dynamo: amazon's highly available key-value store. ACM SIGOPS operating systems review (pp. 205–220).

Deng et al., 2009

Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. 2009 IEEE conference on computer vision and pattern recognition (pp. 248–255).

DerKiureghian & Ditlevsen, 2009

Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural safety, 31(2), 105–112.

Devlin et al., 2018

Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

Doersch et al., 2015

Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE international conference on computer vision (pp. 1422–1430).

Dosovitskiy et al., 2021

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … others. (2021). An image is worth 16x16 words: transformers for image recognition at scale. International Conference on Learning Representations.

Doucet et al., 2001

Doucet, A., De Freitas, N., & Gordon, N. (2001). An introduction to sequential monte carlo methods. Sequential Monte Carlo methods in practice (pp. 3–14). Springer.

Duchi et al., 2011

Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul), 2121–2159.

Dumoulin & Visin, 2016

Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. arXiv preprint arXiv:1603.07285.

Dwork et al., 2015

Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015). Preserving statistical validity in adaptive data analysis. Proceedings of the forty-seventh annual ACM symposium on Theory of computing (pp. 117–126).

Fechner, 1860

Fechner, G. T. (1860). Elemente der Ppsychophysik. Vol. 2. Breitkopf u. Härtel.

Fedus et al., 2022

Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.

Fernando, 2004

Fernando, R. (2004). GPU gems: programming techniques, tips, and tricks for real-time graphics. Vol. 590. Addison-Wesley Reading.

Field, 1987

Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. Josa a, 4(12), 2379–2394.

Fisher, 1928

Fisher, R. (1928). Statistical methods for research workers. Stechert.

Flammarion & Bach, 2015

Flammarion, N., & Bach, F. (2015). From averaging to acceleration, there is only a step-size. Conference on Learning Theory (pp. 658–695).

Frankle & Carbin, 2018

Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural networks. arXiv preprint arXiv:1803.03635.

Frazier, 2018

Frazier, P. I. (2018). A tutorial on bayesian optimization. arXiv preprint arXiv:1807.02811.

Freund et al., 1996

Freund, Y., Schapire, R. E., & others. (1996). Experiments with a new boosting algorithm. icml (pp. 148–156).

Friedman, 1987

Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American statistical association, 82(397), 249–266.

Frostig et al., 2018

Frostig, R., Johnson, M. J., & Leary, C. (2018). Compiling machine learning programs via high-level tracing. Systems for Machine Learning.

Fukushima, 1982

Fukushima, K. (1982). Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. Competition and cooperation in neural nets (pp. 267–285). Springer.

Garg et al., 2021

Garg, S., Balakrishnan, S., Kolter, Z., & Lipton, Z. (2021). Ratt: leveraging unlabeled data to guarantee generalization. International Conference on Machine Learning (pp. 3598–3609).

Gatys et al., 2016

Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2414–2423).

Gauss, 1809

Gauss, C. F. (1809). Theoria motus corporum coelestum. Werke.

Gibbs, 1902

Gibbs, J. W. (1902). Elementary principles of statistical mechanics. Compare, 289, 314.

Ginibre, 1965

Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3), 440–449.

Girshick, 2015

Girshick, R. (2015). Fast r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 1440–1448).

Girshick et al., 2014

Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580–587).

Glorot & Bengio, 2010

Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the thirteenth international conference on artificial intelligence and statistics (pp. 249–256).

Goh, 2017

Goh, G. (2017). Why momentum really works. Distill. URL:, doi:10.23915/distill.00006

Goldberg et al., 1992

Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–71.

Golub & VanLoan, 1996

Golub, G. H., & Van Loan, C. F. (1996). Matrix computations. Johns Hopkins studies in the mathematical sciences.

Goodfellow et al., 2016

Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.

Goodfellow et al., 2014

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. Advances in neural information processing systems (pp. 2672–2680).

Gotmare et al., 2018

Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2018). A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. arXiv preprint arXiv:1810.13243.

Goyal et al., 2021

Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. arXiv preprint arXiv:2110.07641.

Graham, 2014

Graham, B. (2014). Fractional max-pooling. arXiv preprint arXiv:1412.6071.

Graves, 2013

Graves, A. (2013). Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850.

Graves et al., 2008

Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE transactions on pattern analysis and machine intelligence, 31(5), 855–868.

Graves & Schmidhuber, 2005

Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural networks, 18(5-6), 602–610.

Griewank, 1989

Griewank, A. (1989). On automatic differentiation. Mathematical Programming: recent developments and applications, 6(6), 83–107.

Gunawardana & Shani, 2015

Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. Recommender systems handbook (pp. 265–308). Springer.

Guo et al., 2017

Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1725–1731).

Guyon et al., 2008

Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2008). Feature extraction: foundations and applications. Vol. 207. Springer.

Hadjis et al., 2016

Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., & Ré, C. (2016). Omnivore: an optimizer for multi-device deep learning on cpus and gpus. arXiv preprint arXiv:1606.04487.

Hartley & Zisserman, 2000

Hartley, R., & Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge University Press.

He et al., 2017a

He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. Proceedings of the IEEE international conference on computer vision (pp. 2961–2969).

He et al., 2015

He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on imagenet classification. Proceedings of the IEEE international conference on computer vision (pp. 1026–1034).

He et al., 2016a

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770–778).

He et al., 2016b

He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. European conference on computer vision (pp. 630–645).

He & Chua, 2017

He, X., & Chua, T.-S. (2017). Neural factorization machines for sparse predictive analytics. Proceedings of the 40th International ACM SIGIR conference on Research and Development in Information Retrieval (pp. 355–364).

He et al., 2017b

He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. Proceedings of the 26th international conference on world wide web (pp. 173–182).

Hebb & Hebb, 1949

Hebb, D. O., & Hebb, D. (1949). The organization of behavior. Vol. 65. Wiley New York.

Hendrycks & Gimpel, 2016a

Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

Hendrycks & Gimpel, 2016b

Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415.

Hennessy & Patterson, 2011

Hennessy, J. L., & Patterson, D. A. (2011). Computer architecture: a quantitative approach. Elsevier.

Herlocker et al., 1999

Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1999 (pp. 230–237).

Hidasi et al., 2015

Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2015). Session-based recommendations with recurrent neural networks. arXiv preprint arXiv:1511.06939.

Hochreiter et al., 2001

Hochreiter, S., Bengio, Y., Frasconi, P., Schmidhuber, J., & others (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies.

Hochreiter & Schmidhuber, 1997

Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.

Hoffmann et al., 2022

Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … others. (2022). Training compute-optimal large language models. arXiv preprint arXiv:2203.15556.

Howard et al., 2019

Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for mobilenetv3. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).

Hoyer et al., 2009

Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. Advances in neural information processing systems (pp. 689–696).

Hu et al., 2018

Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 7132–7141).

Hu et al., 2008

Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. 2008 Eighth IEEE International Conference on Data Mining (pp. 263–272).

Hu et al., 2020

Hu, Z., Lee, R. K.-W., Aggarwal, C. C., & Zhang, A. (2020). Text style transfer: a review and experimental evaluation. arXiv preprint arXiv:2010.12742.

Huang et al., 2018

Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., … Eck, D. (2018). Music transformer: generating music with long-term structure. International Conference on Learning Representations.

Huang et al., 2017

Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4700–4708).

Huang et al., 2015

Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional lstm-crf models for sequence tagging. arXiv preprint arXiv:1508.01991.

Hubel & Wiesel, 1959

Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat's striate cortex. The Journal of physiology, 148(3), 574–591.

Hubel & Wiesel, 1962

Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. The Journal of physiology, 160(1), 106–154.

Hubel & Wiesel, 1968

Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. The Journal of physiology, 195(1), 215–243.

Ioffe, 2017

Ioffe, S. (2017). Batch renormalization: towards reducing minibatch dependence in batch-normalized models. Advances in neural information processing systems (pp. 1945–1953).

Ioffe & Szegedy, 2015

Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.

Izmailov et al., 2018

Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. arXiv preprint arXiv:1803.05407.

Jacot et al., 2018

Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: convergence and generalization in neural networks. Advances in neural information processing systems.

Jaeger, 2002

Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the" echo state network" approach. Vol. 5. GMD-Forschungszentrum Informationstechnik Bonn.

James, 2007

James, W. (2007). The principles of psychology. Vol. 1. Cosimo, Inc.

Jia et al., 2018

Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., … others. (2018). Highly scalable deep learning training system with mixed-precision: training imagenet in four minutes. arXiv preprint arXiv:1807.11205.

Jia et al., 2014

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM international conference on Multimedia (pp. 675–678).

Joshi et al., 2020

Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). Spanbert: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77.

Jouppi et al., 2017

Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … others. (2017). In-datacenter performance analysis of a tensor processing unit. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (pp. 1–12).

Kalchbrenner et al., 2014

Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188.

Kalman & Kwasny, 1992

Kalman, B. L., & Kwasny, S. C. (1992). Why tanh: choosing a sigmoidal function. [Proceedings 1992] IJCNN International Joint Conference on Neural Networks (pp. 578–581).

Kaplan et al., 2020

Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. arXiv preprint arXiv:2001.08361.

Karras et al., 2017

Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of gans for improved quality, stability, and variation. arXiv preprint arXiv:1710.10196.

Kim, 2014

Kim, Y. (2014). Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882.

Kimeldorf & Wahba, 1971

Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33, 82-95.

Kingma & Ba, 2014

Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980.

Kingma & Welling, 2014

Kingma, D. P., & Welling, M. (2014). Auto-Encoding Variational Bayes. International Conference on Learning Representations (ICLR).

Kipf & Welling, 2016

Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. arXiv preprint arXiv:1609.02907.

Koller & Friedman, 2009

Koller, D., & Friedman, N. (2009). Probabilistic graphical models: principles and techniques. MIT press.

Kolmogorov, 1933

Kolmogorov, A. (1933). Sulla determinazione empirica di una lgge di distribuzione. Inst. Ital. Attuari, Giorn., 4, 83–91.

Kolter, 2008

Kolter, Z. (2008). Linear algebra review and reference. Available online:

Koren et al., 2009

Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, pp. 30–37.

Krizhevsky et al., 2012

Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems (pp. 1097–1105).

Kung, 1988

Kung, S. Y. (1988). Vlsi array processors. Englewood Cliffs, NJ, Prentice Hall, 1988, 685 p. Research supported by the Semiconductor Research Corp., SDIO, NSF, and US Navy.

Kuzovkin et al., 2018

Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J.-P., Baciu, M., Kahane, P., … Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications biology, 1(1), 1–12.

Lan et al., 2019

Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). Albert: a lite bert for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942.

Lavin & Gray, 2016

Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4013–4021).

LeCun et al., 1995a

LeCun, Y., Bengio, Y., & others. (1995). Convolutional networks for images, speech, and time series. The handbook of brain theory and neural networks, 3361(10), 1995.

LeCun et al., 1989

LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural computation, 1(4), 541–551.

LeCun et al., 1998a

LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. Neural Networks: Tricks of the Trade. New York: Springer.

LeCun et al., 1998b

LeCun, Y., Bottou, L., Bengio, Y., Haffner, P., & others. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

LeCun et al., 1995b

LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., … others. (1995). Comparison of learning algorithms for handwritten digit recognition. International conference on artificial neural networks (pp. 53–60).

Legendre, 1805

Legendre, A. M. (1805). Mémoire sur les opérations trigonométriques: dont les résultats dépendent de la figure de la terre. F. Didot.

Lewis et al., 2019

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2019). Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461.

Lewkowycz et al., 2022

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., … others. (2022). Solving quantitative reasoning problems with language models. arXiv preprint arXiv:2206.14858.

Li, 2017

Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doctoral dissertation). PhD Thesis, CMU.

Li et al., 2014a

Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. 11th $\$USENIX$\$ Symposium on Operating Systems Design and Implementation ($\$OSDI$\$ 14) (pp. 583–598).

Li et al., 2014b

Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 661–670).

Lin et al., 2013

Lin, M., Chen, Q., & Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.

Lin et al., 2017a

Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE international conference on computer vision (pp. 2980–2988).

Lin et al., 2010

Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., … others. (2010). Imagenet classification: fast descriptor coding and large-scale svm training. Large scale visual recognition challenge.

Lin et al., 2017b

Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.

Lipton et al., 2015

Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019.

Lipton et al., 2016

Lipton, Z. C., Kale, D. C., Elkan, C., & Wetzel, R. (2016). Learning to diagnose with lstm recurrent neural networks. International Conference on Learning Representations (ICLR).

Lipton & Steinhardt, 2018

Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. Communications of the ACM (CACM).

Liu & Nocedal, 1989

Liu, D. C., & Nocedal, J. (1989). On the limited memory bfgs method for large scale optimization. Mathematical programming, 45(1), 503–528.

Liu et al., 2018

Liu, H., Simonyan, K., & Yang, Y. (2018). Darts: differentiable architecture search. arXiv preprint arXiv:1806.09055.

Liu et al., 2016

Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). Ssd: single shot multibox detector. European conference on computer vision (pp. 21–37).

Liu et al., 2019a

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Liu et al., 2019b

Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

Liu et al., 2021

Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).

Liu et al., 2022

Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convnet for the 2020s. arXiv preprint arXiv:2201.03545.

Long et al., 2015

Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3431–3440).

Loshchilov & Hutter, 2016

Loshchilov, I., & Hutter, F. (2016). Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983.

Lowe, 2004

Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International journal of computer vision, 60(2), 91–110.

Luo et al., 2018

Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch normalization. arXiv preprint.

Maas et al., 2011

Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies-volume 1 (pp. 142–150).

MacKay & MacKay, 2003

MacKay, D. J., & Mac Kay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press.

Mangasarian, 1965

Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13, 444-452.

Mangram, 2013

Mangram, M. E. (2013). A simplified perspective of the markowitz portfolio theory. Global journal of business research, 7(1), 59–70.

McCann et al., 2017

McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: contextualized word vectors. Advances in Neural Information Processing Systems (pp. 6294–6305).

McCulloch & Pitts, 1943

McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. The bulletin of mathematical biophysics, 5(4), 115–133.

McMahan et al., 2013

McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … others. (2013). Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 1222–1230).

Mead, 1980

Mead, C. (1980). Introduction to vlsi systems. IEE Proceedings I-Solid-State and Electron Devices, 128(1), 18.

Merity et al., 2016

Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843.

Micchelli, 1984

Micchelli, C. A. (1984). Interpolation of scattered data: distance matrices and conditionally positive definite functions. Approximation theory and spline functions (pp. 143–145). Springer.

Mikolov et al., 2013a

Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781.

Mikolov et al., 2013b

Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems (pp. 3111–3119).

Miller, 1995

Miller, G. A. (1995). Wordnet: a lexical database for english. Communications of the ACM, 38(11), 39–41.

Mirhoseini et al., 2017

Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., … Dean, J. (2017). Device placement optimization with reinforcement learning. Proceedings of the 34th International Conference on Machine Learning-Volume 70 (pp. 2430–2439).

Mnih et al., 2014

Mnih, V., Heess, N., Graves, A., & others. (2014). Recurrent models of visual attention. Advances in neural information processing systems (pp. 2204–2212).

Moon et al., 2010

Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). Intervalrank: isotonic regression with listwise and pairwise constraints. Proceedings of the third ACM international conference on Web search and data mining (pp. 151–160).

Morey et al., 2016

Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic bulletin & review, 23(1), 103–123.

Morozov, 2012

Morozov, V. A. (2012). Methods for solving incorrectly posed problems. Springer Science & Business Media.

Nadaraya, 1964

Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & Its Applications, 9(1), 141–142.

Nair & Hinton, 2010

Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted boltzmann machines. Icml.

Nakkiran et al., 2021

Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.

Naor & Reingold, 1999

Naor, M., & Reingold, O. (1999). On the construction of pseudorandom permutations: luby—rackoff revisited. Journal of Cryptology, 12(1), 29–66.

Nesterov & Vial, 2000

Nesterov, Y., & Vial, J.-P. (2000). Confidence level solutions for stochastic programming, Stochastic Programming E-Print Series.

Nesterov, 2018

Nesterov, Y. (2018). Lectures on convex optimization. Vol. 137. Springer.

Neyman, 1937

Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380.

Ong et al., 2005

Ong, C. S., Smola, A., Williamson, R., & others. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research.

Papineni et al., 2002

Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). Bleu: a method for automatic evaluation of machine translation. Proceedings of the 40th annual meeting of the Association for Computational Linguistics (pp. 311–318).

Parikh et al., 2016

Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933.

Park et al., 2019

Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2337–2346).

Paszke et al., 2019

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … others. (2019). Pytorch: an imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 8026–8037.

Paulus et al., 2017

Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.

Pennington et al., 2017

Pennington, J., Schoenholz, S., & Ganguli, S. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in neural information processing systems (pp. 4785–4795).

Pennington et al., 2014

Pennington, J., Socher, R., & Manning, C. (2014). Glove: global vectors for word representation. Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543).

Peters et al., 2017a

Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of causal inference: foundations and learning algorithms. MIT press.

Peters et al., 2017b

Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 1756–1765).

Peters et al., 2018

Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers) (pp. 2227–2237).

Petersen et al., 2008

Petersen, K. B., Pedersen, M. S., & others. (2008). The matrix cookbook. Technical University of Denmark, 7(15), 510.

Pleiss et al., 2017

Pleiss, G., Chen, D., Huang, G., Li, T., Van Der Maaten, L., & Weinberger, K. Q. (2017). Memory-efficient implementation of densenets. arXiv preprint arXiv:1707.06990.

Polyak, 1964

Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.

Popper, 2005

Popper, K. (2005). The logic of scientific discovery. Routledge.

Quadrana et al., 2018

Quadrana, M., Cremonesi, P., & Jannach, D. (2018). Sequence-aware recommender systems. ACM Computing Surveys (CSUR), 51(4), 66.

Quinlan, 2014

Quinlan, J. R. (2014). C4. 5: programs for machine learning. Elsevier.

Radford et al., 2021

Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … others. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (pp. 8748–8763).

Radford et al., 2015

Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434.

Radford et al., 2018

Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.

Radford et al., 2019

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

Radosavovic et al., 2019

Radosavovic, I., Johnson, J., Xie, S., Lo, W.-Y., & Dollár, P. (2019). On network design spaces for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1882–1890).

Radosavovic et al., 2020

Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).

Rae et al., 2021

Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … others. (2021). Scaling language models: methods, analysis & insights from training gopher. arXiv preprint arXiv:2112.11446.

Raffel et al., 2020

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67.

Rajpurkar et al., 2016

Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250.

Ramachandran et al., 2019

Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.

Ramachandran et al., 2017

Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. arXiv preprint arXiv:1710.05941.

Ramesh et al., 2022

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.

Ranzato et al., 2007

Ranzato, Marc’Aurelio, Boureau, Y.-L., Chopra, S., & LeCun, Y. (2007). A unified energy-based framework for unsupervised learning. Artificial Intelligence and Statistics (pp. 371–379).

Reddi et al., 2019

Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237.

Redmon et al., 2016

Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-time object detection. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779–788).

Redmon & Farhadi, 2018

Redmon, J., & Farhadi, A. (2018). Yolov3: an incremental improvement. arXiv preprint arXiv:1804.02767.

Reed & DeFreitas, 2015

Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. arXiv preprint arXiv:1511.06279.

Reed et al., 2022

Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., … others. (2022). A generalist agent. arXiv preprint arXiv:2205.06175.

Ren et al., 2015

Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster r-cnn: towards real-time object detection with region proposal networks. Advances in neural information processing systems (pp. 91–99).

Rendle, 2010

Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining (pp. 995–1000).

Rendle et al., 2009

Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). Bpr: bayesian personalized ranking from implicit feedback. Proceedings of the twenty-fifth conference on uncertainty in artificial intelligence (pp. 452–461).

Revels et al., 2016

Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in julia. arXiv preprint arXiv:1607.07892.

Riesenhuber & Poggio, 1999

Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature neuroscience, 2(11), 1019–1025.

Rockafellar, 1970

Rockafellar, R. T. (1970). Convex Analysis. Vol. 28. Princeton, NJ: Princeton University Press.

Rolnick et al., 2017

Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. arXiv preprint arXiv:1705.10694.

Rudin, 1973

Rudin, W. (1973). Functional Analysis. New York: McGraw-Hill.

Rumelhart et al., 1988

Rumelhart, D. E., Hinton, G. E., Williams, R. J., & others. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5(3), 1.

Russakovsky et al., 2013

Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: what have we done, and where are we going? International Conference on Computer Vision (ICCV).

Russell & Norvig, 2016

Russell, S. J., & Norvig, P. (2016). Artificial intelligence: a modern approach. Malaysia; Pearson Education Limited,.

Saharia et al., 2022

Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., … others. (2022). Photorealistic text-to-image diffusion models with deep language understanding. arXiv preprint arXiv:2205.11487.

Sanh et al., 2019

Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.

Santurkar et al., 2018

Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization? Advances in Neural Information Processing Systems (pp. 2483–2493).

Sarwar et al., 2001

Sarwar, B. M., Karypis, G., Konstan, J. A., Riedl, J., & others. (2001). Item-based collaborative filtering recommendation algorithms. Www, 1, 285–295.

Schein et al., 2002

Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-start recommendations. Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 253–260).

Scholkopf & Smola, 2002a

Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning Series.

Scholkopf & Smola, 2002b

Scholkopf, B., & Smola, A. J. (2002). Learning with kernels: support vector machines, regularization, optimization, and beyond. Adaptive Computation and Machine Learning Series.

Schuster & Paliwal, 1997

Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.

Scholkopf et al., 2001

Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). Helmbold, D. P., & Williamson, B. (Eds.). A generalized representer theorem. Proc. Annual Conf. Computational Learning Theory (pp. 416–426). London, UK: Springer-Verlag.

Scholkopf et al., 1996

Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning machines. International Conference on Artificial Neural Networks (pp. 47–52).

Sedhain et al., 2015

Sedhain, S., Menon, A. K., Sanner, S., & Xie, L. (2015). Autorec: autoencoders meet collaborative filtering. Proceedings of the 24th International Conference on World Wide Web (pp. 111–112).

Sennrich et al., 2015

Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.

Sergeev & DelBalso, 2018

Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in tensorflow. arXiv preprint arXiv:1802.05799.

Shannon, 1948

Shannon, C. E. (1948 , 7). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

Shao et al., 2020

Shao, H., Yao, S., Sun, D., Zhang, A., Liu, S., Liu, D., … Abdelzaher, T. (2020). Controlvae: controllable variational autoencoder. Proceedings of the 37th International Conference on Machine Learning.

Shaw et al., 2018

Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. arXiv preprint arXiv:1803.02155.

Silver et al., 2016

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … others. (2016). Mastering the game of go with deep neural networks and tree search. nature, 529(7587), 484.

Simard et al., 1998

Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition—tangent distance and tangent propagation. Neural networks: tricks of the trade (pp. 239–274). Springer.

Simonyan & Zisserman, 2014

Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

Sindhwani et al., 2015

Sindhwani, V., Sainath, T. N., & Kumar, S. (2015). Structured transforms for small-footprint deep learning. arXiv preprint arXiv:1510.01722.

Sivic & Zisserman, 2003

Sivic, J., & Zisserman, A. (2003). Video google: a text retrieval approach to object matching in videos. Computer Vision, IEEE International Conference on (pp. 1470–1470).

Smith et al., 2022

Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., … others. (2022). Using deepspeed and megatron to train megatron-turing nlg 530b, a large-scale generative language model. arXiv preprint arXiv:2201.11990.

Smola & Narayanamurthy, 2010

Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2), 703–710.

Speelpenning, 1980

Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms (Doctoral dissertation). University of Illinois at Urbana-Champaign.

Srivastava et al., 2022

Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … others. (2022). Beyond the imitation game: quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615.

Srivastava et al., 2014

Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15(1), 1929–1958.

Srivastava et al., 2015

Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. arXiv preprint arXiv:1505.00387.

Strang, 1993

Strang, G. (1993). Introduction to linear algebra. Vol. 3. Wellesley-Cambridge Press Wellesley, MA.

Su & Khoshgoftaar, 2009

Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in artificial intelligence, 2009.

Sukhbaatar et al., 2015

Sukhbaatar, S., Weston, J., Fergus, R., & others. (2015). End-to-end memory networks. Advances in neural information processing systems (pp. 2440–2448).

Sutskever et al., 2013

Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. International conference on machine learning (pp. 1139–1147).

Sutskever et al., 2014

Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in neural information processing systems (pp. 3104–3112).

Szegedy et al., 2017

Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, inception-resnet and the impact of residual connections on learning. Thirty-First AAAI Conference on Artificial Intelligence.

Szegedy et al., 2015

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1–9).

Szegedy et al., 2016

Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the inception architecture for computer vision. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2818–2826).

Tallec & Ollivier, 2017

Tallec, C., & Ollivier, Y. (2017). Unbiasing truncated backpropagation through time. arXiv preprint arXiv:1705.08209.

Tan & Le, 2019

Tan, M., & Le, Q. (2019). Efficientnet: rethinking model scaling for convolutional neural networks. International conference on machine learning (pp. 6105–6114).

Tang & Wang, 2018

Tang, J., & Wang, K. (2018). Personalized top-n sequential recommendation via convolutional sequence embedding. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 565–573).

Taskar et al., 2004

Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin markov networks. Advances in neural information processing systems, 16, 25.

Tay et al., 2020

Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: a survey. arXiv preprint arXiv:2009.06732.

Teye et al., 2018

Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. arXiv preprint arXiv:1802.06455.

Thomee et al., 2016

Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., … Li, L.-J. (2016). Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2), 64–73.

Tieleman & Hinton, 2012

Tieleman, T., & Hinton, G. (2012). Lecture 6.5-rmsprop: divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2), 26–31.

Tikhonov & Arsenin, 1977

Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of ill-posed problems. W.H. Winston.

Tolstikhin et al., 2021

Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … others. (2021). Mlp-mixer: an all-mlp architecture for vision. Advances in Neural Information Processing Systems, 34.

Torralba et al., 2008

Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE transactions on pattern analysis and machine intelligence, 30(11), 1958–1970.

Touvron et al., 2021

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (pp. 10347–10357).

Tsoumakas & Katakis, 2007

Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining (IJDWM), 3(3), 1–13.

Turing, 1950

Turing, A. (1950). Computing machinery and intelligence. Mind, 59(236), 433.

Toscher et al., 2009

Töscher, A., Jahrer, M., & Bell, R. M. (2009). The bigchaos solution to the netflix grand prize. Netflix prize documentation, pp. 1–52.

Uijlings et al., 2013

Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International journal of computer vision, 104(2), 154–171.

VanLoan & Golub, 1983

Van Loan, C. F., & Golub, G. H. (1983). Matrix computations. Johns Hopkins University Press.

Vapnik, 1998

Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley and Sons.

Vapnik & Chervonenkis, 1964

Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote Control, 25.

Vapnik & Chervonenkis, 1968

Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915-918.

Vapnik & Chervonenkis, 1971

Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2), 264-281.

Vapnik & Chervonenkis, 1981

Vapnik, V., & Chervonenkis, A. (1981). The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3), 543-564.

Vapnik & Chervonenkis, 1991

Vapnik, V., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 283-305.

Vapnik & Chervonenkis, 1974

Vapnik, V. N., & Chervonenkis, A. Y. (1974). Ordered risk minimization. Automation and Remote Control, 35, 1226–1235, 1403–1412.

Vapnik, 1992

Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in neural information processing systems (pp. 831–838).

Vapnik et al., 1994

Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the vc-dimension of a learning machine. Neural computation, 6(5), 851–876.

Vaswani et al., 2017

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems (pp. 5998–6008).

Wahba, 1990

Wahba, G. (1990). Spline models for observational data. SIAM.

Waibel et al., 1989

Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE transactions on acoustics, speech, and signal processing, 37(3), 328–339.

Wang et al., 2018

Wang, L., Li, M., Liberty, E., & Smola, A. J. (2018). Optimal message scheduling for aggregation. NETWORKS, 2(3), 2–3.

Wang et al., 2019

Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1810–1822).

Wang et al., 2016

Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016). Gunrock: a high-performance graph processing library on the gpu. ACM SIGPLAN Notices (p. 11).

Warstadt et al., 2019

Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641.

Wasserman, 2013

Wasserman, L. (2013). All of statistics: a concise course in statistical inference. Springer Science & Business Media.

Watkins & Dayan, 1992

Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine learning, 8(3-4), 279–292.

Watson, 1964

Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pp. 359–372.

Welling & Teh, 2011

Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient langevin dynamics. Proceedings of the 28th international conference on machine learning (ICML-11) (pp. 681–688).

Wengert, 1964

Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the ACM, 7(8), 463–464.

Werbos, 1990

Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.

Wigner, 1958

Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math (pp. 325–327).

Wood et al., 2011

Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. W. (2011). The sequence memoizer. Communications of the ACM, 54(2), 91–98.

Wu et al., 2016

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … others. (2016). Google's neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144.

Xiao et al., 2017

Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747.

Xiao et al., 2018

Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018). Dynamical isometry and a mean field theory of cnns: how to train 10,000-layer vanilla convolutional neural networks. International Conference on Machine Learning (pp. 5393–5402).

Xie et al., 2017

Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492–1500).

Xiong et al., 2020

Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., … Liu, T. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning (pp. 10524–10533).

Xiong et al., 2018

Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The microsoft 2017 conversational speech recognition system. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5934–5938).

Cajal & Azoulay, 1894

y Cajal, S. R., & Azoulay, L. (1894). Les nouvelles idées sur la structure du système nerveux chez l'homme et chez les vertébrés. C. Reinwald.

Yamaguchi et al., 1990

Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). A neural network for speaker-independent isolated word recognition. First International Conference on Spoken Language Processing.

Yang et al., 2015

Yang, Z., Moczulski, M., Denil, M., De Freitas, N., Smola, A., Song, L., & Wang, Z. (2015). Deep fried convnets. Proceedings of the IEEE International Conference on Computer Vision (pp. 1476–1483).

Ye et al., 2011

Ye, M., Yin, P., Lee, W.-C., & Lee, D.-L. (2011). Exploiting geographical influence for collaborative point-of-interest recommendation. Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval (pp. 325–334).

You et al., 2017

You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888.

Yu et al., 2022

Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., … Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789.

Zaheer et al., 2018

Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems (pp. 9793–9803).

Zeiler, 2012

Zeiler, M. D. (2012). Adadelta: an adaptive learning rate method. arXiv preprint arXiv:1212.5701.

Zeiler & Fergus, 2013

Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. arXiv preprint arXiv:1301.3557.

Zhang et al., 2021a

Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. International Conference on Learning Representations.

Zhang et al., 2021b

Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.

Zhang et al., 2019

Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys (CSUR), 52(1), 5.

Zhang et al., 2022

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., … others. (2022). Opt: open pre-trained transformer language models. arXiv preprint arXiv:2205.01068.

Zhang & others, 1988

Zhang, W., & others. (1988). Shift-invariant pattern recognition neural network and its optical architecture. Proceedings of annual conference of the Japan Society of Applied Physics.

Zhang et al., 2021c

Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … Wang, X. (2021). Bytetrack: multi-object tracking by associating every detection box. arXiv preprint arXiv:2110.06864.

Zhao et al., 2019

Zhao, Z.-Q., Zheng, P., Xu, S.-t., & Wu, X. (2019). Object detection with deep learning: a review. IEEE transactions on neural networks and learning systems, 30(11), 3212–3232.

Zhu et al., 2017

Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE international conference on computer vision (pp. 2223–2232).

Zhu et al., 2015

Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE international conference on computer vision (pp. 19–27).

Zoph & Le, 2016

Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578.