References¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab

Abadi et al., 2016: Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., … et al. (2016). TensorFlow: a system for large-scale machine learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (pp. 265–283).
Abdel-Hamid et al., 2014: Abdel-Hamid, O., Mohamed, A.-R., Jiang, H., Deng, L., Penn, G., & Yu, D. (2014). Convolutional neural networks for speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 22(10), 1533–1545.
Ahmed et al., 2012: Ahmed, A., Aly, M., Gonzalez, J., Narayanamurthy, S., & Smola, A. J. (2012). Scalable inference in latent variable models. Proceedings of the Fifth ACM International Conference on Web Search and Data Mining (pp. 123–132).
Akiba et al., 2019: Akiba, T., Sano, S., Yanase, T., Ohta, T., & Koyama, M. (2019). Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining.
Alayrac et al., 2022: Alayrac, J.-B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., … et al. (2022). Flamingo: a visual language model for few-shot learning. ArXiv:2204.14198.
Alsallakh et al., 2020: Alsallakh, B., Kokhlikyan, N., Miglani, V., Yuan, J., & Reblitz-Richardson, O. (2020). Mind the PAD – CNNs can develop blind spots. ArXiv:2010.02178.
Anil et al., 2023: Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., … et al. (2023). PaLM 2 Technical Report. ArXiv:2305.10403.
Anil et al., 2020: Anil, R., Gupta, V., Koren, T., Regan, K., & Singer, Y. (2020). Scalable second-order optimization for deep learning. ArXiv:2002.09018.
Aronszajn, 1950: Aronszajn, N. (1950). Theory of reproducing kernels. Transactions of the American Mathematical Society, 68(3), 337–404.
Ba et al., 2016: Ba, J. L., Kiros, J. R., & Hinton, G. E. (2016). Layer normalization. ArXiv:1607.06450.
Baevski & Auli, 2018: Baevski, A., & Auli, M. (2018). Adaptive input representations for neural language modeling. International Conference on Learning Representations.
Bahdanau et al., 2014: Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly learning to align and translate. ArXiv:1409.0473.
Bai et al., 2022: Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., … et al. (2022). Constitutional AI: harmlessness from AI feedback. ArXiv:2212.08073.
Baptista & Poloczek, 2018: Baptista, R., & Poloczek, M. (2018). Bayesian optimization of combinatorial structures. Proceedings of the 35th International Conference on Machine Learning.
Bardenet et al., 2013: Bardenet, R., Brendel, M., Kégl, B., & Sebag, M. (2013). Collaborative hyperparameter tuning. Proceedings of the 30th International Conference on Machine Learning (ICML'13).
Bay et al., 2006: Bay, H., Tuytelaars, T., & Van Gool, L. (2006). SURF: Speeded up robust features. European Conference on Computer Vision (pp. 404–417).
Bellman, 1966: Bellman, R. (1966). Dynamic programming. Science, 153, 34–37.
Bellman, 1952: Bellman, R. (1952). On the theory of dynamic programming. Proceedings of the National Academy of Sciences, 38(8), 716–719.
Bellman, 1957a: Bellman, R. (1957). A Markovian decision process. Journal of Mathematics and Mechanics, 6(5), 679–684. URL: http://www.jstor.org/stable/24900506
Bellman, 1957b: Bellman, R. (1957). Dynamic Programming. Dover Publications.
Beltagy et al., 2020: Beltagy, I., Peters, M. E., & Cohan, A. (2020). Longformer: the long-document transformer. ArXiv:2004.05150.
Bengio et al., 2003: Bengio, Y., Ducharme, R., Vincent, P., & Jauvin, C. (2003). A neural probabilistic language model. Journal of Machine Learning Research, 3(Feb), 1137–1155.
Bengio et al., 1994: Bengio, Y., Simard, P., & Frasconi, P. (1994). Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on Neural Networks, 5(2), 157–166.
Bergstra et al., 2011: Bergstra, J., Bardenet, R., Bengio, Y., & Kégl, B. (2011). Algorithms for hyper-parameter optimization. Advances in Neural Information Processing Systems, 24.
Bergstra et al., 2010: Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., … Bengio, Y. (2010). Theano: a CPU and GPU math compiler in Python. Proc. 9th Python in Science Conference (pp. 3–10).
Beutel et al., 2014: Beutel, A., Murray, K., Faloutsos, C., & Smola, A. J. (2014). CoBaFi: collaborative Bayesian filtering. Proceedings of the 23rd International Conference on World Wide Web (pp. 97–108).
Bishop, 1995: Bishop, C. M. (1995). Training with noise is equivalent to Tikhonov regularization. Neural Computation, 7(1), 108–116.
Bishop, 2006: Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer.
Black & Scholes, 1973: Black, F., & Scholes, M. (1973). The pricing of options and corporate liabilities. Journal of Political Economy, 81, 637–654.
Bodla et al., 2017: Bodla, N., Singh, B., Chellappa, R., & Davis, L. S. (2017). Soft-NMS-improving object detection with one line of code. Proceedings of the IEEE International Conference on Computer Vision (pp. 5561–5569).
Bojanowski et al., 2017: Bojanowski, P., Grave, E., Joulin, A., & Mikolov, T. (2017). Enriching word vectors with subword information. Transactions of the Association for Computational Linguistics, 5, 135–146.
Bollobas, 1999: Bollobás, B. (1999). Linear Analysis. Cambridge University Press.
Bommasani et al., 2021: Bommasani, R., Hudson, D. A., Adeli, E., Altman, R., Arora, S., von Arx, S., … et al. (2021). On the opportunities and risks of foundation models. ArXiv:2108.07258.
Bottou, 2010: Bottou, L. (2010). Large-scale machine learning with stochastic gradient descent. Proceedings of COMPSTAT'2010 (pp. 177–186). Springer.
Bottou & Le Cun, 1988: Bottou, L., & Le Cun, Y. (1988). SN: a simulator for connectionist models. Proceedings of NeuroNimes 88 (pp. 371–382). Nimes, France. URL: http://leon.bottou.org/papers/bottou-lecun-88
Boucheron et al., 2005: Boucheron, S., Bousquet, O., & Lugosi, G. (2005). Theory of classification: a survey of some recent advances. ESAIM: Probability and Statistics, 9, 323–375.
Bowman et al., 2015: Bowman, S. R., Angeli, G., Potts, C., & Manning, C. D. (2015). A large annotated corpus for learning natural language inference. ArXiv:1508.05326.
Boyd & Vandenberghe, 2004: Boyd, S., & Vandenberghe, L. (2004). Convex Optimization. Cambridge, England: Cambridge University Press.
Bradley & Terry, 1952: Bradley, R. A., & Terry, M. E. (1952). Rank analysis of incomplete block designs: I. The method of paired comparisons. Biometrika, 39(3/4), 324–345.
Brown & Sandholm, 2017: Brown, N., & Sandholm, T. (2017). Libratus: the superhuman AI for no-limit poker. IJCAI (pp. 5226–5228).
Brown et al., 1990: Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Lafferty, J., … Roossin, P. S. (1990). A statistical approach to machine translation. Computational Linguistics, 16(2), 79–85.
Brown et al., 1988: Brown, P. F., Cocke, J., Della Pietra, S. A., Della Pietra, V. J., Jelinek, F., Mercer, R. L., & Roossin, P. (1988). A statistical approach to language translation. COLING Budapest 1988 Volume 1: International Conference on Computational Linguistics.
Brown et al., 2020: Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., … et al. (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901.
Buslaev et al., 2020: Buslaev, A., Iglovikov, V. I., Khvedchenya, E., Parinov, A., Druzhinin, M., & Kalinin, A. A. (2020). Albumentations: Fast and flexible image augmentations. Information, 11(2), 125.
Campbell et al., 2002: Campbell, M., Hoane Jr, A. J., & Hsu, F.-h. (2002). Deep blue. Artificial Intelligence, 134(1-2), 57–83.
Canny, 1987: Canny, J. (1987). A computational approach to edge detection. Readings in Computer Vision (pp. 184–203). Elsevier.
Cer et al., 2017: Cer, D., Diab, M., Agirre, E., Lopez-Gazpio, I., & Specia, L. (2017). SemEval-2017 Task 1: semantic textual similarity multilingual and crosslingual focused evaluation. Proceedings of the 11th International Workshop on Semantic Evaluation (SemEval-2017) (pp. 1–14).
Chan et al., 2015: Chan, W., Jaitly, N., Le, Q. V., & Vinyals, O. (2015). Listen, attend and spell. ArXiv:1508.01211.
Chen et al., 2021: Chen, L., Lu, K., Rajeswaran, A., Lee, K., Grover, A., Laskin, M., … Mordatch, I. (2021). Decision transformer: reinforcement learning via sequence modeling. Advances in Neural Information Processing Systems, 34, 15084–15097.
Chen et al., 2015: Chen, T., Li, M., Li, Y., Lin, M., Wang, N., Wang, M., … Zhang, Z. (2015). MXNET: a flexible and efficient machine learning library for heterogeneous distributed systems. ArXiv:1512.01274.
Cheng et al., 2016: Cheng, J., Dong, L., & Lapata, M. (2016). Long short-term memory-networks for machine reading. Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing (pp. 551–561).
Chetlur et al., 2014: Chetlur, S., Woolley, C., Vandermersch, P., Cohen, J., Tran, J., Catanzaro, B., & Shelhamer, E. (2014). CuDNN: Efficient primitives for deep learning. ArXiv:1410.0759.
Cho et al., 2014a: Cho, K., Van Merriënboer, B., Bahdanau, D., & Bengio, Y. (2014). On the properties of neural machine translation: Encoder–decoder approaches. ArXiv:1409.1259.
Cho et al., 2014b: Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. (2014). Learning phrase representations using RNN encoder–decoder for statistical machine translation. ArXiv:1406.1078.
Chowdhery et al., 2022: Chowdhery, A., Narang, S., Devlin, J., Bosma, M., Mishra, G., Roberts, A., … et al. (2022). PaLM: scaling language modeling with pathways. ArXiv:2204.02311.
Chung et al., 2014: Chung, J., Gulcehre, C., Cho, K., & Bengio, Y. (2014). Empirical evaluation of gated recurrent neural networks on sequence modeling. ArXiv:1412.3555.
Clark et al., 2020: Clark, K., Luong, M.-T., Le, Q. V., & Manning, C. D. (2020). ELECTRA: pre-training text encoders as discriminators rather than generators. International Conference on Learning Representations.
Collobert et al., 2011: Collobert, R., Weston, J., Bottou, L., Karlen, M., Kavukcuoglu, K., & Kuksa, P. (2011). Natural language processing (almost) from scratch. Journal of Machine Learning Research, 12, 2493–2537.
Cordonnier et al., 2020: Cordonnier, J.-B., Loukas, A., & Jaggi, M. (2020). On the relationship between self-attention and convolutional layers. International Conference on Learning Representations.
Cover & Thomas, 1999: Cover, T., & Thomas, J. (1999). Elements of Information Theory. John Wiley & Sons.
Csiszar, 2008: Csiszár, I. (2008). Axiomatic characterizations of information measures. Entropy, 10(3), 261–273.
Cybenko, 1989: Cybenko, G. (1989). Approximation by superpositions of a sigmoidal function. Mathematics of Control, Signals and Systems, 2(4), 303–314.
Dalal & Triggs, 2005: Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05) (pp. 886–893).
DeCock, 2011: De Cock, D. (2011). Ames, Iowa: alternative to the Boston housing data as an end of semester regression project. Journal of Statistics Education, 19(3).
Dean et al., 2012: Dean, J., Corrado, G. S., Monga, R., Chen, K., Devin, M., Le, Q. V., … et al. (2012). Large scale distributed deep networks. Proceedings of the 25th International Conference on Neural Information Processing Systems, Volume 1 (pp. 1223–1231).
DeCandia et al., 2007: DeCandia, G., Hastorun, D., Jampani, M., Kakulapati, G., Lakshman, A., Pilchin, A., … Vogels, W. (2007). Dynamo: Amazon's highly available key-value store. ACM SIGOPS Operating Systems Review (pp. 205–220).
Deng et al., 2009: Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). Imagenet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition (pp. 248–255).
DerKiureghian & Ditlevsen, 2009: Der Kiureghian, A., & Ditlevsen, O. (2009). Aleatory or epistemic? does it matter? Structural Safety, 31(2), 105–112.
Devlin et al., 2018: Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. ArXiv:1810.04805.
Dinh et al., 2014: Dinh, L., Krueger, D., & Bengio, Y. (2014). NICE: non-linear independent components estimation. ArXiv:1410.8516.
Dinh et al., 2017: Dinh, L., Sohl-Dickstein, J., & Bengio, S. (2017). Density estimation using real NVP. International Conference on Learning Representations.
Doersch et al., 2015: Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation learning by context prediction. Proceedings of the IEEE International Conference on Computer Vision (pp. 1422–1430).
Dosovitskiy et al., 2021: Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., … et al. (2021). An image is worth 16 x 16 words: transformers for image recognition at scale. International Conference on Learning Representations.
Duchi et al., 2011: Duchi, J., Hazan, E., & Singer, Y. (2011). Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12, 2121–2159.
Dumoulin & Visin, 2016: Dumoulin, V., & Visin, F. (2016). A guide to convolution arithmetic for deep learning. ArXiv:1603.07285.
Dwivedi & Bresson, 2020: Dwivedi, V. P., & Bresson, X. (2020). A generalization of transformer networks to graphs. ArXiv:2012.09699.
Dwork et al., 2015: Dwork, C., Feldman, V., Hardt, M., Pitassi, T., Reingold, O., & Roth, A. L. (2015). Preserving statistical validity in adaptive data analysis. Proceedings of the 47th Annual ACM Symposium on Theory of Computing (pp. 117–126).
Elman, 1990: Elman, J. L. (1990). Finding structure in time. Cognitive Science, 14(2), 179–211.
Elsken et al., 2018: Elsken, T., Metzen, J. H., & Hutter, F. (2018). Neural architecture search: a ssurvey. ArXiv:1808.05377 [stat.ML].
Fechner, 1860: Fechner, G. T. (1860). Elemente der Psychophysik. Vol. 2. Breitkopf u. Härtel.
Fedus et al., 2022: Fedus, W., Zoph, B., & Shazeer, N. (2022). Switch transformers: scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120), 1–39.
Fernando, 2004: Fernando, R. (2004). GPU Gems: Programming Techniques, Tips, and Tricks for Real-Time Graphics. Addison-Wesley.
Feurer & Hutter, 2018: Feurer, M., & Hutter, F. (2018). Hyperparameter ptimization. Automatic Machine Learning: Methods, Systems, Challenges. Springer.
Feurer et al., 2022: Feurer, M., Letham, B., Hutter, F., & Bakshy, E. (2022). Practical transfer learning for Bayesian optimization. ArXiv:1802.02219 [stat.ML].
Field, 1987: Field, D. J. (1987). Relations between the statistics of natural images and the response properties of cortical cells. JOSA A, 4(12), 2379–2394.
Fisher, 1925: Fisher, R. A. (1925). Statistical Methods for Research Workers. Oliver & Boyd.
Flammarion & Bach, 2015: Flammarion, N., & Bach, F. (2015). From averaging to acceleration, there is only a step-size. Conference on Learning Theory (pp. 658–695).
Forrester et al., 2007: Forrester, A. I., Sóbester, A., & Keane, A. J. (2007). Multi-fidelity optimization via surrogate modelling. Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences, 463(2088), 3251–3269.
Franceschi et al., 2017: Franceschi, L., Donini, M., Frasconi, P., & Pontil, M. (2017). Forward and reverse gradient-based hyperparameter optimization. Proceedings of the 34th International Conference on Machine Learning (ICML'17).
Frankle & Carbin, 2018: Frankle, J., & Carbin, M. (2018). The lottery ticket hypothesis: finding sparse, trainable neural networks. ArXiv:1803.03635.
Frazier, 2018: Frazier, P. I. (2018). A tutorial on Bayesian optimization. ArXiv:1807.02811.
Freund & Schapire, 1996: Freund, Y., & Schapire, R. E. (1996). Experiments with a new boosting algorithm. Proceedings of the International Conference on Machine Learning (pp. 148–156).
Friedman, 1987: Friedman, J. H. (1987). Exploratory projection pursuit. Journal of the American Statistical Association, 82(397), 249–266.
Frostig et al., 2018: Frostig, R., Johnson, M. J., & Leary, C. (2018). Compiling machine learning programs via high-level tracing. Proceedings of Systems for Machine Learning.
Fukushima, 1982: Fukushima, K. (1982). Neocognitron: a self-organizing neural network model for a mechanism of visual pattern recognition. Competition and Cooperation in Neural Nets (pp. 267–285). Springer.
Gardner et al., 2018: Gardner, J., Pleiss, G., Weinberger, K. Q., Bindel, D., & Wilson, A. G. (2018). GPyTorch: blackbox matrix–matrix Gaussian process inference with GPU acceleration. Advances in Neural Information Processing Systems.
Garg et al., 2021: Garg, S., Balakrishnan, S., Kolter, Z., & Lipton, Z. (2021). RATT: leveraging unlabeled data to guarantee generalization. International Conference on Machine Learning (pp. 3598–3609).
Gatys et al., 2016: Gatys, L. A., Ecker, A. S., & Bethge, M. (2016). Image style transfer using convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2414–2423).
Gauss, 1809: Gauss, C. F. (1809). Theoria motus corporum coelestum. Werke. Königlich Preussische Akademie der Wissenschaften.
Gibbs, 1902: Gibbs, J. W. (1902). Elementary Principles of Statistical Mhanics. Scribner's.
Ginibre, 1965: Ginibre, J. (1965). Statistical ensembles of complex, quaternion, and real matrices. Journal of Mathematical Physics, 6(3), 440–449.
Girshick, 2015: Girshick, R. (2015). Fast R-CNN. Proceedings of the IEEE International Conference on Computer Vision (pp. 1440–1448).
Girshick et al., 2014: Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 580–587).
Glorot & Bengio, 2010: Glorot, X., & Bengio, Y. (2010). Understanding the difficulty of training deep feedforward neural networks. Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (pp. 249–256).
Goh, 2017: Goh, G. (2017). Why momentum really works. Distill. URL: http://distill.pub/2017/momentum
Goldberg et al., 1992: Goldberg, D., Nichols, D., Oki, B. M., & Terry, D. (1992). Using collaborative filtering to weave an information tapestry. Communications of the ACM, 35(12), 61–71.
Golub & VanLoan, 1996: Golub, G. H., & Van Loan, C. F. (1996). Matrix Computations. Johns Hopkins University Press.
Goodfellow et al., 2016: Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press. http://www.deeplearningbook.org.
Goodfellow et al., 2014: Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., … Bengio, Y. (2014). Generative adversarial nets. Advances in Neural Information Processing Systems (pp. 2672–2680).
Gotmare et al., 2018: Gotmare, A., Keskar, N. S., Xiong, C., & Socher, R. (2018). A closer look at deep learning heuristics: learning rate restarts, warmup and distillation. ArXiv:1810.13243.
Goyal et al., 2021: Goyal, A., Bochkovskiy, A., Deng, J., & Koltun, V. (2021). Non-deep networks. ArXiv:2110.07641.
Graham, 2014: Graham, B. (2014). Fractional max-pooling. ArXiv:1412.6071.
Graves, 2013: Graves, A. (2013). Generating sequences with recurrent neural networks. ArXiv:1308.0850.
Graves et al., 2008: Graves, A., Liwicki, M., Fernández, S., Bertolami, R., Bunke, H., & Schmidhuber, J. (2008). A novel connectionist system for unconstrained handwriting recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 31(5), 855–868.
Graves & Schmidhuber, 2005: Graves, A., & Schmidhuber, J. (2005). Framewise phoneme classification with bidirectional LSTM and other neural network architectures. Neural Networks, 18(5-6), 602–610.
Griewank, 1989: Griewank, A. (1989). On automatic differentiation. Mathematical Programming: Recent Developments and Applications (pp. 83–107). Kluwer.
Gulati et al., 2020: Gulati, A., Qin, J., Chiu, C.-C., Parmar, N., Zhang, Y., Yu, J., … et al. (2020). Conformer: convolution-augmented transformer for speech recognition. Proc. Interspeech 2020, pp. 5036–5040.
Gunawardana & Shani, 2015: Gunawardana, A., & Shani, G. (2015). Evaluating recommender systems. Recommender Systems Handbook (pp. 265–308). Springer.
Guo et al., 2017: Guo, H., Tang, R., Ye, Y., Li, Z., & He, X. (2017). Deepfm: a factorization-machine based neural network for ctr prediction. Proceedings of the 26th International Joint Conference on Artificial Intelligence (pp. 1725–1731).
Guyon et al., 2008: Guyon, I., Gunn, S., Nikravesh, M., & Zadeh, L. A. (2008). Feature Extraction: Foundations and Applications. Springer.
Hadjis et al., 2016: Hadjis, S., Zhang, C., Mitliagkas, I., Iter, D., & Ré, C. (2016). Omnivore: an optimizer for multi-device deep learning on CPUs and GPUs. ArXiv:1606.04487.
Hartley & Zisserman, 2000: Hartley, R., & Zisserman, A. (2000). Multiple View Geometry in Computer Vision. Cambridge University Press.
Hartley & Kahl, 2009: Hartley, R. I., & Kahl, F. (2009). Global optimization through rotation space search. International Journal of Computer Vision, 82(1), 64–79.
He et al., 2022: He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2022). Masked autoencoders are scalable vision learners. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 16000–16009).
He et al., 2017a: He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask R-CNN. Proceedings of the IEEE International Conference on Computer Vision (pp. 2961–2969).
He et al., 2015: He, K., Zhang, X., Ren, S., & Sun, J. (2015). Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. Proceedings of the IEEE International Conference on Computer Vision (pp. 1026–1034).
He et al., 2016a: He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 770–778).
He et al., 2016b: He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual networks. European Conference on Computer Vision (pp. 630–645).
He & Chua, 2017: He, X., & Chua, T.-S. (2017). Neural factorization machines for sparse predictive analytics. Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 355–364).
He et al., 2017b: He, X., Liao, L., Zhang, H., Nie, L., Hu, X., & Chua, T.-S. (2017). Neural collaborative filtering. Proceedings of the 26th International Conference on World Wide Web (pp. 173–182).
Hebb, 1949: Hebb, D. O. (1949). The Organization of Behavior. Wiley.
Hendrycks & Gimpel, 2016: Hendrycks, D., & Gimpel, K. (2016). Gaussian error linear units (GELUs). ArXiv:1606.08415.
Hennessy & Patterson, 2011: Hennessy, J. L., & Patterson, D. A. (2011). Computer Architecture: A Quantitative Approach. Elsevier.
Herlocker et al., 1999: Herlocker, J. L., Konstan, J. A., Borchers, A., & Riedl, J. (1999). An algorithmic framework for performing collaborative filtering. 22nd Annual International ACM Conference on Research and Development in Information Retrieval, SIGIR 1999 (pp. 230–237).
Hidasi et al., 2015: Hidasi, B., Karatzoglou, A., Baltrunas, L., & Tikk, D. (2015). Session-based recommendations with recurrent neural networks. ArXiv:1511.06939.
Ho et al., 2020: Ho, J., Jain, A., & Abbeel, P. (2020). Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33, 6840–6851.
Hochreiter et al., 2001: Hochreiter, S., Bengio, Y., Frasconi, P., & Schmidhuber, J. (2001). Gradient flow in recurrent nets: the difficulty of learning long-term dependencies. A Field Guide to Dynamical Recurrent Neural Networks. IEEE Press.
Hochreiter & Schmidhuber, 1997: Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural Computation, 9(8), 1735–1780.
Hoffmann et al., 2022: Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., … et al. (2022). Training compute-optimal large language models. ArXiv:2203.15556.
Howard et al., 2019: Howard, A., Sandler, M., Chu, G., Chen, L.-C., Chen, B., Tan, M., … Adam, H. (2019). Searching for MobileNetV3. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1314–1324).
Hoyer et al., 2009: Hoyer, P. O., Janzing, D., Mooij, J. M., Peters, J., & Schölkopf, B. (2009). Nonlinear causal discovery with additive noise models. Advances in Neural Information Processing Systems (pp. 689–696).
Hu et al., 2018: Hu, J., Shen, L., & Sun, G. (2018). Squeeze-and-excitation networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 7132–7141).
Hu et al., 2008: Hu, Y., Koren, Y., & Volinsky, C. (2008). Collaborative filtering for implicit feedback datasets. 2008 8th IEEE International Conference on Data Mining (pp. 263–272).
Hu et al., 2022: Hu, Z., Lee, R. K.-W., Aggarwal, C. C., & Zhang, A. (2022). Text style transfer: a review and experimental evaluation. SIGKDD Explor. Newsl., 24(1). URL: https://doi.org/10.1145/3544903.3544906
Huang et al., 2018: Huang, C.-Z. A., Vaswani, A., Uszkoreit, J., Simon, I., Hawthorne, C., Shazeer, N., … Eck, D. (2018). Music transformer: generating music with long-term structure. International Conference on Learning Representations.
Huang et al., 2017: Huang, G., Liu, Z., Van Der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4700–4708).
Huang et al., 2015: Huang, Z., Xu, W., & Yu, K. (2015). Bidirectional LSTM–CRF models for sequence tagging. ArXiv:1508.01991.
Hubel & Wiesel, 1959: Hubel, D. H., & Wiesel, T. N. (1959). Receptive fields of single neurones in the cat's striate cortex. Journal of Physiology, 148(3), 574–591.
Hubel & Wiesel, 1962: Hubel, D. H., & Wiesel, T. N. (1962). Receptive fields, binocular interaction and functional architecture in the cat's visual cortex. Journal of Physiology, 160(1), 106–154.
Hubel & Wiesel, 1968: Hubel, D. H., & Wiesel, T. N. (1968). Receptive fields and functional architecture of monkey striate cortex. Journal of Physiology, 195(1), 215–243.
Hutter et al., 2011: Hutter, F., Hoos, H., & Leyton-Brown, K. (2011). Sequential model-based optimization for general algorithm configuration. Proceedings of the Fifth International Conference on Learning and Intelligent Optimization (LION'11).
Hutter et al., 2019: Hutter, F., Kotthoff, L., & Vanschoren, J. (Eds.) (2019). Automated Machine Learning: Methods, Systems, Challenges. Springer.
Ioffe, 2017: Ioffe, S. (2017). Batch renormalization: towards reducing minibatch dependence in batch-normalized models. Advances in Neural Information Processing Systems (pp. 1945–1953).
Ioffe & Szegedy, 2015: Ioffe, S., & Szegedy, C. (2015). Batch normalization: accelerating deep network training by reducing internal covariate shift. ArXiv:1502.03167.
Izmailov et al., 2018: Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D., & Wilson, A. G. (2018). Averaging weights leads to wider optima and better generalization. ArXiv:1803.05407.
Jacot et al., 2018: Jacot, A., Gabriel, F., & Hongler, C. (2018). Neural tangent kernel: convergence and generalization in neural networks. Advances in Neural Information Processing Systems.
Jaeger, 2002: Jaeger, H. (2002). Tutorial on training recurrent neural networks, covering BPPT, RTRL, EKF and the “echo state network” approach. GMD-Forschungszentrum Informationstechnik Bonn.
Jamieson & Talwalkar, 2016: Jamieson, K., & Talwalkar, A. (2016). Non-stochastic best arm identification and hyperparameter optimization. Proceedings of the 17th International Conference on Artificial Intelligence and Statistics.
Jenatton et al., 2017: Jenatton, R., Archambeau, C., González, J., & Seeger, M. (2017). Bayesian optimization with tree-structured dependencies. Proceedings of the 34th International Conference on Machine Learning (ICML'17).
Jia et al., 2018: Jia, X., Song, S., He, W., Wang, Y., Rong, H., Zhou, F., … et al. (2018). Highly scalable deep learning training system with mixed-precision: training ImageNet in four minutes. ArXiv:1807.11205.
Jia et al., 2014: Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., … Darrell, T. (2014). Caffe: convolutional architecture for fast feature embedding. Proceedings of the 22nd ACM International Conference on Multimedia (pp. 675–678).
Joshi et al., 2020: Joshi, M., Chen, D., Liu, Y., Weld, D. S., Zettlemoyer, L., & Levy, O. (2020). SpanBERT: improving pre-training by representing and predicting spans. Transactions of the Association for Computational Linguistics, 8, 64–77.
Jouppi et al., 2017: Jouppi, N. P., Young, C., Patil, N., Patterson, D., Agrawal, G., Bajwa, R., … et al. (2017). In-datacenter performance analysis of a tensor processing unit. 2017 ACM/IEEE 44th Annual International Symposium on Computer Architecture (ISCA) (pp. 1–12).
Kalchbrenner et al., 2014: Kalchbrenner, N., Grefenstette, E., & Blunsom, P. (2014). A convolutional neural network for modelling sentences. ArXiv:1404.2188.
Kalman & Kwasny, 1992: Kalman, B. L., & Kwasny, S. C. (1992). Why tanh: choosing a sigmoidal function. Proceedings of the International Joint Conference on Neural Networks (IJCNN) (pp. 578–581).
Kaplan et al., 2020: Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., … Amodei, D. (2020). Scaling laws for neural language models. ArXiv:2001.08361.
Karnin et al., 2013: Karnin, Z., Koren, T., & Somekh, O. (2013). Almost optimal exploration in multi-armed bandits. Proceedings of the 30th International Conference on Machine Learning (ICML'13).
Karras et al., 2017: Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2017). Progressive growing of GANs for improved quality, stability, and variation. ArXiv:1710.10196.
Kim et al., 2017: Kim, J., El-Khamy, M., & Lee, J. (2017). Residual LSTM: design of a deep recurrent architecture for distant speech recognition. ArXiv:1701.03360.
Kim, 2014: Kim, Y. (2014). Convolutional neural networks for sentence classification. ArXiv:1408.5882.
Kimeldorf & Wahba, 1971: Kimeldorf, G. S., & Wahba, G. (1971). Some results on Tchebycheffian spline functions. J. Math. Anal. Appl., 33, 82–95.
Kingma & Ba, 2014: Kingma, D. P., & Ba, J. (2014). Adam: a method for stochastic optimization. ArXiv:1412.6980.
Kingma & Welling, 2014: Kingma, D. P., & Welling, M. (2014). Auto-encoding variational Bayes. International Conference on Learning Representations (ICLR).
Kipf & Welling, 2016: Kipf, T. N., & Welling, M. (2016). Semi-supervised classification with graph convolutional networks. ArXiv:1609.02907.
Kojima et al., 2022: Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large language models are zero-shot reasoners. arxiv.org/abs/2205.11916.
Koller & Friedman, 2009: Koller, D., & Friedman, N. (2009). Probabilistic Graphical Models: Principles and Techniques. MIT Press.
Kolmogorov, 1933: Kolmogorov, A. (1933). Sulla determinazione empirica di una legge di distribuzione. Inst. Ital. Attuari, Giorn., 4, 83–91.
Kolter, 2008: Kolter, Z. (2008). Linear algebra review and reference. Available online: http://cs229.stanford.edu/section/cs229-linalg.pdf.
Koren et al., 2009: Koren, Y., Bell, R., & Volinsky, C. (2009). Matrix factorization techniques for recommender systems. Computer, pp. 30–37.
Krizhevsky et al., 2012: Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (pp. 1097–1105).
Kung, 1988: Kung, S. Y. (1988). VLSI Array Processors. Prentice Hall.
Kuzovkin et al., 2018: Kuzovkin, I., Vicente, R., Petton, M., Lachaux, J.-P., Baciu, M., Kahane, P., … Aru, J. (2018). Activations of deep convolutional neural networks are aligned with gamma band activity of human visual cortex. Communications Biology, 1(1), 1–12.
Lan et al., 2019: Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., & Soricut, R. (2019). ALBERT: a lite BERT for self-supervised learning of language representations. ArXiv:1909.11942.
Lavin & Gray, 2016: Lavin, A., & Gray, S. (2016). Fast algorithms for convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 4013–4021).
Le, 2013: Le, Q. V. (2013). Building high-level features using large scale unsupervised learning. Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (pp. 8595–8598).
LeCun et al., 1995a: LeCun, Y., Bengio, Y., & et al. (1995). Convolutional networks for images, speech, and time series. The Handbook of Brain Theory and Neural Networks (p. 3361). MIT Press.
LeCun et al., 1989: LeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W., & Jackel, L. D. (1989). Backpropagation applied to handwritten zip code recognition. Neural Computation, 1(4), 541–551.
LeCun et al., 1998a: LeCun, Y., Bottou, L., Orr, G., & Muller, K.-R. (1998). Efficient backprop. Neural Networks: Tricks of the Trade. Springer.
LeCun et al., 1998b: LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
LeCun et al., 1995b: LeCun, Y., Jackel, L., Bottou, L., Brunot, A., Cortes, C., Denker, J., … et al. (1995). Comparison of learning algorithms for handwritten digit recognition. International Conference on Artificial Neural Networks (pp. 53–60).
Legendre, 1805: Legendre, A. M. (1805). Mémoire sur les Opérations Trigonométriques: dont les Résultats Dépendent de la Figure de la Terre. F. Didot.
Lewis et al., 2019: Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., … Zettlemoyer, L. (2019). BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. ArXiv:1910.13461.
Lewkowycz et al., 2022: Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., … et al. (2022). Solving quantitative reasoning problems with language models. ArXiv:2206.14858.
Li et al., 2018: Li, L., Jamieson, K., Rostamizadeh, A., Gonina, K., Hardt, M., Recht, B., & Talwalkar, A. (2018). Massively parallel hyperparameter tuning. ArXiv:1810.05934.
Li, 2017: Li, M. (2017). Scaling Distributed Machine Learning with System and Algorithm Co-design (Doctoral dissertation). PhD Thesis, CMU.
Li et al., 2014a: Li, M., Andersen, D. G., Park, J. W., Smola, A. J., Ahmed, A., Josifovski, V., … Su, B.-Y. (2014). Scaling distributed machine learning with the parameter server. 11th Symposium on Operating Systems Design and Implementation (OSDI 14) (pp. 583–598).
Li et al., 2014b: Li, M., Zhang, T., Chen, Y., & Smola, A. J. (2014). Efficient mini-batch training for stochastic optimization. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 661–670).
Liaw et al., 2018: Liaw, R., Liang, E., Nishihara, R., Moritz, P., Gonzalez, J., & Stoica, I. (2018). Tune: a research platform for distributed model selection and training. ArXiv:1807.05118.
Lin et al., 2013: Lin, M., Chen, Q., & Yan, S. (2013). Network in network. ArXiv:1312.4400.
Lin et al., 2017a: Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision (pp. 2980–2988).
Lin et al., 2010: Lin, Y., Lv, F., Zhu, S., Yang, M., Cour, T., Yu, K., … others. (2010). ImageNet classification: fast descriptor coding and large-scale SVM training. Large Scale Visual Recognition Challenge.
Lin et al., 2017b: Lin, Z., Feng, M., Santos, C. N. d., Yu, M., Xiang, B., Zhou, B., & Bengio, Y. (2017). A structured self-attentive sentence embedding. ArXiv:1703.03130.
Lipton et al., 2015: Lipton, Z. C., Berkowitz, J., & Elkan, C. (2015). A critical review of recurrent neural networks for sequence learning. ArXiv:1506.00019.
Lipton et al., 2016: Lipton, Z. C., Kale, D. C., Elkan, C., & Wetzel, R. (2016). Learning to diagnose with LSTM recurrent neural networks. International Conference on Learning Representations (ICLR).
Lipton & Steinhardt, 2018: Lipton, Z. C., & Steinhardt, J. (2018). Troubling trends in machine learning scholarship. Communications of the ACM, 17, 45–77.
Liu & Nocedal, 1989: Liu, D. C., & Nocedal, J. (1989). On the limited memory BFGS method for large scale optimization. Mathematical Programming, 45(1), 503–528.
Liu et al., 2018: Liu, H., Simonyan, K., & Yang, Y. (2018). DARTS: differentiable architecture search. ArXiv:1806.09055.
Liu et al., 2016: Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., & Berg, A. C. (2016). SSD: single shot multibox detector. European Conference on Computer Vision (pp. 21–37).
Liu et al., 2019: Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., … Stoyanov, V. (2019). RoBERTa: a robustly optimized BERT pretraining approach. ArXiv:1907.11692.
Liu et al., 2021: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., … Guo, B. (2021). Swin transformer: hierarchical vision transformer using shifted windows. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 10012–10022).
Liu et al., 2022: Liu, Z., Mao, H., Wu, C.-Y., Feichtenhofer, C., Darrell, T., & Xie, S. (2022). A convNet for the 2020s. ArXiv:2201.03545.
Long et al., 2015: Long, J., Shelhamer, E., & Darrell, T. (2015). Fully convolutional networks for semantic segmentation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 3431–3440).
Loshchilov & Hutter, 2016: Loshchilov, I., & Hutter, F. (2016). SGDR: stochastic gradient descent with warm restarts. ArXiv:1608.03983.
Lowe, 2004: Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Luo et al., 2018: Luo, P., Wang, X., Shao, W., & Peng, Z. (2018). Towards understanding regularization in batch normalization. ArXiv:1809.00846.
Maas et al., 2011: Maas, A. L., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 142–150).
Mack & Silverman, 1982: Mack, Y.-P., & Silverman, B. W. (1982). Weak and strong uniform consistency of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete, 61(3), 405–415.
MacKay, 2003: MacKay, D. J. (2003). Information Theory, Inference and Learning Algorithms. Cambridge University Press.
Maclaurin et al., 2015: Maclaurin, D., Duvenaud, D., & Adams, R. (2015). Gradient-based hyperparameter optimization through reversible learning. Proceedings of the 32nd International Conference on Machine Learning (ICML'15).
Mangasarian, 1965: Mangasarian, O. L. (1965). Linear and nonlinear separation of patterns by linear programming. Oper. Res., 13, 444-452.
Mangram, 2013: Mangram, M. E. (2013). A simplified perspective of the Markowitz portfolio theory. Global Journal of Business Research, 7(1), 59–70.
Matthews et al., 2018: Matthews, A. G. d. G., Rowland, M., Hron, J., Turner, R. E., & Ghahramani, Z. (2018). Gaussian process behaviour in wide deep neural networks. ArXiv:1804.11271.
McCann et al., 2017: McCann, B., Bradbury, J., Xiong, C., & Socher, R. (2017). Learned in translation: Contextualized word vectors. Advances in Neural Information Processing Systems (pp. 6294–6305).
McCulloch & Pitts, 1943: McCulloch, W. S., & Pitts, W. (1943). A logical calculus of the ideas immanent in nervous activity. Bulletin of Mathematical Biophysics, 5(4), 115–133.
McMahan et al., 2013: McMahan, H. B., Holt, G., Sculley, D., Young, M., Ebner, D., Grady, J., … et al. (2013). Ad click prediction: a view from the trenches. Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1222–1230).
Mead, 1980: Mead, C. (1980). Introduction to VLSI systems. IEE Proceedings I-Solid-State and Electron Devices, 128(1), 18.
Merity et al., 2016: Merity, S., Xiong, C., Bradbury, J., & Socher, R. (2016). Pointer sentinel mixture models. ArXiv:1609.07843.
Micchelli, 1984: Micchelli, C. A. (1984). Interpolation of scattered data: distance matrices and conditionally positive definite functions. Approximation Theory and Spline Functions (pp. 143–145). Springer.
Mikolov et al., 2013a: Mikolov, T., Chen, K., Corrado, G., & Dean, J. (2013). Efficient estimation of word representations in vector space. ArXiv:1301.3781.
Mikolov et al., 2013b: Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., & Dean, J. (2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems (pp. 3111–3119).
Miller, 1995: Miller, G. A. (1995). WordNet: a lexical database for English. Communications of the ACM, 38(11), 39–41.
Mirhoseini et al., 2017: Mirhoseini, A., Pham, H., Le, Q. V., Steiner, B., Larsen, R., Zhou, Y., … Dean, J. (2017). Device placement optimization with reinforcement learning. Proceedings of the 34th International Conference on Machine Learning (pp. 2430–2439).
Mnih et al., 2014: Mnih, V., Heess, N., Graves, A., & others. (2014). Recurrent models of visual attention. Advances in Neural Information Processing Systems (pp. 2204–2212).
Mnih et al., 2013: Mnih, V., Kavukcuoglu, K., Silver, D., Graves, A., Antonoglou, I., Wierstra, D., & Riedmiller, M. (2013). Playing Atari with deep reinforcement learning. ArXiv:1312.5602.
Mnih et al., 2015: Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., … et al. (2015). Human-level control through deep reinforcement learning. Nature, 518(7540), 529–533.
Moon et al., 2010: Moon, T., Smola, A., Chang, Y., & Zheng, Z. (2010). Intervalrank: isotonic regression with listwise and pairwise constraints. Proceedings of the 3rd ACM International Conference on Web Search and Data Mining (pp. 151–160).
Morey et al., 2016: Morey, R. D., Hoekstra, R., Rouder, J. N., Lee, M. D., & Wagenmakers, E.-J. (2016). The fallacy of placing confidence in confidence intervals. Psychonomic Bulletin & Review, 23(1), 103–123.
Morozov, 1984: Morozov, V. A. (1984). Methods for Solving Incorrectly Posed Problems. Springer.
Nadaraya, 1964: Nadaraya, E. A. (1964). On estimating regression. Theory of Probability & its Applications, 9(1), 141–142.
Nair & Hinton, 2010: Nair, V., & Hinton, G. E. (2010). Rectified linear units improve restricted Boltzmann machines. ICML.
Nakkiran et al., 2021: Nakkiran, P., Kaplun, G., Bansal, Y., Yang, T., Barak, B., & Sutskever, I. (2021). Deep double descent: where bigger models and more data hurt. Journal of Statistical Mechanics: Theory and Experiment, 2021(12), 124003.
Naor & Reingold, 1999: Naor, M., & Reingold, O. (1999). On the construction of pseudorandom permutations: Luby–Rackoff revisited. Journal of Cryptology, 12(1), 29–66.
Neal, 1996: Neal, R. M. (1996). Bayesian Learning for Neural Networks. Springer.
Nesterov, 2018: Nesterov, Y. (2018). Lectures on Convex Optimization. Springer.
Nesterov & Vial, 2000: Nesterov, Y., & Vial, J.-P. (2000). Confidence level solutions for stochastic programming. Automatica, 44(6), 1559–1568.
Neyman, 1937: Neyman, J. (1937). Outline of a theory of statistical estimation based on the classical theory of probability. Philosophical Transactions of the Royal Society of London. Series A, Mathematical and Physical Sciences, 236(767), 333–380.
Norelli et al., 2022: Norelli, A., Fumero, M., Maiorca, V., Moschella, L., Rodolà, E., & Locatello, F. (2022). ASIF: coupled data turns unimodal models to multimodal without training. ArXiv:2210.01738.
Novak et al., 2018: Novak, R., Xiao, L., Lee, J., Bahri, Y., Yang, G., Hron, J., … Sohl-Dickstein, J. (2018). Bayesian deep convolutional networks with many channels are Gaussian processes. ArXiv:1810.05148.
Novikoff, 1962: Novikoff, A. B. J. (1962). On convergence proofs on perceptrons. Proceedings of the Symposium on the Mathematical Theory of Automata (pp. 615–622).
Olshausen & Field, 1996: Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381(6583), 607–609.
Ong et al., 2005: Ong, C. S., Smola, A., & Williamson, R. (2005). Learning the kernel with hyperkernels. Journal of Machine Learning Research, 6, 1043–1071.
OpenAI, 2023: OpenAI. (2023). GPT-4 Technical Report. ArXiv:2303.08774.
Ouyang et al., 2022: Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C. L., Mishkin, P., … et al. (2022). Training language models to follow instructions with human feedback. ArXiv:2203.02155.
Papineni et al., 2002: Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (pp. 311–318).
Parikh et al., 2016: Parikh, A. P., Täckström, O., Das, D., & Uszkoreit, J. (2016). A decomposable attention model for natural language inference. ArXiv:1606.01933.
Park et al., 2019: Park, T., Liu, M.-Y., Wang, T.-C., & Zhu, J.-Y. (2019). Semantic image synthesis with spatially-adaptive normalization. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2337–2346).
Parzen, 1957: Parzen, E. (1957). On consistent estimates of the spectrum of a stationary time series. Annals of Mathematical Statistics, 28, 329–348.
Paszke et al., 2019: Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., … et al. (2019). PyTorch: an imperative style, high-performance deep learning library. Advances in Neural Information Processing Systems, 32, 8026–8037.
Paulus et al., 2017: Paulus, R., Xiong, C., & Socher, R. (2017). A deep reinforced model for abstractive summarization. ArXiv:1705.04304.
Penedo et al., 2023: Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., … Launay, J. (2023). The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. ArXiv:2306.01116.
Pennington et al., 2017: Pennington, J., Schoenholz, S., & Ganguli, S. (2017). Resurrecting the sigmoid in deep learning through dynamical isometry: theory and practice. Advances in Neural Information Processing Systems (pp. 4785–4795).
Pennington et al., 2014: Pennington, J., Socher, R., & Manning, C. (2014). GloVe: global vectors for word representation. Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 1532–1543).
Peters et al., 2017a: Peters, J., Janzing, D., & Schölkopf, B. (2017). Elements of Causal Inference: Foundations and Learning Algorithms. MIT Press.
Peters et al., 2017b: Peters, M., Ammar, W., Bhagavatula, C., & Power, R. (2017). Semi-supervised sequence tagging with bidirectional language models. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, Volume 1 (pp. 1756–1765).
Peters et al., 2018: Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., & Zettlemoyer, L. (2018). Deep contextualized word representations. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (pp. 2227–2237).
Petersen & Pedersen, 2008: Petersen, K. B., & Pedersen, M. S. (2008). The Matrix Cookbook. Technical University of Denmark.
Pleiss et al., 2017: Pleiss, G., Chen, D., Huang, G., Li, T., Van Der Maaten, L., & Weinberger, K. Q. (2017). Memory-efficient implementation of densenets. ArXiv:1707.06990.
Polyak, 1964: Polyak, B. T. (1964). Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5), 1–17.
Prakash et al., 2016: Prakash, A., Hasan, S. A., Lee, K., Datla, V., Qadir, A., Liu, J., & Farri, O. (2016). Neural paraphrase generation with stacked residual LSTM networks. ArXiv:1610.03098.
Qin et al., 2023: Qin, C., Zhang, A., Zhang, Z., Chen, J., Yasunaga, M., & Yang, D. (2023). Is ChatGPT a general-purpose natural language processing task solver? ArXiv:2302.06476.
Quadrana et al., 2018: Quadrana, M., Cremonesi, P., & Jannach, D. (2018). Sequence-aware recommender systems. ACM Computing Surveys, 51(4), 66.
Quinlan, 1993: Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. Elsevier.
Rabiner & Juang, 1993: Rabiner, L., & Juang, B.-H. (1993). Fundamentals of Speech Recognition. Prentice-Hall.
Radford et al., 2021: Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., … et al. (2021). Learning transferable visual models from natural language supervision. International Conference on Machine Learning (pp. 8748–8763).
Radford et al., 2015: Radford, A., Metz, L., & Chintala, S. (2015). Unsupervised representation learning with deep convolutional generative adversarial networks. ArXiv:1511.06434.
Radford et al., 2018: Radford, A., Narasimhan, K., Salimans, T., & Sutskever, I. (2018). Improving language understanding by generative pre-training. OpenAI.
Radford et al., 2019: Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.
Radosavovic et al., 2019: Radosavovic, I., Johnson, J., Xie, S., Lo, W.-Y., & Dollár, P. (2019). On network design spaces for visual recognition. Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 1882–1890).
Radosavovic et al., 2020: Radosavovic, I., Kosaraju, R. P., Girshick, R., He, K., & Dollár, P. (2020). Designing network design spaces. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 10428–10436).
Rae et al., 2021: Rae, J. W., Borgeaud, S., Cai, T., Millican, K., Hoffmann, J., Song, F., … et al. (2021). Scaling language models: methods, analysis & insights from training gopher. ArXiv:2112.11446.
Raffel et al., 2020: Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., … Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21, 1–67.
Rajpurkar et al., 2016: Rajpurkar, P., Zhang, J., Lopyrev, K., & Liang, P. (2016). SQuAD: 100,000+ questions for machine comprehension of text. ArXiv:1606.05250.
Ramachandran et al., 2019: Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., & Shlens, J. (2019). Stand-alone self-attention in vision models. Advances in Neural Information Processing Systems, 32.
Ramachandran et al., 2017: Ramachandran, P., Zoph, B., & Le, Q. V. (2017). Searching for activation functions. ArXiv:1710.05941.
Ramesh et al., 2022: Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., & Chen, M. (2022). Hierarchical text-conditional image generation with clip latents. ArXiv:2204.06125.
Cajal & Azoulay, 1894: Ramón y Cajal, Santiago, & Azoulay, L. (1894). Les Nouvelles Idées sur la Structure du Système Nerveux chez l'Homme et chez les Vertébrés. Paris, C. Reinwald & Cie.
Ranzato et al., 2007: Ranzato, M.-A., Boureau, Y.-L., Chopra, S., & LeCun, Y. (2007). A unified energy-based framework for unsupervised learning. Artificial Intelligence and Statistics (pp. 371–379).
Rasmussen & Williams, 2006: Rasmussen, C. E., & Williams, C. K. (2006). Gaussian Processes for Machine Learning. MIT Press.
Reddi et al., 2019: Reddi, S. J., Kale, S., & Kumar, S. (2019). On the convergence of Adam and beyond. ArXiv:1904.09237.
Redmon et al., 2016: Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 779–788).
Redmon & Farhadi, 2018: Redmon, J., & Farhadi, A. (2018). YOLOv3: an incremental improvement. ArXiv:1804.02767.
Reed & DeFreitas, 2015: Reed, S., & De Freitas, N. (2015). Neural programmer-interpreters. ArXiv:1511.06279.
Reed et al., 2022: Reed, S., Zolna, K., Parisotto, E., Colmenarejo, S. G., Novikov, A., Barth-Maron, G., … et al. (2022). A generalist agent. ArXiv:2205.06175.
Ren et al., 2015: Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: towards real-time object detection with region proposal networks. Advances in Neural Information Processing Systems (pp. 91–99).
Rendle, 2010: Rendle, S. (2010). Factorization machines. 2010 IEEE International Conference on Data Mining (pp. 995–1000).
Rendle et al., 2009: Rendle, S., Freudenthaler, C., Gantner, Z., & Schmidt-Thieme, L. (2009). BPR: Bayesian personalized ranking from implicit feedback. Proceedings of the 25th Conference on Uncertainty in Artificial Intelligence (pp. 452–461).
Revels et al., 2016: Revels, J., Lubin, M., & Papamarkou, T. (2016). Forward-mode automatic differentiation in Julia. ArXiv:1607.07892.
Rezende et al., 2014: Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. International Conference on Machine Learning (pp. 1278–1286).
Riesenhuber & Poggio, 1999: Riesenhuber, M., & Poggio, T. (1999). Hierarchical models of object recognition in cortex. Nature Neuroscience, 2(11), 1019–1025.
Rockafellar, 1970: Rockafellar, R. T. (1970). Convex Analysis. Princeton University Press.
Rolnick et al., 2017: Rolnick, D., Veit, A., Belongie, S., & Shavit, N. (2017). Deep learning is robust to massive label noise. ArXiv:1705.10694.
Rudin, 1973: Rudin, W. (1973). Functional Analysis. McGraw-Hill.
Rumelhart et al., 1988: Rumelhart, D. E., Hinton, G. E., & Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive Modeling, 5(3), 1.
Russakovsky et al., 2013: Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013). Detecting avocados to zucchinis: what have we done, and where are we going? International Conference on Computer Vision (ICCV).
Russakovsky et al., 2015: Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., … et al. (2015). ImageNet large scale visual recognition challenge. International Journal of Computer Vision, 115(3), 211–252.
Russell & Norvig, 2016: Russell, S. J., & Norvig, P. (2016). Artificial Intelligence: A Modern Approach. Pearson Education Limited.
Saharia et al., 2022: Saharia, C., Chan, W., Saxena, S., Li, L., Whang, J., Denton, E., … et al. (2022). Photorealistic text-to-image diffusion models with deep language understanding. ArXiv:2205.11487.
Salinas et al., 2022: Salinas, D., Seeger, M., Klein, A., Perrone, V., Wistuba, M., & Archambeau, C. (2022). Syne Tune: a library for large scale hyperparameter tuning and reproducible research. First Conference on Automated Machine Learning.
Sanh et al., 2019: Sanh, V., Debut, L., Chaumond, J., & Wolf, T. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. ArXiv:1910.01108.
Sanh et al., 2021: Sanh, V., Webson, A., Raffel, C., Bach, S. H., Sutawika, L., Alyafeai, Z., … et al. (2021). Multitask prompted training enables zero-shot task generalization. ArXiv:2110.08207.
Santurkar et al., 2018: Santurkar, S., Tsipras, D., Ilyas, A., & Madry, A. (2018). How does batch normalization help optimization? Advances in Neural Information Processing Systems (pp. 2483–2493).
Sarwar et al., 2001: Sarwar, B. M., Karypis, G., Konstan, J. A., & Riedl, J. (2001). Item-based collaborative filtering recommendation algorithms. Proceedings of 10th International Conference on World Wide Web (pp. 285–295).
Scao et al., 2022: Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., … et al. (2022). BLOOM: a 176B-parameter open-access multilingual language model. ArXiv:2211.05100.
Schein et al., 2002: Schein, A. I., Popescul, A., Ungar, L. H., & Pennock, D. M. (2002). Methods and metrics for cold-start recommendations. Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 253–260).
Schuhmann et al., 2022: Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., … et al. (2022). LAION-5B: an open large-scale dataset for training next generation image-text models. ArXiv:2210.08402.
Schuster & Paliwal, 1997: Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673–2681.
Scholkopf et al., 2001: Schölkopf, B., Herbrich, R., & Smola, A. J. (2001). Helmbold, D. P., & Williamson, B. (Eds.). A generalized representer theorem. Proceedings of the Annual Conference on Computational Learning Theory (pp. 416–426). Springer-Verlag.
Scholkopf et al., 1996: Schölkopf, B., Burges, C., & Vapnik, V. (1996). Incorporating invariances in support vector learning machines. International Conference on Artificial Neural Networks (pp. 47–52).
Scholkopf & Smola, 2002: Schölkopf, B., & Smola, A. J. (2002). Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond. MIT Press.
Sedhain et al., 2015: Sedhain, S., Menon, A. K., Sanner, S., & Xie, L. (2015). Autorec: autoencoders meet collaborative filtering. Proceedings of the 24th International Conference on World Wide Web (pp. 111–112).
Sennrich et al., 2015: Sennrich, R., Haddow, B., & Birch, A. (2015). Neural machine translation of rare words with subword units. ArXiv:1508.07909.
Sergeev & DelBalso, 2018: Sergeev, A., & Del Balso, M. (2018). Horovod: fast and easy distributed deep learning in TensorFlow. ArXiv:1802.05799.
Shannon, 1948: Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.
Shao et al., 2020: Shao, H., Yao, S., Sun, D., Zhang, A., Liu, S., Liu, D., … Abdelzaher, T. (2020). ControlVAE: controllable variational autoencoder. Proceedings of the 37th International Conference on Machine Learning.
Shaw et al., 2018: Shaw, P., Uszkoreit, J., & Vaswani, A. (2018). Self-attention with relative position representations. ArXiv:1803.02155.
Shoeybi et al., 2019: Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., & Catanzaro, B. (2019). Megatron-LM: training multi-billion parameter language models using model parallelism. ArXiv:1909.08053.
Silver et al., 2016: Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., Van Den Driessche, G., … et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529(7587), 484.
Silverman, 1986: Silverman, B. W. (1986). Density Estimation for Statistical and Data Analysis. Chapman and Hall.
Simard et al., 1998: Simard, P. Y., LeCun, Y. A., Denker, J. S., & Victorri, B. (1998). Transformation invariance in pattern recognition – tangent distance and tangent propagation. Neural Networks: Tricks of the Trade (pp. 239–274). Springer.
Simonyan & Zisserman, 2014: Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv:1409.1556.
Sindhwani et al., 2015: Sindhwani, V., Sainath, T. N., & Kumar, S. (2015). Structured transforms for small-footprint deep learning. ArXiv:1510.01722.
Sivic & Zisserman, 2003: Sivic, J., & Zisserman, A. (2003). Video Google: a text retrieval approach to object matching in videos. Proceedings of the IEEE International Conference on Computer Vision (pp. 1470–1470).
Smith et al., 2022: Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., … et al. (2022). Using DeepSpeed and Megatron to train Megatron-Turing NLG 530B, a large-scale generative language model. ArXiv:2201.11990.
Smola & Narayanamurthy, 2010: Smola, A., & Narayanamurthy, S. (2010). An architecture for parallel topic models. Proceedings of the VLDB Endowment, 3(1-2), 703–710.
Snoek et al., 2012: Snoek, J., Larochelle, H., & Adams, R. (2012). Practical Bayesian optimization of machine learning algorithms. Advances in Neural Information Processing Systems 25 (pp. 2951–2959).
Sohl-Dickstein et al., 2015: Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., & Ganguli, S. (2015). Deep unsupervised learning using nonequilibrium thermodynamics. International Conference on Machine Learning (pp. 2256–2265).
Song & Ermon, 2019: Song, Y., & Ermon, S. (2019). Generative modeling by estimating gradients of the data distribution. Advances in Neural Information Processing Systems, 32.
Song et al., 2021: Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., & Poole, B. (2021). Score-based generative modeling through stochastic differential equations. International Conference on Learning Representations.
Speelpenning, 1980: Speelpenning, B. (1980). Compiling fast partial derivatives of functions given by algorithms (Doctoral dissertation). University of Illinois at Urbana-Champaign.
Srivastava et al., 2022: Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., … et al. (2022). Beyond the imitation game: quantifying and extrapolating the capabilities of language models. ArXiv:2206.04615.
Srivastava et al., 2014: Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1), 1929–1958.
Srivastava et al., 2015: Srivastava, R. K., Greff, K., & Schmidhuber, J. (2015). Highway networks. ArXiv:1505.00387.
Strang, 1993: Strang, G. (1993). Introduction to Linear Algebra. Wellesley–Cambridge Press.
Su & Khoshgoftaar, 2009: Su, X., & Khoshgoftaar, T. M. (2009). A survey of collaborative filtering techniques. Advances in Artificial Intelligence, 2009.
Sukhbaatar et al., 2015: Sukhbaatar, S., Weston, J., & Fergus, R. (2015). End-to-end memory networks. Advances in Neural Information Processing Systems (pp. 2440–2448).
Sutskever et al., 2013: Sutskever, I., Martens, J., Dahl, G., & Hinton, G. (2013). On the importance of initialization and momentum in deep learning. International Conference on Machine Learning (pp. 1139–1147).
Sutskever et al., 2014: Sutskever, I., Vinyals, O., & Le, Q. V. (2014). Sequence to sequence learning with neural networks. Advances in Neural Information Processing Systems (pp. 3104–3112).
Szegedy et al., 2017: Szegedy, C., Ioffe, S., Vanhoucke, V., & Alemi, A. A. (2017). Inception-v4, Inception-ResNet and the impact of residual connections on learning. 31st AAAI Conference on Artificial Intelligence.
Szegedy et al., 2015: Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., … Rabinovich, A. (2015). Going deeper with convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1–9).
Szegedy et al., 2016: Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., & Wojna, Z. (2016). Rethinking the Inception architecture for computer vision. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2818–2826).
Tallec & Ollivier, 2017: Tallec, C., & Ollivier, Y. (2017). Unbiasing truncated backpropagation through time. ArXiv:1705.08209.
Tan & Le, 2019: Tan, M., & Le, Q. (2019). EfficientNet: rethinking model scaling for convolutional neural networks. International Conference on Machine Learning (pp. 6105–6114).
Tang & Wang, 2018: Tang, J., & Wang, K. (2018). Personalized top-n sequential recommendation via convolutional sequence embedding. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining (pp. 565–573).
Taskar et al., 2004: Taskar, B., Guestrin, C., & Koller, D. (2004). Max-margin Markov networks. Advances in Neural Information Processing Systems, 16, 25.
Tay et al., 2020: Tay, Y., Dehghani, M., Bahri, D., & Metzler, D. (2020). Efficient transformers: a survey. ArXiv:2009.06732.
Taylor et al., 2022: Taylor, R., Kardas, M., Cucurull, G., Scialom, T., Hartshorn, A., Saravia, E., … Stojnic, R. (2022). Galactica: a large language model for science. ArXiv:2211.09085.
Teye et al., 2018: Teye, M., Azizpour, H., & Smith, K. (2018). Bayesian uncertainty estimation for batch normalized deep networks. ArXiv:1802.06455.
Thomee et al., 2016: Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni, K., Poland, D., … Li, L.-J. (2016). Yfcc100m: the new data in multimedia research. Communications of the ACM, 59(2), 64–73.
Tieleman & Hinton, 2012: Tieleman, T., & Hinton, G. (2012). Divide the gradient by a running average of its recent magnitude. COURSERA: Neural Networks for Machine Learning, Lecture 6.5-rmsprop.
Tikhonov & Arsenin, 1977: Tikhonov, A. N., & Arsenin, V. Y. (1977). Solutions of Ill-Posed Problems. W.H. Winston.
Tolstikhin et al., 2021: Tolstikhin, I. O., Houlsby, N., Kolesnikov, A., Beyer, L., Zhai, X., Unterthiner, T., … et al. (2021). MLP-mixer: an all-MLP architecture for vision. Advances in Neural Information Processing Systems, 34.
Torralba et al., 2008: Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(11), 1958–1970.
Touvron et al., 2021: Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. International Conference on Machine Learning (pp. 10347–10357).
Touvron et al., 2023a: Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.-A., Lacroix, T., … et al. (2023a). LLaMA: open and efficient foundation language models. ArXiv:2302.13971.
Touvron et al., 2023b: Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., … et al. (2023b). LLaMA 2: open foundation and fine-tuned chat models. ArXiv:2307.09288.
Tsoumakas & Katakis, 2007: Tsoumakas, G., & Katakis, I. (2007). Multi-label classification: an overview. International Journal of Data Warehousing and Mining, 3(3), 1–13.
Turing, 1950: Turing, A. (1950). Computing machinery and intelligence. Mind, 59(236), 433.
Toscher et al., 2009: Töscher, A., Jahrer, M., & Bell, R. M. (2009). The bigchaos solution to the Netflix grand prize.
Uijlings et al., 2013: Uijlings, J. R., Van De Sande, K. E., Gevers, T., & Smeulders, A. W. (2013). Selective search for object recognition. International Journal of Computer Vision, 104(2), 154–171.
Vapnik, 1995: Vapnik, V. (1995). The Nature of Statistical Learning Theory. New York: Springer.
Vapnik, 1998: Vapnik, V. (1998). Statistical Learning Theory. New York: John Wiley and Sons.
Vapnik & Chervonenkis, 1964: Vapnik, V., & Chervonenkis, A. (1964). A note on one class of perceptrons. Automation and Remote Control, 25.
Vapnik & Chervonenkis, 1968: Vapnik, V., & Chervonenkis, A. (1968). Uniform convergence of frequencies of occurence of events to their probabilities. Dokl. Akad. Nauk SSSR, 181, 915-918.
Vapnik & Chervonenkis, 1971: Vapnik, V., & Chervonenkis, A. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory Probab. Appl., 16(2), 264-281.
Vapnik & Chervonenkis, 1981: Vapnik, V., & Chervonenkis, A. (1981). The necessary and sufficient conditions for the uniform convergence of averages to their expected values. Teoriya Veroyatnostei i Ee Primeneniya, 26(3), 543-564.
Vapnik & Chervonenkis, 1991: Vapnik, V., & Chervonenkis, A. (1991). The necessary and sufficient conditions for consistency in the empirical risk minimization method. Pattern Recognition and Image Analysis, 1(3), 283-305.
Vapnik & Chervonenkis, 1974: Vapnik, V. N., & Chervonenkis, A. Y. (1974). Ordered risk minimization. Automation and Remote Control, 35, 1226–1235, 1403–1412.
Vapnik, 1992: Vapnik, V. (1992). Principles of risk minimization for learning theory. Advances in Neural Information Processing Systems (pp. 831–838).
Vapnik et al., 1994: Vapnik, V., Levin, E., & Le Cun, Y. (1994). Measuring the VC-dimension of a learning machine. Neural Computation, 6(5), 851–876.
Vaswani et al., 2017: Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., … Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (pp. 5998–6008).
Wahba, 1990: Wahba, G. (1990). Spline Models for Observational Data. SIAM.
Waibel et al., 1989: Waibel, A., Hanazawa, T., Hinton, G., Shikano, K., & Lang, K. J. (1989). Phoneme recognition using time-delay neural networks. IEEE Transactions on Acoustics, Speech, and Signal Processing, 37(3), 328–339.
Wang et al., 2022: Wang, H., Zhang, A., Zheng, S., Shi, X., Li, M., & Wang, Z. (2022). Removing batch normalization boosts adversarial training. International Conference on Machine Learning (pp. 23433–23445).
Wang et al., 2018: Wang, L., Li, M., Liberty, E., & Smola, A. J. (2018). Optimal message scheduling for aggregation. Networks, 2(3), 2–3.
Wang et al., 2019: Wang, Q., Li, B., Xiao, T., Zhu, J., Li, C., Wong, D. F., & Chao, L. S. (2019). Learning deep transformer models for machine translation. Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (pp. 1810–1822).
Wang et al., 2023: Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2023). Self-consistency improves chain of thought reasoning in language models. International Conference on Learning Representations.
Wang et al., 2016: Wang, Y., Davidson, A., Pan, Y., Wu, Y., Riffel, A., & Owens, J. D. (2016). Gunrock: a high-performance graph processing library on the GPU. ACM SIGPLAN Notices (p. 11).
Warstadt et al., 2019: Warstadt, A., Singh, A., & Bowman, S. R. (2019). Neural network acceptability judgments. Transactions of the Association for Computational Linguistics, 7, 625–641.
Wasserman, 2013: Wasserman, L. (2013). All of Statistics: A Concise Course in Statistical Inference. Springer.
Watkins & Dayan, 1992: Watkins, C. J., & Dayan, P. (1992). Q-learning. Machine Learning, 8(3–4), 279–292.
Watson, 1964: Watson, G. S. (1964). Smooth regression analysis. Sankhyā: The Indian Journal of Statistics, Series A, pp. 359–372.
Wei et al., 2021: Wei, J., Bosma, M., Zhao, V. Y., Guu, K., Yu, A. W., Lester, B., … Le, Q. V. (2021). Finetuned language models are zero-shot learners. ArXiv:2109.01652.
Wei et al., 2022a: Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., … et al. (2022). Emergent abilities of large language models. ArXiv:2206.07682.
Wei et al., 2022b: Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E., Le, Q., & Zhou, D. (2022). Chain of thought prompting elicits reasoning in large language models. ArXiv:2201.11903.
Welling & Teh, 2011: Welling, M., & Teh, Y. W. (2011). Bayesian learning via stochastic gradient Langevin dynamics. Proceedings of the 28th International Conference on Machine Learning (ICML-11) (pp. 681–688).
Wengert, 1964: Wengert, R. E. (1964). A simple automatic derivative evaluation program. Communications of the ACM, 7(8), 463–464.
Werbos, 1990: Werbos, P. J. (1990). Backpropagation through time: what it does and how to do it. Proceedings of the IEEE, 78(10), 1550–1560.
Wigner, 1958: Wigner, E. P. (1958). On the distribution of the roots of certain symmetric matrices. Ann. Math. (pp. 325–327).
Wilson & Izmailov, 2020: Wilson, A. G., & Izmailov, P. (2020). Bayesian deep learning and a probabilistic perspective of generalization. Advances in Neural Information Processing Systems, 33, 4697–4708.
Wistuba et al., 2019: Wistuba, M., Rawat, A., & Pedapati, T. (2019). A survey on neural architecture search. ArXiv:1905.01392 [cs.LG].
Wistuba et al., 2018: Wistuba, M., Schilling, N., & Schmidt-Thieme, L. (2018). Scalable Gaussian process-based transfer surrogates for hyperparameter optimization. Machine Learning, 108, 43–78.
Wolpert & Macready, 1995: Wolpert, D. H., & Macready, W. G. (1995). No free lunch theorems for search. Technical Report SFI-TR-95-02-010, Santa Fe Institute.
Wood et al., 2011: Wood, F., Gasthaus, J., Archambeau, C., James, L., & Teh, Y. W. (2011). The sequence memoizer. Communications of the ACM, 54(2), 91–98.
Wu et al., 2018: Wu, B., Wan, A., Yue, X., Jin, P., Zhao, S., Golmant, N., … Keutzer, K. (2018). Shift: a zero flop, zero parameter alternative to spatial convolutions. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 9127–9135).
Wu et al., 2016: Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., … et al. (2016). Google's neural machine translation system: bridging the gap between human and machine translation. ArXiv:1609.08144.
Xiao et al., 2017: Xiao, H., Rasul, K., & Vollgraf, R. (2017). Fashion-MNIST: a novel image dataset for benchmarking machine learning algorithms. ArXiv:1708.07747.
Xiao et al., 2018: Xiao, L., Bahri, Y., Sohl-Dickstein, J., Schoenholz, S., & Pennington, J. (2018). Dynamical isometry and a mean field theory of CNNs: how to train 10,000-layer vanilla convolutional neural networks. International Conference on Machine Learning (pp. 5393–5402).
Xie et al., 2017: Xie, S., Girshick, R., Dollár, P., Tu, Z., & He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 1492–1500).
Xiong et al., 2020: Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., … Liu, T. (2020). On layer normalization in the transformer architecture. International Conference on Machine Learning (pp. 10524–10533).
Xiong et al., 2018: Xiong, W., Wu, L., Alleva, F., Droppo, J., Huang, X., & Stolcke, A. (2018). The Microsoft 2017 conversational speech recognition system. 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5934–5938).
Yamaguchi et al., 1990: Yamaguchi, K., Sakamoto, K., Akabane, T., & Fujimoto, Y. (1990). A neural network for speaker-independent isolated word recognition. First International Conference on Spoken Language Processing.
Yang et al., 2016: Yang, Z., Hu, Z., Deng, Y., Dyer, C., & Smola, A. (2016). Neural machine translation with recurrent attention modeling. ArXiv:1607.05108.
Yang et al., 2015: Yang, Z., Moczulski, M., Denil, M., De Freitas, N., Smola, A., Song, L., & Wang, Z. (2015). Deep fried convnets. Proceedings of the IEEE International Conference on Computer Vision (pp. 1476–1483).
Ye et al., 2011: Ye, M., Yin, P., Lee, W.-C., & Lee, D.-L. (2011). Exploiting geographical influence for collaborative point-of-interest recommendation. Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval (pp. 325–334).
You et al., 2017: You, Y., Gitman, I., & Ginsburg, B. (2017). Large batch training of convolutional networks. ArXiv:1708.03888.
Yu et al., 2022: Yu, J., Xu, Y., Koh, J. Y., Luong, T., Baid, G., Wang, Z., … Wu, Y. (2022). Scaling autoregressive models for content-rich text-to-image generation. ArXiv:2206.10789.
Zaheer et al., 2018: Zaheer, M., Reddi, S., Sachan, D., Kale, S., & Kumar, S. (2018). Adaptive methods for nonconvex optimization. Advances in Neural Information Processing Systems (pp. 9793–9803).
Zeiler, 2012: Zeiler, M. D. (2012). ADADELTA: an adaptive learning rate method. ArXiv:1212.5701.
Zeiler & Fergus, 2013: Zeiler, M. D., & Fergus, R. (2013). Stochastic pooling for regularization of deep convolutional neural networks. ArXiv:1301.3557.
Zhang et al., 2021a: Zhang, A., Tay, Y., Zhang, S., Chan, A., Luu, A. T., Hui, S. C., & Fu, J. (2021). Beyond fully-connected layers with quaternions: parameterization of hypercomplex multiplications with 1/n parameters. International Conference on Learning Representations.
Zhang et al., 2021b: Zhang, C., Bengio, S., Hardt, M., Recht, B., & Vinyals, O. (2021). Understanding deep learning (still) requires rethinking generalization. Communications of the ACM, 64(3), 107–115.
Zhang et al., 2019: Zhang, S., Yao, L., Sun, A., & Tay, Y. (2019). Deep learning based recommender system: a survey and new perspectives. ACM Computing Surveys, 52(1), 5.
Zhang et al., 2022: Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., … et al. (2022). OPT: open pre-trained transformer language models. ArXiv:2205.01068.
Zhang et al., 1988: Zhang, W., Tanida, J., Itoh, K., & Ichioka, Y. (1988). Shift-invariant pattern recognition neural network and its optical architecture. Proceedings of Annual Conference of the Japan Society of Applied Physics.
Zhang et al., 2021c: Zhang, Y., Sun, P., Jiang, Y., Yu, D., Yuan, Z., Luo, P., … Wang, X. (2021). ByteTrack: multi-object tracking by associating every detection box. ArXiv:2110.06864.
Zhang et al., 2023a: Zhang, Z., Zhang, A., Li, M., & Smola, A. (2023). Automatic chain of thought prompting in large language models. International Conference on Learning Representations.
Zhang et al., 2023b: Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., & Smola, A. (2023). Multimodal chain-of-thought reasoning in language models. ArXiv:2302.00923.
Zhao et al., 2019: Zhao, Z.-Q., Zheng, P., Xu, S.-t., & Wu, X. (2019). Object detection with deep learning: a review. IEEE Transactions on Neural Networks and Learning Systems, 30(11), 3212–3232.
Zhou et al., 2023: Zhou, D., Schärli, N., Hou, L., Wei, J., Scales, N., Wang, X., … Chi, E. (2023). Least-to-most prompting enables complex reasoning in large language models. International Conference on Learning Representations.
Zhu et al., 2017: Zhu, J.-Y., Park, T., Isola, P., & Efros, A. A. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. Proceedings of the IEEE International Conference on Computer Vision (pp. 2223–2232).
Zhu et al., 2015: Zhu, Y., Kiros, R., Zemel, R., Salakhutdinov, R., Urtasun, R., Torralba, A., & Fidler, S. (2015). Aligning books and movies: towards story-like visual explanations by watching movies and reading books. Proceedings of the IEEE International Conference on Computer Vision (pp. 19–27).
Zoph & Le, 2016: Zoph, B., & Le, Q. V. (2016). Neural architecture search with reinforcement learning. ArXiv:1611.01578.

References¶ Colab [pytorch] Open the notebook in Colab Colab [mxnet] Open the notebook in Colab Colab [jax] Open the notebook in Colab Colab [tensorflow] Open the notebook in Colab SageMaker Studio Lab Open the notebook in SageMaker Studio Lab

References¶

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in Colab

Open the notebook in SageMaker Studio Lab