Dive into Deep Learning adds attention mechanism chapter
The newest chapter addresses a problem that often bedevils nonparametric machine learning models.
The global machine-learning market is expanding rapidly. In fact, Forbes Business Insights predicts that the market for machine-learning products will reach $117 billion by 2027. That growth has been accompanied by an attendant spike in the demand for machine-learning skills.
Check out our brand-new attention chapter, such as simpler illustrations using Nadaraya-Watson kernel regression (1964) before Bahdanau attention. Now PyTorch adaptation is complete for Vol.1. The first 1/3 part is also translated into Chinese and Turkish. https://t.co/MzhIsafM8o pic.twitter.com/KKHBQ4H4qx— Dive into Deep learning (D2L.ai) (@D2L_ai) January 6, 2021
An Amazon team has helped those who are working on meeting that demand by creating a book called Dive into Deep Learning (aka D2L). Published in 2020 and drafted entirely through individually executable Jupyter notebooks, the book is a living document, with each update triggering updates to the PDF, HTML, and notebook versions available as open source.
Its original authors are Aston Zhang, an AWS senior applied scientist; Zachary Lipton, an Amazon visiting academic and assistant professor of Operations Research and Machine Learning at Carnegie Mellon University; Mu Li, AWS senior principal scientist; and Alex Smola, AWS vice president and distinguished scientist. The number of contributing authors is expanding as the book evolves to cover additional topics, deep learning frameworks, such as PyTorch and TensorFlow and languages, such as Chinese or Turkish.
A recent addition is a chapter on “Attention Mechanisms”. Adds Smola: “Attention is one of the coolest things to happen in deep learning recently. It’s almost embarrassing that we didn’t already have something on it.”
The new chapter addresses a problem that often bedevils nonparametric machine learning models, namely the fact that they need to operate on varying amounts of information that needs to be put into context.
Consider human vision: we can have hundreds of things within our field of vision, but our brain understands what’s important to prevent sensory overload. That can range from the archaic (“Do I see a tiger near my camp?”) to the modern (“Where did I put my red coffee cup?”). This allows humans to adapt to information rich scenes on the fly. Attention to parts of it is both computationally economical and statistically effective.
“We can use an attention mechanism solve computer vision tasks such as image recognition problems,” says Zhang. “That what’s why we thought we needed a dedicated chapter to introduce some of the exciting stuff going on in this field. It also reflects how the field of deep learning keeps evolving — even as we add new material to the book, there are new developments to keep up with.”
Smola compares the problem to playing chess with chess pieces and chessboard, versus playing with neither. “A physical chessboard allows players to focus on locations of specific pawns and rooks and the moves they might make,” he says. “Without a chessboard, players have to remember all the information and options on the board before making a move.”
Attention is one of the coolest things to happen in deep learning recently.
While important in computer vision, attention also plays a key role in natural language processing (NLP). In these applications, a computer’s attention may be directed to specific words that suggest a logical next word or thought, e.g. the word “bank” may be followed by “deposit.” That accelerates a computer’s ability to predict or analyze speech and words. In fact, modern attention was invented by Bahdanau, Cho, and Bengio in 2014 to deal with the problem that models for machine translation had to “remember” the source when generating the translation. This is a challenge, particularly for long sentences. Attention allows the model to keep on peeking at the source and what it’s already translated to improve accuracy.
In the new chapter, Dive into Deep Learning offers an overview of research in the psychology of attention, which dates back to the 19th century, and how to apply our understanding of it to computers. That includes developing models for how a computer should “weight” what it sees in order of importance.
One challenge in the attention field is that it remains computationally intensive, says Smola. In recent years the “Transformer” model, which applies advanced algorithms to sequences of words to determine a logical sequence, has emerged to help address that challenge. Transformers drive some of the most powerful NLP models today, such as GPT-3.
“Transformers have revolutionized NLP,” says Smola. “They allow me to jump forward and back many steps — in a sequence of words, for instance — very efficiently, almost like a wave form, rather than just one step at a time. That creates a lot more computational efficiency.”
The new D2L chapter addresses key elements of Transformers and how to apply them to NLP or computer-vision problems.
The attention chapter is rich with models, examples, and code that students can experiment with as they go along. Examples are offered in both MXNet and PyTorch, giving users a choice of machine-learning frameworks to choose from.
The authors also focused on making the subject of attention approachable to students. Attention can be a difficult concept to conquer, with subtle mathematics and sometimes abstract concepts.
Meanwhile, the first third of the new chapter has been translated into Chinese and Turkish. The Chinese market is one of the biggest for Dive into Deep Learning, where it is used by 20 universities as a textbook. The Turkish version came about due to the large community of machine-learning experts in that country.
As an adjunct to the book, the D2L team has also launched a series of “boot camp” video tutorials on its YouTube channel. Smola who, like many other people during the pandemic is working from home, discovered he had some time on his hands. He decided to spend some of it creating new videos — his 11-year-old daughter helped with set-staging. Each video lasts from 15 to 20 minutes and is modeled on relevant chapters in Dive into Deep Learning.
The team is working on a print version of Dive into Deep Learning and is also evolving the model for additional content, where the original authors focus less on development of original content, and more on being stewards of what the broader machine-learning community contributes.