Simplifying BERT-based models to increase efficiency, capacity

New method would enable BERT-based natural-language-processing models to handle longer text strings, run in resource-constrained settings — or sometimes both.

In recent years, many of the best-performing models in the field of natural-language processing (NLP) have been built on top of BERT language models. Pretrained on large corpora of (unlabeled) public texts, BERT models encode the probabilities of sequences of words. Because a BERT model begins with extensive knowledge of a language as a whole, it can be fine-tuned on a more targeted task — like question answering or machine translation — with relatively little labeled data.

BERT models, however, are very large, and BERT-based NLP models can be slow — even prohibitively slow, for users with limited computational resources. Their complexity also limits the length of the inputs they can take, as their memory footprint scales with the square of the input length.

Pyramid-BERT architecture.png
A simplified illustration of the Pyramid-BERT architecture.

At this year’s meeting of the Association for Computational Linguistics (ACL), my colleagues and I presented a new method, called Pyramid-BERT, that reduces the training time, inference time, and memory footprint of BERT-based models, without sacrificing much accuracy. The reduced memory footprint also enables BERT models to operate on longer text sequences.

BERT-based models take sequences of sentences as inputs and output vector representations — embeddings — of both each sentence as a whole and its constituent words individually. Downstream applications such as text classification and ranking, however, use only the complete-sentence embeddings. To make BERT-based models more efficient, we progressively eliminate redundant individual-word embeddings in intermediate layers of the network, while trying to minimize the effect on the complete-sentence embeddings.

We compare Pyramid-BERT to several state-of-the-art techniques for making BERT models more efficient and show that we can speed inference up 3- to 3.5-fold while suffering an accuracy drop of only 1.5%, whereas, at the same speeds, the best existing method loses 2.5% of its accuracy.

Related content
Combination of distillation and distillation-aware quantization compresses BART model to 1/16th its size.

Moreover, when we apply our method to Performers — variations on BERT models that are specifically designed for long texts — we can reduce the models’ memory footprint by 70%, while actually increasing accuracy. At that compression rate, the best existing approach suffers an accuracy dropoff of 4%.

A token’s progress

Each sentence input to a BERT model is broken into units called tokens. Most tokens are words, but some are multiword phrases, some are subword parts, some are individual letters of acronyms, and so on. The start of each sentence is demarcated by a special token called — for reasons that will soon be clear — CLS, for classification.

Each token passes through a series of encoders — usually somewhere between four and 12 — each of which produces a new embedding for each input token. Each encoder has an attention mechanism, which decides how much each token’s embedding should reflect information carried by other tokens.

For instance, given the sentence “Bob told his brother that he was starting to get on his nerves,” the attention mechanism should pay more attention to the word “Bob” when encoding the word “his” but “brother” when encoding the word “he”. It’s because the attention mechanism must compare every word in an input sequence to every other that a BERT model’s memory footprint scales with the square of the input.

Related content
Determining the optimal architectural parameters reduces network size by 84% while improving performance on natural-language-understanding tasks.

As tokens pass through the series of encoders, their embeddings factor in more and more information about other tokens in the sequence, since they’re attending to other tokens that are also factoring in more and more information. By the time the tokens pass through the final encoder, the embedding of the CLS token ends up representing the sentence as a whole (hence the CLS token’s name). But its embedding is also very similar to those of all the other tokens in the sentence. That’s the redundancy we’re trying to remove.

The basic idea is that, in each of the network’s encoders, we preserve the embedding of the CLS token but select a representative subset — a core set — of the other tokens’ embeddings.

Embeddings are vectors, so they can be interpreted as points in a multidimensional space. To construct core sets we would, ideally, sort embeddings into clusters of equal diameter and select the center point — the centroid — of each cluster.

Centroid core set.png
Ideally, for each encoder in the network, we would construct a representative subset of token embeddings (green dots) by selecting the centroids (red dots) of token clusters (circles). The centroids would then pass to the next layer of the network.

Unfortunately, the problem of constructing a core set that spans a layer of a neural network is NP-hard, meaning that it’s impractically time consuming.

As an alternative, our paper proposes a greedy algorithm that selects n members of the core set at a time. At each layer, we take the embedding of the CLS token, and then we find the n embeddings farthest from it in the representational space. We add those, along with the CLS embedding, to our core set. Then we find the n embeddings whose minimum distance from any of the points already in our core set is greatest, and we add those to the core set.

Related content
"Perfect hashing" is among the techniques that reduce the memory footprints of machine learning models by 94%.

We repeat this process until our core set reaches the desired size. This is provably an adequate approximation of the optimal core set.

Finally, in our paper, we consider the question of how large the core set of each layer should be. We use an exponential-delay function to determine the degree of attenuation from one layer to the next, and we investigate the trade-offs between accuracy and speedups or memory reduction that result from selecting different rates of decay.

Acknowledgements: Ashish Khetan, Rene Bidart, Zohar Karnin

Related content

US, WA, Seattle
The Global Media Entertainment Science team uses state of the art economics and machine learning models to provide Amazon’s entertainment businesses guidance on strategically important questions. We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities
US, CA, Palo Alto
The Amazon Search team creates powerful, customer-focused search solutions and technologies. Whenever a customer visits an Amazon site worldwide and types in a query or browses through product categories, Amazon Product Search services go to work. We design, develop, and deploy high performance, fault-tolerant distributed search systems used by millions of Amazon customers every day. Our Search Relevance team works to maximize the quality and effectiveness of the search experience for visitors to Amazon websites worldwide. The Search Relevance team focuses on several technical areas for improving search quality. In this role, you will invent universally applicable signals and algorithms for training machine-learned ranking models. The relevance improvements you make will help millions of customers discover the products they want from a catalog containing millions of products. You will work on problems such as predicting the popularity of new products, developing new ranking features and algorithms that capture unique characteristics, and analyzing the differences in behavior of different categories of customers. The work will span the whole development pipeline, including data analysis, prototyping, A/B testing, and creating production-level components. Joining this team, you’ll experience the benefits of working in a dynamic, entrepreneurial environment, while leveraging the resources of Amazon.com (AMZN), one of the world’s leading Internet companies. We provide a highly customer-centric, team-oriented environment in our offices located in Palo Alto, California. Please visit https://www.amazon.science for more information
US, WA, Seattle
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. Do you have a strong machine learning background and want to help build new speech and language technology? Amazon is looking for PhD students who are ready to tackle some of the most interesting research problems on the leading edge of natural language processing. We are hiring in all areas of spoken language understanding: NLP, NLU, ASR, text-to-speech (TTS), and more! A successful candidate will be a self-starter comfortable with ambiguity, strong attention to detail, and the ability to work in a fast-paced, ever-changing environment. As an Applied Science Intern, you will develop and implement novel scalable algorithms and modeling techniques to advance the state-of-the-art in technology areas at the intersection of ML, NLP, search, and deep learning. You will work side-by-side with global experts in speech and language to solve challenging groundbreaking research problems on production scale data. The ideal candidate must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. Amazon has positions available for Natural Language Processing & Speech Intern positions in multiple locations across the United States. Amazon fundamentally believes that scientific innovation is essential to being the most customer-centric company in the world. Please visit our website to stay updated with the research our teams are working on: https://www.amazon.science/research-areas/conversational-ai-natural-language-processing
US, WA, Seattle
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. The Research team at Amazon works passionately to apply cutting-edge advances in technology to solve real-world problems. Do you have a strong machine learning background and want to help build new speech and language technology? Do you welcome the challenge to apply optimization theory into practice through experimentation and invention? Would you love to help us develop the algorithms and models that power computer vision services at Amazon, such as Amazon Rekognition, Amazon Go, Visual Search, etc? At Amazon we hire research science interns to work in a number of domains including Operations Research, Optimization, Speech Technologies, Computer Vision, Robotics, and more! As an intern, you will be challenged to apply theory into practice through experimentation and invention, develop new algorithms using mathematical programming techniques for complex problems, implement prototypes and work with massive datasets. Amazon has a culture of data-driven decision-making, and the expectation is that analytics are timely, accurate, innovative and actionable. Amazon Science gives insight into the company’s approach to customer-obsessed scientific innovation. Amazon fundamentally believes that scientific innovation is essential to being the most customer-centric company in the world. It’s the company’s ability to have an impact at scale that allows us to attract some of the brightest minds in artificial intelligence and related fields. Amazon Scientist use our working backwards method to enrich the way we live and work. For more information on the Amazon Science community please visit https://www.amazon.science.
US, WA, Seattle
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. The Research team at Amazon works passionately to apply cutting-edge advances in technology to solve real-world problems. Do you have a strong machine learning background and want to help build new speech and language technology? Do you welcome the challenge to apply optimization theory into practice through experimentation and invention? Would you love to help us develop the algorithms and models that power computer vision services at Amazon, such as Amazon Rekognition, Amazon Go, Visual Search, etc? At Amazon we hire research science interns to work in a number of domains including Operations Research, Optimization, Speech Technologies, Computer Vision, Robotics, and more! As an intern, you will be challenged to apply theory into practice through experimentation and invention, develop new algorithms using mathematical programming techniques for complex problems, implement prototypes and work with massive datasets. Amazon has a culture of data-driven decision-making, and the expectation is that analytics are timely, accurate, innovative and actionable. Amazon Science gives insight into the company’s approach to customer-obsessed scientific innovation. Amazon fundamentally believes that scientific innovation is essential to being the most customer-centric company in the world. It’s the company’s ability to have an impact at scale that allows us to attract some of the brightest minds in artificial intelligence and related fields. Amazon Scientist use our working backwards method to enrich the way we live and work. For more information on the Amazon Science community please visit https://www.amazon.science.
CA, ON, Toronto
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. Are you a Masters student interested in machine learning, natural language processing, computer vision, automated reasoning, or robotics? We are looking for skilled scientists capable of putting theory into practice through experimentation and invention, leveraging science techniques and implementing systems to work on massive datasets in an effort to tackle never-before-solved problems. A successful candidate will be a self-starter comfortable with ambiguity, strong attention to detail, and the ability to work in a fast-paced, ever-changing environment. As an Applied Science Intern, you will own the design and development of end-to-end systems. You’ll have the opportunity to create technical roadmaps, and drive production level projects that will support Amazon Science. You will work closely with Amazon scientists, and other science interns to develop solutions and deploy them into production. The ideal scientist must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. Amazon Science gives insight into the company’s approach to customer-obsessed scientific innovation. Amazon fundamentally believes that scientific innovation is essential to being the most customer-centric company in the world. It’s the company’s ability to have an impact at scale that allows us to attract some of the brightest minds in artificial intelligence and related fields. Our scientists use our working backwards method to enrich the way we live and work. For more information on the Amazon Science community please visit https://www.amazon.science.
CA, ON, Toronto
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. Are you a PhD student interested in machine learning, natural language processing, computer vision, automated reasoning, or robotics? We are looking for skilled scientists capable of putting theory into practice through experimentation and invention, leveraging science techniques and implementing systems to work on massive datasets in an effort to tackle never-before-solved problems. A successful candidate will be a self-starter comfortable with ambiguity, strong attention to detail, and the ability to work in a fast-paced, ever-changing environment. As an Applied Science Intern, you will own the design and development of end-to-end systems. You’ll have the opportunity to create technical roadmaps, and drive production level projects that will support Amazon Science. You will work closely with Amazon scientists, and other science interns to develop solutions and deploy them into production. The ideal scientist must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. Amazon Science gives insight into the company’s approach to customer-obsessed scientific innovation. Amazon fundamentally believes that scientific innovation is essential to being the most customer-centric company in the world. It’s the company’s ability to have an impact at scale that allows us to attract some of the brightest minds in artificial intelligence and related fields. Our scientists use our working backwards method to enrich the way we live and work. For more information on the Amazon Science community please visit https://www.amazon.science.
US, WA, Seattle
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. We are looking for Masters or PhD students excited about working on Automated Reasoning or Storage System problems at the intersection of theory and practice to drive innovation and provide value for our customers. AWS Automated Reasoning teams deliver tools that are called billions of times daily. Amazon development teams are integrating automated-reasoning tools such as Dafny, P, and SAW into their development processes, raising the bar on the security, durability, availability, and quality of our products. AWS Automated Reasoning teams are changing how computer systems built on top of the cloud are developed and operated. AWS Automated Reasoning teams work in areas including: Distributed proof search, SAT and SMT solvers, Reasoning about distributed systems, Automating regulatory compliance, Program analysis and synthesis, Security and privacy, Cryptography, Static analysis, Property-based testing, Model-checking, Deductive verification, compilation into mainstream programming languages, Automatic test generation, and Static and dynamic methods for concurrent systems. AWS Storage Systems teams manage trillions of objects in storage, retrieving them with predictable low latency, building software that deploys to thousands of hosts, achieving 99.999999999% (you didn’t read that wrong, that’s 11 nines!) durability. AWS storage services grapple with exciting problems at enormous scale. Amazon S3 powers businesses across the globe that make the lives of customers better every day, and forms the backbone for applications at all scales and in all industries ranging from multimedia to genomics. This scale and data diversity requires constant innovation in algorithms, systems and modeling. AWS Storage Systems teams work in areas including: Error-correcting coding and durability modeling, system and distributed system performance optimization and modeling, designing and implementing distributed, multi-tenant systems, formal verification and strong, practical assurances of correctness, bits-IOPS-Watts: the interplay between computation, performance, and energy, data compression - both general-purpose and domain specific, research challenges with storage media, both existing and emerging, and exploring the intersection between storage and quantum technologies. As an Applied Science Intern, you will work closely with Amazon scientists and other science interns to develop solutions and deploy them into production. The ideal scientist must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. A successful candidate will be a self-starter with strong attention to detail and the ability to thrive in a fast-paced, ever-changing environment who is comfortable with ambiguity. Amazon believes that scientific innovation is essential to being the world’s most customer-centric company. Our ability to have impact at scale allows us to attract some of the brightest minds in Automated Reasoning and related fields. Our scientists work backwards to produce innovative solutions that delight our customers. Please visit https://www.amazon.science (https://www.amazon.science/) for more information.
US, WA, Seattle
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. We are looking for PhD students excited about working on Automated Reasoning or Storage System problems at the intersection of theory and practice to drive innovation and provide value for our customers. AWS Automated Reasoning teams deliver tools that are called billions of times daily. Amazon development teams are integrating automated-reasoning tools such as Dafny, P, and SAW into their development processes, raising the bar on the security, durability, availability, and quality of our products. AWS Automated Reasoning teams are changing how computer systems built on top of the cloud are developed and operated. AWS Automated Reasoning teams work in areas including: Distributed proof search, SAT and SMT solvers, Reasoning about distributed systems, Automating regulatory compliance, Program analysis and synthesis, Security and privacy, Cryptography, Static analysis, Property-based testing, Model-checking, Deductive verification, compilation into mainstream programming languages, Automatic test generation, and Static and dynamic methods for concurrent systems. AWS Storage Systems teams manage trillions of objects in storage, retrieving them with predictable low latency, building software that deploys to thousands of hosts, achieving 99.999999999% (you didn’t read that wrong, that’s 11 nines!) durability. AWS storage services grapple with exciting problems at enormous scale. Amazon S3 powers businesses across the globe that make the lives of customers better every day, and forms the backbone for applications at all scales and in all industries ranging from multimedia to genomics. This scale and data diversity requires constant innovation in algorithms, systems and modeling. AWS Storage Systems teams work in areas including: Error-correcting coding and durability modeling, system and distributed system performance optimization and modeling, designing and implementing distributed, multi-tenant systems, formal verification and strong, practical assurances of correctness, bits-IOPS-Watts: the interplay between computation, performance, and energy, data compression - both general-purpose and domain specific, research challenges with storage media, both existing and emerging, and exploring the intersection between storage and quantum technologies. As an Applied Science Intern, you will work closely with Amazon scientists and other science interns to develop solutions and deploy them into production. The ideal scientist must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. A successful candidate will be a self-starter with strong attention to detail and the ability to thrive in a fast-paced, ever-changing environment who is comfortable with ambiguity. Amazon believes that scientific innovation is essential to being the world’s most customer-centric company. Our ability to have impact at scale allows us to attract some of the brightest minds in Automated Reasoning and related fields. Our scientists work backwards to produce innovative solutions that delight our customers. Please visit https://www.amazon.science (https://www.amazon.science/) for more information.
US, WA, Seattle
To ensure a great internship experience, please keep these things in mind. This is a full time internship and requires an individual to work 40 hours a week for the duration of the internship. Amazon requires an intern to be located where their assigned team is. Amazon is happy to provide relocation and housing assistance if you are located 50 miles or further from the office location. Help us develop the algorithms and models that power computer vision services at Amazon, such as Amazon Rekognition, Amazon Go, Visual Search, and more! We are combining computer vision, mobile robots, advanced end-of-arm tooling and high-degree of freedom movement to solve real-world problems at huge scale. As an intern, you will help build solutions where visual input helps the customers shop, anticipate technological advances, work with leading edge technology, focus on highly targeted customer use-cases, and launch products that solve problems for Amazon customers. A successful candidate will be a self-starter comfortable with ambiguity, strong attention to detail, and the ability to work in a fast-paced, ever-changing environment. You will own the design and development of end-to-end systems and have the opportunity to write technical white papers, create technical roadmaps, and drive production level projects that will support Amazon Science. You will work closely with Amazon scientists, and other science interns to develop solutions and deploy them into production. The ideal scientist must have the ability to work with diverse groups of people and cross-functional teams to solve complex business problems. Amazon Science gives insight into the company’s approach to customer-obsessed scientific innovation. Amazon fundamentally believes that scientific innovation is essential to being the most customer-centric company in the world. It’s the company’s ability to have an impact at scale that allows us to attract some of the brightest minds in artificial intelligence and related fields. Amazon Scientist use our working backwards method to enrich the way we live and work. For more information on the Amazon Science community please visit https://www.amazon.science