Automatically generating text from structured data

Technique that lets devices convey information in natural language improves on state of the art.

Data-to-text generation converts information from a structured format such as a table into natural language. This allows structured information to be read or listened to, as when a device displays a weather forecast or a voice assistant answers a question.

Language models trained on billions of sentences learn common linguistic patterns and can generate natural-sounding sentences by predicting likely sequences of words. However, in data-to-text generation we want to generate language that not only is fluent but also conveys content accurately. 

Some approaches to data-to-text generation use a pipeline of machine learning models to turn the data into text, but this can be labor intensive to create, and pipelining poses the risk that errors in one step will compound in later steps.

In the Alexa AI organization, we’ve developed a neural, end-to-end, data-to-text generation system called DataTuner, which can be used for a variety of data types and topics to generate fluent and accurate texts. We've released the DataTuner code on GitHub under a noncommercial license.

DataTuner.png
Alexa AI's new DataTuner software can convert structured information, such as the relationships encoded by knowledge graphs, into texts that are both semantically faithful and fluent.
Credit: Glynis Condon

At last year’s International Conference on Computational Linguistics (COLING), we presented a paper in which we compared our approach to its best-performing predecessors, using four data-to-text data sets. On automated metrics, DataTuner pushes the state of the art by significant margins, from 1.2 to 5.9 points according to the BLEU algorithm for evaluating text quality.

Human annotators also graded our responses as both more natural-sounding and more accurate. In fact, on two of the four data sets, our texts were judged to be more natural-sounding, on average, than human-written texts.

Annotator evaluations showed that DataTuner improved the semantic accuracy of generated texts, with margins ranging from 5.3% to 40%. Our paper also introduces a model-based approach for measuring the accuracy of generated texts, an approach that is 4.2% to 14.2% more accurate at detecting errors than previous hand-crafted approaches. 

Semantic fidelity vs. fluency

To get a sense of the problem we address, consider an example in which we have some structured information about Michelle Obama that we want to convey to our readers or listeners. That information is organized in the entity-relation-entity format typical of knowledge graphs.

Michelle Obama | author of | Becoming 
Michelle Obama | birthplace | Chicago, Illinois, USA
Princeton University | alma mater of | Michelle Obama
Harvard University | alma mater of | Michelle Obama

We could imagine a text that conveys the meaning accurately but doesn’t sound very natural:

Michelle Obama is the author of Becoming. Michelle Obama was born in Chicago, Illinois, USA. Michelle Obama was educated at Princeton University. Michelle Obama was educated at Harvard University.

This text has high semantic fidelity but low fluency.

Alternatively, we could imagine a text that sounds very fluent but doesn’t accurately convey the information: 

Born in Chicago, Illinois, and educated at Harvard, Michelle Obama is the author of A Promised Land

This text has added some information and missed some out, so it has low semantic fidelity even though it has high fluency.

Pipeline-based approaches to data-to-text generation typically consist of steps such as (1) ordering the content; (2) dividing the content into sentences; (3) finding the right words and phrases to express the data (lexicalization and referring-expression generation), and (4) joining it all together to produce the final text (realization). These approaches usually generalize well to new concepts because of the separate lexicalization step, but they can be difficult to maintain and require training data for each step that can be labor intensive to acquire. 

End-to-end approaches are trained on [data, text] pairs that can be gathered more easily, but it’s difficult to guarantee the semantic fidelity of the results. This is the problem we address with DataTuner.

The DataTuner model

DataTuner’s approach has two steps, generation and reranking. 

First, our language model generates texts from data. In our experiments, we started with a pretrained language model that could generate text, the GPT-2 model. To adapt it to the data-to-text task, we trained it on concatenated data and text, using the special tokens <data> and <text> to indicate which was which. When we use the trained model to generate text, the only input is the data.

DataTuner architecture.png
During training, the inputs to DataTuner's data-to-text model are data and text, separated by the special tokens <data> and <text>. At runtime, the only input is the data.
Credit: Hamza Harkous

Inside the model, we concatenate several types of embeddings, or vector representations whose spatial relationships indicate relationships between data (see figure above). The first type is token embeddings, which encode semantic information about individual input words. The other is an embedding that represents words’ positions in the text. 

We also introduce what we call fine-grained state embeddings. To produce these, we use special tokens that indicate structural relationships between data items.

For example, we would convert the data triple Michelle Obama | author of | Becoming into the string <subject> Michelle Obama <predicate> author of <object> Becoming, with <subject>, <object>, and <predicate> as special tokens. The state embedding for any token is that of the special token that most recently precedes it; for example, the token Becoming will get the state embedding of <object>. 

Secondly, we train a semantic-fidelity classifier. This takes the input data and a generated text and identifies whether the text accurately conveys the data or whether it adds, repeats, omits, or changes any of the content. We use this to rerank the generated texts according to accuracy. 

The classifier is trained using the same data we used to train our language model. Our original [data, text] pairs give us the examples that are to be classified as accurate. To get inaccurate examples, we use rule-based corruptions of the accurate [data, text] pairs. For example, we could take the training pair (Michelle Obama | author of | Becoming) and “Michelle Obama wrote Becoming and swap the entities to create the inaccurate [data, text] pair (Michelle Obama | author of | the Gruffalo) and “Michelle Obama wrote Becoming”.

For this classifier we use the RoBERTA language model with an additional classification layer, an approach that has been successful in other tasks, such as natural-language inference. For each input token (either data or text), we take the token embeddings, positional embeddings, and segment embeddings (embeddings of the tokens that distinguish text and data) and sum these element-wise to provide the input to RoBERTa’s first layer. A final single-layer neural network produces a classification label. 

Evaluation

We experimented with four different data sets in different formats, including news texts, restaurant reviews, and chats about video games. We evaluated the texts we generated both with automated metrics and by asking human annotators to rate fluency and accuracy via Amazon Mechanical Turk. 

In our experiments, we saw that a model trained without the fine-grained state embeddings is less accurate than a model with them and that adding the semantic-fidelity classifier boosts accuracy further.

We also examined the cases in which our generated texts were assessed as better than human-written texts, and we suspect that the reason is that our model learned to produce standard formulations, whereas humans sometimes write in non-standard or informal ways that other people might find less fluent.

We also investigated the use of our semantic-fidelity classifier as a method for automatically evaluating the accuracy of texts generated by different models and found that, for two datasets, it was a significantly better predictor of annotators’ evaluations than existing heuristic approaches.

Related content

US, VA, Arlington
The People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal. We are looking for economists who are able to apply economic methods to address business problems. The ideal candidate will work with engineers and computer scientists to estimate models and algorithms on large scale data, design pilots and measure their impact, and transform successful prototypes into improved policies and programs at scale. We are looking for creative thinkers who can combine a strong technical economic toolbox with a desire to learn from other disciplines, and who know how to execute and deliver on big ideas as part of an interdisciplinary technical team. Ideal candidates will work in a team setting with individuals from diverse disciplines and backgrounds. They will work with teammates to develop scientific models and conduct the data analysis, modeling, and experimentation that is necessary for estimating and validating models. They will work closely with engineering teams to develop scalable data resources to support rapid insights, and take successful models and findings into production as new products and services. They will be customer-centric and will communicate scientific approaches and findings to business leaders, listening to and incorporate their feedback, and delivering successful scientific solutions. Key job responsibilities Use causal inference methods to evaluate the impact of policies on employee outcomes. Examine how external labor market and economic conditions impact Amazon's ability to hire and retain talent. Use scientifically rigorous methods to develop and recommend career paths for employees. A day in the life Work with teammates to apply economic methods to business problems. This might include identifying the appropriate research questions, writing code to implement a DID analysis or estimate a structural model, or writing and presenting a document with findings to business leaders. Our economists also collaborate with partner teams throughout the process, from understanding their challenges, to developing a research agenda that will address those challenges, to help them implement solutions. About the team We are a multidisciplinary team that combines the talents of science and engineering to develop innovative solutions to make Amazon Earth's Best Employer.
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Knowledge of econometrics, as well as basic familiarity with Python (or R, Matlab, or equivalent) is necessary, and experience with SQL would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, WA, Virtual Contact Center-WA
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. About the team The Selling Partner Fees team owns the end-to-end fees experience for two million active third party sellers. We own the fee strategy, fee seller experience, fee accuracy and integrity, fee science and analytics, and we provide scalable technology to monetize all services available to third-party sellers. Within the Science team, our goal is to understand the impact of changing fees on Seller (supply) and Customers (demand) behavior (e.g. price changes, advertising strategy changes, introducing new selection etc.) as well as using this information to optimize our fee structure and maximizing our long term profitability.
US, WA, Seattle
This is a unique opportunity to build technology and science that millions of people will use every day. Are you excited about working on large scale Natural Language Processing (NLP), Machine Learning (ML), and Deep Learning (DL)? We are embarking on a multi-year journey to improve the shopping experience for customers globally. Amazon Search team creates customer-focused search solutions and technologies that makes shopping delightful and effortless for our customers. Our goal is to understand what customers are looking for in whatever language happens to be their choice at the moment and help them find what they need in Amazon's vast catalog of billions of products. As Amazon expands to new geographies, we are faced with the unique challenge of maintaining the bar on Search Quality due to the diversity in user preferences, multilingual search and data scarcity in new locales. We are looking for an applied researcher to work on improving search on Amazon using NLP, ML, and DL technology. As an Applied Scientist, you will lead our efforts in query understanding, semantic matching (e.g. is a drone the same as quadcopter?), relevance ranking (what is a "funny halloween costume"?), language identification (did the customer just switch to their mother tongue?), machine translation (猫の餌を注文する). This is a highly visible role with a huge impact on Amazon customers and business. As part of this role, you will develop high precision, high recall, and low latency solutions for search. Your solutions should work for all languages that Amazon supports and will be used in all Amazon locales world-wide. You will develop scalable science and engineering solutions that work successfully in production. You will work with leaders to develop a strategic vision and long term plans to improve search globally. We are growing our collaborative group of engineers and applied scientists by expanding into new areas. This is a position on Global Search Quality team in Seattle Washington. We are moving fast to change the way Amazon search works. Together with a multi-disciplinary team you will work on building solutions with NLP/ML/DL at its core. Along the way, you’ll learn a ton, have fun and make a positive impact on millions of people. Come and join us as we invent new ways to delight Amazon customers.
US, WA, Seattle
This is a unique opportunity to build technology and science that millions of people will use every day. Are you excited about working on large scale Natural Language Processing (NLP), Machine Learning (ML), and Deep Learning (DL)? We are embarking on a multi-year journey to improve the shopping experience for customers globally. Amazon Search team creates customer-focused search solutions and technologies that makes shopping delightful and effortless for our customers. Our goal is to understand what customers are looking for in whatever language happens to be their choice at the moment and help them find what they need in Amazon's vast catalog of billions of products. As Amazon expands to new geographies, we are faced with the unique challenge of maintaining the bar on Search Quality due to the diversity in user preferences, multilingual search and data scarcity in new locales. We are looking for an applied researcher to work on improving search on Amazon using NLP, ML, and DL technology. As an Applied Scientist, you will lead our efforts in query understanding, semantic matching (e.g. is a drone the same as quadcopter?), relevance ranking (what is a "funny halloween costume"?), language identification (did the customer just switch to their mother tongue?), machine translation (猫の餌を注文する). This is a highly visible role with a huge impact on Amazon customers and business. As part of this role, you will develop high precision, high recall, and low latency solutions for search. Your solutions should work for all languages that Amazon supports and will be used in all Amazon locales world-wide. You will develop scalable science and engineering solutions that work successfully in production. You will work with leaders to develop a strategic vision and long term plans to improve search globally. We are growing our collaborative group of engineers and applied scientists by expanding into new areas. This is a position on Global Search Quality team in Seattle Washington. We are moving fast to change the way Amazon search works. Together with a multi-disciplinary team you will work on building solutions with NLP/ML/DL at its core. Along the way, you’ll learn a ton, have fun and make a positive impact on millions of people. Come and join us as we invent new ways to delight Amazon customers.
US, WA, Seattle
The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon’s on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon’s goods and services are aligned with Amazon’s corporate goals. We are seeking an experienced high-energy Economist to help envision, design and build the next generation of retail pricing capabilities. You will work at the intersection of economic theory, statistical inference, and machine learning to design new methods and pricing strategies to deliver game changing value to our customers. Roughly 85% of previous intern cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Amazon’s Pricing Science and Research team is seeking an Economist to help envision, design and build the next generation of pricing capabilities behind Amazon’s on-line retail business. As an economist on our team, you will work at the intersection of economic theory, statistical inference, and machine learning to design new methods and pricing strategies with the potential to deliver game changing value to our customers. This is an opportunity for a high-energy individual to work with our unprecedented retail data to bring cutting edge research into real world applications, and communicate the insights we produce to our leadership. This position is perfect for someone who has a deep and broad analytic background and is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis. We are particularly interested in candidates with research background in applied microeconomics, econometrics, statistical inference and/or finance. A day in the life Discussions with business partners, as well as product managers and tech leaders to understand the business problem. Brainstorming with other scientists and economists to design the right model for the problem in hand. Present the results and new ideas for existing or forward looking problems to leadership. Deep dive into the data. Modeling and creating working prototypes. Analyze the results and review with partners. Partnering with other scientists for research problems. About the team The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon’s on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon’s goods and services are aligned with Amazon’s corporate goals.
US, CA, San Francisco
The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon's on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon's goods and services are aligned with Amazon's corporate goals. We are seeking an experienced high-energy Economist to help envision, design and build the next generation of retail pricing capabilities. You will work at the intersection of statistical inference, experimentation design, economic theory and machine learning to design new methods and pricing strategies for assessing pricing innovations. Roughly 85% of previous intern cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. Key job responsibilities Amazon's Pricing Science and Research team is seeking an Economist to help envision, design and build the next generation of pricing capabilities behind Amazon's on-line retail business. As an economist on our team, you will will have the opportunity to work with our unprecedented retail data to bring cutting edge research into real world applications, and communicate the insights we produce to our leadership. This position is perfect for someone who has a deep and broad analytic background and is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis. We are particularly interested in candidates with research background in experimentation design, applied microeconomics, econometrics, statistical inference and/or finance. A day in the life Discussions with business partners, as well as product managers and tech leaders to understand the business problem. Brainstorming with other scientists and economists to design the right model for the problem in hand. Present the results and new ideas for existing or forward looking problems to leadership. Deep dive into the data. Modeling and creating working prototypes. Analyze the results and review with partners. Partnering with other scientists for research problems. About the team The retail pricing science and research group is a team of scientists and economists who design and implement the analytics powering pricing for Amazon's on-line retail business. The team uses world-class analytics to make sure that the prices for all of Amazon's goods and services are aligned with Amazon's corporate goals.
US, WA, Bellevue
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and UNIX would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of interns from previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US
The Amazon Supply Chain Optimization Technology (SCOT) organization is looking for an Intern in Economics to work on exciting and challenging problems related to Amazon's worldwide inventory planning. SCOT provides unique opportunities to both create and see the direct impact of your work on billions of dollars’ worth of inventory, in one of the world’s most advanced supply chains, and at massive scale. We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. We are looking for a PhD candidate with exposure to Program Evaluation/Causal Inference. Knowledge of econometrics and Stata/R/or Python is necessary, and experience with SQL, Hadoop, and Spark would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis at Internet speed collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time scientist employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com.
US, WA, Seattle
The Selling Partner Fees team owns the end-to-end fees experience for two million active third party sellers. We own the fee strategy, fee seller experience, fee accuracy and integrity, fee science and analytics, and we provide scalable technology to monetize all services available to third-party sellers. We are looking for an Intern Economist with excellent coding skills to design and develop rigorous models to assess the causal impact of fees on third party sellers’ behavior and business performance. As a Science Intern, you will have access to large datasets with billions of transactions and will translate ambiguous fee related business problems into rigorous scientific models. You will work on real world problems which will help to inform strategic direction and have the opportunity to make an impact for both Amazon and our Selling Partners.