Automatically generating text from structured data

Technique that lets devices convey information in natural language improves on state of the art.

Data-to-text generation converts information from a structured format such as a table into natural language. This allows structured information to be read or listened to, as when a device displays a weather forecast or a voice assistant answers a question.

Language models trained on billions of sentences learn common linguistic patterns and can generate natural-sounding sentences by predicting likely sequences of words. However, in data-to-text generation we want to generate language that not only is fluent but also conveys content accurately. 

Some approaches to data-to-text generation use a pipeline of machine learning models to turn the data into text, but this can be labor intensive to create, and pipelining poses the risk that errors in one step will compound in later steps.

In the Alexa AI organization, we’ve developed a neural, end-to-end, data-to-text generation system called DataTuner, which can be used for a variety of data types and topics to generate fluent and accurate texts. We've released the DataTuner code on GitHub under a noncommercial license.

DataTuner.png
Alexa AI's new DataTuner software can convert structured information, such as the relationships encoded by knowledge graphs, into texts that are both semantically faithful and fluent.
Credit: Glynis Condon

At last year’s International Conference on Computational Linguistics (COLING), we presented a paper in which we compared our approach to its best-performing predecessors, using four data-to-text data sets. On automated metrics, DataTuner pushes the state of the art by significant margins, from 1.2 to 5.9 points according to the BLEU algorithm for evaluating text quality.

Human annotators also graded our responses as both more natural-sounding and more accurate. In fact, on two of the four data sets, our texts were judged to be more natural-sounding, on average, than human-written texts.

Annotator evaluations showed that DataTuner improved the semantic accuracy of generated texts, with margins ranging from 5.3% to 40%. Our paper also introduces a model-based approach for measuring the accuracy of generated texts, an approach that is 4.2% to 14.2% more accurate at detecting errors than previous hand-crafted approaches. 

Semantic fidelity vs. fluency

To get a sense of the problem we address, consider an example in which we have some structured information about Michelle Obama that we want to convey to our readers or listeners. That information is organized in the entity-relation-entity format typical of knowledge graphs.

Michelle Obama | author of | Becoming 
Michelle Obama | birthplace | Chicago, Illinois, USA
Princeton University | alma mater of | Michelle Obama
Harvard University | alma mater of | Michelle Obama

We could imagine a text that conveys the meaning accurately but doesn’t sound very natural:

Michelle Obama is the author of Becoming. Michelle Obama was born in Chicago, Illinois, USA. Michelle Obama was educated at Princeton University. Michelle Obama was educated at Harvard University.

This text has high semantic fidelity but low fluency.

Alternatively, we could imagine a text that sounds very fluent but doesn’t accurately convey the information: 

Born in Chicago, Illinois, and educated at Harvard, Michelle Obama is the author of A Promised Land

This text has added some information and missed some out, so it has low semantic fidelity even though it has high fluency.

Pipeline-based approaches to data-to-text generation typically consist of steps such as (1) ordering the content; (2) dividing the content into sentences; (3) finding the right words and phrases to express the data (lexicalization and referring-expression generation), and (4) joining it all together to produce the final text (realization). These approaches usually generalize well to new concepts because of the separate lexicalization step, but they can be difficult to maintain and require training data for each step that can be labor intensive to acquire. 

End-to-end approaches are trained on [data, text] pairs that can be gathered more easily, but it’s difficult to guarantee the semantic fidelity of the results. This is the problem we address with DataTuner.

The DataTuner model

DataTuner’s approach has two steps, generation and reranking. 

First, our language model generates texts from data. In our experiments, we started with a pretrained language model that could generate text, the GPT-2 model. To adapt it to the data-to-text task, we trained it on concatenated data and text, using the special tokens <data> and <text> to indicate which was which. When we use the trained model to generate text, the only input is the data.

DataTuner architecture.png
During training, the inputs to DataTuner's data-to-text model are data and text, separated by the special tokens <data> and <text>. At runtime, the only input is the data.
Credit: Hamza Harkous

Inside the model, we concatenate several types of embeddings, or vector representations whose spatial relationships indicate relationships between data (see figure above). The first type is token embeddings, which encode semantic information about individual input words. The other is an embedding that represents words’ positions in the text. 

We also introduce what we call fine-grained state embeddings. To produce these, we use special tokens that indicate structural relationships between data items.

For example, we would convert the data triple Michelle Obama | author of | Becoming into the string <subject> Michelle Obama <predicate> author of <object> Becoming, with <subject>, <object>, and <predicate> as special tokens. The state embedding for any token is that of the special token that most recently precedes it; for example, the token Becoming will get the state embedding of <object>. 

Secondly, we train a semantic-fidelity classifier. This takes the input data and a generated text and identifies whether the text accurately conveys the data or whether it adds, repeats, omits, or changes any of the content. We use this to rerank the generated texts according to accuracy. 

The classifier is trained using the same data we used to train our language model. Our original [data, text] pairs give us the examples that are to be classified as accurate. To get inaccurate examples, we use rule-based corruptions of the accurate [data, text] pairs. For example, we could take the training pair (Michelle Obama | author of | Becoming) and “Michelle Obama wrote Becoming and swap the entities to create the inaccurate [data, text] pair (Michelle Obama | author of | the Gruffalo) and “Michelle Obama wrote Becoming”.

For this classifier we use the RoBERTA language model with an additional classification layer, an approach that has been successful in other tasks, such as natural-language inference. For each input token (either data or text), we take the token embeddings, positional embeddings, and segment embeddings (embeddings of the tokens that distinguish text and data) and sum these element-wise to provide the input to RoBERTa’s first layer. A final single-layer neural network produces a classification label. 

Evaluation

We experimented with four different data sets in different formats, including news texts, restaurant reviews, and chats about video games. We evaluated the texts we generated both with automated metrics and by asking human annotators to rate fluency and accuracy via Amazon Mechanical Turk. 

In our experiments, we saw that a model trained without the fine-grained state embeddings is less accurate than a model with them and that adding the semantic-fidelity classifier boosts accuracy further.

We also examined the cases in which our generated texts were assessed as better than human-written texts, and we suspect that the reason is that our model learned to produce standard formulations, whereas humans sometimes write in non-standard or informal ways that other people might find less fluent.

We also investigated the use of our semantic-fidelity classifier as a method for automatically evaluating the accuracy of texts generated by different models and found that, for two datasets, it was a significantly better predictor of annotators’ evaluations than existing heuristic approaches.

About the Author
Isabel Groves is a computational linguist in the Alexa AI organization.

Related content

US, WA, Seattle
Job summaryAt Alexa Shopping, we strive to enable shopping in everyday life. We allow customers to instantly order whatever they need, by simply interacting with their Smart Devices such as Amazon Show, Spot, Echo, Dot or Tap. Our Services allow you to shop, no matter where you are or what you are doing, you can go from 'I want that' to 'that's on the way' in a matter of seconds. We are seeking the industry's best to help us create new ways to interact, search and shop. Join us, and you'll be taking part in changing the future of everyday lifeWe are seeking a Data Scientist to be part of the NLU science team for Alexa Shopping. This is a strategic role to shape and deliver our technical strategy in developing and deploying NLU, Machine Learning solutions to our hardest customer facing problems. Our goal is to delight customers by providing a conversational interaction. These initiatives are at the heart of the organization and recognized as the innovations that will allow us to build a differentiated product that exceeds customer expectations. We're a high energy, fast growth business excited to have the opportunity to shape Alexa Shopping NLU is defined for years to come. If this role seems like a good fit, please reach out, we'd love to talk to you.This role requires working closely with business, engineering and other scientists within Alexa Shopping and across Amazon to deliver ground breaking features. You will lead high visibility and high impact programs collaborating with various teams across Amazon. You will work with a team of Language Engineers and Scientists to launch new customer facing features and improve the current features.
US, WA, Bellevue
Job summaryThe People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal.We are looking for economists who are able to work with business partners to hone complex problems into specific, scientific questions, and test those questions to generate insights. The ideal candidate will work with engineers and computer scientists to estimate models and algorithms on large scale data, design pilots and measure their impact, and transform successful prototypes into improved policies and programs at scale. We are looking for creative thinkers who can combine a strong technical economic toolbox with a desire to learn from other disciplines, and who know how to execute and deliver on big ideas as part of an interdisciplinary technical team.Ideal candidates will work closely with business partners to develop science that solves the most important business challenges. They will work in a team setting with individuals from diverse disciplines and backgrounds. They will serve as an ambassador for science and a scientific resource for business teams, so that scientific processes permeate throughout the HR organization to the benefit of Amazonians and Amazon. Ideal candidates will own the data analysis, modeling, and experimentation that is necessary for estimating and validating models. They will work closely with engineering teams to develop scalable data resources to support rapid insights, and take successful models and findings into production as new products and services. They will be customer-centric and will communicate scientific approaches and findings to business leaders, listening to and incorporate their feedback, and delivering successful scientific solutions.Key job responsibilitiesUse causal inference methods to evaluate the impact of policies on employee outcomes. Examine how external labor market and economic conditions impact Amazon's ability to hire and retain talent. Use scientifically rigorous methods to develop and recommend career paths for employees.A day in the lifeWork with teammates to apply economic methods to business problems. This might include identifying the appropriate research questions, writing code to implement a DID analysis or estimate a structural model, or writing and presenting a document with findings to business leaders. Our economists also collaborate with partner teams throughout the process, from understanding their challenges, to developing a research agenda that will address those challenges, to help them implement solutions.About the teamWe are a multidisciplinary team that combines the talents of science and engineering to develop innovative solutions to make Amazon Earth's Best Employer.
US, CA, Sunnyvale
Job summaryThe Amazon Alexa app is a companion to Alexa devices for setup, remote control, and enhanced features. The Alexa app understands a customer’s habits, preferences and delivers a personalized experience to help them manage their day by providing relevant information as customers want it. We believe voice is the most natural user interface for interacting with technology across many domains; we are inventing the future. As voice-enabled technology becomes increasingly advanced, consumers are demanding more from what their voice products can do. We’re looking for Scientists who are passionate about innovating on behalf of customers, demonstrate a high degree of product ownership, and want to have fun while they make history.As a Data Scientist, you will help build a production scaled personalized recommendation, Machine Learning (ML) and Deep Learning (DL) models to help derive business value and new insights through the adoption of Artificial Intelligence (AI).Key job responsibilitiesThe successful candidate will be responsible for distilling user data insights for ML science applications and influence business decision with data-driven approach to increase Alexa mobile engagement and growth. A successful candidate will be a person who enjoys diving deep into data, doing analysis, discovering root causes, and designing long-term solutions.· Expertise in the areas of data science, machine learning and statistics.· Translate business needs into advanced analytics and machine learning models and provide strong algorithm and coding execution and delivery of Machine Learning & Artificial Intelligence.· Work closely with the engineers to architect and develop the best technical design and approach.· Being able to dive a ML / DL project from beginning to end, including understanding the business need, aggregating data, exploring data, building & validating predictive models, and deploying completed models to deliver business impact to the organization.· Analyze, extract, normalize, and label relevant data.· Work with Engineers to help our customers operationalize models after they are built.A day in the life· Design and review mobile experiments for growth and engagement· Build statistical models and generate data insights to understand mobile growth and retention· Feature engineering to improve ML model performance.· Analyze, extract, normalize, and label relevant data.· Work with Engineers to deploy applications to production· Work with product manager to convert business problems to science problems and define the solutions.About the teamAlexa Mobile Intelligence team is motivated to make Alexa mobile app being the best intelligent assistant and providing personalized relevant features and content by understanding customers' habits, preferences, hence will reach high growth and retention for the app.
US, CA, Sunnyvale
Job summaryOur Alexa Product Advisor (part of Alexa Shopping) vision is to provide the best possible answers for a wide range of questions around product being asked by the customer. Our customers ask various questions to Alexa regarding products, and not all the time we can find an answer in our knowledge sources. "Alexa, how strong is the magsafe on iPhone 12?" is a typical question that could be asked to our system. The first step in providing these answers is to form high quality classification and machine understanding of natural language questions into their core components (shape, product references, attributes, pronouns etc).Alexa Shopping is looking for an experienced Data Scientist to be a part of a team solving complex natural language processing problems and customer demand insights (including segmentation analysis and personas building using big data, ML and potentially AI). This is a blue-sky role that gives you a chance to roll up your sleeves and dive into big data sets in order to build simulations and experimentation systems at scale, build optimization algorithms and leverage cutting-edge technologies across Amazon. This is an opportunity to think big about how to solve a challenging problem for the customers and understand their requirements for products.If you are thinking how big is this, then think how we searched on desktops in 2000's, mobiles in 2010s and on voice and intelligent devices today! We want to provide a great product experience though the intelligence we are building about products on any platform, making it easier for customers to learn about the products on Echo devices, mobile app, desktop, etcYou will work closely with product and technical leaders throughout Alexa Shopping and will be responsible for influencing technical decisions in areas of development/modelling that you identify as critical future product offerings. You will identify both enablers and blockers of adoption for product understanding, and build programs to raise the bar in terms of understanding product questions and predict the shaping of customer utterances as we move from simple to complex utterances.The ideal candidate will have extensive experience in Science work, business analytics and have the aptitude to incorporate new approaches and methodologies while dealing with ambiguities in sourcing processes. Excellent business and communication skills are a must to develop and define key business questions and to build data sets that answer those questions. You should have a demonstrated ability to think strategically and analytically about business, product, and technical challenges. Further, you must have the ability to build and communicate compelling value propositions, and work across the organization to achieve consensus. This role requires a strong passion for customers, a high level of comfort navigating ambiguity, and a keen sense of ownership and drive to deliver results.
US, CA, Palo Alto
Job summaryAmazon is the 4th most popular site in the US (http://www.alexa.com/topsites/countries/US). Our product search engine is one of the most heavily used services in the world, indexes billions of products, and serves hundreds of millions of customers world-wide. We are working on a new AI-first initiative to re-architect and reinvent the way we do search through the use of extremely large scale next-generation deep learning techniques. Our goal is to make step function improvements in the use of advanced Machine Learning (ML) on very large scale datasets, specifically through the use of aggressive systems engineering and hardware accelerators. This is a rare opportunity to develop cutting edge ML solutions and apply them to a problem of this magnitude. Some exciting questions that we expect to answer over the next few years include:· Can a focus on compilers and custom hardware help us accelerate model training and reduce hardware costs?· Can combining supervised multi-task training with unsupervised training help us to improve model accuracy?· Can we transfer our knowledge of the customer to every language and every locale ?This is a unique opportunity to get in on the ground floor, shape, and build the next-generation of Amazon Search. We are looking for exceptional scientists and ML engineers who are passionate about innovation and impact, and want to work in a team with a startup culture within a larger organization.Please visit https://www.amazon.science for more information
US, CA, Sunnyvale
Job summaryAmazon Lab 126 specializes in pioneering new home experiences that brings the future one step closer. The most recent invention is Amazon Astro, a home robot that brings the family closer and provides peace of mind. Building a home robot that gracefully moves through an ever-changing environment, such as one’s home, required challenging the state-of-the-art and furthering it, in areas of Perception, SLAM, Mapping and Intelligent Motion to name a few. Packing that technology in an affordable piece of hardware that consistently accomplishes its tasks, is a whole another story!Ada Lovelace, the first computer programmer, once famously said, “Those who have learned to walk on the threshold of the unknown worlds, by means of what are commonly termed par excellence the exact sciences, may then, with the fair white wings of imagination, hope to soar further into the unexplored amidst which we live”. With the launch of Astro, we are on the threshold of something that will change our lives forever. Join us, as we soar further to imagine and invent new experiences that will one day become the future. It is still Day One!Key job responsibilitiesAs a Senior Applied Scientist in Robotics, you will work with a team of smart, passionate and diverse engineers researching and developing mobility solutions for the robot, in the areas of intelligent motion, mapping, exploration - to name a few. You will design solutions for complex and ambiguous problem areas where the business problem or opportunity may not yet be defined. Most business problems that you will take on, require scientific breakthroughs. You will provide context for current technology choices and make recommendations on the right modelling and component design approach to achieve the desired customer experience/business outcome. You will set standards and proactively drive components to utilize and improve on state-of-the-art techniques. Your will create solutions that are inventive, easily maintainable, scalable, and extensible. You will file for patents and publish research work where opportunities arise, and give internal or external presentations about your area of speciality.
IL, Haifa
Job summaryYou: Alexa, I am looking for a role in which I could learn, research, and innovate in AI and, most of all, impact the life of millions of customers worldwide. What do you suggest?Alexa: The Alexa Shopping team is looking for research engineers to help me become the best personal shopping assistant. Do you want to hear more?You: Yes, please!Alexa: As a research engineer, you will work with top researchers and engineers, both locally and abroad, to explore and develop new AI technologies helping me in my journey to become the ultimate shopping assistant for millions of customers around the world. You should have strong computer science foundations, excellent development skills, and some experience with research methodology. You also preferably have some applied or research expertise in at least one of the following fields: Web search and mining, Machine Learning, Natural Language Processing, Computer Vision, Speech Processing, or Artificial Intelligence.
US, CA, Sunnyvale
Job summaryAmazon Lab126 is an inventive research and development company that designs and engineers high-profile consumer electronics. Lab126 began in 2004 as a subsidiary of Amazon.com, Inc., originally creating the best-selling Kindle family of products. Since then, we have produced groundbreaking devices like Fire tablets, Fire TV and Amazon Echo. What will you help us create?The Role:We are looking for a passionate, talented and inventive Senior Applied Scientist - Sensors to join our team. As part of the larger technology team working on new consumer technology, your work will have a large impact to hardware, internal software developers, ecosystem, and ultimately the lives of Amazon customers. You must love high quality signal processing, enjoy sensor data analysis, optimizing sensor performance, and have a feel for what a good consumer experience should be like. In this role, you will: - Engage with an experienced cross-disciplinary staff to conceive and design innovative consumer products · Work closely with an internal interdisciplinary team, and outside partners to drive key aspects of product definition, execution and test · Development of new sensor algorithms · Optimization and porting of sensor algorithms to different platforms. · Integrate vendor hardware and software stacks · Be able, and willing, to multi-task and learn new technologies quickly · Be responsive, flexible and able to succeed within an open collaborative peer environment
IE, D, Dublin
Job summary*Flexibility for alternate EU Amazon offices*Amazon’s mission is to be the most customer centric company in the world. The Workforce Staffing organization is on the front line of that mission by hiring the hourly fulfilment associates who make that mission a reality. To drive the necessary growth and continued scale of Amazon’s associate needs within a constrained employment environment, Amazon is creating a Workforce Staffing research program.This program will re-invent how Amazon attracts, communicates with, and ultimately hires its hourly associates. This team will own multi-layered research and program implementation to drive deep learnings, process improvements, and strategic recommendations to global leadership. Are you passionate about data? Are you a tinkerer by trade? Do you enjoy questioning the status quo? Do complex and difficult challenges excite you? If yes, this may be the team for you.As a Manager, Data Science in Workforce Staffing, you will have a strong focus on quantitative data analysis, understanding labor markets and the candidates within them. You will be responsible for building and developing a team, developing roadmaps, and driving business impact through your research at global scale.You will lead data science projects using your deep expertise in statistics (regressions, multilevel models, structural equation models, etc.), and data collection in a variety of settings (e.g., field studies, surveys, existing large data sets) to define and answer nebulous problems. You leverage your quantitative background to develop and test theoretical frameworks and design experiments. You design, deployment, and conduct analysis of our global candidate research activities, using experimental, quasi-experimental, and RCT methods. You relentlessly obsess over understanding our candidates and what attracts them to Amazon. You work with colleagues across Research, Data Science, Business Intelligence and related teams to enable Amazon find and hire the right candidates for the right roles at an unprecedented scale.A customer-obsessed, relentless curiosity is a must, as is commitment to the highest standards of methodological rigor that a given study allows. This role provides opportunity for significant exposure to Amazon’s culture, leadership, and global businesses, and furthermore provides significant opportunity to influence how Workforce Staffing matches talent to business demand.This will be a highly visible role across multiple key deliverables for our global organization. If you are passionate and curious about data, obsess over customers, love questioning the status quo, and want to make the world a better place, let’s chat. #scienceemea
ES, M, Madrid
Job summary*Flexibility for alternate EU Amazon offices*Amazon’s mission is to be the most customer centric company in the world. The Workforce Staffing organization is on the front line of that mission by hiring the hourly fulfilment associates who make that mission a reality. To drive the necessary growth and continued scale of Amazon’s associate needs within a constrained employment environment, Amazon is creating a Workforce Staffing research program.This program will re-invent how Amazon attracts, communicates with, and ultimately hires its hourly associates. This team will own multi-layered research and program implementation to drive deep learnings, process improvements, and strategic recommendations to global leadership. Are you passionate about data? Are you a tinkerer by trade? Do you enjoy questioning the status quo? Do complex and difficult challenges excite you? If yes, this may be the team for you.As a Manager, Data Science in Workforce Staffing, you will have a strong focus on quantitative data analysis, understanding labor markets and the candidates within them. You will be responsible for building and developing a team, developing roadmaps, and driving business impact through your research at global scale.You will lead data science projects using your deep expertise in statistics (regressions, multilevel models, structural equation models, etc.), and data collection in a variety of settings (e.g., field studies, surveys, existing large data sets) to define and answer nebulous problems. You leverage your quantitative background to develop and test theoretical frameworks and design experiments. You design, deployment, and conduct analysis of our global candidate research activities, using experimental, quasi-experimental, and RCT methods. You relentlessly obsess over understanding our candidates and what attracts them to Amazon. You work with colleagues across Research, Data Science, Business Intelligence and related teams to enable Amazon find and hire the right candidates for the right roles at an unprecedented scale.A customer-obsessed, relentless curiosity is a must, as is commitment to the highest standards of methodological rigor that a given study allows. This role provides opportunity for significant exposure to Amazon’s culture, leadership, and global businesses, and furthermore provides significant opportunity to influence how Workforce Staffing matches talent to business demand.This will be a highly visible role across multiple key deliverables for our global organization. If you are passionate and curious about data, obsess over customers, love questioning the status quo, and want to make the world a better place, let’s chat. #scienceemea