Automatically generating text from structured data

Technique that lets devices convey information in natural language improves on state of the art.

Data-to-text generation converts information from a structured format such as a table into natural language. This allows structured information to be read or listened to, as when a device displays a weather forecast or a voice assistant answers a question.

Language models trained on billions of sentences learn common linguistic patterns and can generate natural-sounding sentences by predicting likely sequences of words. However, in data-to-text generation we want to generate language that not only is fluent but also conveys content accurately. 

Some approaches to data-to-text generation use a pipeline of machine learning models to turn the data into text, but this can be labor intensive to create, and pipelining poses the risk that errors in one step will compound in later steps.

In the Alexa AI organization, we’ve developed a neural, end-to-end, data-to-text generation system called DataTuner, which can be used for a variety of data types and topics to generate fluent and accurate texts. We've released the DataTuner code on GitHub under a noncommercial license.

DataTuner.png
Alexa AI's new DataTuner software can convert structured information, such as the relationships encoded by knowledge graphs, into texts that are both semantically faithful and fluent.
Credit: Glynis Condon

At last year’s International Conference on Computational Linguistics (COLING), we presented a paper in which we compared our approach to its best-performing predecessors, using four data-to-text data sets. On automated metrics, DataTuner pushes the state of the art by significant margins, from 1.2 to 5.9 points according to the BLEU algorithm for evaluating text quality.

Human annotators also graded our responses as both more natural-sounding and more accurate. In fact, on two of the four data sets, our texts were judged to be more natural-sounding, on average, than human-written texts.

Annotator evaluations showed that DataTuner improved the semantic accuracy of generated texts, with margins ranging from 5.3% to 40%. Our paper also introduces a model-based approach for measuring the accuracy of generated texts, an approach that is 4.2% to 14.2% more accurate at detecting errors than previous hand-crafted approaches. 

Semantic fidelity vs. fluency

To get a sense of the problem we address, consider an example in which we have some structured information about Michelle Obama that we want to convey to our readers or listeners. That information is organized in the entity-relation-entity format typical of knowledge graphs.

Michelle Obama | author of | Becoming 
Michelle Obama | birthplace | Chicago, Illinois, USA
Princeton University | alma mater of | Michelle Obama
Harvard University | alma mater of | Michelle Obama

We could imagine a text that conveys the meaning accurately but doesn’t sound very natural:

Michelle Obama is the author of Becoming. Michelle Obama was born in Chicago, Illinois, USA. Michelle Obama was educated at Princeton University. Michelle Obama was educated at Harvard University.

This text has high semantic fidelity but low fluency.

Alternatively, we could imagine a text that sounds very fluent but doesn’t accurately convey the information: 

Born in Chicago, Illinois, and educated at Harvard, Michelle Obama is the author of A Promised Land

This text has added some information and missed some out, so it has low semantic fidelity even though it has high fluency.

Pipeline-based approaches to data-to-text generation typically consist of steps such as (1) ordering the content; (2) dividing the content into sentences; (3) finding the right words and phrases to express the data (lexicalization and referring-expression generation), and (4) joining it all together to produce the final text (realization). These approaches usually generalize well to new concepts because of the separate lexicalization step, but they can be difficult to maintain and require training data for each step that can be labor intensive to acquire. 

End-to-end approaches are trained on [data, text] pairs that can be gathered more easily, but it’s difficult to guarantee the semantic fidelity of the results. This is the problem we address with DataTuner.

The DataTuner model

DataTuner’s approach has two steps, generation and reranking. 

First, our language model generates texts from data. In our experiments, we started with a pretrained language model that could generate text, the GPT-2 model. To adapt it to the data-to-text task, we trained it on concatenated data and text, using the special tokens <data> and <text> to indicate which was which. When we use the trained model to generate text, the only input is the data.

DataTuner architecture.png
During training, the inputs to DataTuner's data-to-text model are data and text, separated by the special tokens <data> and <text>. At runtime, the only input is the data.
Credit: Hamza Harkous

Inside the model, we concatenate several types of embeddings, or vector representations whose spatial relationships indicate relationships between data (see figure above). The first type is token embeddings, which encode semantic information about individual input words. The other is an embedding that represents words’ positions in the text. 

We also introduce what we call fine-grained state embeddings. To produce these, we use special tokens that indicate structural relationships between data items.

For example, we would convert the data triple Michelle Obama | author of | Becoming into the string <subject> Michelle Obama <predicate> author of <object> Becoming, with <subject>, <object>, and <predicate> as special tokens. The state embedding for any token is that of the special token that most recently precedes it; for example, the token Becoming will get the state embedding of <object>. 

Secondly, we train a semantic-fidelity classifier. This takes the input data and a generated text and identifies whether the text accurately conveys the data or whether it adds, repeats, omits, or changes any of the content. We use this to rerank the generated texts according to accuracy. 

The classifier is trained using the same data we used to train our language model. Our original [data, text] pairs give us the examples that are to be classified as accurate. To get inaccurate examples, we use rule-based corruptions of the accurate [data, text] pairs. For example, we could take the training pair (Michelle Obama | author of | Becoming) and “Michelle Obama wrote Becoming and swap the entities to create the inaccurate [data, text] pair (Michelle Obama | author of | the Gruffalo) and “Michelle Obama wrote Becoming”.

For this classifier we use the RoBERTA language model with an additional classification layer, an approach that has been successful in other tasks, such as natural-language inference. For each input token (either data or text), we take the token embeddings, positional embeddings, and segment embeddings (embeddings of the tokens that distinguish text and data) and sum these element-wise to provide the input to RoBERTa’s first layer. A final single-layer neural network produces a classification label. 

Evaluation

We experimented with four different data sets in different formats, including news texts, restaurant reviews, and chats about video games. We evaluated the texts we generated both with automated metrics and by asking human annotators to rate fluency and accuracy via Amazon Mechanical Turk. 

In our experiments, we saw that a model trained without the fine-grained state embeddings is less accurate than a model with them and that adding the semantic-fidelity classifier boosts accuracy further.

We also examined the cases in which our generated texts were assessed as better than human-written texts, and we suspect that the reason is that our model learned to produce standard formulations, whereas humans sometimes write in non-standard or informal ways that other people might find less fluent.

We also investigated the use of our semantic-fidelity classifier as a method for automatically evaluating the accuracy of texts generated by different models and found that, for two datasets, it was a significantly better predictor of annotators’ evaluations than existing heuristic approaches.

Related content

ES, M, Madrid
Amazon's International Technology org in EU (EU INTech) is creating new ways for Amazon customers discovering Amazon catalog through new and innovative Customer experiences. Our vision is to provide the most relevant content and CX for their shopping mission. We are responsible for building the software and machine learning models to surface high quality and relevant content to the Amazon customers worldwide across the site. The team, mainly located in Madrid Technical Hub, London and Luxembourg, comprises Software Developer and ML Engineers, Applied Scientists, Product Managers, Technical Product Managers and UX Designers who are experts on several areas of ranking, computer vision, recommendations systems, Search as well as CX. Are you interested on how the experiences that fuel Catalog and Search are built to scale to customers WW? Are interesting on how we use state of the art AI to generate and provide the most relevant content? Key job responsibilities We are looking for Applied Scientists who are passionate to solve highly ambiguous and challenging problems at global scale. You will be responsible for major science challenges for our team, including working with text to image and image to text state of the art models to scale to enable new Customer Experiences WW. You will design, develop, deliver and support a variety of models in collaboration with a variety of roles and partner teams around the world. You will influence scientific direction and best practices and maintain quality on team deliverables. We are open to hiring candidates to work out of one of the following locations: Madrid, M, ESP
US, NY, New York
Amazon Advertising exists at the intersection of marketing and e-commerce and offers advertisers a rich array of innovative advertising solutions across Amazon-owned and third party properties. We believe that advertising, when done well, can greatly enhance the value of the customer experience and generate a positive return on investment for our advertising partners. We are currently looking for a highly skilled and motivated Data Scientist to help scale our growing advertising business. The Data Scientist is a key member of the Global Marketing Insights team at Amazon Ads, working with marketing, product, retail and other Amazon business partners to analyze and improve advertisers’ performance on Amazon, in support of their marketing objectives. You will work with Amazon's unique data and translate it into high-quality and actionable insights and recommendations to improve the effectiveness of advertiser campaigns and unlock business opportunities. Day to day activities include analyzing advertiser behaviors to develop data-driven insights on what tactics and strategies lead to success. You will also build automated solutions to generate science driven insights at scale, that are distributed to our advertisers across channels. Basic qualifications - Bachelor's or Master's degree in Engineering, Statistics, Economics, or a related technical field - Proven experience in data analytics or data science roles - Proficiency with SQL and Python - Strong understanding of basic statistical techniques and methodologies such as distributions, hypothesis testing, regressions, experimentation, A/B Testing etc. - Excellent organizational, interpersonal, and communication skills (both written and verbal) - Ability to work cross-functionally and with technical and non-technical stakeholders Preferred qualifications - Understanding of advanced statistical techniques and methodologies such as causal inference, propensity score matching, machine learning etc. - Experience with developing and deploying production machine learning models, especially on cloud platforms - Experience building and managing data pipelines - Experience with digital advertising products, performance analytics , marketing and advertising campaigns - MBA, Master’s, or Doctoral degree in Economics, Engineering, Marketing, Statistics, Advertising, or related fields - Publication track record/writing experience (ex. published a paper in a technical journal or trade publication) About the team The Marketing Insights team is responsible for delivering science backed insights to millions of advertisers via our marketing messages. Our team is distributed across the globe and is building cutting edge data science to identify and communicate the impact of various advertising strategies for our products. We are open to hiring candidates to work out of one of the following locations: New York, NY, USA
US, WA, Bellevue
We are designing the future. If you are in quest of an iterative fast-paced environment, where you can drive innovation through scientific inquiry, and provide tangible benefit to hundreds of thousands of our associates worldwide, this is your opportunity. Come work on the Amazon Worldwide Fulfillment Design & Engineering Team! We are looking for an experienced and Research Scientist with background in Ergonomics and Industrial Human Factors, someone that is excited to work on complex real-world challenges for which a comprehensive scientific approach is necessary to drive solutions. Your investigations will define human factor / ergonomic thresholds resulting in design and implementation of safe and efficient workspaces and processes for our associates. Your role will entail assessment and design of manual material handling tasks throughout the entire Amazon network. You will identify fundamental questions pertaining to the human capabilities and tolerances in a myriad of work environments, and will initiate and lead studies that will drive decision making on an extreme scale. .You will provide definitive human factors/ ergonomics input and participate in design with every single design group in our network, including Amazon Robotics, Engineering R&D, and Operations Engineering. You will work closely with our Worldwide Health and Safety organization to gain feedback on designs and work tenaciously to continuously improve our associate’s experience. Key job responsibilities - Collaborating and designing work processes and workspaces that adhere to human factors / ergonomics standards worldwide. - Producing comprehensive and assessments of workstations and processes covering biomechanical, physiological, and psychophysical demands. - Effectively communicate your design rationale to multiple engineering and operations entities. - Identifying gaps in current human factors standards and guidelines, and lead comprehensive studies to redefine “industry best practices” based on solid scientific foundations. - Continuously strive to gain in-depth knowledge of your profession, as well as branch out to learn about intersecting fields, such as robotics and mechatronics. - Travelling to our various sites to perform thorough assessments and gain in-depth operational feedback, approximately 25%-50% of the time. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, WA, Bellevue
Imagine being part of an agile team where your ideas have the potential to reach millions of customers. Picture working on cutting-edge, customer-facing solutions, where every team member is a critical voice in the decision making process. Envision being able to leverage the resources of a Fortune 500 company within the atmosphere of a start-up. Welcome to Amazon’s NCRC team. We solve complex problems in an ambiguous space, focusing on reducing return costs and improving the customer experience. We build solutions that are distributed on a large scale, positively impacting experiences for our customers and sellers. Come innovate with the NCRC team! The Net Cost of Refunds and Concessions (NCRC) team is looking for a Senior Manager Data Science to lead a team of economists, business intelligence engineers and business analysts who investigate business problems, develop insights and build models & algorithms that predict and quantify new opportunity. The team instigates and productionalizes nascent solutions around four pillars: outbound defects, inbound defects, yield optimization and returns reduction. These four pillars interact, resulting in impacts to our overall return rate, associated costs, and customer satisfaction. You may have seen some downstream impacts of our work including Amazon.com customer satisfaction badges on the website and app, new returns drop off optionality, and faster refunds for low cost items. In this role, you will set the science vision and direction for the team, collaborating with internal stakeholders across our returns and re-commerce teams to scale and advance science solutions. This role is based in Bellevue, WA Key job responsibilities * Single threaded leader responsible for setting and driving science strategy for the organization. * Lead and provide coaching to a team of Scientists, Economists, Business Intelligence Engineers and Business Analysts. * Partner with Engineering, Product and Machine Learning leaders to deliver insights and recommendations across NCRC initiatives. * Lead research and development of models and science products powering return cost reduction. * Educate and evangelize across internal teams on analytics, insights and measurement by writing whitepapers, knowledge documentation and delivering learning sessions. We are open to hiring candidates to work out of one of the following locations: Bellevue, WA, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and Scala would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Chicago, IL, USA | Seattle, WA, USA | Washington, DC, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and Scala would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Chicago, IL, USA | Seattle, WA, USA | Washington, DC, USA
US, WA, Seattle
We are looking for detail-oriented, organized, and responsible individuals who are eager to learn how to work with large and complicated data sets. Some knowledge of econometrics, as well as basic familiarity with Python is necessary, and experience with SQL and Scala would be a plus. These are full-time positions at 40 hours per week, with compensation being awarded on an hourly basis. You will learn how to build data sets and perform applied econometric analysis collaborating with economists, scientists, and product managers. These skills will translate well into writing applied chapters in your dissertation and provide you with work experience that may help you with placement. Roughly 85% of previous cohorts have converted to full time economics employment at Amazon. If you are interested, please send your CV to our mailing list at econ-internship@amazon.com. We are open to hiring candidates to work out of one of the following locations: Chicago, IL, USA | Seattle, WA, USA | Washington, DC, USA
US, CA, Santa Clara
Amazon AI is looking for world class scientists and engineers to join its AWS AI. This group is entrusted with developing core natural language processing, generative AI, deep learning and machine learning algorithms for AWS. You will invent, implement, and deploy state of the art machine learning algorithms and systems. You will build prototypes and explore conceptually new solutions. You will interact closely with our customers and with the academic community. You will be at the heart of a growing and exciting focus area for AWS and work with other acclaimed engineers and world famous scientists. A day in the life Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. About the team The Amazon Web Services (AWS) Next Gen DevX (NGDE) team uses generative AI and foundation models to reimagine the experience of all builders on AWS. From the IDE to web-based tools and services, AI will help engineers work on large and small applications. We explore new technologies and find creative solutions. Curiosity and an explorative mindset can find a place here to impact the life of engineers around the world. If you are excited about this space and want to enlighten your peers with new capabilities, this is the team for you. We are open to hiring candidates to work out of one of the following locations: Santa Clara, CA, USA
US, CA, Santa Clara
Amazon AI is looking for world class scientists and engineers to join its AWS AI. This group is entrusted with developing core natural language processing, generative AI, deep learning and machine learning algorithms for AWS. You will invent, implement, and deploy state of the art machine learning algorithms and systems. You will build prototypes and explore conceptually new solutions. You will interact closely with our customers and with the academic community. You will be at the heart of a growing and exciting focus area for AWS and work with other acclaimed engineers and world famous scientists. A day in the life Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. About the team The Amazon Web Services (AWS) Next Gen DevX (NGDE) team uses generative AI and foundation models to reimagine the experience of all builders on AWS. From the IDE to web-based tools and services, AI will help engineers work on large and small applications. We explore new technologies and find creative solutions. Curiosity and an explorative mindset can find a place here to impact the life of engineers around the world. If you are excited about this space and want to enlighten your peers with new capabilities, this is the team for you. We are open to hiring candidates to work out of one of the following locations: Santa Clara, CA, USA
US, CA, Santa Clara
We are looking for an Applied Scientist who is passionate about building services and tools for developers that leverage artificial intelligence and machine learning. You will be part of a team building Large Language Model (LLM)-based services with the focus on enhancing the developer experience in the Cloud. The team works in close collaboration with other AWS services such as AWS Cloud9, the AWS IDE Toolkit and AWS Bedrock. If you are excited about working in cloud computing and building new AWS services, then we'd love to talk to you. As an Applied Scientist, you are recognized for your expertise, advise team members on a range of machine learning topics, and work closely with software engineers to drive the delivery of end-to-end modeling solutions. Your work focuses on ambiguous problem areas where the business problem or opportunity may not yet be defined. The problems that you take on require scientific breakthroughs. You take a long-term view of the business objectives, product roadmaps, technologies, and how they should evolve. You drive mindful discussions with customers, engineers, and scientist peers. You bring perspective and provide context for current technology choices, and make recommendations on the right modeling and component design approach to achieve the desired customer experience and business outcome. Key job responsibilities - Understand the challenges that developers face when building software today, and develop generalizable solutions. - Collaborate with developers and pave the way towards bringing your solution into production systems. Lead cross team projects and ensure technical blockers are resolved - Communicate and document your research via publishing papers in external scientific venues. A day in the life Inclusive Team Culture Here at AWS, we embrace our differences. We are committed to furthering our culture of inclusion. We have ten employee-led affinity groups, reaching 40,000 employees in over 190 chapters globally. We have innovative benefit offerings, and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which remind team members to seek diverse perspectives, learn and be curious, and earn trust. Work/Life Balance Our team puts a high value on work-life balance. It isn’t about how many hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We believe striking the right balance between your personal and professional life is critical to life-long happiness and fulfillment. We offer flexibility in working hours and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth Our team is dedicated to supporting new members. We have a broad mix of experience levels and tenures, and we’re building an environment that celebrates knowledge sharing and mentorship. Our senior members enjoy one-on-one mentoring and thorough, but kind, code reviews. We care about your career growth and strive to assign projects based on what will help each team member develop into a better-rounded engineer and enable them to take on more complex tasks in the future. About the team The Amazon Web Services (AWS) Next Gen DevX (NGDE) team uses generative AI and foundation models to reimagine the experience of all builders on AWS. From the IDE to web-based tools and services, AI will help engineers work on large and small applications. We explore new technologies and find creative solutions. Curiosity and an explorative mindset can find a place here to impact the life of engineers around the world. If you are excited about this space and want to enlighten your peers with new capabilities, this is the team for you. We are open to hiring candidates to work out of one of the following locations: Santa Clara, CA, USA