Adversarial training produces synthetic data for machine learning

One-sided label smoothing and techniques to prevent mode collapse improve performance of sentiment analysis models.

Sentiment analysis is the attempt, computationally, to determine from someone’s words how he or she feels about something. It has a host of applications, in market research, media analysis, customer service, and product recommendation, among other things.

Sentiment classifiers are typically machine learning systems, and any given application of sentiment analysis may suffer from a lack of annotated data for training purposes. In a paper I’m presenting at the International Conference on Acoustics, Speech, and Signal Processing, I describe my efforts to build a system that will generate synthetic training data for applications of sentiment analysis where real data is scarce.

Although the results I report are modest — augmenting training sets with my synthetic data improved sentiment classifiers’ accuracy by around 2% — they do demonstrate the viability of the approach. And the paper includes an analysis of the data generated by my system that could point toward techniques for improving its quality.

One of my first design decisions was not to try to generate actual text. Instead, the generator produces embeddings, a uniform way of representing text strings that is ubiquitous in natural-language-understanding applications.

Embeddings represent texts as points — vectors — in a high-dimensional space, such that texts with similar meanings are grouped together. Most common embeddings are based on analyses of huge bodies of text, in which words are judged to have similar meanings if they commonly co-occur with the same groups of other words.

tsne.png._CB469614397_.png
A 2-D projection of the high-dimensional embeddings used to train a sentiment classifier, indicating that the synthetic data (orange) does not yet capture the diversity of the natural data (blue).

The embeddings of any number of words can be averaged to produce a new point in the embedding space, so the length of the embedding vector is fixed, no matter the length of the corresponding text string.

To train my embedding generator, I used a generative adversarial network (GAN), an instance of an increasingly popular machine learning technique called adversarial training. The standard GAN consists of two neural networks, a generator and a discriminator. The discriminator is trained to distinguish between real data and the fake data produced by the generator; at the same time, the generator is trained to fool the discriminator. Hence the adversarial relationship.

In my case, the inputs to the discriminator, both real and fake, had two components: an embedding vector and a one-hot vector. A one-hot vector is a string of zeroes with, somewhere among them, a single one. The location of the one corresponds to a particular property — in this case, the sentiment of the text that the embedding vector (supposedly) represents.

My goal was to train the generator to produce synthetic data that could augment the training data for another neural network, a sentiment classifier. The addition of data produced by this type of plain-vanilla GAN, however, did not improve the sentiment classifier’s performance.

So I made several modifications to the system. The first was to equip it with a simple sentiment classifier, trained only on the real data that the generator was intended to augment. During training, the generator tried not only to fool the discriminator but also to produce one-hot vectors that matched the outputs of the simple classifier. This ensured greater consistency between the semantic content of the synthetic embeddings and the associated sentiments.

The other modifications addressed a problem called mode collapse, common in GANs. If the generator stumbles across a type of output that will reliably fool the discriminator, it has an incentive to restrict itself to outputs of that type. But this leads to very homogenous outputs, and homogenous data is not useful for training neural networks.

In my experiments, I was using two types of data, for both training and testing. One data set consisted of product reviews, the other of movie reviews. The training sets were small, to mimic the case in which training data is scarce.

To combat mode collapse, I first trained the GAN on a much larger set of texts, labeled according to sentiment — a set of Twitter posts, commonly used as a benchmark in the field. The tweets were shorter than the reviews and covered a wider range of topics, but they primed the generator to produce more diverse embeddings and the discriminator to recognize more subtle distinctions. After training the GAN on the tweets, I then fine-tuned it on review data.

As is typical in machine learning, I retrained the GAN several times on the same training data, until further training no longer improved its performance. With each pass through the training data, I added a different, random noise pattern to each training example; the data looked somewhat different each time around. That discouraged the generator from keying in on a single trick for fooling the discriminator.

Finally, I also used a technique called one-sided label smoothing. During training, instead of labeling the inputs to the GAN as 0 or 1 — fake or real — I label them as 0 or .9 — fake or 90% likely to be real. If the discriminator is never more than 90% confident in its classification of real inputs, the generator will keep exploring new options, in an attempt to wring out that extra 10% of certainty.

With these modifications, data produced by the generator led to slight improvements in the sentiment classifier’s performance, 1.6% on the movie reviews and 1.7% on the product reviews.

After each experiment, I used a technique called t-SNE (t-distributed stochastic neighbor embedding) to project the high-dimensional embeddings into a two-dimensional space. As can be seen in the figure above, the fake data never exhibited as much diversity as the real data.

However, after my modifications to the GAN, the data diversity did improve, which suggests a correlation between the diversity of the synthetic data and the performance of the sentiment classifier. GANs were first introduced in 2014, and some more recent GAN architectures appear to do a better job at preventing mode collapse than the architecture I used. In future work, my colleagues and I will explore some of those architectures, as well as experimenting with other techniques for diversifying the generator’s outputs.

Related content

IN, TN, Chennai
DESCRIPTION The Digital Acceleration (DA) team in India is seeking a talented, self-driven Applied Scientist to work on prototyping, optimizing, and deploying ML algorithms for solving Digital businesses problems. Key job responsibilities - Research, experiment and build Proof Of Concepts advancing the state of the art in AI & ML. - Collaborate with cross-functional teams to architect and execute technically rigorous AI projects. - Thrive in dynamic environments, adapting quickly to evolving technical requirements and deadlines. - Engage in effective technical communication (written & spoken) with coordination across teams. - Conduct thorough documentation of algorithms, methodologies, and findings for transparency and reproducibility. - Publish research papers in internal and external venues of repute - Support on-call activities for critical issues 4:14 BASIC QUALIFICATIONS - Experience building machine learning models or developing algorithms for business application - PhD, or a Master's degree and experience in CS, CE, ML or related field - Knowledge of programming languages such as C/C++, Python, Java or Perl - Experience in any of the following areas: algorithms and data structures, parsing, numerical optimization, data mining, parallel and distributed computing, high-performance computing - Proficiency in coding and software development, with a strong focus on machine learning frameworks. - Understanding of relevant statistical measures such as confidence intervals, significance of error measurements, development and evaluation data sets, etc. - Excellent communication skills (written & spoken) and ability to collaborate effectively in a distributed, cross-functional team setting. 4:14 PREFERRED QUALIFICATIONS - 3+ years of building machine learning models or developing algorithms for business application experience - Have publications at top-tier peer-reviewed conferences or journals - Track record of diving into data to discover hidden patterns and conducting error/deviation analysis - Ability to develop experimental and analytic plans for data modeling processes, use of strong baselines, ability to accurately determine cause and effect relations - Exceptional level of organization and strong attention to detail - Comfortable working in a fast paced, highly collaborative, dynamic work environment We are open to hiring candidates to work out of one of the following locations: Chennai, TN, IND
US, VA, Arlington
Machine learning (ML) has been strategic to Amazon from the early years. We are pioneers in areas such as recommendation engines, product search, eCommerce fraud detection, and large-scale optimization of fulfillment center operations. The Generative AI team helps AWS customers accelerate the use of Generative AI to solve business and operational challenges and promote innovation in their organization. As an applied scientist, you are proficient in designing and developing advanced ML models to solve diverse challenges and opportunities. You will be working with terabytes of text, images, and other types of data to solve real-world problems. You'll design and run experiments, research new algorithms, and find new ways of optimizing risk, profitability, and customer experience. We’re looking for talented scientists capable of applying ML algorithms and cutting-edge deep learning (DL) and reinforcement learning approaches to areas such as drug discovery, customer segmentation, fraud prevention, capacity planning, predictive maintenance, pricing optimization, call center analytics, player pose estimation, event detection, and virtual assistant among others. AWS Sales, Marketing, and Global Services (SMGS) is responsible for driving revenue, adoption, and growth from the largest and fastest growing small- and mid-market accounts to enterprise-level customers including public sector. The AWS Global Support team interacts with leading companies and believes that world-class support is critical to customer success. AWS Support also partners with a global list of customers that are building mission-critical applications on top of AWS services. Key job responsibilities The primary responsibilities of this role are to: - Design, develop, and evaluate innovative ML models to solve diverse challenges and opportunities across industries - Interact with customer directly to understand their business problems, and help them with defining and implementing scalable Generative AI solutions to solve them - Work closely with account teams, research scientist teams, and product engineering teams to drive model implementations and new solution About the team About AWS Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Atlanta, GA, USA | Austin, TX, USA | Houston, TX, USA | New York, NJ, USA | New York, NY, USA | San Francisco, CA, USA | Santa Clara, CA, USA | Seattle, WA, USA
US, WA, Seattle
Prime Video offers customers a vast collection of movies, series, and sports—all available to watch on hundreds of compatible devices. U.S. Prime members can also subscribe to 100+ channels including Max, discovery+, Paramount+ with SHOWTIME, BET+, MGM+, ViX+, PBS KIDS, NBA League Pass, MLB.TV, and STARZ with no extra apps to download, and no cable required. Prime Video is just one of the savings, convenience, and entertainment benefits included in a Prime membership. More than 200 million Prime members in 25 countries around the world enjoy access to Amazon’s enormous selection, exceptional value, and fast delivery. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As a Data Scientist at Amazon Prime Video, you will work with massive customer datasets, provide guidance to product teams on metrics of success, and influence feature launch decisions through statistical analysis of the outcomes of A/B experiments. You will develop machine learning models to facilitate understanding of customer's streaming behavior and build predictive models to inform personalization and ranking systems. You will work closely other scientists, economists and engineers to research new ways to improve operational efficiency of deployed models and metrics. A successful candidate will have a strong proven expertise in statistical modeling, machine learning, and experiment design, along with a solid practical understanding of strength and weakness of various scientific approaches. They have excellent communication skills, and can effectively communicate complex technical concepts with a range of technical and non-technical audience. They will be agile and capable of adapting to a fast-paced environment. They have an excellent track-record on delivering impactful projects, simplifying their approaches where necessary. A successful data scientist will own end-to-end team goals, operates with autonomy and strive to meet key deliverables in a timely manner, and with high quality. About the team Prime Video discovery science is a central team which defines customer and business success metrics, models, heuristics and econometric frameworks. The team develops, owns and operates a suite of data science and machine learning models that feed into online systems that are responsible for personalization and search relevance. The team is responsible for Prime Video’s experimentation practice and continuously innovates and upskills teams across the organization on science best practices. The team values diversity, collaboration and learning, and is excited to welcome a new member whose passion and creativity will help the team continue innovating and enhancing customer experience. We are open to hiring candidates to work out of one of the following locations: Seattle, WA, USA
US, CA, Palo Alto
We’re working to improve shopping on Amazon using the conversational capabilities of large language models, and are searching for pioneers who are passionate about technology, innovation, and customer experience, and are ready to make a lasting impact on the industry. You’ll be working with talented scientists, engineers, and technical program managers (TPM) to innovate on behalf of our customers. If you’re fired up about being part of a dynamic, driven team, then this is your moment to join us on this exciting journey! We are open to hiring candidates to work out of one of the following locations: Palo Alto, CA, USA
US, NJ, Newark
Employer: Audible, Inc. Title: Data Scientist II Location: 1 Washington Street, Newark, NJ, 07102 Duties: Design and implement scalable and reliable approaches to support or automate decision making throughout the business. Apply a range of data science techniques and tools combined with subject matter expertise to solve difficult business problems and cases in which the solution approach is unclear. Acquire data by building the necessary SQL/ETL queries. Import processes through various company specific interfaces for accessing RedShift, and S3/edX storage systems. Build relationships with stakeholders and counterparts, and communicate model outputs, observations, and key performance indicators (KPIs) to the management to develop sustainable and consumable products. Explore and analyze data by inspecting univariate distributions and multivariate interactions, constructing appropriate transformations, and tracking down the source and meaning of anomalies. Build production-ready models using statistical modeling, mathematical modeling, econometric modeling, machine learning algorithms, network modeling, social network modeling, natural language processing, or genetic algorithms. Validate models against alternative approaches, expected and observed outcome, and other business defined key performance indicators. Implement models that comply with evaluations of the computational demands, accuracy, and reliability of the relevant ETL processes at various stages of production. Position reports into Newark, NJ office; however, telecommuting from a home office may be allowed. Requirements: Requires a Master’s in Statistics, Computer Science, Data Science, Machine Learning, Applied Math, Operations Research, Economics, or a related field plus two (2) years of Data Scientist or other occupation/position/job title with research or work experience related to data processing and predictive Machine Learning modeling at scale. Experience may be gained concurrently and must include: Two (2) years in each of the following: - Building statistical models and machine learning models using large datasets from multiple resources - Non-linear models including Neural Nets or Deep Learning, and Gradient Boosting - Applying specialized modelling software including Python, R, SAS, MATLAB, or Stata. One (1) year in the following: - Using database technologies including SQL or ETL. Alternatively, will accept a Bachelor's and five (5) years of experience. Salary: $163,238 - $178,400/year. Multiple positions. Apply online: www.amazon.jobs Job Code: ADBL135. We are open to hiring candidates to work out of one of the following locations: Newark, NJ, USA
CN, 11, Beijing
Amazon Search JP builds features powering product search on the Amazon JP shopping site and expands the innovations to world wide. As an Applied Scientist on this growing team, you will take on a key role in improving the NLP and ranking capabilities of the Amazon product search service. Our ultimate goal is to help customers find the products they are searching for, and discover new products they would be interested in. We do so by developing NLP components that cover a wide range of languages and systems. As an Applied Scientist for Search JP, you will design, implement and deliver search features on Amazon site, helping millions of customers every day to find quickly what they are looking for. You will propose innovation in NLP and IR to build ML models trained on terabytes of product and traffic data, which are evaluated using both offline metrics as well as online metrics from A/B testing. You will then integrate these models into the production search engine that serves customers, closing the loop through data, modeling, application, and customer feedback. The chosen approaches for model architecture will balance business-defined performance metrics with the needs of millisecond response times. Key job responsibilities - Designing and implementing new features and machine learned models, including the application of state-of-art deep learning to solve search matching, ranking and Search suggestion problems. - Analyzing data and metrics relevant to the search experiences. - Working with teams worldwide on global projects. Your benefits include: - Working on a high-impact, high-visibility product, with your work improving the experience of millions of customers - The opportunity to use (and innovate) state-of-the-art ML methods to solve real-world problems with tangible customer impact - Being part of a growing team where you can influence the team's mission, direction, and how we achieve our goals We are open to hiring candidates to work out of one of the following locations: Beijing, 11, CHN | Shanghai, 31, CHN
DE, BE, Berlin
The Artificial General Intelligence (AGI) team is looking for a Senior Applied Scientist with background in Large Language Model, Natural Language Process and Machine/Deep learning. You will be work with a team of applied and research scientists to bring all existing Alexa features and beyond to LLM empowered Alexa. You will interact in a cross-functional capacity with science, product and engineering leaders. Key job responsibilities • Conducting research leading to improved Alexa AI systems • Communicating effectively with leadership team as well as with colleagues from science, engineering and business backgrounds. • Providing research directions and mentoring junior researchers. We are open to hiring candidates to work out of one of the following locations: Berlin, BE, DEU
US, MA, North Reading
Working at Amazon Robotics Are you inspired by invention? Is problem solving through teamwork in your DNA? Do you like the idea of seeing how your work impacts the bigger picture? Answer yes to any of these and you’ll fit right in here at Amazon Robotics. We are a smart, collaborative team of doers that work passionately to apply cutting-edge advances in robotics and software to solve real-world challenges that will transform our customers’ experiences in ways we can’t even imagine yet. We invent new improvements every day. We are Amazon Robotics and we will give you the tools and support you need to invent with us in ways that are rewarding, fulfilling and fun. Position Overview The Amazon Robotics (AR) Software Research and Science team builds and runs simulation experiments and delivers analyses that are central to understanding the performance of the entire AR system. This includes operational and software scaling characteristics, bottlenecks, and robustness to “chaos monkey” stresses -- we inform critical engineering and business decisions about Amazon’s approach to robotic fulfillment. We are seeking an enthusiastic Data Scientist to design and implement state-of-the-art solutions for never-before-solved problems. The DS will collaborate closely with other research and robotics experts to design and run experiments, research new algorithms, and find new ways to improve Amazon Robotics analytics to optimize the Customer experience. They will partner with technology and product leaders to solve business problems using scientific approaches. They will build new tools and invent business insights that surprise and delight our customers. They will work to quantify system performance at scale, and to expand the breadth and depth of our analysis to increase the ability of software components and warehouse processes. They will work to evolve our library of key performance indicators and construct experiments that efficiently root cause emergent behaviors. They will engage with software development teams and warehouse design engineers to drive the evolution of the AR system, as well as the simulation engine that supports our work. Inclusive Team Culture Here at Amazon, we embrace our differences. We are committed to furthering our culture of inclusion. We have 12 affinity groups (employee resource groups) with more than 87,000 employees across hundreds of chapters around the world. We have innovative benefit offerings and host annual and ongoing learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences. Amazon’s culture of inclusion is reinforced within our 14 Leadership Principles, which reminds team members to seek diverse perspectives, learn and be curious, and earn trust. Flexibility It isn’t about which hours you spend at home or at work; it’s about the flow you establish that brings energy to both parts of your life. We offer flexibility and encourage you to find your own balance between your work and personal lives. Mentorship & Career Growth We care about your career growth too. Whether your goals are to explore new technologies, take on bigger opportunities, or get to the next level, we'll help you get there. Our business is growing fast and our people will grow with it. A day in the life Amazon offers a full range of benefits that support you and eligible family members, including domestic partners and their children. Benefits can vary by location, the number of regularly scheduled hours you work, length of employment, and job status such as seasonal or temporary employment. The benefits that generally apply to regular, full-time employees include: 1. Medical, Dental, and Vision Coverage 2. Maternity and Parental Leave Options 3. Paid Time Off (PTO) 4. 401(k) Plan If you are not sure that every qualification on the list above describes you exactly, we'd still love to hear from you! At Amazon, we value people with unique backgrounds, experiences, and skillsets. If you’re passionate about this role and want to make an impact on a global scale, please apply! We are open to hiring candidates to work out of one of the following locations: North Reading, MA, USA
US, MA, Boston
The Artificial General Intelligence (AGI) - Automations team is developing AI technologies to automate workflows, processes for browser automation, developers and ops teams. As part of this, we are developing services and inference engine for these automation agents, and techniques for reasoning, planning, and modeling workflows. If you are interested in a startup mode team in Amazon to build the next level of agents then come join us. Scientists in AGI - Automations will develop cutting edge multimodal LLMs to observe, model and derive insights from manual workflows to automate them. You will get to work in a joint scrum with engineers for rapid invention, develop cutting edge automation agent systems, and take them to launch for millions of customers. Key job responsibilities - Build automation agents by developing novel multimodal LLMs. A day in the life An Applied Scientist with the AGI team will support the science solution design, run experiments, research new algorithms, and find new ways of optimizing the customer experience.; while setting examples for the team on good science practice and standards. Besides theoretical analysis and innovation, an Applied Scientist will also work closely with talented engineers and scientists to put algorithms and models into practice. We are open to hiring candidates to work out of one of the following locations: Boston, MA, USA
US, WA, Seattle
Are you motivated to explore research in ambiguous spaces? Are you interested in conducting research that will improve the employee and manager experience at Amazon? Do you want to work on an interdisciplinary team of scientists that collaborate rather than compete? Join us at PXT Central Science! The People eXperience and Technology Central Science Team (PXTCS) uses economics, behavioral science, statistics, and machine learning to proactively identify mechanisms and process improvements which simultaneously improve Amazon and the lives, wellbeing, and the value of work to Amazonians. We are an interdisciplinary team that combines the talents of science and engineering to develop and deliver solutions that measurably achieve this goal. We are seeking a senior Applied Scientist with expertise in more than one or more of the following areas: machine learning, natural language processing, computational linguistics, algorithmic fairness, statistical inference, causal modeling, reinforcement learning, Bayesian methods, predictive analytics, decision theory, recommender systems, deep learning, time series modeling. In this role, you will lead and support research efforts within all aspects of the employee lifecycle: from candidate identification to recruiting, to onboarding and talent management, to leadership and development, to finally retention and brand advocacy upon exit. The ideal candidate should have strong problem-solving skills, excellent business acumen, the ability to work independently and collaboratively, and have an expertise in both science and engineering. The ideal candidate is not methods-driven, but driven by the research question at hand; in other words, they will select the appropriate method for the problem, rather than searching for questions to answer with a preferred method. The candidate will need to navigate complex and ambiguous business challenges by asking the right questions, understanding what methodologies to employ, and communicating results to multiple audiences (e.g., technical peers, functional teams, business leaders). About the team We are a collegial and multidisciplinary team of researchers in People eXperience and Technology (PXT) that combines the talents of science and engineering to develop innovative solutions to make Amazon Earth's Best Employer. We leverage data and rigorous analysis to help Amazon attract, retain, and develop one of the world’s largest and most talented workforces. We are open to hiring candidates to work out of one of the following locations: Arlington, VA, USA | Austin, TX, USA | Chicago, IL, USA | New York, NY, USA | Seattle, WA, USA | Sunnyvale, CA, USA