WACV: Transformers for video and contrastive learning

Amazon’s Joe Tighe on the major trends he sees in the field of computer vision.

Joe Tighe, senior manager for computer vision at Amazon Web Services, is a coauthor on two papers being presented at this year’s Winter Conference on Applications of Computer Vision (WACV), and as he prepares to attend the conference, he sees two major trends in the field of computer vision.

“One is Transformers and what they can do, and the other is self-supervised or unsupervised learning and how we can apply that,” Tighe says.

Joe Tighe, senior manager for computer vision at Amazon Web Services.

The Transformer is a neural-network architecture that uses attention mechanisms to improve performance on machine learning tasks. When processing part of a stream of input data, the Transformer attends to data from other parts of the stream, which influences its handling of the data at hand. Transformers have enabled state-of-the-art performance on natural-language-processing tasks because of their ability to model long-range correlations — recognizing, for instance, that the name at the start of a sentence might be the referent of a pronoun at the sentence’s end.

In visual data, on the other hand, locality tends to matter more: usually, the value of a pixel is more strongly correlated with those of the pixels around it than with pixels that are farther away. Computer vision has traditionally relied on convolutional neural networks (CNNs), which step through an image applying the same set of filters — or kernels — to each patch of an image. That way, the CNN can find the patterns it’s looking for — say, visual characteristics of dog ears — wherever in the image they occur.

“We've been successful in basically achieving the same accuracy as convolutional networks with these Transformers,” Tighe says. “And we maintain that locality constraint by, for instance, feeding in patches of images, because with a patch, you have to be local. Or we start out with a CNN and then feed mid-level features from the CNN into the Transformer, and then you let the Transformer go and relate any patch to any other patch.

“But I don't think what Transformers are going to bring to our field is higher accuracy for just embedding images. What they are incredibly good at — and we’re already seeing strong results — is processing structured data.”

Action recognition.small.png
One of the WACV papers on which Tighe is a coauthor describes a machine learning model that uses attention mechanisms to determine which frames of a video are most relevant to the task of action recognition. At left are video clips, at right heat maps that indicate where the model is attending. Where action is uniform, so is the model's attention (top). In other cases, the model attends only to the most informative parts of the clip (red boxes, center and bottom). From "NUTA: Non-uniform temporal aggregation for action recognition".

For instance, Tighe explains, Transformers can more naturally infer object permanence — determining that a collection of pixels in one frame of video designate the same object as a different collection of pixels in a different frame.

This is crucial to a number of video applications. For instance, determining the semantic content of a film or TV show requires recognizing the same characters across different shots. Similarly, Amazon Go — the Amazon service that enables checkout-free shopping in physical stores — needs to recognize that the same customer who picked up canned peaches on aisle three also picked up raisin bran on aisle five.

“To understand a movie, we can't just send in frames,” Tighe says. “One of the things my group is doing — as well as a lot of different groups — is using Transformers to take in audio information, take in text, like subtitles, and take in the visual information, the movie content, into one framework. Because what you see is only half of it. What you hear is as, if not more, important for understanding what's going on in these movies. I see Transformers as a powerful tool to finally not have ad hoc ways to combine audio, text, and video together.”

Contrastive learning

On the topic of unsupervised and self-supervised learning, Tighe says, the most interesting recent development has been the exploration of contrastive learning. With contrastive learning, a neural network is fed pairs of inputs, some from the same class and some from different classes, and it learns to produce embeddings — vector representations — that cluster instances of the same class together and separate instances of different classes. The trick is to do this with unlabeled data.

“If you take an image, and then you augment it, you change its color, you take a really aggressive crop, you add a bunch of noise, then you have two examples,” Tighe explains. “You put those both through the network and you say, These two things are the same thing. You can be very aggressive with your augmentations. So when you get, say, a crop of a dog's head and a crop of a dog's tail, you're telling the network these are semantically the same object. And so it needs to learn high-level semantics of dog parts.

“But you also need to push them apart from something else. It’s easy to find examples that are far away already, but that doesn’t help the network learn. What we really need is to find the closest example and push away from that. So I think one of the key innovations here is that you have this large bank of image embeddings that you should push against. The network is going to pick out the really hard examples, the ones that it naturally is embedding very close together. It's going to try and push those apart, and that's how this embedding is learned very well.

“Then at the end, when you're going to test how well it does, you just train a single linear layer with all your labeled data. The idea is, if this works, we should be able to throw the world of images at one of these systems, train the ultimate embedding that can describe the entire world, and then, with our specific task in mind, just with a little bit of data, train that last layer and have very high performance.”

Action recognition

In his own papers at WACV, Tighe and his colleagues are exploring both attention mechanisms and semi-supervised learning — although not exactly Transformers and contrastive learning.

“One WACV paper is looking at how we use the Transformer mechanism of self-attention to aggregate temporal information,” he explains. “It's actually a CNN, but then we use that self-attention mechanism to aggregate information across the whole video. So we get the ability to share information globally inside this network as well.

“The other one is looking at, if you have a dictionary of actions, how can you predict the different actions that are occurring by looking at a bunch of events? One of the datasets we look at is gymnastics. So if we look at the floor plan for a gymnastics event, and you have a number of examples of that, we predict the fine-grain actions like a flip or turnover that happened without supervision of those fine-grain actions.”

As for what the future may hold, “what's really missing from video research is around how you model the temporal dimension,” Tighe says. “And I'm not claiming to know what that means yet. But it's inherently a different signal; it can't just be treated like another space dimension.”

Research areas

Related content

US, WA, Redmond
Have you ever wanted to be part of a team that is building industry changing technology? Amazon’s Project Kuiper is an initiative to launch a constellation of Low Earth Orbit satellites that will provide low-latency, high-speed broadband network connectivity to unserved and underserved communities around the world. The Kuiper Business Solutions team owns a suite of products and services to operate and scale Kuiper. We are looking for a passionate, talented, and inventive Data Scientist with a background in AI, Gen AI, Machine Learning, NLP, to lead delivering best in class automated customer service and business analytic solutions for Kuiper Customer Service. As a Data Scientist, you will be responsible for the development, fine-tuning, and evaluation of AI models that power our chatbot and IVR solutions. Your work will ensure the chatbot and IVR is accurate, reliable, and continually improving to meet customer needs. This role involves collaborating with cross-functional teams to integrate AI solutions into our customer service platform, monitor their performance, and implement ongoing enhancements. The ideal candidate has experience in successfully building chat bots using AI technologies, measuring their performance and delivering ongoing improvements. Export Control Requirement: Due to applicable export control laws and regulations, candidates must be a U.S. citizen or national, U.S. permanent resident (i.e., current Green Card holder), or lawfully admitted into the U.S. as a refugee or granted asylum. Key job responsibilities * Build and validate data pipelines for training and evaluating the LLMs * Extensively clean and explore the datasets * Train and evaluate LLMs in a robust manner * Design and conduct A/B tests to validate model performance * Automate model inference on AWS infrastructure
US, WA, Seattle
AWS Infrastructure Services owns the design, planning, delivery, and operation of all AWS global infrastructure. In other words, we’re the people who keep the cloud running. We support all AWS data centers and all of the servers, storage, networking, power, and cooling equipment that ensure our customers have continual access to the innovation they rely on. We work on the most challenging problems, with thousands of variables impacting the supply chain — and we’re looking for talented people who want to help. You’ll join a diverse team of software, hardware, and network engineers, supply chain specialists, security experts, operations managers, and other vital roles. You’ll collaborate with people across AWS to help us deliver the highest standards for safety and security while providing seemingly infinite capacity at the lowest possible cost for our customers. And you’ll experience an inclusive culture that welcomes bold ideas and empowers you to own them to completion. Come work for M13 - an AWS team specializing in the deception and disruption of cyber threats. We are looking for an Applied Scientist who is passionate about the security domain. You will build services and tools for security engineers and developers that leverage artificial intelligence and machine learning to pull unique insights about the cyber threat landscape. You will be part of a team building Large Language Model (LLM)-based services with the focus on enabling AWS teams to interact with our threat data. The team works in close collaboration with other AWS security services to power mitigations that protect the global AWS network and features in external security services such as Amazon GuardDuty, AWS WAF, and AWS Network Firewall. If you are excited about combating the ever evolving threat landscape then we'd love to talk to you. As an Applied Scientist, you are recognized for your expertise, advise team members on a range of machine learning topics, and work closely with software engineers to drive the delivery of end-to-end modeling solutions. Your work focuses on ambiguous problem areas where the business problem or opportunity may not yet be defined. The problems that you take on require scientific breakthroughs. You take a long-term view of the business objectives, product roadmaps, technologies, and how they should evolve. You drive mindful discussions with customers, engineers, and scientist peers. You bring perspective and provide context for current technology choices, and make recommendations on the right modeling and component design approach to achieve the desired customer experience and business outcome. Key job responsibilities • Understand the challenges that security engineers and developers face when building software today, and develop generalizable solutions. • Collaborate with the team to pave the way towards bringing your solution into production systems. Lead cross team projects and ensure technical blockers are resolved • Communicate and document your research via publishing papers in external scientific venues. About the team *Why AWS* Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. *Diverse Experiences* Amazon values diverse experiences. Even if you do not meet all of the preferred qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. *Work/Life Balance* We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. *Inclusive Team Culture* Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. *Mentorship and Career Growth* We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional.
US, CA, Sunnyvale
The Edge CV team under Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Applied Scientist with a strong deep learning background, to help build industry-leading technology with computer vision and multimodal perception models for various edge applications. Key job responsibilities As an Applied Scientist with the Edge CV team under AGI, you will work with talented peers to develop novel algorithms and modeling techniques to advance the state of the art with multimodal models with an emphasis on computer vision. Your work will directly impact our customers in the form of products and services that make use of CV technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in AGI in within perception domain. A day in the life An Applied Scientist with the AGI team will support the science solution design, run experiments, research new algorithms, and find new ways of optimizing the customer experience; while setting examples for the team on good science practice and standards. Besides theoretical analysis and innovation, an Applied Scientist will also work closely with talented engineers and scientists to put algorithms and models into practice. About the team The Edge CV team has a mission to deliver best in class CV and multimodal models in support of various low latency perception based applications for devices like Echo Show series within Amazon.
US, WA, Seattle
We are seeking a senior scientist with demonstrated experience in A/B testing along with related experience with observational causal modeling (e.g. synthetic controls, causal matrix completion). Our team owns "causal inference as a service" for the Pricing and Promotions organization; we run A/B tests on new pricing, promotions, and pricing/promotions CX algorithms and, where experimentation is impractical, conduct observational causal studies. Key job responsibilities We are seeking a senior scientist to help envision, design, and build the next generation of pricing, promotions, and pricing/promotions CX for Amazon. On our team, you will work at the intersection of economic theory, statistical inference, and machine learning to design and implement in production new statistical methods for measuring causal effects of an extensive array of business policies. This position is perfect for someone who has a deep and broad analytic background, is passionate about using mathematical modeling and statistical analysis to make a real difference. You should be familiar with modern tools for data science and business analysis and have experience coding with engineers to put projects into production. We are particularly interested in candidates with research background in experimental statistics. A day in the life - Discuss with business problems with business partners, product managers, and tech leaders - Brainstorm with other scientists to design the right model for the problem at hand - Present the results and new ideas for existing or forward looking problems to leadership - Dive deep into the data - Build working prototypes of models - Work with engineers to implement prototypes in production - Analyze the results and review with partners About the team We are a team of scientists who design and implement the econometrics powering pricing, promotions, and pricing/promotions CX.
US, WA, Seattle
Do you want to join a team of innovative scientists to research and develop generative AI technology that would disrupt the industry? Do you enjoy dealing with ambiguity and working on hard problems in a fast-paced environment? Amazon Connect is a highly disruptive cloud-based contact center from AWS that enables businesses to deliver intelligent, engaging, dynamic, and personalized customer service experiences. As an Applied Scientist on our team, you will work closely with senior technical and business leaders from within the team and across AWS. You distill insight from huge data sets, conduct cutting edge research, foster ML models from conception to deployment. You have deep expertise in machine learning and deep learning broadly, and extensive domain knowledge in natural language processing, generative AI and LLMs, etc. The ideal candidate has the ability to understand, implement, innovate and on the state-of-the-art generative AI based systems. You are comfortable with quickly prototyping and iterating your ideas to build robust ML models using technology such as PyTorch, Tensorflow, AWS Sagemaker, and SparkML. Our team is at an early stage, so you will have significant impact on our ML deliverables with little operational load from existing models/systems. We have a rapidly growing customer base and an exciting charter in front of us that includes solving highly complex engineering and scientific problems. We are looking for passionate, talented, and experienced people to join us to innovate on modern contact centers in the cloud. The position represents a rare opportunity to be a part of a fast-growing business soon after launch, and help shape the technology and product as we grow. You will be playing a crucial role in developing the next generation contact center, and get the opportunity to design and deliver scalable, resilient systems while maintaining a constant customer focus. Our team is leading ML and optimization features in Amazon Connect. We are a team of scientists and engineers working on multiple science projects for Amazon Connect. We use state-of-the-art science and engineering practices to address the hard problems in contact center operation and management for our customers, and we move fast to implement solutions and refine them based on customer feedback. Learn more about Amazon Connect here: https://aws.amazon.com/connect/ About the team Diverse Experiences AWS values diverse experiences. Even if you do not meet all of the qualifications and skills listed in the job description, we encourage candidates to apply. If your career is just starting, hasn’t followed a traditional path, or includes alternative experiences, don’t let it stop you from applying. Why AWS? Amazon Web Services (AWS) is the world’s most comprehensive and broadly adopted cloud platform. We pioneered cloud computing and never stopped innovating — that’s why customers from the most successful startups to Global 500 companies trust our robust suite of products and services to power their businesses. Inclusive Team Culture Here at AWS, it’s in our nature to learn and be curious. Our employee-led affinity groups foster a culture of inclusion that empower us to be proud of our differences. Ongoing events and learning experiences, including our Conversations on Race and Ethnicity (CORE) and AmazeCon (gender diversity) conferences, inspire us to never stop embracing our uniqueness. Mentorship & Career Growth We’re continuously raising our performance bar as we strive to become Earth’s Best Employer. That’s why you’ll find endless knowledge-sharing, mentorship and other career-advancing resources here to help you develop into a better-rounded professional. Work/Life Balance We value work-life harmony. Achieving success at work should never come at the expense of sacrifices at home, which is why we strive for flexibility as part of our working culture. When we feel supported in the workplace and at home, there’s nothing we can’t achieve in the cloud. Hybrid Work We value innovation and recognize this sometimes requires uninterrupted time to focus on a build. We also value in-person collaboration and time spent face-to-face. Our team affords employees options to work in the office every day or in a flexible, hybrid work model near one of our U.S. Amazon offices.
US, MA, Cambridge
Amazon Lab126 is an inventive research and development company that designs and engineers high-profile consumer electronics. Lab126 began in 2004 as a subsidiary of Amazon.com, Inc., originally creating the best-selling Kindle family of products. Since then, we have produced groundbreaking devices like Fire tablets, Fire TV and Amazon Echo. What will you help us create? The Role: We are looking for a high caliber Applied Scientist to join our team. As part of the larger technology team working on new consumer technology, your work will have a large impact to hardware, internal software developers, ecosystem, and ultimately the lives of Amazon customers. In this role, you will: - Propose new research projects, get buy-in from stakeholders, plan and budget the project and lead the team for successful execution - Work closely with an inter-disciplinary product development team including outside partners to bring the prototype algorithm into commercialization - Take a big part in the mission to create earth's best employer - Be a respectable team leader in an open and collaborative environment
US, CA, San Diego
Do you want to join an innovative team of scientists who use machine learning and statistical techniques to help Amazon provide the best customer experience by preventing eCommerce fraud? Are you excited by the prospect of analyzing and modeling terabytes of data and creating state-of-the-art algorithms to solve real world problems? Do you like to own end-to-end business problems/metrics and directly impact the profitability of the company? Do you enjoy collaborating in a diverse team environment? If yes, then you may be a great fit to join the Amazon Buyer Risk Prevention (BRP) Machine Learning group. We are looking for a talented scientist who is passionate to build advanced algorithmic systems that help manage safety of millions of transactions every day. Key job responsibilities Use machine learning and statistical techniques to create scalable risk management systems Learning and understanding large amounts of Amazon’s historical business data for specific instances of risk or broader risk trends Design, development and evaluation of highly innovative models for risk management Working closely with software engineering teams to drive real-time model implementations and new feature creations Working closely with operations staff to optimize risk management operations, Establishing scalable, efficient, automated processes for large scale data analyses, model development, model validation and model implementation Tracking general business activity and providing clear, compelling management reporting on a regular basis Research and implement novel machine learning and statistical approaches
GB, Cambridge
The Artificial General Intelligence team (AGI) has an exciting position for an Applied Scientist with a strong background NLP and Large Language Models to help us develop state-of-the-art conversational systems. As part of this team, you will collaborate with talented scientists and software engineers to enable conversational assistants capabilities to support the use of external tools and sources of information, and develop novel reasoning capabilities to revolutionise the user experience for millions of Alexa customers. Key job responsibilities As an Applied Scientist, you will develop innovative solutions to complex problems to extend the functionalities of conversational assistants . You will use your technical expertise to research and implement novel algorithms and modelling solutions in collaboration with other scientists and engineers. You will analyse customer behaviours and define metrics to enable the identification of actionable insights and measure improvements in customer experience. You will communicate results and insights to both technical and non-technical audiences through written reports, presentations and external publications.
US, WA, Seattle
Amazon’s Global Media and Entertainment (GME) organization is creating a future of entertainment where creative content, innovation, and commerce come together. We leverage Amazon’s unique expertise across video, music, gaming, and more to create a truly immersive entertainment experience. Our team, GME Science, is focused on building science tools to optimize Amazon’s entertainment offerings, so that we can provide a great customer experience while operating as a sustainable and profitable business. We push ourselves to Think Big, building ambitious models that create value in multiple GME businesses. This role will expand our team’s measurement work. Business leaders need to quickly understand the long-term impact of various investments, such as new website features, content creation, or marketing campaigns. Our team figures out how to take short-term signals – such as clicks or signups – and turn them into estimates of long-term financial impacts. We work with measurement teams in each business as well as central teams to build foundational measurement science and adapt it for unique use cases. One particular application for this role is to build a principled approach to valuing content/talent deals that include multiple GME businesses. Each deal is unique, featuring talent from film, sports, music, and other top industries, with contract terms that could include video content, podcasts, live appearances, and more. Our valuations need to be structured so that they are comparable across deals, yet flexible enough to account for diverse contracts. To be successful in this role, you will need effective communication, an ability to work closely with stakeholders across our many GME partner teams, and the skill to translate data-driven findings into actionable insights. This includes developing a deep understanding of our business context, which is ambiguous and can change quickly. Your work will be used by decision-makers across GME to deliver the best entertainment experience for our customers, which means we have a high bar. Our healthy team culture is supportive and fast-paced, and we prioritize learning, growth, and helping each other to continuously raise the bar. Impact and Career Growth In today’s entertainment landscape, critical decisions are made with data and economic models. You’ll help GME leaders ask the right questions, and then deliver data-driven answers, creating the future of GME at Amazon. You’ll help define a long-term science vision in this space and translate it into an actionable roadmap. This role combines science leadership, organizational ability, technical strength, product focus, and business understanding – a perfect recipe for career growth as an economist in tech. Key job responsibilities • Design and build econometric models, especially causal models, to measure the value of the business and its many features • Develop science products from concept to prototype to production, incorporating feedback from scientists and business partners • Independently identify and pursue new opportunities to leverage economic insights across GME businesses • Write business and technical documents communicating business context, methods, and results to business leadership and other scientists • Serve as a technical reviewer for our team and related teams, including document and code reviews
US, WA, Seattle
Amazons Price Optimization science team is seeking a Senior Scientist to harness planet scale multi-modal datasets, navigate a continuously evolving competitor landscape, in order to regularly generate fresh customer-relevant prices on billions of Amazon and Third Party Seller products worldwide. This is a high visibility, high impact role for a seasoned, intellectually curious scientist able to partition customer problems into discrete solvable components, build or leverage existing approaches to deliver those components, and innovate to deploy the science into measurable customer-improving outputs. This role requires an individual with exceptional machine learning and reinforcement learning modeling expertise, a strong statistical background, excellent cross-functional collaboration skills, outstanding business acumen, and an entrepreneurial spirit. We are looking for an experienced innovator, who is a self-starter, comfortable with ambiguity, demonstrates strong attention to detail, and has the ability to work in a fast-paced and ever-changing environment. Price is a highly relevant input into many partner-team architectures, and is highly relevant to the customer, therefore this role creates the opportunity to drive extremely large impact (measured in Bs not Ms), but demands careful thought and clear communication. Key job responsibilities We are hiring a senior applied scientist to drive our pricing optimization initiatives. The Price Optimization science team drives cross-domain and cross-system improvements through: * Using cross-ASIN signals to optimally price bundles, ensure price rationality across products, and discovering and launch optimal promotional bundles * invent and deliver price optimization, simulation, and competitiveness tools for 3p Sellers. * shape and extend our bandit optimization platform - a pricing centric multi-armed bandit platform that automates the optimization of various system parameters and price inputs. * Promotion optimization initiatives exploring CX, discount amount, and cross-product optimization opportunities. * Identifying opportunities to optimally price across systems and contexts (marketplaces, request types, event periods) About the team The Pricing Optimization science team owns price quality, discovery and discount optimization initiatives across Amazon’s internal pricing architecture as well as upwards into the customer discovery funnel. We leverage planet scale data on billions of Amazon and external competitor products to build advanced optimization models for pricing, elasticity estimation, product substitutability, and optimization. We preserve long term customer trust by ensuring Amazon's prices are always competitive and error free.