Prime Video's work on 3-D scene reconstruction, image representation

CVPR papers examine the recovery of 3-D information from camera movement and learning general representations from weakly annotated data.

At this year’s Conference on Computer Vision and Pattern Recognition (CVPR), Prime Video is presenting a pair of papers that indicate the range of problems we work on.

In one paper, “Depth-guided sparse structure-from-motion for movies and TV shows”, we present a method for determining the camera movement and 3-D geometry of scenes depicted in videos. An important application of this work is to enable the accurate insertion of digital objects into already recorded videos. Our approach, which leverages off-the-shelf depth estimators to enhance the standard geometric-optimization approach, results in improvements of 10% to 30% on six different performance measures, relative to the best-performing prior technique.

SfM.gif
The Prime Video structure-from-motion system at work. At top is the input video. At lower left is the video with keypoints (colored circles) added. The keypoints are tracked accurately from frame to frame, and their color indicates their depth, as estimated by a machine learning model. At lower right is the 3-D model of the keypoints (whose rotation, to demonstrate the 3-D structure, is not synchronized with the video).

In the other paper, “Robust cross-modal representation learning with progressive self-distillation,” we expand on the CLIP method of using paired images and texts found online to train a model that produces image and text representations useful for downstream tasks, such as image classification or text-based image retrieval.

Where CLIP enforces a hard alignment between Web-crawled images and their associated texts, our method is more flexible, allowing for partial correspondences between a given image and texts associated with other images. We also use a self-distillation technique, in which our model progressively creates some of its own training targets, to steadily refine its representations.

Related content
Detectors for block corruption, audio artifacts, and errors in audio-video synchronization are just three of Prime Video’s quality assurance tools.

In two different image classification settings, our method outperforms CLIP across the board, by significant margins — 30% to 90% — on some datasets. Our method also consistently outperforms its CLIP counterpart on the tasks of image-based text retrieval and text-based image retrieval.

Structure-from-motion

Structure-from-motion is the problem of determining the 3-D structure of a scene from parallax — the relative displacement of objects in the scene as the camera moves. There are robust solutions for videos with large camera movements, but they don’t work as well for feature films and TV shows, where the camera movements tend to be more restrained.

The standard approach to determining structure from motion uses geometric optimization. First, the method estimates the location of a set of 3-D points in the scene, and then, based on that estimation, it re-projects them onto a 2-D image corresponding to each camera location. The optimization procedure minimizes the distance between points in the original 2-D image and the corresponding points of the 2-D projection.

We improve on this approach by introducing depth estimates performed by off-the-shelf, pretrained models. Instead of minimizing only the difference between the original and the projected 2-D points, our approach minimizes both the reprojection error of the 2-D points and the depth measurement error, relative to the output of the depth estimation model.

Double loss.png
Our approach jointly minimizes 2-D reprojection error and depth estimate error.

Our approach begins by using a standard method to detect image keypoints — salient points in the image, usually at object corners and other edge intersections — and identify their correspondences across successive frames of video. Then, through bilinear interpolation, we use the depth map obtained from an off-the-shelf depth estimator to determine the ground-truth keypoint depths. We use the depth information not only during optimization but also during the initialization stage of the process, when we produce our initial estimates of 3-D scene structure and relative camera pose.

SfM.png
The Prime Video structure-from-motion technique identifies keypoints in input video, finds their correspondences across frames, and then estimates their depth using bilinear interpolation on a dense depth map.

We experimented with several different depth estimation models and found that the results of our approach were essentially the same with all of them. And, in all cases, our approach improved substantially on the state of the art.

Cross-modal representations

In natural-language processing, the best-performing models in recent years have been built on top of language models that learn generic linguistic representations from huge corpora of unannotated public texts. The language models can then be fine-tuned for specific tasks with minimal additional data.

CLIP (contrastive language-image pretraining) seeks to do something similar for computer vision, learning generic visual representations from images harvested from the Web and their associated texts.

Related content
The switch to WebAssembly increases stability, speed.

Like many such weakly supervised models, CLIP is trained through contrastive learning. Intuitively, for each training image, the model is fed two texts: one, the positive training example, is the text associated with the image online; the other text, the negative example, is randomly chosen. CLIP learns a data representation that pulls the image and the positive text together in the representation space and pushes the image and the negative text apart.

Although CLIP has yielded impressive results on downstream computer vision tasks, its training approach has two drawbacks. First, the web-harvested data is noisy: the text associated with an image may in fact be semantically unrelated to it. Conversely, the text randomly selected as a negative example may in fact be semantically related to the image. CLIP can thus steer the model toward erroneous associations and away from correct ones.

Our method attempts to address this problem. Rather than learn a hard alignment between image and text, we learn a soft alignment, which gives the resulting model more interpretive flexibility.

For example, in one of our experiments, both the CLIP baseline and our model were trained on datasets that included images of goldfish. When presented with an image of a stained-glass window depicting a goldfish — a type of image not included in the training data — CLIP guessed that it was a guinea pig or maybe a beer glass, while our model guessed that it was a goldfish or possibly a clown fish. That is, our model learned a representation general enough to accommodate the stylization of the stained-glass artist’s rendering style.

CV model learning.png
CLIP’s contrastive-learning procedure enforces connections between web-harvested images and their associated texts (green lines, at left) while dissociating them from other images’ texts (red lines). Our approach instead privileges associated texts but also learns softer, probabilistic alignments with other images’ texts (dotted blue lines).

Our model learns its soft alignments through a self-distillation process. First, the model learns an initial data representation through the same contrastive-loss function that CLIP uses.

Over the course of training, however, we use the model itself to make predictions about the training examples and use those predictions as additional training targets. At first, the loss function gives these self-predictions little weight, but it gradually increases the weight as training progresses.

Related content
In a pilot study, an automated code checker found about 100 possible errors, 80% of which turned out to require correction.

The idea is that, over time, the model learns more reliable correlations between training images and texts. Self-distillation reinforces those correlations, so the model isn’t encouraged to break semantic connections between images and texts that may very well be present in the data. Similarly, over time, the model learns to give less weight to spurious connections between images and the texts initially associated with them.

The great virtue of general representation models like ours and CLIP is that they can be applied to a wide variety of computer vision problems. So the accuracy improvements that our approach affords should pay dividends for Prime Video customers in a range of contexts over the next few years.

Research areas

Related content

GB, MLN, Edinburgh
We’re looking for a Machine Learning Scientist in the Personalization team for our Edinburgh office experienced in generative AI and large models. You will be responsible for developing and disseminating customer-facing personalized recommendation models. This is a hands-on role with global impact working with a team of world-class engineers and scientists across the Edinburgh offices and wider organization. You will lead the design of machine learning models that scale to very large quantities of data, and serve high-scale low-latency recommendations to all customers worldwide. You will embody scientific rigor, designing and executing experiments to demonstrate the technical efficacy and business value of your methods. You will work alongside a science team to delight customers by aiding in recommendations relevancy, and raise the profile of Amazon as a global leader in machine learning and personalization. Successful candidates will have strong technical ability, focus on customers by applying a customer-first approach, excellent teamwork and communication skills, and a motivation to achieve results in a fast-paced environment. Our position offers exceptional opportunities for every candidate to grow their technical and non-technical skills. If you are selected, you have the opportunity to make a difference to our business by designing and building state of the art machine learning systems on big data, leveraging Amazon’s vast computing resources (AWS), working on exciting and challenging projects, and delivering meaningful results to customers world-wide. Key job responsibilities Develop machine learning algorithms for high-scale recommendations problems. Rapidly design, prototype and test many possible hypotheses in a high-ambiguity environment, making use of both quantitative analysis and business judgement. Collaborate with software engineers to integrate successful experimental results into large-scale, highly complex Amazon production systems capable of handling 100,000s of transactions per second at low latency. Report results in a manner which is both statistically rigorous and compellingly relevant, exemplifying good scientific practice in a business environment.
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! In Prime Video READI, our mission is to automate infrastructure scaling and operational readiness. We are growing a team specialized in time series modeling, forecasting, and release safety. This team will invent and develop algorithms for forecasting multi-dimensional related time series. The team will develop forecasts on key business dimensions with optimization recommendations related to performance and efficiency opportunities across our global software environment. As a founding member of the core team, you will apply your deep coding, modeling and statistical knowledge to concrete problems that have broad cross-organizational, global, and technology impact. Your work will focus on retrieving, cleansing and preparing large scale datasets, training and evaluating models and deploying them to production where we continuously monitor and evaluate. You will work on large engineering efforts that solve significantly complex problems facing global customers. You will be trusted to operate with complete independence and are often assigned to focus on areas where the business and/or architectural strategy has not yet been defined. You must be equally comfortable digging in to business requirements as you are drilling into design with development teams and developing production ready learning models. You consistently bring strong, data-driven business and technical judgment to decisions. You will work with internal and external stakeholders, cross-functional partners, and end-users around the world at all levels. Our team makes a big impact because nothing is more important to us than delivering for our customers, continually earning their trust, and thinking long term. You are empowered to bring new technologies to your solutions. If you crave a sense of ownership, this is the place to be.
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video team member, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As an Applied Scientist in the Content Understanding Team, you will lead the end-to-end research and deployment of video and multi-modal models applied to a variety of downstream applications. More specifically, you will: - Work backwards from customer problems to research and design scientific approaches for solving them - Work closely with other scientists, engineers and product managers to expand the depth of our product insights with data, create a variety of experiments to determine the high impact projects to include in planning roadmaps - Stay up-to-date with advancements and the latest modeling techniques in the field - Publish your research findings in top conferences and journals About the team Our Prime Video Content Understanding team builds holistic media representations (e.g. descriptions of scenes, semantic embeddings) and apply them to new customer experiences supply chain problems. Our technology spans the entire Prime Video catalogue globally, and we enable instant recaps, skip intro timing, ad placement, search, and content moderation.
IN, HR, Gurugram
We're on a journey to build something new a green field project! Come join our team and build new discovery and shopping products that connect customers with their vehicle of choice. We're looking for a talented Senior Applied Scientist to join our team of product managers, designers, and engineers to design, and build innovative automotive-shopping experiences for our customers. This is a great opportunity for an experienced engineer to design and implement the technology for a new Amazon business. We are looking for a Applied Scientist to design, implement and deliver end-to-end solutions. We are seeking passionate, hands-on, experienced and seasoned Senior Applied Scientist who will be deep in code and algorithms; who are technically strong in building scalable computer vision machine learning systems across item understanding, pose estimation, class imbalanced classifiers, identification and segmentation.. You will drive ideas to products using paradigms such as deep learning, semi supervised learning and dynamic learning. As a Senior Applied Scientist, you will also help lead and mentor our team of applied scientists and engineers. You will take on complex customer problems, distill customer requirements, and then deliver solutions that either leverage existing academic and industrial research or utilize your own out-of-the-box but pragmatic thinking. In addition to coming up with novel solutions and prototypes, you will directly contribute to implementation while you lead. A successful candidate has excellent technical depth, scientific vision, project management skills, great communication skills, and a drive to achieve results in a unified team environment. You should enjoy the process of solving real-world problems that, quite frankly, haven’t been solved at scale anywhere before. Along the way, we guarantee you’ll get opportunities to be a bold disruptor, prolific innovator, and a reputed problem solver—someone who truly enables AI and robotics to significantly impact the lives of millions of consumers. Key job responsibilities Architect, design, and implement Machine Learning models for vision systems on robotic platforms Optimize, deploy, and support at scale ML models on the edge. Influence the team's strategy and contribute to long-term vision and roadmap. Work with stakeholders across , science, and operations teams to iterate on design and implementation. Maintain high standards by participating in reviews, designing for fault tolerance and operational excellence, and creating mechanisms for continuous improvement. Prototype and test concepts or features, both through simulation and emulators and with live robotic equipment Work directly with customers and partners to test prototypes and incorporate feedback Mentor other engineer team members. A day in the life - 6+ years of building machine learning models for retail application experience - PhD, or Master's degree and 6+ years of applied research experience - Experience programming in Java, C++, Python or related language - Experience with neural deep learning methods and machine learning - Demonstrated expertise in computer vision and machine learning techniques.
US, MA, Boston
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Applied Scientist with a strong deep learning background, to build industry-leading technology with Large Language Models (LLMs) and multi-modal systems. You will support projects that work on technologies including multi-modal model alignment, moderation systems and evaluation. Key job responsibilities As an Applied Scientist with the AGI team, you will support the development of novel algorithms and modeling techniques, to advance the state of the art with LLMs. Your work will directly impact our customers in the form of products and services that make use of speech and language technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in generative artificial intelligence (GenAI). You are also expected to publish in top tier conferences. About the team The AGI team has a mission to push the envelope in LLMs and multimodal systems. Specifically, we focus on model alignment with an aim to maintain safety while not denting utility, in order to provide the best-possible experience for our customers.
IN, HR, Gurugram
Our customers have immense faith in our ability to deliver packages timely and as expected. A well planned network seamlessly scales to handle millions of package movements a day. It has monitoring mechanisms that detect failures before they even happen (such as predicting network congestion, operations breakdown), and perform proactive corrective actions. When failures do happen, it has inbuilt redundancies to mitigate impact (such as determine other routes or service providers that can handle the extra load), and avoids relying on single points of failure (service provider, node, or arc). Finally, it is cost optimal, so that customers can be passed the benefit from an efficiently set up network. Amazon Shipping is hiring Applied Scientists to help improve our ability to plan and execute package movements. As an Applied Scientist in Amazon Shipping, you will work on multiple challenging machine learning problems spread across a wide spectrum of business problems. You will build ML models to help our transportation cost auditing platforms effectively audit off-manifest (discrepancies between planned and actual shipping cost). You will build models to improve the quality of financial and planning data by accurately predicting ship cost at a package level. Your models will help forecast the packages required to be pick from shipper warehouses to reduce First Mile shipping cost. Using signals from within the transportation network (such as network load, and velocity of movements derived from package scan events) and outside (such as weather signals), you will build models that predict delivery delay for every package. These models will help improve buyer experience by triggering early corrective actions, and generating proactive customer notifications. Your role will require you to demonstrate Think Big and Invent and Simplify, by refining and translating Transportation domain-related business problems into one or more Machine Learning problems. You will use techniques from a wide array of machine learning paradigms, such as supervised, unsupervised, semi-supervised and reinforcement learning. Your model choices will include, but not be limited to, linear/logistic models, tree based models, deep learning models, ensemble models, and Q-learning models. You will use techniques such as LIME and SHAP to make your models interpretable for your customers. You will employ a family of reusable modelling solutions to ensure that your ML solution scales across multiple regions (such as North America, Europe, Asia) and package movement types (such as small parcel movements and truck movements). You will partner with Applied Scientists and Research Scientists from other teams in US and India working on related business domains. Your models are expected to be of production quality, and will be directly used in production services. You will work as part of a diverse data science and engineering team comprising of other Applied Scientists, Software Development Engineers and Business Intelligence Engineers. You will participate in the Amazon ML community by authoring scientific papers and submitting them to Machine Learning conferences. You will mentor Applied Scientists and Software Development Engineers having a strong interest in ML. You will also be called upon to provide ML consultation outside your team for other problem statements. If you are excited by this charter, come join us!
US, WA, Seattle
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video team member, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As an Applied Scientist in the Content Understanding Team, you will lead the end-to-end research and deployment of video and multi-modal models applied to a variety of downstream applications. More specifically, you will: - Work backwards from customer problems to research and design scientific approaches for solving them - Work closely with other scientists, engineers and product managers to expand the depth of our product insights with data, create a variety of experiments to determine the high impact projects to include in planning roadmaps - Stay up-to-date with advancements and the latest modeling techniques in the field - Publish your research findings in top conferences and journals About the team Our Prime Video Content Understanding team builds holistic media representations (e.g. descriptions of scenes, semantic embeddings) and apply them to new customer experiences supply chain problems. Our technology spans the entire Prime Video catalogue globally, and we enable instant recaps, skip intro timing, ad placement, search, and content moderation.
US, WA, Seattle
Do you want to re-invent how millions of people consume video content on their TVs, Tablets and Alexa? We are building a free to watch streaming service called Fire TV Channels (https://techcrunch.com/2023/08/21/amazon-launches-fire-tv-channels-app-400-fast-channels/). Our goal is to provide customers with a delightful and personalized experience for consuming content across News, Sports, Cooking, Gaming, Entertainment, Lifestyle and more. You will work closely with engineering and product stakeholders to realize our ambitious product vision. You will get to work with Generative AI and other state of the art technologies to help build personalization and recommendation solutions from the ground up. You will be in the driver's seat to present customers with content they will love. Using Amazon’s large-scale computing resources, you will ask research questions about customer behavior, build state-of-the-art models to generate recommendations and run these models to enhance the customer experience. You will participate in the Amazon ML community and mentor Applied Scientists and Software Engineers with a strong interest in and knowledge of ML. Your work will directly benefit customers and you will measure the impact using scientific tools.
US, MA, Boston
The Artificial General Intelligence (AGI) team is looking for a passionate, talented, and inventive Senior Applied Scientist with a strong deep learning background, to build industry-leading technology with Large Language Models (LLMs) and multimodal systems. Key job responsibilities As a Senior Applied Scientist with the AGI team, you will work with talented peers to lead the development of novel algorithms and modeling techniques, to advance the state of the art with LLMs. Your work will directly impact our customers in the form of products and services that make use of speech and language technology. You will leverage Amazon’s heterogeneous data sources and large-scale computing resources to accelerate advances in generative artificial intelligence (GenAI). About the team The AGI team has a mission to push the envelope in LLMs and multimodal systems, in order to provide the best-possible experience for our customers.
IN, KA, Bangalore
Prime Video is a first-stop entertainment destination offering customers a vast collection of premium programming in one app available across thousands of devices. Prime members can customize their viewing experience and find their favorite movies, series, documentaries, and live sports – including Amazon MGM Studios-produced series and movies; licensed fan favorites; and programming from Prime Video add-on subscriptions such as Apple TV+, Max, Crunchyroll and MGM+. All customers, regardless of whether they have a Prime membership or not, can rent or buy titles via the Prime Video Store, and can enjoy even more content for free with ads. Are you interested in shaping the future of entertainment? Prime Video's technology teams are creating best-in-class digital video experience. As a Prime Video technologist, you’ll have end-to-end ownership of the product, user experience, design, and technology required to deliver state-of-the-art experiences for our customers. You’ll get to work on projects that are fast-paced, challenging, and varied. You’ll also be able to experiment with new possibilities, take risks, and collaborate with remarkable people. We’ll look for you to bring your diverse perspectives, ideas, and skill-sets to make Prime Video even better for our customers. With global opportunities for talented technologists, you can decide where a career Prime Video Tech takes you! Key job responsibilities As a highly experienced and seasoned science leader, you will apply state of the art natural language processing and computer vision research to video centric digital media, while also responsible for creating and maintaining the best environment for applied science in order to recruit, retain and develop top talent. You will lead the research direction for a team of deeply talented applied scientists, creating the roadmaps for forward-looking research and communicate them effectively to senior leadership. You will also hire and develop applied scientists - growing the team to meet the evolving needs of our customers.