Amazon at WACV: Computer vision is more than labeling pixels
Amazon distinguished scientist Gérard Medioni on the complexities of “understanding your environment through visual input”.
Gérard Medioni, an Amazon vice president and distinguished scientist, is the general chair at this year’s IEEE Winter Conference on Applications of Computer Vision (WACV), and in that capacity, he led the recruitment of the conference’s three keynote speakers.
On Wednesday, Lihi Zelnik-Manor, an associate professor of electrical engineering at Israel’s Technion, described her experiences working on computer vision and artificial-intelligence projects for Alibaba, the leading Chinese e-commerce company. Yesterday, Hao Li, a cofounder and CEO at Pinscreen, addressed the challenges of creating virtual online avatars that move and sound like real people. And today, Raquel Urtasun, chief scientist at Uber’s Advanced Technologies Group and a professor of computer science at the University of Toronto, will discuss the science of self-driving cars.
“It's really an international mix of people,” Medioni says. “Lihi is representing both Israel and China for Alibaba. Raquel is originally from Spain, educated in Switzerland, and leading the effort for Uber in Canada. Hao Li is from Germany and is working here in the U.S.”
The speakers’ topics demonstrate how expansive the applications of computer vision have become; they’re no longer just a matter of labeling pixels in an image.
“You have to take computer vision as the interpretation of the scene, not necessarily static, but also dynamic,” Medioni explains. “It’s understanding your environment through visual input. That involves anticipating and understanding actions as well. Activity understanding is a subfield of computer vision: ‘What is this person doing?’”
The ideal sandbox
In that context, “addressing self-driving cars and Just Walk Out shopping are ideal sandboxes for computer vision,” Medioni says. “You need to solve every sub-problem that you can think of in computer vision. For autonomous driving, you need to understand the scene, which means you need to detect signs, you need to detect people, you need to detect cars, and you need to make inferences about behavior. And in addition to that, you have to provide the motor signals to actuate the car.
“Another interesting part — which is true for both Amazon Go and for self-driving cars — is that the basic case is fairly straightforward. But there is a very, very long tail of complicated cases. And because it's such a long tail, you cannot think in advance of all the cases and solve them in the lab. You have to actually gather tens of thousands of hours of driving experience to address these cases.
“Another part of the complexity is the combination of human drivers and self-driving cars. When you and I get to a stop sign at the same time, I look at you; you look at me. We have established contact. And now I can start going, and I know what's going to happen. This whole interplay that occurs nonverbally doesn't exist if you have a self-driving car and a human driver. There is no eye contact. So this is a very interesting aspect of it, too.”
In his keynote, Hao Li discussed the challenge that his company, Pinscreen, is addressing: the synthesis of realistic online avatars. Like the problem of self-driving cars, it’s a computer vision problem whose solution depends on accurately modeling and reproducing human behavior.
“When you and I talk, you’re not just the head,” Medioni explains. “Your hands are moving; your arms are moving; your shoulders are moving. If you have ever seen an avatar that just speaks with the face, and the arms are not moving, it is very disturbing. It looks fake.
“The complexity comes from the fact that we humans are very good at detecting any type of defect. Anything that looks slightly off is going to create this uncanny-valley effect. When a designer is looking to generate expression, for example, that is different from just expression classification. You can say this person is smiling, or this person is frowning; well, that's just a label that you put on it. What Li’s doing is more complicated. Creating an expression involves tens of muscles in the face, and some of this muscle can be very, very subtle activation. Then you have parts that you do not necessarily see. Like when you open your mouth, well, you see part of the tongue and teeth. How do you do that? Li is one of the leaders in producing this type of richness of expression in the face.
“It still continues to amaze me what we are able to accomplish with computer vision today,” Medioni adds. “It's truly great to be in this field and to see the progress on a weekly basis.”