Amazon at CVPR: Pietro Perona on computer vision's frontiers
Efficient learning and the capacity for abstraction are attributes that will probably require new insights — but self-supervised learning could help.
The Conference on Computer Vision and Pattern Recognition (CVPR) — the premier conference in the field of computer vision — was first held in 1985. Pietro Perona, an Amazon Fellow and the Allan E. Puckett Professor of Electrical Engineering and Computation and Neural Systems at the California Institute of Technology, first attended in 1988, when he was a graduate student at the University of California, Berkeley.
“At the time, computer vision was a field for visionaries — pun intended — where we wanted to solve the question of how we can make a machine see,” Perona says. “The whole conference was maybe 200 people. And we had basically no clear idea how to make progress and so would try different things, and we would try and see if we could split the complex problem of vision into simpler questions. And the results were not very good. Now we see in the conference great systems working really well on very difficult problems. So the level of success and ambition is completely different.”
Much of that success stems, of course, from deep learning, which superseded many earlier computer vision techniques. But, Perona points out, it’s not as if computer vision researchers had simply failed to recognize the utility of deep learning for CVPR’s first 25 years. Until around 2010, he says, using deep learning to tackle computer vision problems wasn’t really an option.
“Deep learning has been around since the late ’80s,” he says, “but we simply didn't have enough computational power to run big experiments on complex images. You have to look to 2008, 2009, when good GPUs began coming out. Then, people in computer vision had to learn how to code up these GPUs. There were no special software tools at the time, so people were just handcrafting software.
“Another factor is the emergence of vast, well-annotated datasets of images, which came about in 2005 to 2010. That was the result of a couple of things. One was the Internet: all of a sudden there were tons of images available. The other thing is Amazon Mechanical Turk, which came out in 2005, and without which we would not be able to have these very large annotated datasets. It's funny, because within Amazon, people are not so aware of it, but Amazon Mechanical Turk was one of the three big factors for the AI revolution to come about. Datasets like ImageNet and COCO would not have been possible without it.”
For all of deep learning’s successes on such canonical computer vision tasks as object recognition, there are some respects in which it has made little headway, Perona says.
More on Amazon at CVPR
Read more about Amazon's presence at CVPR, including papers, workshop involvement, and committee membership.
“One barrier is the efficiency of learning,” he says. “There was a paper from my team looking at classification of plants and animals. If you have 10,000 images per category — each species of bird or species of butterfly — then the machine will beat a human in accuracy. But the efficiency is not even close. If I give you a new species you have never seen before, and I show you three to five pictures of this new species, you become competent at recognizing that species. For a machine that would not be possible.”
One reason to try to break this barrier is scientific, Perona says. “Humans don't own a special kind of computation,” he says. “So it should be possible for machines to do it. You want to understand this exquisite ability that humans have, how it works.”
But, he adds, there are also practical reasons to worry about learning efficiency.
“If you think of people who are trying to use machine vision in industry or in science, something that is frequent is often not so important,” Perona explains. “What is rare is more important. So if you think of building a machine that can help an ophthalmologist recognize retinal disease, let's suppose, there are some 10 or 20 diseases of the retina that doctors see all the time. So they have no problem. They don't need help from a machine. But then there are another about 600 diseases that they see fewer times. And some of those are seen just by a few doctors per year.
It's funny, because within Amazon, people are not so aware of it, but Amazon Mechanical Turk was one of the three big factors for the AI revolution to come about.
“The world is a long-tailed distribution. A few things are very frequent, and most things are not frequent at all. How often do you see an elephant cross the road? But if you want to build autonomous vehicles, they should be able to handle elephants crossing the road.”
Another aspect of human visual reasoning that deep learning has struggled to duplicate is the capacity for abstraction, Perona says.
“Right now, we need to train machines with diverse backgrounds,” he says. “If you want to train a machine to recognize toads, you've got to show it pictures of toads in all possible environments and all possible poses for the machine to be able to abstract away the concept of toad. If you had trained the machine with pictures of toads always against the same piece of wallpaper or the same blank background, the machine would not be able to handle the toad in a new scenario. Or take a cow on the beach: machines have a terrible time recognizing a cow that is right in the middle of a picture, and it's on the beach. So we know that machines are not yet seeing objects the same way we see them. From the training examples, they are not able to abstract away the attributes of these objects. What is the face of the cow? And relating the face of the cow with the face of a dog and the face of a person — the machine is not yet able to do that.”
Before machines’ learning efficiency and capacity for abstraction can rival humans’, Perona says, “new insights are needed”. But in the near term, progress on both fronts could come from self-supervised learning, a topic that has, he says, grown in popularity at CVPR in recent years.
“Even if there is nobody teaching a machine what to look for, the machine can teach itself in some way and can be prepared to learn the next task,” Perona explains. “Let's suppose that we have a million images, for example, but no labels telling the machine what is in each picture. The machine has CPU cycles to spare, so what could it do? The images are all upside up, with the sky up and the ground down. But the machine could randomly flip a few and train itself to recognize when the image is flipped versus when the image is as it should be. Here’s another game you can play: each image is color, so there are three channels, RGB [red, green, blue]. So you could try and predict the green from the red and blue.
“Now, it turns out that in order to win at these games, it will have to develop some sense for the key features in the image. And one crucial feature is that trees grow from the ground up in some way. And so it has to recognize the structure of trees or the structure of things that are planted in the ground to recognize what is on the ground and what is not. It doesn't have a high level of semantic knowledge, but it does develop some features that are good preparation for the next step.
“To give you more advanced example, a student of mine and I have a paper showing how a machine can learn about numbers purely by playing with objects. Suppose that you had a few M&Ms, and you are just tossing them into a cup in front of you, and then you're picking one up and moving it away or putting one in or just scrambling the ones you have and rearranging them like a child would do. We demonstrate that the machine is able to learn the concept of number, an abstract concept, purely by playing with little objects, taking one out and putting one in, and so on. And it's quite interesting how that concept, that abstraction, can emerge from no supervision at all.”