Computer vision

WACV: Where application-based research finds a home

As video scales up — in both duration and resolution — it raises new research questions.

January 4, 2023

3 min read

Zongyi (Joe) Liu is a principal scientist in the Amazon Customer Experience and Business Trends (CXBT) organization, which evaluates the customer experience across Amazon’s products and services and works to improve it. Among the products the organization evaluates are deployed artificial-intelligence models, such as Alexa’s voice recognizer or the computer vision models that identify scene boundaries in Prime Video content.

Joe Liu.jpg — Amazon principal scientist Zongyi (Joe) Liu outside St. Joseph's Oratory in Montréal, Quebec.

Liu is a coauthor on a paper at this year’s Winter Conference on Applications of Computer Vision (WACV), and he’s also one of the organizers of the conference’s Workshop on Video/Audio Quality in Computer Vision.

As its name suggests, Liu says, WACV is distinctive among the major computer vision conferences for its focus on applications. While Amazon researchers publish dozens of papers on fundamental computer vision research each year at conferences like CVPR and ECCV, WACV is an inviting venue for scientific work that directly addresses business problems.

“One thing unique to WACV is that it has an application track, where you don't need to have a fundamental innovation in the algorithm area,” Liu says. “It provides opportunities for industry research scientists to present their work. If you prove you can make an application work, they will take it.”

Expanding vision

Two examples of industrial applications that are driving research in computer vision, Liu says, are long-video segmentation — finding the scene boundaries in a long video — and object recognition and semantic segmentation in ultrahigh-definition (UHD) video.

Automated methods with a little human guidance use annotators’ time much more efficiently.

“Computer vision is growing both spatially and temporally, and it’s hard in both directions,” Liu says. “Today, people focus on, say, 10-second videos, because that's what a GPU memory can handle. Once you go to an hour — or with a sporting event, three hours — that's almost impossible for any memory. If you break it down to a few seconds, that's too granular. You don't have a full picture.

“And the individual images are fundamentally changing, because we're going to UHD. Most images today are 1,080 [pixels per row] or even 640. But when you go to 2,096, it raises new questions. How do you make the algorithm scalable to do object detection? You can downsample, but that can be problematic if you want to detect a very small object. How do you detect the quality of the image and know if it’s been upscaled or if it’s original? What's the quality of it? How does a human perceive it? These are challenges in terms of both computational power and memory.”

Audio-video alignment

Liu and his colleagues' paper on video segmentation examines the problem of recognizing which segments of a video stream are advertisements.

Ad detection algorithm.png — A schematic of the algorithm presented in "A deep neural framework to detect individual advertisement (ad) from videos".

“The main contribution of my paper is that we are using the audio instead of video, because using the audio to start segmentation is much more scalable,” Liu says. “Our time is around eight to ten times faster than the state of art.”

The same principle — that scene breaks and audio disjunctions should align — can be used in reverse, Liu explains, to identify cases in which audio and video have gotten out of sync.

“Audio-video synchronization is a very hard topic because it’s multimodal and because humans are very sensitive to it,” Liu says. “You can accept that an image becomes lower resolution for some time, but if audio and video get out of sync for just a half-second, it can drive you crazy.

This image is overlaid with graphics and labels showing an example of instance segmentation as it applies to people eating at a barbecue, there are labels for person, bowl, cup, and knife

University of Wisconsin-Madison associate professor and ARA recipient has authored a series of pioneering papers on real-time object instance segmentation.

“One way people identify it is through cross-correlation. They have a window basically shift the audio with your image features and see which one gives the highest correlation. I can give you an example that I published earlier. When one shot ends and the next starts, you can tell. I detect the transition period from both video and audio. If the video and audio are in sync, their transition times should be aligned. There’s some noise, but if you’re systematically out of sync, you should be able to tell by aggregating all those video and audio signals.”

In this and similar ways, Liu says, focusing on the customer experience can give rise to new and interesting research questions. And when those questions involve computer vision, WACV is the ideal place to present them.

“In industry, we have scientists who aren’t working on academic algorithms,” Liu says. “WACV is really a good conference for scientists like us to present our work."

About the Author

Larry Hardesty

Larry Hardesty is the editor of the Amazon Science blog. Previously, he was a senior editor at MIT Technology Review and the computer science writer at the MIT News Office.

WACV: Where application-based research finds a home

As video scales up — in both duration and resolution — it raises new research questions.

Expanding vision

Audio-video alignment

Related content

Work with us