Voice assistants such as Siri, Alexa, etc. usually adopt a pipeline to process users’ utterances, which generally include transcribing the audio into text, understanding the text, and ﬁnally responding back to users. One potential issue is that some utterances could be devoid of any interesting speech, and are thus not worth being processed through the entire pipeline. Examples of uninteresting utterances include those that have too much noise,are devoid of intelligible speech,etc. It is therefore desirable to have a model to ﬁlter out such useless utterances before they are ingested for downstream processing, thus saving system resources. Towards this end, we propose the Combination of Audio and Metadata (CAM) detector to identify utterances that contain only uninteresting speech. Our experimental results show that the CAM detector considerably outperforms using either an audio model or a metadata model alone, which demonstrates the effectiveness of the proposed system.