Learning self-supervised user representations using contextualized mixture of experts in transformers
Robotic traffic is an endemic problem in digital advertising, often driven by a large number of fake users indulging in advertising fraud. Temporal sequences of user ad activity contain rich information about user intention while interacting with digital ads, and can be effectively modeled to segregate robotic users with abnormal browsing patterns from regular human users. Sequence models on user ad activity trail trained with generative pre-training produce self-supervised user embeddings that work well on the downstream task of robotic user detection. However, they fall short on robot detection for low-and-slow attacks with very short user sequence lengths, i.e., low activity robotic users with a small number of ad traffic events. As sophisticated bot traffic gravitates toward complex modus operandi at a fast pace and exploits gaps in detection systems, it opens up a critical requirement to build advanced user models that go beyond modeling activity sequences. This problem is circumvented by a variation of TabTransformer networks , which simultaneously encode user behavioral information from a mix of sequential data (for long activity sequences), and from tabular and numerical user/ads metadata (for short sequences). Despite the overall improvement in detection with TabTransformers , there are pockets of under-represented traffic slices where model performance is sub-optimal due to biased allocation of weights between sequential and tabular features to optimize for high volume slices. To that end, we propose a novel sparse Mixture of Experts with TabTransformers as component experts, where the sparse gating function follows a new context-aware routing mechanism comprising of local-global experts. We demonstrate that our proposed model helps to uniformly improve detection and to de-bias vanilla TabTransformer networks with respect to user sequence length, with a maximum gain of 33% over the vanilla TabTransformer model achieved on short activity sequences.