Publications

Insert-optimized implementation of streaming data sketches

Pascal Pfeil, Dominik Horn, Orestis Polychroniou, George Erickson, Zhe Heng Eng, Mengchu Cai, Tim Kraska

SIGMOD/PODS 2025 Workshop on Data Management on New Hardware

2025

We present insert-optimized implementations of three fundamental data sketching algorithms: Count Sketch (CS), SpaceSaving (SS), and Karnin-Lang-Liberty (KLL).While these sketches are widely used for approximate query processing and stream analytics, their practical insert performance often falls short of their full potential. Through careful engineering and novel implementation strategies, we achieve substantial

Cloud and systems

Analyzing metastable failures

Rebecca Isaacs, Peter Alvaro, Rupak Majumdar, Kiran Reddy, Mahmoud Salamati, Sadegh Soudjani

ACM SIGOPS 2025 Workshop on Hot Topics in Operating Systems

2025

A metastable failure is a self-sustaining congestive collapse in which a system degrades in response to a transient stressor (e.g., a load surge) but fails to recover after the stressor is removed. These rare but potentially catastrophic events are notoriously hard to diagnose and mitigate, sometimes causing prolonged outages affecting millions of users. Ideally, we would discover susceptibility to metastable

Cloud and systems

Managed resource scaling in Amazon EMR

Vishal Vyas, Andrei Paduroiu, Srikanth Kandula, Hari Ohm Prasath Rajagopal, Mukesh Punhani, Marco Manzo, Ankur Goyal, Santosh Chandrachood, Rick Sears, Joseph Marques, Sushant Majithia

SIGMOD/PODS 2025

2025

Compute elasticity is a primary benefit of using cloud-based data processing platforms such as Amazon EMR, where clusters can be scaled both horizontally and vertically. For example, a query scanning petabytes of data can run faster in a cluster with thousands of nodes compared to one with only a few hundred. However, not all workloads require the same computational power or have the same resource utilization

Cloud and systems

Program synthesis from partial traces

Margarida Ferreira, Victor Nicolet, Joey Dodds, Daniel Kroening

PLDI 2025

2025

We present the first technique to synthesize programs that compose side-effecting functions, pure functions, and control flow, from partial traces containing records of only the side-effecting functions. This technique can be applied to synthesize API composing scripts from logs of calls made to those APIs, or a script from traces of system calls made by a workload, for example. All of the provided traces

Cloud and systems

Marconi: Prefix caching for the era of hybrid LLMs

Rui Pan, Zhuang Wang, Zhen Jia, Can Karakus, Luca Zancato, Tri Dao, Yida Wang, Ravi Netravali

MLSys 2025

2025

Hybrid models that combine the language modeling capabilities of Attention layers with the efficiency of Recurrent layers (e.g., State Space Models) have gained traction in practically supporting long contexts in Large Language Model serving. Yet, the unique properties of these models complicate the usage of complementary efficiency optimizations such as prefix caching that skip redundant computations across

Cloud and systems

DiskGNN: Bridging I/O efficiency and model accuracy for out-of-core GNN training

Renjie Liu, Yichuan Wang, Xiao Yan, Haitian Jiang, Zhenkun Cai, Minjie Wang, Bo Tang, Jinyang Li

SIGMOD/PODS 2025

2025

Graph neural networks (GNNs) are models specialized for graph data and widely used in applications. To train GNNs on large graphs that exceed CPU memory, several systems have been designed to store data on disk and conduct out-of-core processing. However, these systems suffer from either read amplification when conducting random reads for node features that are smaller than a disk page, or degraded model

Cloud and systems

Deterministic in-fleet scan test for a cloud computing platform

Dan Trock, Subramanian Mahadevan, Nilanjan Mukherjee, Lee Harrison, Janusz Rajski, Jerzy Tyszer

ITC 2024

2024

Recently the semiconductor industry has been alerted by hyperscaler companies reporting impact of field errors in megascale datacenters. They tend to be elusive and very difficult to detect until they affect a particular application several days or months after the IC has been deployed in a fleet. Although the cause of such errors can be manifold, ranging from test escapes and design marginalities to design

Cloud and systems

Distributed training of large language models on AWS Trainium

Xinwei Fu, Zhen Zhang, Haozheng Fan, Guangtai Huang, Randy Huang, Rahul Solanki, Fei Wu, Ron Diamant, Yida Wang

ACM SoCC 2024

2024

Large language models (LLMs) are ubiquitously powerful but prohibitively expensive to train, often requiring thousands of compute devices, typically GPUs. To reduce the cost of training LLMs for customers, Amazon Web Services (AWS) launched the Amazon EC2 trn1 instances, powered by AWS Trainium, Amazon’s homegrown deep-learning accelerator, as an alternative to distributed LLM training. The trn1 instances

Cloud and systems

Forecasting algorithms for intelligent resource scaling: An experimental analysis

Yanlei Diao, Dominik Horn, Andreas Kipf, Oleksandr Shchur, Ines Benito, Wenjian Dong, Davide Pagano, Pascal Pfeil, Vikram Nathan, Murali Narayanaswamy, Tim Kraska

ACM SoCC 2024

2024

There has been a growing demand for making modern cloud-based data analytics systems cost-effective and easy to use. AI-powered intelligent resource scaling is one such effort, aiming at automating scaling decisions for serverless offerings like Amazon Redshift Serverless. The foundation of intelligent resource scaling lies in the ability to forecast query workloads and their resource consumption accurately

Cloud and systems

Vista: Machine learning based database performance troubleshooting framework in Amazon RDS

Vikramank Singh, Zhao Song, Murali Narayanaswamy, Kapil Eknath Vaidya, Tim Kraska

ACM SoCC 2024

2024

Database performance troubleshooting is a complex multi-step process that broadly involves three key stages– (a) Detection: determining what’s wrong and when; (b) Root Cause Analysis (RCA): reasoning about why is the performance poor; (c) Resolution: identifying a fix. A plethora of techniques exist to address each of these problems, but they hardly work in real-world at scale. First, real-world customer

Cloud and systems

The Fuse platform: Integrating data from IoT and other sensors into an industrial spatial digital twin

Gregory Biegel, Nicholas Bower, Will Castelnau

ISPRS Technical Commission IV Symposium 2024

2024

Digital Twins as virtual representations of industrial assets are being used to assimilate varied sources of data for improved awareness and decision making in operations and process optimisation. This paper explores the integration of IoT sensors into a spatial digital twin called Fuse that Woodside Energy has been building for the assets it operates. We describe the Fuse platform and its knowledge graph

Cloud and systems

Cloud resource protection via automated security property reasoning

Zhixing Xu, Daniel Guo, Oksana Tkachuk, Saeed Nejati, Niloofar Razavi, George Argyros

ASE 2024

2024

As cloud computing gains widespread adoption across various industries, securing cloud resources has become a top priority for cloud providers. However, ensuring configuration security among highly interconnected cloud resources is challenging due to the complexities of resource modeling, correlation analysis, and large-scale security checks. To tackle those practical challenges, we propose Security Invariants

Cloud and systems

Data science projects development with Amazon SageMaker

Yuri Demchenko, Oleg Chertov, Marharyta Aleksandrova, Juan J. Cuadrado-Gallego

Big Data Infrastructure Technologies for Data Analytics

2024

This chapter discusses SageMaker - a fully managed machine learning (ML) service provided by Amazon Web Services (AWS). Being a fully managed service, means that a user does not have to deal with hardware setup, patching, management, backups etc. All this is taken care of by the service provider. The user can choose from a wide variety of computing instance types that are optimized for different tasks,

Cloud and systems

Pattern template manifest for live video streaming

Yongjun Wu, Kyle Koceski , Mairo Pedrini, Sally Cheng, Parminder Singh

MMSP 2024

2024

In live video streaming, the size of manifest grows linearly as the overall time duration of manifest increases in many scenarios. Such a behavior exists across streaming technologies, e.g. Dynamic Adaptive Streaming through HTTP (DASH), HTTP Live Streaming (HLS) and Microsoft Smooth Streaming (MSS). It introduces significant overhead for manifest generation on cloud services, download latency and network

Cloud and systems

Membrane – Safe and performant data access controls in Apache Spark in the presence of imperative code

Andrei Paduroiu, Sungheun Wi, Yan Yan, Roni Burd, Ruhollah Farchtchi, Giovanni Matteo Fumarola

VLDB 2024

2024

Data Governance is an increasingly critical feature of modern cloud database systems, enabling administrators to set granular access policies on their data. AWS customers want to define row or column filtering on their blob storage data and access it using popular tools such as Apache Spark. AWS EMR provides a managed and serverless solution that lets users run Spark jobs in the AWS cloud with imperative

Cloud and systems

Publications

Latest news

Work with us