Machine learning

Making deep learning practical for Earth system forecasting

Novel “cuboid attention” helps transformers handle large-scale multidimensional data, while diffusion models enable probabilistic prediction.

By Zhihan Gao, Boran Han

September 15, 2023

7 min read

The Earth is a complex system. Variabilities ranging from regular events like temperature fluctuations to extreme events like drought, hailstorms, and the El Niño–Southern Oscillation (ENSO) phenomenon can influence crop yields, delay airline flights, and cause floods and forest fires. Precise and timely forecasting of these variabilities can help people take necessary precautions to avoid crises or better utilize natural resources such as wind and solar energy.

The success of transformer-based models in other AI domains has led researchers to attempt applying them to Earth system forecasting, too. But these efforts have encountered several major challenges. Foremost among these is the high dimensionality of Earth system data: naively applying the transformer’s quadratic-complexity attention mechanism is too computationally expensive.

Most existing machine-learning-based Earth systems models also output single, point forecasts, which are often averages across wide ranges of possible outcomes. Sometimes, however, it may be more important to know that there’s a 10% chance of an extreme weather event than to know the general averages across a range of possible outcomes. And finally, typical machine learning models don’t have guardrails imposed by physical laws or historical precedents and can produce outputs that are unlikely or even impossible.

In recent work, our team at Amazon Web Services has tackled all these challenges. Our paper “Earthformer: Exploring space-time transformers for Earth system forecasting”, published at NeurIPS 2022, suggests a novel attention mechanism we call cuboid attention, which enables transformers to process large-scale, multidimensional data much more efficiently.

And in “PreDiff: Precipitation nowcasting with latent diffusion models”, to appear at NeurIPS 2023, we show that diffusion models can both enable probabilistic forecasts and impose constraints on model outputs, making them much more consistent with both the historical record and the laws of physics.

Earthformer and cuboid attention

The heart of the transformer model is its “attention mechanism”, which enables it to weigh the importance of different parts of an input sequence when processing each element of the output sequence. This mechanism allows transformers to capture spatiotemporally long-range dependencies and relationships in the data, which have not been well modeled by conventional convolutional-neural-network- or recurrent-neural-network-based architectures.

Earth system data, however, is inherently high-dimensional and spatiotemporally complex. In the SEVIR dataset studied in our NeurIPS 2022 paper, for instance, each data sequence consists of 25 frames of data captured at five-minute intervals, each frame having a spatial resolution of 384 x 384 pixels. Using the conventional transformer attention mechanism to process such high-dimensional data would be extremely expensive.

In our NeurIPS 2022 paper, we proposed a novel attention mechanism we call cuboid attention, which decomposes input tensors into cuboids, or higher-dimensional analogues of cubes, and applies attention at the level of each cuboid. Since the computational cost of attention scales quadratically with the tensor size, applying attention locally in each cuboid is much more computationally tractable than trying to compute attention weights across the entire tensor at once. For instance, decomposing along the temporal axis can result in cost reduction by a factor of 384² for the SEVIR dataset, since each frame has a spatial resolution of 384 x 384 pixels

Of course, such decomposition introduces a limitation: attention functions independently within each cuboid, with no communication between cuboids. To address this issue, we also compute global vectors that summarize the cuboids’ attention weights. Other cuboids can factor the global vectors into their own attention weight computations.

Cuboid attention layer processing an input tensor *(X)* with global vectors *(G)*.

We call our transformer-based model with cuboid attention Earthformer. Earthformer adopts a hierarchical encoder-decoder architecture, which gradually encodes the input sequence to multiple levels of representations and generates the prediction via a coarse-to-fine procedure. Each hierarchy includes a stack of cuboid attention blocks. By stacking multiple cuboid attention layers with different configurations, we are able to efficiently explore effective space-time attention.

The Earthformer architecture is a hierarchical transformer encoder-decoder with cuboid attention. In this diagram, *“×D”* means to stack D cuboid attention blocks with residual connections, while *“×M”* means to have M layers of hierarchies.

We experimented with multiple methods for decomposing an input tensor into cuboids. Our empirical studies show that the “axial” pattern, which stacks three unshifted local decompositions along the temporal, height, and width axes, is both effective and efficient. It achieves the best performance while avoiding the exponential computational cost of vanilla attention.

Illustration of cuboid decomposition strategies when the input shape is (*T, H, W*) = (6, 4, 4), and cuboid size is (3, 2, 2). Elements that have the same color belong to the same cuboid and will attend to each other. Local decompositions aggregate contiguous elements of the tensor, and dilated decompositions aggregate elements according to a step function determined by the cuboid size. Both local and dilated decompositions, however, can be shifted by some number of elements along any of the tensor’s axes.

Experimental results

To evaluate Earthformer, we compared it to six state-of-the-art spatiotemporal forecasting models on two real-world datasets: SEVIR, for the task of continuously predicting precipitation probability in the near future (“nowcasting”), and ICAR-ENSO, for forecasting sea surface temperature (SST) anomalies.

On SEVIR, the evaluation metrics we used were standard mean squared error (MSE) and critical success index (CSI), a standard metric in precipitation nowcasting evaluation. CSI is also known as intersection over union (IoU): at different thresholds, it's denoted as CSI-thresh; their mean is denoted as CSI-M.

On both MSE and CSI, Earthformer outperformed all six baseline models across the board. Earthformer with global vectors also uniformly outperformed the version without global vectors.

Model	#Params.(M)	GFLOPS	Metrics
Model	#Params.(M)	GFLOPS	CSI-M↑	CSI-219↑	CSI-181↑	MSE(10^-3)↓
Persistence	-	-	0.2613	0.0526	0.0969	11.5338
UNet	16.6	33	0.3593	0.0577	0.1580	4.1119
ConvLSTM	14.0	527	0.4185	0.1288	0.2482	3.7532
PredRNN	46.6	328	0.4080	0.1312	0.2324	3.9014
PhyDNet	13.7	701	0.3940	0.1288	0.2309	4.8165
E3D-LSTM	35.6	523	0.4038	0.1239	0.2270	4.1702
Rainformer	184.0	170	0.3661	0.0831	0.1670	4.0272
Earthformer w/o global	13.1	257	0.4356	0.1572	0.2716	3.7002
Earthformer	15.1	257	0.4419	0.1791	0.2848	3.6957

On ICAR-ENSO, we report the correlation skill of the three-month-moving-averaged Nino3.4 index, which evaluates the accuracy of SST anomaly prediction across a certain area (170°-120°W, 5°S-5°N) of the Pacific. Earthformer consistently outperforms the baselines in all concerned evaluation metrics, and the version using global vectors further improves performance.

Model	#Params.(M)	GFLOPS	Metrics
Model	#Params.(M)	GFLOPS	C-Nino3.4-M↑	C-Nino3.4-WM↑	MSE(10^-4)↓
Persistence	-	-	0.3221	0. 447	4.581
UNet	12.1	0.4	0.6926	2.102	2.868
ConvLSTM	14.0	11.1	0.6955	2.107	2.657
PredRNN	23.8	85.8	0.6492	1.910	3.044
PhyDNet	3.1	5.7	0.6646	1.965	2.708
E3D-LSTM	12.9	99.8	0.7040	2.125	3.095
Rainformer	19.2	1.3	0.7106	2.153	3.043
Earthformer w/o global	6.6	23.6	0.7239	2.214	2.550
Earthformer	7.6	23.9	0.7329	2.259	2.546

PreDiff

Diffusion models have recently emerged as a leading approach to many AI tasks. Diffusion models are generative models that establish a forward process of iteratively adding Gaussian noise to training samples; the model then learns to incrementally remove the added noise in a reverse diffusion process, gradually reducing the noise level and ultimately resulting in clear and high-quality generation.

During training, the model learns a sequence of transition probabilities between each of the denoising steps it incrementally learns to perform. It is therefore an intrinsically probabilistic model, which is well suited for probabilistic forecasting.

A recent variation on diffusion models is the latent diffusion model: before passing to the diffusion model, an input is first fed to an autoencoder, which has a bottleneck layer that produces a compressed embedding (data representation); the diffusion model is then applied in the compressed space.

In our forthcoming NeurIPS paper, “PreDiff: Precipitation nowcasting with latent diffusion models”, we present PreDiff, a latent diffusion model that uses Earthformer as its core neural-network architecture.

By modifying the transition probabilities of the trained model, we can impose constraints on the model output, making it more likely to conform to some prior knowledge. We achieve this by simply shifting the mean of the learned distribution, until it complies better with the constraint we wish to impose.

An overview of PreDiff. The autoencoder *(e)* encodes the input as a latent vector *(z_cond)*. The latent diffusion model, which adopts the Earthformer architecture, then incrementally denoises *(steps z_t+1 to z₀)* the noisy version of the input *(z_T)*. In the knowledge control step, the transition distributions between denoising steps are modified to accord with prior knowledge.

Results

We evaluated PreDiff on the task of predicting precipitation intensity in the near future (“nowcasting”) on SEVIR. We use anticipated precipitation intensity as a knowledge control to simulate possible extreme weather events like rainstorms and droughts.

We found that knowledge control with anticipated future precipitation intensity effectively guides generation while maintaining fidelity and adherence to the true data distribution. For example, the third row of the following figure simulates how weather unfolds in an extreme case (with probability around 0.35%) where the future average intensity exceeds μ_τ + 4σ_τ. Such simulation can be valuable for estimating potential damage in extreme-rainstorm cases.

A set of example forecasts from PreDiff with knowledge control (PreDiff-KC), i.e., PreDiff under the guidance of anticipated average intensity. From top to bottom: context sequence y, target sequence x, and forecasts from PreDiff-KC showcasing different levels of anticipated future intensity *(μ_τ + nσ_τ)*, where n takes the values −4, −2, 0, 2, and 4.

About the Author

Zhihan Gao

Zhihan Gao is a graduate student in computer science and engineering at Hong Kong University of Science and Technology. He was an intern at Amazon when the work was done.

Boran Han

Boran Han is a senior applied scientist with Amazon Web Services.

Making deep learning practical for Earth system forecasting

Novel “cuboid attention” helps transformers handle large-scale multidimensional data, while diffusion models enable probabilistic prediction.

Earthformer and cuboid attention

Experimental results

PreDiff

Results

Related content

Work with us