SWIFT-AI-DS is a benchmark dataset that consists of samples that have been derived from two simulation runs (each 2.5 years long) of the chemistry and transport model ATLAS (Wohltmann and Rex, 2009; Wohltmann et al., 2010). This data set of nearly 200 million samples meets the requirements of a labelled data set and is ideally suited for training and testing of a machine learning based surrogate model.Two time periods were considered in the simulation runs: first from November 1998 to March 2001 and the second from November 2004 to March 2007.The dataset covers the entire Earth geographically, but is vertically restricted to the altitudes of the lower to middle stratosphere, for which the SWIFT (Rex et al., 2014; Kreyling et. al, 2017; Wohltmann et al., 2017) approach of 24-hour ozone tendencies can be applied. Applicability was determined in terms of the chemical lifetime of stratospheric ozone, which is a function of solar irradiance and altitude. It can be described by a dynamic upper bound [Kreyling et. Al, 2017]. Within the range where the chemical lifetime is longer than 14 days, ozone is not in quasi-chemical equilibrium. Moreover, this data set focuses on the region of the lower to middle stratosphere because it is the region with the largest contribution to the total ozone column.State-of-the-art physical process models for stratospheric chemistry require enormous computational time. Our research is focused on developing much faster, yet accurate, surrogate models for computing the 24-hour tendencies of stratospheric ozone. Much faster models of stratospheric ozone provide a new application area such as for climate models. These surrogate models benefit greatly from the methodological and hardware improvements of the last decade.Each simulation run uses the full stratospheric chemistry model to solve a system of differential equations involving 47 chemical species and 171 chemical reactions at a very high (<< seconds) and variable temporal resolution. The ATLAS model is driven by ECMWF reanalysis data (either ERA-I or ERA5). The air parcel state has been sampled at a 24-hour time step (00:00 UTC model time). During postprocessing some variables are stored as 24-hour averages, as 24-hour tendencies or as the state at the beginning of the 24-hour time step. The dataset is stored in 12 monthly netCDF-files.
The benchmark-dataset consists of training- and test-data.Variables are being described in the document Description_variables.pdf.Training-Data:The training data consists of files that include ca. 100 million data samples. Each data sample consists of the input and output features that can be used to train a data-driven model on a regression taskInput X: choice of variables (see document Description_variables.pdf)Output y: 24-hour tendency of stratospheric ozoneTest-Data:Similar to the training data, but this test-data includes ca. 100 million data samples that have not been used for training. It can be used to assess model performance.