ADI Framework

The ADI Data Processing Framwork consists of the following major components:

  • Library modules for the key data processing tasks

  • User hooks between each of the core modules

  • Supporting functions used to implement the core modules and user hooks

_images/overview_data_processing_framework.png

Process modules enable you to select, retrieve, and manipulate ARM data. The process modules form the steps of a data retrieval and in order of use are:

  • Initialization

  • Retrieval

  • Merge

  • Transform

  • Finish

Initialization

Initialization begins the process of data retrieval.

Retrieval

The Retrieval module selects and downloads the data you want.

Merge

The Merge module(s) merge the downloaded data into a new data set.

Transform

The Transform Data module maps the retrieved data onto a user-defined coordinate grid using one of several supported transformation methods or by using a method developed and integrated into the ADI framework by the user. The specifics of how the mapping is performed are controlled through parameters set by the user. These parameters can be set in a configuration file or through the use of an internal function. A subset of the parameters that define the core criteria for performing a transformation can also be set in the PCM’s Coordinate System Definition Form.

Non-core parameters are parameters that define the characteristics of the input data. They provide more advanced control of the transformation onto the output grid. These supplemental parameters can be to used transform instantaneous values, smooth data, and handle unexpected variations in data such as gaps.

Quality Control (QC) that meets ARM standards is automatically incorporated into the transformation process. Values can also be excluded from the transformation by applying a user-defined QC mask prior to the execution of the transformation module. In addition, QC and metrics relevant to the transformation method applied are created during the mapping process and are available for inclusion in output data products.

The successful transformation of an input variable’s data to a new coordinate system involves

  • defining the output grid

  • selecting the method of transforming input to output

  • identifying and setting the parameters of the output and input grid that will affect the transformation

  • filtering undesirable input data from the transformation by using the input variable’s QC or building a QC mask as necessary.

The resulting output includes not only data transformed to the new coordinate system, but also QC describing various QC conditions that occurred during the transformation and metrics giving additional information about the transform.

The follow sections describe

  • the supported transformation methods and how to apply them to a set of retrieved variables

  • summaries of the transformation parameters, output QC states, and metrics

  • an overview of the steps involved in executing a transformation

  • and the process of integrating and applying a transformation developed by the user.

Defining the Output Grid

A coordinate system is defined by specifying the values of individual coordinate fields. This can be done in several ways:

  1. specifying parameters that describe the dimension(s)

  2. defining the dimension as equal to a dimension associated with an input datastream

  3. explicitly defining the individual values of the coordinate dimension.

A grid is considered uniform if the intervals of all dimensions that make up the grid are equally spaced. An irregular grid is one whose dimension intervals are not equally spaced. The table below describes the three methods to define a grid, the type of grid to which each approach can be applied, and the method for defining the coordinate dimension.

Coordinate Dimension Definition Methods

Approach

Grid Type

Method

specify parameters

uniform

set 3 of the following 4 parameters: start, end, interval, length(link to section 2.4 Setting Transform Parameters)

map to existing input coordinate dimension

uniform irregular

Use the PCM Coordinate System Definition Form

explicitly define coordinate dimension values

uniform irregular

set coordinate variable data values using ADI functions

While a uniform dimension (i.e., one for which the interval or spacing between sample values is the same for all samples) can be defined using any of these three methods, it is most expedient to do so by specifying three of the following four transformation parameters: interval, start, end, and length. Mapping to an existing input coordinate dimension is most efficiently performed using the PCM. Currently the only method of explicitly defining the values of an output coordinate dimension by assigning values to the coordinate dimension before the transformation process is begin. This would typically be done in the pre_transform_hook using ADI functions such as dsproc_get_coord_var, and dsproc_alloc_var_data_index to get the coordinate dimension variable and an index to the first sample. The user can then set the values in any fashion that can be programed (for example, by calculating values or setting to values found in a flat file).

When explicitly defining a coordinate variable, note that all coordinate dimensions other than time are expected to be complete (i.e., to have valid values for all samples across their length) and monotonically increasing or decreasing. Time is the unlimited dimension and as such can grow to any length. While also monotonically increasing, time is permitted to have gaps, but these gaps should be noted in a companion time QC variable.

Select Transform Method

This section discusses the three standard transformation methods that are supported in ADI: bin average, interpolation and subsampling, and a custom ‘’summing in quadrature’’ transform.

Transformations are defined and applied at the variable level. The transformation method that should be applied is a function of the variable being transformed and whether the input grid’s interval is equal, smaller, or larger than the output grid’s interval.

Variable Type

Grid Type Interval Relative Size

Transformation Method(s)

not uncertainty measurement

input grid interval = output grid interval

Bin Average

Interpolation

Subsample

any

Interpolation

Subsample

input grid interval < output grid interval

Bin Average

uncertainty measurement

N/A

Quadrature

If a transform method is not explicitly set, the bin average method is applied by default if the output grid is larger than the input grid. The interpolate method is used if the output grid interval is smaller or equal in size to the input grid. If either or both grids are irregular, ADI will still attempt to guess which default transformation to use based on the average interval over the whole span of the grid, which can lead to some unexpected results. It should be noted that a grid intended to be uniform can appear to be irregular if there are one or more large gaps in the grid (i.e., gaps in the data in the input file). Because of the relatively frequent occurrence of gaps in the time dimension (i.e., the distance between samples being larger than the defined interval), it is recommended that the user always explicitly define input and output bin widths using the transform parameter ‘width’. How to specify this is discussed in the next section Setting Transform Parameters.

In the subsections that follow each of the four standard transform methods are described, the applicable transform parameters are presented, the possible resulting QC states of the transformed variables are listed, metrics set, and issues associated with the method.

Bin Average

Discussion

Averaging is the most complicated standard transform, because it requires your input and output data to represent a region (or span) of your coordinate space, not just a single point. To emphasize this fact we call this method a bin average; the input and output data are represented by bins with a finite width.

showing how one minute data is averaged into five minute bins

Averaging One Minute Data onto Five Minute Bins for Different Alignments

Thus, we need two numbers to index our variables: the front and back edge of each coordinate bin, or a single coordinate index and a bin width. Because netCDF allows only a single index for its coordinates, we generally use the latter formulation.

To fully describe each bin, therefore, we need a coordinate index, a width, and the alignment of that index. The alignment tells you to where in the bin your coordinate value is pegged. An alignment of 0.0 means you are indexing the front of the bin, while an alignment of 1.0 indexes the back edge of the bin and 0.5 indexes the middle of the bin. The span of your bin is therefore given by [coord-alignment*width, coord+(1.0-alignment)*width]. For example, if the coordinate is time, the width is 60 seconds, and the alignment is 0.5, the span of the bin indexed by the value time == 60 is [30,90] seconds.

Once we have a handle on our bins, then we are ready to take the average. The fundamental idea behind bin averaging is that each input bin is weighted by the fraction of the overlap with the span of the transformed bin. Most interior input bins will be completely covered by the transformed bin, so their weights will be 1.0. But bins on the edge may straddle two different transformed bins, and thus their contribution has to be split between them.

The figure Averaging One Minute Data onto Five Minute Bins for Different Alignments above illustrates some of these concepts. In the example, we are taking one minute data and averaging it into five minute bins. The input values that contribute to this bin depend upon where the input index lies in each bin; if we index the front edge (Alignment=0.0) then we average the first five data points, but if Alignment=1.0 we average the second through sixth (index 1 through 5) data points. If our Alignment=0.5, the first and last input bin does not completely fall within the span of the transformed bin, so those values must be weighted by the fraction that does lie in the transformed bin.

Key Concepts

  • Bin span - width, alignment, and bin edges

  • Weights and input and output bin overlap

Other Concepts

  • Irregular grids

  • Rolling means and incomplete input spans

  • Default widths and alignments

Specific Transform Parameters

front_edge
back_edge
width
alignment

QC States

QC_ALL_BAD_INPUTS
QC_BAD
QC_ESTIMATED_INPUT_BIN
QC_ESTIMATED_OUTPUT_BIN
QC_INDETERMINATE
QC_NOT_USING_CLOSEST
QC_OUTSIDE_RANGE
QC_SOME_BAD_INPUTS
QC_ZERO_WEIGHT

Metrics

  • Standard deviation

  • % of output bin spanned by bad data

  • % of output bin spanned by indeterminate data

Interpolation

Discussion

The standard interpolation transformation is linear (as opposed to, for example, more complicated polynomial or spline interpolation methods). We take the nearest bracketing input points around our target transformed coordinate index, draw a straight line through them, and take the value of that line corresponding to our target index. This is the default transformation when we try to transform data from a larger grid to a smaller one, and can also be used to (for example) shift every index in a grid half a bin over.

Because we use only two input points to calculate every output point, the QC state of the inputs is very important. Thus, if one of our inputs has been flagged QC_BAD, we must not use it to calculate our output. Instead, we scan up or down the input grid until we find the nearest good point in that direction that is still within our defined range. If no such good point exists, we scan down in the other direction until we find a good point to use; in that case, the transform actually becomes an extrapolation (which is mathematically identical to an interpolation; the only difference is that instead of bracketing our target index the two points we use are on the same side).

If we do not use the two closest bracketing points to interpolate, we set the QC bit QC_INTERPOLATE to indicate that a non-standard interpolation took place. If we had to extrapolate we set the QC_EXTRAPOLATE bit, and if we could not find two good points to interpolate from within the given range of the transformation, we set the data to missing and set QC_BAD and QC_ALL_BAD_INPUTS.

Key Concepts

  • Range

  • Extrapolation

Other Concepts

Specific Transform Parameters

range

QC States

QC_ALL_BAD_INPUTS
QC_BAD
QC_ESTIMATED_INPUT_BIN
QC_ESTIMATED_OUTPUT_BIN
QC_INDETERMINATE
QC_NOT_USING_CLOSEST
QC_OUTSIDE_RANGE
QC_SOME_BAD_INPUTS
QC_ZERO_WEIGHT

Metrics

Currently only averaging has had metrics defined for now. The Distance metric for interpolation (and subsampling) is suggested for future development).

Subsampling

Discussion

Subsampling is the simplest transform, and consists of simply taking the nearest good input point within our range. This might be called something like “nearest-neighbor interpolation”; we use subsampling because it has been used primarily to take a rapid sample measurement onto a slower grid. The direction of the sampling doesn’t matter - we take the value of the point with the least absolute distance to the target index. If no such good point can be found within the specified range, we set the output QC to QC_ALL_BAD_INPUTS and the output data value to missing. If the nearest good point is not the nearest absolute point (i.e., the nearest point was flagged as bad), we set the QC_NOT_USING_CLOSEST bit. In the case where there are no input points at all within our specified range (if there was a gap in the data, for example), then the QC_OUTSIDE_RANGE bit is set.

Key Concepts

  • Range

Other Concepts

Specific Transform Parameters

range

QC States

QC_ALL_BAD_INPUTS
QC_BAD
QC_ESTIMATED_INPUT_BIN
QC_ESTIMATED_OUTPUT_BIN
QC_INDETERMINATE
QC_NOT_USING_CLOSEST
QC_OUTSIDE_RANGE
QC_SOME_BAD_INPUTS
QC_ZERO_WEIGHT

Metrics

Currently only averaging has had metrics defined for now. The “Distance” metric for interpolation (and subsampling) is suggested for future development).

Setting Transform Parameters

Transform parameters are variables that allow customization and information passing between different ADI functions and the data transformation module. Transformation parameters are defined to characterize the coordinate system grid being created and to specify information about the input data. Which parameters need to be set and whether to apply them to the input grid, output grid, or both is a function of qualities associated with the input data and the transformation method being applied. Core parameters refer to output grid parameters that need to be set for all supported transformation methods. These include transform, interval, alignment, and width.

These parameters that can be set through the the PCM Coordinate System Definition Form should be. Transformation parameters that are needed but cannot be set defined in a transformation configuration file. The use of the internal function, cds_set_transform_param should be reserved for situations where the value of the parameter can only be deduced during processing.

The table below lists the available transformation parameters and indicates whether they should always be defined (Core parameters), and to what they are applied (i.e. an input datastream, output datastream, or a coordinate system). As noted in the table, there are transform parameters that are required for the input datastream. If these parameters are not explicitly defined then their value will be inferred by examining the data. While this works well in many cases, gaps in data (periods where the interval between samples is greater than the expected value) and other data variations can result in unexpected output. Users are strongly encouraged to explicitly set both the input and output grid core transform parameters, as it is difficult to predict when problems may arise from the default settings.

Transform Parameter

Core Para- meter

Grid

Transform Type

Grid Type

transform

Y

coordinate sys

both

interval

Y

input, output

interpolate average

regular

alignment

Y

input, output

average

both

start

Y

input, output

all

both

length

Y

input, output

all

both

width

Y

input, output

average

regular

front_edge

N

input, output

average

both

back_edge

N

input, output

average

both

range

N

input

coordinate sys

interpolate subsample

both

qc_bad

N

input

all

both

missing_value

N

input

all

both

qc_mask

N

input

all

both

std_bad_max

N

coordinate sys

average

both

std_ind_max

N

coordinate sys

average

both

goodfrac_bad_min

N

coordinate sys

average

both

goodfrac_ind_min

N

coordinate sys

average

both

data_type

N

coordinate sys

all

both

units

N

coordinate sys

all

both

values

N

coordinate sys

all

both

Defining Transform Parameters in a Configuration File

A configuration file should be used to set transform parameters that cannot be set using the PCM. If a parameter is defined in both the PCM and a configuration file, the PCM value will override the value in the configuration file.

Location of transform configuration files

Location of transform configuration files that will NOT change over time

In most cases, the transformation configuration files will not change and as such should be defined in a configuration file located in the VAP’s code base and stored in the GitLab resposity for that VAP. The files should be located in the VAP’s conf directory in a subdirectory named ‘transform’ and in a file that indicates to what the parameters should be applied (i.e. the input datastream, coordinate system, or output datastream). When the VAPs build is executed these files will released them to the following directory location

$>VAP_HOME/conf/vap/<PCM process name>/transform

where ‘PCM process name’ refers to the name of the process as defined in the PCM.

Location of transform configuration files that will change over time

If it is anticipated the files may be to be updated following release

$>CONF_DATA/transform/<PCM process name>
Naming transformation configuration files

Parameters relating to the input grid are defined in files named after the input datastream from which the variable being transformed was retrieved. The file name should reflect at least the base platform name and data level of the input datastream (e.g., sirs.b1), but can include the site (e.g. sgpsirs.b1), or site and facility (sgpsirsC1.b1) if the parameters vary across site and facility.

Output grid transform parameters are defined in files named for the output coordinate grid as defined in the PCM Process GUI for that VAP. The name of these files should exactly match the name assigned to the coordinate system in the PCM. An example of file that applies trasnformation parameters to the ‘coor_1’ coordinate system in the sfccldgrid2long_caracena process. The coordinate system as defined in the PCM Process GUI and the file that sets the parameters named after that coordiante system are shown below. The location of the files is as it is installed on production. It is part of the sfccldgrid2long_caracena repository https://https://code.arm.gov/vap/sfccldgrid2long_caracena/-/tree/master/conf>.

/apps/process/conf/vap/sfccldgrid2long_caracena/transform/coords_1
Naming the coordinate bin system for your custom VAP
How to Define Transformation Prameters in a Configuration File

The syntax of setting parameters in the configuration files is to use colons to separate variables and parameters, and the equal sign to assign a value.

Parameters can be applied to a specific dimension of a single variable

<outvar>:<dim>:<parameter> = <value>

temperature:time:width = 60;

to all dimensions belonging to a variable <outvar>:<parameter> = <value>

temperature:alignment = 0.5;

or more global by applying to all variables that use a given dimension <dim>:<parameter> = <value>

time:width = 60;

The Transform Parameters section documents, for each transform parameter, whether it can be applied globally, to a dimension, or to a variable.

Using the Transform Parameter Function

WARNING:

Currently only the input grid parameters and core parameters that can be defined in the PCM can be set using cds_set_transform_param. The ability to define parameters relating to the output internal to the code is functionality that will be rolled out in the future.

The cds_set_transform_param should be used to transform parameters only if their values are a function of process state.

An example of using the function is illustrated at : <<A link will be added here when an example has been implemented into Sphinx. Krista will do that in January>>

Filtering Input Samples

Filtering input samples for undesirable samples allows the transformations to not use and/or replace bad data and tag output points that use indeterminate input data. By default ADI will use the QC associated with the variable being mapped to a new grid (or onto itself) to filter all samples with assessments of ‘’Bad’’ and values that have been assigned the value of the variable’s missing_value attribute. As such input QC, if available, should be retrieved so it can be utilized in the transformation process (although it is not strictly necessary). Input variable QC is expected to use ARM standard bit-packed QC, along with the appropriate bit assessments. Certain transform parameters and ADI methods allow you to map non-standard input QC into usable ARM-standard QC values.

When a transformation encounters a data point that is flagged as bad or for which a mask has been created, it will not use it; instead, it attempts to “go around” that data point in whatever way makes sense for that transformation. The interpolation transformation, for example, will not interpolate using a bad input point but will scan up or down the input data until it finds the next nearest good bracketing point to interpolate to.

The qc_mask parameter should be applied to input that meets ARM QC standards, but for which one or more of the bits with assessments of ‘’Bad’’ are to be transformed as if they are ‘’Indeterminate’’. This allows developers to select which QC impacts the output data product.

The qc_bad parameter provides a mechanism for incorporating non-ARM standard QC into the transformation process. It should be applied to ARM input data that uses older conventions to document QC.

Transform Parameters

The directory structure and location of transformation paramter files is discussed in the tutorial section Defining Transform Parameters in a Configuration File

The transform parameters supported by the ADI framework are described below.

transform

<outvar>:<dim>:transform = {string};
<outvar>:transform = {string};
<dim>:transform = {string};

Any of these forms allow you to specify the transform used for the given <outvar> and/or <dim>. Current allowable values are TRANS_BIN_AVERAGE, TRANS_INTERPOLATE, TRANS_SUBSAMPLE, and TRANS_PASSTHROUGH eventually. User-specified transform functions can be set via the assign_transform_function() - the “name” argument to that function has to match this transform parameter to use that transform. This parameter is not applicable to input datastreams.

interval

<coordinate var name>:interval = {double};

This specifies the interval (the difference between two values) of the given coordinate variable. Obviously, it assumes a regular grid. This is used to assign default transformation functions - if the interval for the output coordinate is greater than the interval for the input coordinate, the default transformation is TRANS_BIN_AVERAGE; otherwise it is TRANS_INTERPOLATE. If this is not set we will attempt to infer it from the actual data.

alignment

<var>:<dim>:alignment = {double};
<var>:alignment = {double};
<dim>:alignment = {double};

Paired with width to specify the bin dimensions for each element of dim; this parameter tells you where in the bin the coordinate variable for <dim> is. Alignment == 0 means the coordinate variable is the front_edge of the bin, while alignment == 1 means it is the back_edge, and alignment == 0.5 is the center of the bin.

Thus:

front_edge[i] = coord[i] - alignment*width[i]
back_edge[i] = coord[i] + (1.0 - alignment)*width[i]

If front_edge and back_edge are specified, alignment and width are ignored.

For front, middle, and end alignment, a sample time of 60 seconds past midnight represents the range [60,120], [30,90], [0,60] seconds past midnight respectively. If you don’t set the alignment transform parameter, then the transform code assumes that both your input and output sample times represent the center of the time bin. Absent a specific need, a good value of alignment is 0.5, in the center of the bin, to avoid the “shifting” that may occur when averaging data that is indexed at the edges of the bin.

start

<coordinate var name>:start = {double};

Specifies the first value of the coordinate dimension. If the dimension is time, the units are seconds. The full size of the dimension is defined by setting the three transform parameters: start, length, and interval.

length

<coordinate var name>:length = {double};

Specifies the the number of values for the coordinate dimension. The full size of the dimension is defined by setting the three transform parameters: start, length, and interval.

width

<var>:<dim>:width = {list of doubles of size length(dim)};
<var>:width = {list of doubles of size length(dim)};
<dim>:width = {list of doubles of size length(dim)};

The width parameter gives the width of each bin in the units of the dimension; you can specify just one value or a different value for each element in the dimension. If not explicitly set the default behavior is to assume that the input bins completely span the input space to the interval. If there are no gaps and the grid is regularly spaced this will be equal to the interval parameter. If there are gaps in the data, where the gaps exist it will be assumed that there are irregularly shaped input bins, which can lead to unexpected output values. To avoid this possibility, explicitly define the width parameter of regularly shaped output grids. For typical situations set width equal to interval.

front_edge

<var>:<dim>:front_edge = {list of doubles of size length(dim)};
<var>:front_edge = {list of doubles of size length(dim)};
<dim>:front_edge = {list of doubles of size length(dim)};

Specifies the front edge of each bin in dimension <dim>; it is an error to provide a number of values that is not the same as the size of the dimension in question. This is used by TRANS_BIN_AVERAGE to figure out what fraction of each input bin overlaps the given output bin. (Bin descriptions work, and are probably required somehow, for both input and output variables).

back_edge

<var>:<dim>:back_edge = {list of doubles of size length(dim)};
<var>:back_edge = {list of doubles of size length(dim)};
<dim>:back_edge = {list of doubles of size length(dim)};

Same as above, only (obviously) the back_edge of each bin. You should specify both a front_edge and back_edge if you specify either.

range

<invar>:<dim>:range = {double};

Used by TRANS_INTERPOLATE and TRANS_SUBSAMPLE, they give the maximum distance you interpolate or subsample over, in the given <dim>. For example, if range = 1800 in time, we will not interpolate a value if the nearest good sample time to interpolate with is more than half an hour away from the target time. Points that cannot be transformed because the good inputs are outside the “range” are filled with missing_values and the QC_OUTSIDE_RANGE bit is set. (Note that these points may still be filled in by a transformation in another dimension, if we are doing Serial_1D transformations). The range defaults to the size of the data type associated with the dimension being transformed.

qc_bad

<qc_invar>:qc_bad = {list of integers};

Used to map non-bitpacked value-based QC (aqc) fields onto ARM standard QC (which is what the core transform functions need). Any integer values listed in qc_bad will set the QC_BAD bit on the QC used by the core transforms. Any non-zero integer value not listed will set the QC_INDETERMINATE bit.

missing_value

<invar>:<dim>:missing_value = {double};

Sets the missing value for the given input variable and dimension. Defaults to -9999; apparently does NOT read the metadata, which it probably should do.

qc_mask

<invar>:<dim>:qc_mask = {single bit packed integer};
<invar>:qc_mask = {single bit packed integer};
<dim>:qc_mask = {single bit packed integer};

Sets the qc_mask for the given input variable and dimension; these are the bits that, if set in qc_invar, will cause the transform to decide the input data is “Bad” and will need to be filled in (according to the rules of the transform). Any bits that are not in qc_mask will cause the transform to interpret the input data sample as “Indeterminate”; it will be used normally according to the rules of the transform, but the output qc var will have the QC_INDETERMINATE_INPUTS bit set. If defined for specific variables, <invar> is the name of the variable as defined in the input datastream DOD.

By default the qc_mask is determined by reading the ‘’bit_N_assessment’’ attributes of the input variable <qc_invar>. If these attributes do not exist, we assume that <qc_invar> is a value-based AQC field instead and must be mapped to QC bits as described above under qc_bad; accordingly, qc_mask then becomes QC_BAD.

Note

when we go from state-based QC to bit-packed QC, we set QC_BAD for bad data, so we need QC_BAD set in our mask so we properly filter that data.

std_bad_max

<invar>:<dim>:std_bad_max = {double};
<invar>:std_bad_max = {double};
<dim>:std_bad_max = {double};

Sets the std_bad_max Defines a max limit on the variable’s transform metrics variable <var>_std such that samples with values above this limit have qc_<var> bit_10 (QC_BAD_SDEV) of qc_<var> set with a bad assessment.

std_ind_max

<invar>:<dim>:std_ind_max = {double};
<invar>:std_ind_max = {double};
<dim>:std_ind_max = {double};

Sets the std_ind_max Defines a max limit on the variable’s transform metrics variable <var>_std such that samples with values above this limit have qc_<var> bit_11 (QC_INDETERMINATE_SDEV) of qc_<var> set with an indeterminate assessment.

goodfrac_bad_min

<invar>:<dim>:goodfrac_bad_min = {double};
<invar>:goodfrac_bad_min = {double};
<dim>:goodfrac_bad_min = {double};

Sets the goodfrac_bad_min Defines a min limit on the variable’s transform metrics variable <var>_goodfraction such samples with values below this limit have qc_<var> bit_12 (QC_BAD_GOODFRAC) of qc_<var> set with a bad assessment.

goodfrac_ind_min

<invar>:<dim>:goodfrac_ind_min = {double};
<invar>:goodfrac_ind_min = {double};
<dim>:goodfrac_ind_min = {double};

Sets the goodfrac_ind_min Defines a min limit on the variable’s transform metrics variable <var>_goodfraction such that samples with values below this limit have qc_<var> bit_12 (QC_INDETERMINATE_GOODFRAC) set with an indeterminate assessment.

data_type

<dim>:data_type = {type};

Specifies the data type to use for the coordinate variable (if not specified in PCM)

units

<dim>:units = {string};

Specifies the units to use for the coordinate variable (if not specified in PCM)

values

<dim>:values = {array of values};

Specifies the coordinate variable’s data values

QC

All QC fields are assumed to have regular ARM-style QC: in other words, they are bit-packed integers, where each bit represents a particular test for that field and a bit value of ‘1’ means that datum failed the test.

Output QC will be in this format, and the QC values set upon return from the driver function will apply tests particular to the transformation process. In other words, the QC field you get after transformation contains the QC for the transformation, which “washes out” the input QC.

Right now, we can tell on output whether some of the inputs used in the transformation were “Bad” or “Indeterminate” (i.e., there are bits set in the output QC field corresponding to those tests), but we can’t keep any other details about the input QC. This is because there is no one-to-one mapping (in general) between the input and output grid. For example, if you are taking hourly averages of one-minute data, you used up to sixty input points to create one output point; we have to condense the QC state of those sixty inputs, and in doing so we necessarily lose some information.

Input QC

Not all ARM data provides standard ARM QC fields yet; either no QC is done at all, or the QC is given in a non-standard way. There are a couple ways to deal with this situation. First, we could build a preprocessing step or hook that constructs a proper bit-packed QC field from the available input information. This is the most flexible way, but it requires the developer to actually write a function to do this, and it would be specific to the fields in question (so you would have to write a whole new one each time you had to do this).

If the non-standard QC field follows certain conventions, then we have some built-in tools to help us. If the QC field is of type integer and specific values to indicate specific QC states (the most likely situation), then you can use the qc_bad transform parameter to help map bad states to bad QC bits. If the qc_bad transform parameter exists, the driver function will automatically create a new array of bit-packed QC values, with the QC_BAD bit set if the input QC field matches one of the values in qc_bad, and QC_INDETERMINATE is set if the input QC field was nonzero and did not match qc_bad.

Note, however, that the non-standard QC field still has to be passed in to the driver function; this means that whatever is calling the driver has to deal with it if the input QC field has a non-standard name, as well. Presumably, either the PCM or some other transform parameter could ultimately be used to indicate when one input field is actually a QC field of another field.

Output QC

For every variable transformed a companion bit packed QC variable is created with the name qc_<varname>. Whenever anything unusual happens in the transform, an appropriate QC bit is set for that output point describing that condition. Most of the output QC bits are indeterminate, because transformations attempt to “fix” bad data; the only way bad data occurs coming out of the transformation is if no such correction was possible (for example, if there were no good input points to interpolate to within the specified range). In this case, the output data should be set to the missing value and the QC_BAD bit set in the QC.

Note that transformed data can still be flagged as bad post-hoc; if the transformed value exceeds a valid maximum, for example, it should still be flagged as bad. But the transformation itself will not usually set non-missing output data to bad. The table below summarizes all the possible QC tests applied to transformed data. The transform methods that can set each state are noted as follows.

QC_BAD

Description: Transformation was unsuccessful, and a valid data value could not be calculated, data value set to -9999.

QC Assessment: Bad

Comments: An example that will trip this bit is if all values are bad or outside range. Transformation Method(s): All

QC_INDETERMINATE

Description: Some, or all, of the input values used to create this output value had a QC assessment of Indeterminate.

Assessment: Indeterminate

Comments: Indeterminate inputs are still used exactly as if they were “Good”; the only indication that those inputs were not good is the fact that QC_INDETERMINATE is set on output.

Transformation Method(s): All

QC_INTERPOLATE

Description: Indicates a non-standard interpolation using points other than the two that bracket the target index was applied.

Assessment: Indeterminate

Comments: An example of why this may occur is if one or both of the nearest points was flagged as bad.

Transformation Method(s): Interpolate

QC_EXTRAPOLATE

Description: Indicates extrapolation is performed out from two points on the same side of the target index.

Assessment: Indeterminate

Comments: This occurs because the input grid doesn’t span the output grid, or because all the points within range and on one side of the target were flagged as bad. Transformation Method(s): Interpolate

QC_NOT_USING_CLOSEST

Description: Nearest good point was not used (because it was missing or bad), but a valid, more distant point within range was found and used instead.

Assessment: Indeterminate

Comments:

Transformation Method(s): Subsample

QC_SOME_BAD_INPUTS

Description: Some, but not all, of the inputs in the averaging window were flagged as bad and excluded from the transform.

Assessment: Indeterminate

Comments: This means we are averaging fewer data values than expected.

Transformation Method(s): Bin average

QC_ZERO_WEIGHT

Description: The weights for all the input points to be averaged for this output bin were zero.

Assessment: Indeterminate

Comments: The output “average” value is set to zero, independent of the value of the input. The assessment is indeterminate because, depending on the variable to which it applies, this does not always reflect an error. For example, for a cloud liquid water path measurement when there is no cloud the weights would be zero, and the “averaged” output would, correctly, also be zero.

Transformation Method(s): Bin average

QC_OUTSIDE_RANGE (QC_NO_INPUTS)

Description: No input samples exist in the transformation region.

Assessment: Bad

Comments: Nearest good bracketing points are farther away than the “range” transform parameter if transformation is done using the interpolate or subsample method, or “width” if a bin average transform is applied. Test can also fail if more than half an input bin is extrapolated beyond the first or last point of the input grid.

If this test fails for a value dimensioned by time it probably reflects a gap in the input data; either a missing file or a “jump” in samples beyond the expected averaging interval. If this is flagged then QC_BAD is also set.

QC_OUTSIDE_RANGE may not be a problem with the data. If you are transforming the “time” coordinate dimension it probably is a problem with the data (a data gap), but for other dimensions the output grid may have been set beyond the edges of the input grid. The setting of this flag in some instances should be expected. For example there is no cloud or aerosol data above 20km, but the atmosphere may be described up to 68km for the thermodynamic fields. In such a case the cloud and aerosol fields would always have QC_OUTSIDE_RANGE for heights above 20km.

Transformation Method(s): All

QC_ALL_BAD_INPUTS

Description: All the input values in the transformation region are bad.

Assessment: Bad

Comments: The transformation could not be completed. Values in output grid are set to missing and QC_BAD bit also set.

This means slightly different things for the different transforms. For the bin average transform method, the test means all the points that were “attempted to be averaged” were bad. For the interpolate and subsample methods it usually means “every” point in our 1D slice of data that is to be transformed was bad. This test reflects an occurrence of an unrecoverable situation for which a value for the variable on the new coordinate grid cannot be determined.

Transformation Method(s): All

QC_BAD_STD

Description: Standard deviation over averaging interval is greater than limit set by transform parameter std_bad_max.

Assessment: Bad

Comments: Applies only to the bin average transformation method

QC_INDETERMINATE_STD

Description: Standard deviation over averaging interval is greater than limit set by transform parameter std_ind_max.

Assessment: Indeterminate

Comments: Applies only to the bin average transformation method

QC_BAD_GOODFRAC

Description: Fraction of good and indeterminate points over averaging interval are lss than the limit set by transform parameter goodfrac_bad_min.

Assessment: Bad

Comments: Applies only to the bin average transformation method

QC_ESTIMATED_OUTPUT_BIN

Description: Fraction of good and indeterminate points over averaging interval are lss than the limit set by transform parameter goodfrac_ind_min.

Assessment: Indeterminate

Comments: Applies only to the bin average transformation method

The table below indicates the type of transformation to which each of the possible transform QC tests apply.

Transform QC Bit

average method

interpolate method

subsample method

QC_BAD

X

X

X

QC_INDETERMINATE

X

X

X

QC_INTERPOLATE

X

QC_EXTRAPOLATE

X

QC_NOT_USING_CLOSEST

X

QC_SOME_BAD_INPUTS

X

QC_ZERO_WEIGHT

X

QC_OUTSIDE_RANGE

X

X

X

QC_ALL_BAD_INPUTS

X

X

X

QC_BAD_STD

X

QC_INDETERMINATE_STD

X

QC_BAD_GOODFRAC

X

QC_INDETERMINATE_GOODFRAC

X

Transform Metrics

Transformations may create companion variables called “metrics” that provide additional details about the transformed data. Right now only the bin average transform provides metrics: the standard deviation of the points used in the average and a fractional indicator of the number of good points available in the averaging window. The naming convention for these variables is to append a suffix to the transformed variable name; in the case of the averaging metrics, _std is used for the standard deviation and _goodfraction for the factional test.

Examples of these variables for the variable wspd_u are:

wspd_u_std(time) : double
long_name     = "Metric std for field wspd_u"
units         = "m/s"
missing_value = [-9999] : double
wspd_u_goodfraction(time) : double
long_name     = "Metric goodfraction for field wspd_u"
units         = "unitless"
missing_value = [-9999] : double

In the future additional metrics may be provided for any transform, but they have not been implemented yet.

To propagate metric variables to the output datastreams you must define them using the variable names shown in the output DOD definition. If they are not defined in a DOD they will not be in the output.

Because these variables provide an indication of the completeness of the transformation in general it is prudent to set a limit for one or both of these metrics below which transformed variables are considered to be “Bad”.

Creating and Using Custom Transformations

We have built in the ability for end users to create their own custom transformation functions and hook them into the transformation process. For now, this will work with serial 1D transformations; in the future a similar (but slightly different) process will be how we create most of the multidimensional transformations.

There are a few coding steps that have to be taken to use a custom transformation:

  • You must create an interface function, and it must be prototyped according to specific rules so that we can assign it a pointer of the right type. This means the way the input and output data is passed into and out of the function is fixed.

  • You should create a core function that does the actual mathematics, and deals with non-ADI variables and structures (i.e., simple C data arrays). Strictly speaking, you could build the core transformation mathematics into the interface function - the driver only calls the interface function and doesn’t really care how it goes about calculating its outputs. But by putting the core mathematics into its own function, you increase the modularity of the code, which allows it to be used more easily in other contexts. It also means that the suite of interface functions will look very similar, which allows for code reuse - you may not have to do much more than copy an existing interface function and change a few variable names.

  • You have to do some initialization right at the start of your code to let ADI know about your custom interface function, and to assign it an appropriate label.

  • Finally, you have to set the transform transform parameter for the variables and dimensions for which you want to use the custom transform, using the label you created during the initialization step.

Once these coding steps are done, just compile and link your VAP with the custom transformation code and you are ready to go.

Interface Function

The interface function for your custom transformation has to be prototyped as follows:

int trans_name_interface (double *data, int *qc_data,
                          double *odata, int *qc_odata,
                          CDSVar *invar, CDSVar *outvar, int d)

The function name can actually be anything, but by tradition is follows the pattern given above (e.g., trans_interpolate_interface). The various arguments are:

  • data is a one-dimensional, double precision representation of the actual input data in this dimension. It has been pulled out of invar, which is a simple process for 1D data but much more complicated for 2D data. The driver function takes care of allocating and assigning this array; you just have to use it.

  • qc_data is the one-dimensional ARM standard QC for the input data in this dimension. If this input variable has no QC associated with it, this pointer will be NULL, so the interface function has to check that qc_data points to something else before it uses it.

  • odata and qc_odata are the corresponding 1D output arrays in the output dimension. Both of these arrays will be created and allocated by the driver function; obviously, they are filled with 0s on input (because the whole point of the transform is to fill these two arrays).

  • invar and outvar are the CDSVar pointers to the actual input and output variables passed to the driver function. They are provided here to give the interface function the opportunity to find metadata like transform parameters and dimensional information for these dimensions. Under no circumstances should your interface function use these two pointers to read or write actual data - you must use the data and odata arrays instead.

  • d is the index of the dimension that is currently being transformed. This is used to find the appropriate coordinate fields for this transformation, and to get dimensional sizes and transform parameters.

To find the length of the input dimension:

ni=invar->dims[d]->length;

To build an array holding the input coordinate values:

incoord = cds_get_coord_var(invar, d);
index= (double *) cds_copy_array(incoord->type, ni,
                           incoord->data.vp,
                           CDS_DOUBLE, NULL,
                           0, NULL, NULL, NULL, NULL, NULL, NULL);

To get the “qc_mask” transform parameter or calculate it if it doesn’t exist:

if (cds_get_transform_param_by_dim(invar, invar->dims[d],
                                            "qc_mask", CDS_INT,
                                            &one, &qc_mask) == NULL) {
        qc_mask=get_qc_mask(invar);
}

Calling the core function:

status=custom_core_fcn(data, qc_data, qc_mask, index,
                                   ni, range, odata, qc_odata, target,
                                   nt, missing_value);

In general, the interface function won’t do anything to data, qc_data, odata, or qc_data except pass them into the core function. (An exception might be if the input QC has to be calculated in some odd way that can’t be done by using the qc_bad transform parameter.) The main purpose of the interface function is to find, by mining invar, outvar, and the transform parameters, all the inputs the core function needs. In the above example, the interface function has to determine index and target (the input and output coordinates used in the transformation), their lengths ni and nt, a missing_value, and the range, which can represent any piece of information this transformation needs.

Core Function

The core function can require any information it needs on input and the calling sequence can be whatever makes sense to the developer. However, it should use no CDS or dsproc structures and should be as general as possible. It should basically be whatever you would write if you were going to post how to do a 1D transformation of this type to a newsgroup or send it to an unknown developer using unknown resources other than a standard C compiler. Anything ADI-ish should be pushed back to the interface function, which should convert it into the standard C structures and arrays the core function can use.

This means, in particular, that transform parameters are not available to the core function, because it doesn’t have the CDS infrastructure to find them. They will have be read in by the interface function and passed into the core function in whatever way makes sense (i.e., as an argument to the core function itself).

By modularizing the core and interface functionality this way, it makes it easier to integrate core functions written by someone else - all you need is an interface that gathers information from and writes the output to the ADI tools, and you are ready to go. It also means your core functions can be easily used by someone outside of ADI or even ARM to duplicate what you have done.

Initialization

Once you have your core and interface functions, you need to hook them into the shared libraries. The way you do that is with a call to assign_transform_function() in your main() function, before you call dsproc_vap_main() or dsproc_transform_main():

int main(int argc, char *argv[])
{
  assign_transform_function("TRANS_FOO", trans_foo_interface);
  dsproc_transform_main(...);
  return(0);
}

The two arguments to assign_transform_function() are a string which will be used to label this transformation, and the exact name of the interface function.

The label is what will set the transform parameter to use this transformation. For example:

temperature:height:transform = TRANS_FOO;

would transform the temperature variable in the height dimension by the custom transformation we assigned in the above example.

Please do not assign a label that is already in use (i.e., TRANS_INTERPOLATE, TRANS_BIN_AVERAGE, TRANS_SUBSAMPLE).

FAQ

Q: The transformation is filling in values where there are unnexpected large changes in the time dimension and I do not want it to.

A: If the transformation applied is not an interpolation this means that the input data for which the gaps are being filled has not had a width transformation parameter defined in a transformation parameter file. Because of this the bin widths are assumed to be variable and when a large gap is encountered it is processed as a large bin. Widths of the time dimension for all input datastreams should be defined in a transformation parameter file. To undertand where the parameter file should be located and what it should be named see section Defining Transform Parameters in a Configuration File. For information on the width parameter see width.

If the transformation being applied is an interpolation, the filling of gaps is an intended functionality but can be disabled through the use of the range parameter by setting a range on the input datastreams. The range parmeter defines the maximum distance over which the data will be interpolated. See discussion of range here range.

Q: Can I use a transformation to create an average of all values in retrieved variable and store it as a static var.

A: Yes. Apply a transform to the variable to be averaged using the PCM ‘Coord. System’ such that it has the desirable interval and start times, but set the length equal to ‘1’. Define a dimensionless variable in the output DOD of the process. Do not use the PCM ‘Outputs’ functionality to populate the output variable. The output variable must be assigned the value resulting from the transformation in either the post_transform or process_data hook.

Finish

Hooks

ADI Data System Process Library

The Data System Process Library is a suite of callable C functions that simplify the access to, manipulation of, and addition of data to ADI’s internal data structures. Bindings to IDL and Python are available to support development in those languages. ADI’s core functionality of retrieving, processing, and storing data is performed using libdsproc3 functions.

The vap_process_data_loop function loops from the begin date to the end date as specified in the VAP’s ‘-b’ and ‘-e’ input arguments in steps equal to the VAP’s processing interval (typically a day), retrieves the input data, and calls the processing function unique to the VAP (<vap_process>_process_data).

The VAP processing function declares pointers to the output variables data types defined in the header file, sets nsamples for the output datastream, calls get dsproc_get_dataset_vars to populate an array of pointers to the output variables and output QC_companion variables that can be used assigned to data types to output variables (simplifying the readibility and use of the data in the analysis portion of the VAP), and is the section of code into which a server inserts their VAP algorithm. The flow of these steps are illustrated in Figure 8.2. The code for the VAP analysis typically consists of the calculation of the new output variables. Details of the updates are outlined in the following discussion.