Setting up Data and Creating New Projects

This section discusses the environment variables used to access data and process files and executables, how to setup data using the environment variables, the various location the environment variables can point to on development servers, how to create new process repositories to store code, and provides an overview of the key development stages in the ADI Development Steps section presented at the end..

Environment Variables

ADI shared libraries, VAPs, and ingests use environment variables to determine the location of data, configuration files, and binaries. By using environment variables to define locations of these items their location can be easily manipulated to a different location by resetting the environment variables without having to change any source code.

Data Environment Variables

  • DATA_HOME
    • The directory at which the data subdirectories required for ARM process is located

    • Base directory for datastream, configuration, logs, and quicklook data which each also have environment variables

    • Expected last directory in path = ‘data’

    • Subdirectories:
      • conf

      • datastream

      • logs

      • quicklook(s) or www/process

    • Examples:
      • prod location: /data/

      • test location: /data/home/dev/vap/<pcm_process_name>/DATA/data

      • user location: /data/home/<username>/data

  • DATASTREAM_DATA
    • Location for netCDF data. For VAPs this includes input and output netcdf data

    • This must be equal to $DATA_HOME/datastream

    • Expected last directory in path = ‘datastream’

    • Subdirectories: /<site>/<site><datastream><facility>.<data_level> where

    • Examples:
      • prod location: /data/datastream

      • test location: /data/home/dev/vap/<pcm_process>/DATA/data/datastream

      • user location: /data/home/<username>/data/datastream

  • LOGS_DATA:
    • Location of logs generated during run.

    • This must be equal to $DATA_HOME/logs.

    • Expected last directory in path = ‘logs’

    • Subdirectories: /<site>/proc_logs/<site><pcm_process_name><facility>

      !Note these subdirectories are always created by ADI libraries!!

    • Examples:
      • prod location: /data/logs/

      • test location: /data/home/dev/vap/twrmr/DATA/data/logs

      • user location: /data/home/<username>/data/logs

  • COLLECTION_DATA:
    • Location of input data of raw files (applies to ingests only)

    • This must be equal to $DATA_HOME/collection

    • Expected last directory in path = ‘collection’

  • CONF_DATA
    • This must be equal to $DATA_HOME/conf

    • Location for configrations files that change more than once a year. Within this directory files can be organized by site in
      • $CONF_DATA/<site>/<site><process_name><facility>

        or by vap in

      • $CONF_DATA/<vap>/

There are two additional datastream environment variables that can be used to isolate the input data sources from the output data sources. This is useful if you want to read the data in from /data/archive but write it out to another area.

  • DATASTREAM_DATA_IN
    • Expected last directory in path = ‘datastream’

      Same as DATASTREAM_DATA but only used to find the input datastream directories.

  • DATASTREAM_DATA_OUT
    • Expected last directory in path = ‘datastream’

      Same as DATASTREAM_DATA but only used to find the output datastream directories.

VAP Executable Environment Variable

The following Environment variables are only required if the VAP makes use of configuration files.

  • VAP_HOME
    • Base directory for VAP binaries and executables.

    • Expected last directory in path = There is no expected last directory name.

    • Subdirectories:
      • bin

      • bytecode

      • conf

      • include

    • Examples:
      • prod location: /apps/process

      • user location: /home/<username>/apps/process

VAP Configuration Environment Variable

  • VAP_HOME/vap/conf
    • Location of configuration files that do not change over time, or change at most once a year.

      These files are maintained in the VAP’s GitLab repository, and released to the $VAP_HOME/vap/conf as part of the build process.

Methods of updating VAP configuration files in CONF_DATA

Because the files in $CONF_DATA are not released, an alternative method of installing them on the production processing system is needed. There are two possible methods of updating files in CONF_DATA (1) create a stand alone task in ServiceNow to have the system administrators copy them into the desirect location (2) Use doorstep to install the configuration files. Details for both methods are described below.

  • Updating files by requesting they be copied to production.

    This method is recommended when the file will be updated infrequently (a few times a year) or that only need to be transferred to production once when the VAP (or a new site for the VAP) is setup because subsequent updates will be done automatically by the VAP process.

    Request to transer files to production should be made via a ServiceNow. Preferably in an ENG or EWO associated with the VAP, or if those are not available in a stand alone incident. Describe where the files should be installed in $DATA_CONF and the location that the files to transfer to production can be found and assign to ADC system administrators.

  • Updating files using doorstep:

    !!This method can currenlty ONLY be used to install files to $CONF_DATA/<site>/<site><process_name><facility>!!. As such it only supports installation of conf files that require a seperate file for each site and facility. To use this method

    1. Notify the individual who will be providing the new or updated files to deliver them via ftp.arm.gov as ‘anonymous’ using their email as password. They should place the files in the directory corresponding to the site and facility to which the conf files apply. (i.e. /pub/sites/<site><facility>/<process_name>_conffiles)

    2. Submit a task in ServiceNow to have the doorstep.conf file updated. Preferably the task should be a child of an ENG or EWO associated with the VAP, or if those are not available in a stand alone incident. Assign this task to Brian Ermold. Note the process name, sites and facilities that will have files, and who should receive notification that files have been updated.

ARM Data Locations

On production the location of the DATA_HOME is always /data

However, where to set DATA_HOME when running on the development server is a function of why the process is being executed. The locatin differs based on whether the process is being run
  • on production in an event-driven mode

  • for production in ARMFlow in manual mode

  • for production in ARMFlow in reprocess mode

  • to execute a formal process test

  • to run an evaluation VAP whose output will be shipped to the archive

  • to do large scale testing to validate the logic and algorithm of a process.

For the first 3 cases related to production processing the user executes the processes via ArmFlow and it sets up all the environment variables. This section will discuss the test data area, locations to process evaluation data, and where to setup data large scale testing/validating.

Test Data Area

This is a defined area for formal tests associated with VAPs and ingests. This is the location that the dsproc_test application expects to find the input data to test cases it executes and where it will write the output data it creates. The location is a function of the type of process (vap or ingest) and includes a subdirectory ‘DATA’ prior to the end point directory ‘data’.

  • Path and example for VAP
    • DATA_HOME = /data/home/dev/vap/<pcm_process_name>/DATA/data

    • DATA_HOME = /data/home/dev/vap/twrmr/DATA/data

  • Path and example for ingest
    • DATA_HOME = /data/home/dev/ingest/<pcm_process_name>/DATA/data

    • DATA_HOME = /data/home/dev/ingest/sirs/DATA/data

  • This directory and its child directories must all be setup by the developer responsible for the VAP.

  • Input files for each input data source must be copied to this area (!DO not use symbolic links to /data/archive!! as this could cause a test case to fail unexpectedly should the files in /data/archive be reprocessed)

  • The permissions on all files created should be rw-rw-r (i.e. 775) so that any developer can run the test and overwrite existing data.

Evaluation Data Area

Currently evaluation processes cannot be run through ArmFlow. This area is set aside as an area in which evaluation data can be created that is intended to be shipped to the archive via a ServiceNow Release Data to Archive workflow. The location is a function of the user who will be running the process and process (i.e. vap repository not pcm process) being run.

  • Path and example for vap
    • DATA_HOME = /data/vap/<username>/<vapname>/

    • DATA_HOME = /data/vap/gaustad/mfraod

  • Typically a ‘data’ directory is not included and the DATA_HOME is noted as above. A ‘data’ can be added if a user chooses as long as the child environment variables are all defined with respect to DATA_HOME (i.e. DATASTREAM_DATA = DATA_HOME/datastream).

  • This directory and its child directories must all be setup by the developer responsible for the VAP.

  • Input directories for each input data source should typically use symbolic links to /data/archive (!!Do not copy files if they are unchanged from /data/archive area).

  • The permissions on all files created should not be rw-rw-r (i.e. 775) so that any developer can cannot overwrite existing data.

User Data Area

An additional location a developer can point to is their own /data/home/<userdata> area. This area can be used for testing otuside of formal test data cases but not intended to produce data that will be shipped to the archive. It is more of a scratch area. As such it DATA_HOME is not required to be in any particular directory in /data/home/<username>. This location can be used for digging deeper into certain periods where problems are found during larger scale processing in the /data/vap area without having to overwrite the files in /data/vap.

  • Paths are not managed beyond requiring they be in /data/home/<usersname>, they can be set there or by process_name or any other way a developer chooses. Sample locations include
    • DATA_HOME = /data/home/<username>/data

    • DATA_HOME = /data/home/<username>/<process_name>/data

    • DATA_HOME = /data/home/<username>/DATA/data

  • This directory and its child directories must all be setup by the developer responsible for the VAP.

  • Input directories for each input data source should typically use symbolic links to /data/archive but can also link to another developer’s area or contain actual files if the number is small.

Creating New Repositories in GitLab

Need section on creating empty repository

create_adi_project

create_adi_project is a source code generation tool that uses the PCM database entries to create a C, IDL, or Python software project for processes defined in the PCM. The scripts for the project are created from a script generator, can compile and run with no additional code producing netCDF files with all variables that can be derived from the database entries made via the PCM. The source code produced has hooks into which the user can insert their own code, thus, jump starting the development of their ARM Value Added Products (VAPs).

After the VAP process has been fully defined in the PCM and saved to the DSDB, the create_adi_project application can be run to create a C, or Python project comprised (use of IDL is discouraged) of a - main module, - hooks for the ADI Data Processing Modules (shown in green at <https://engineering.arm.gov/ADI_doc/framework.html#data-processing-modules>’_), - supporting files documenting retrieved, transformed, and output variables, and - files needed to build the VAP.

There are templates to create ingest and VAP projects.

create_adi_project Command Line Arguments

The required input parameters for the create_adi_project include the specification of the process for which templates are being produced, the template type, and the directory into which the templates will be created. Optional input parameters are provided to document the source code with the developers contact information, to produce a dump of the DSDB elements associated with the process into a json data file, and to run from such a json dump rather than accessing process information from the DSDB. A complete summary of the create_adi_project command line options is shown in the following table along with an example.

create_adi_project Usage

Input Arguments

Argument Value

Req

Argument Description

-h

–help

N/A

-p

–process

<process>

Yes

Name process defined in PCM

-t

–template

<template>

Yes

Type of template to create

-o

–output

<output directory>

Yes

Directory location to place templates

-d

N/A

Dump json data from webservice

-i

–input

<input file>

No

Json file to use instead of getting process params from DSDB

-a

–author

<name>

No

Developer’s name

-n

–phone

<phone>

No

Developer’s phone

-e

–email

<email>

No

Developer’s email address

-v

–dodversion

<email>

No

Create output field file using specific DOD version. Default is to create output file as union of all DODs.

create_adi_project Templates

Which template type should be provided in the -t option is dependent on whether the intent is to create the initial set of source code or to propagate changes made in the PCM that will impact the source code. Primary templates create all the necessary files for a process’s project. The available primary templates by supported languages are shown below.

create_adi_project Primary Templates

C

IDL

Python

transform

idl_transform

py_transform

retriever

idl_retriever

py_retriever

ingest

idl_ingest

py_ingest

Individual elements of the project are also templates and should be used to implement updates, if necessary, after development has begun. Updating any template in which you have inserted logic will result in the loss of that logic. Available supporting templates

WARNING: Updating any template in which you have inserted logic will result in the loss of that logic. To prevent over writing the main module <process>_vap.c and supporting <process>_vap.h files into which developers add their code, the template generator should not be rerun using the ‘transform’, ‘retriever’, or ‘ingest’ templates during after development as begun.

Individual elements of the project are also templates and should be used to implement updates, if necessary, after development has begun. Available supporting templates are listed in the table below. The primary tempalte to which they apply is noted next to it by T => transform R => retriever I => ingest

create_adi_project Supporting Templates by Primary Template

Primary Template

C

IDL

Python

transform

makefiles

idl_makefiles

makefiles py_makefile_lib

retriever

makefiles

idl_retriever_makefiles

makefiles py_makefile_lib

ingest

makefiles_ingest

idl_makefiles_ingest

makefiles_ingest py_makefile_lib_ingest

transform

vars

idl_vars

py_vars

retriever

vars_retriever

idl_vars_retriever

py_vars_retriever

transform retriever

test

idl_test

py_test

ingest

test

idl_test_ingest

py_test_ingest

transform retriever

input_fields

idl_input_fields

py_input_fields

transform retriever

output_fields

idl_output_fields

py_output_fields

transform retriever

trans_fields

idl_trans_fields

py_trans_fields

Not all updates to the PCM result in the need to regenerate header files. The user must determine whether after any change to the data stored in the DSDB whether they need to rerun the create_adi_project with a subcomponent template option, and if so, which one. A good rule of thumb is if the change affects a DSDB entity (for example, as new variable is being retrieved, a change in the name of a variable, or a coordinate system is defined) then an update is needed. If however, the change only affects a DSDB entity’s attribute (such as units of a variable, or sampling interval of a transform) then none of the VAP’s header files will need to be regenerated.

If a change was made to the PCM that required an update to the source code, and the template was not rerun to create the impacted file(s), the process will fail to run and will produce an error message, giving an indication of the header file with the inconsistency. Table 6.2 lists the template types for C projects, the files they create , and a description of the file’s purpose.

create_adi_project C Templates (IDL and PYTHON are identical, but are preceeded

with idl and py followed by an underscore)

Template

Files Created

Description

transform

<process>_vap.c <process>_vap.h

Creates all files that comprise a C Project that will perform a transformation.

Typically run only when creating the initial set of templates.

<process>_vap.c main source code into which the user should add VAP specific logic (see the following figure.

<process>_vap.h defines prototypes, structures, and macros needed by the user.

Makefile

Makefile.aux

<process>_input _fields.h <process>_trans_ fields.h

<process>_output_ fields.h

retriever

<process>_vap.c <process>_vap.h

Creates all files that comprise C Project. that will not perform a transformation.

Typically run only when creating the initial set of templates.

<process>_vap.c main source code into which the user should add specific logic.

<process>_vap.h defines prototypes, structures, and macros needed by user.

Makefile

Makefile.aux

<process>_input_ fields.h <process>_output_ fileds.h

vars

<process>_input_ fields.h <process>_trans_ fields.h <process>_output _fields.h

Creates all header files.

Users should not edit these files.

This template can be run if the user is unsure whether, and if so, which, header file is affected by a change to the PCM entries.

input_fields

<process>_input_ fields.h

Contains structure of retrieved variables and indexes to access the names within the structure.

Not used by ADI libaries, but provided to encourage standarized access to input fields. User should not edit file.

trans_fields

<process>_trans_ fields.h

Contains structures of retrieved variables in context of the coordinate systems to which they have been assigned in the PCM. In addition to the variable name, indexes to access the values within the structures names are based on the coordinate system, datastream group from which the variable was retrieved.

Not used by ADI libaries, but provided to standarized access to transformed fields. User should not edit file.

output_ fields

<process>_output_ fields.h

Contains structures of output variables in context of the output datastreams. In addition to the variable name, indexes to access the values within the structures names are based on the output datastream name and level.

Not used by ADI libaries, but provided to encourage standarized access to output fields. User should not edit file.

makefiles

Makefile

For a system with a SWAWT environment. http://engineering.arm.gov/base/swawt/

Makefile.aux can be updated to link with outside libraries, adjust compile options, etc.

Makefiles typically created with a primary template (transform, retriever, ingest) and should not be recreated after developmment has begun as that will overwrite user updates

Makefile.aux

Running create_adi_project

The first time the create_adi_project is run for a new VAP, it should be run using either the ‘transform’ or ‘retriever’ template and include specifications for the author information to document the the VAP’s main C module and header file.

$> create_adi_project -p <process name> -t <primary template type> -o <project directory> -a <’developer name’> -n <developer phone number> -e <your email address>

Note: Quotes are necessary around inputs for which there are white spaces, such as developer name and possibly phone number.

Compiling the code produced by the templater after being run with a primary template as input will produce a binary that will run “out of the box” with the capability of creating output netCDF file(s) with all passthrough variable values and completed DOD headers. The values of output variables whose values are to be calculated by VAP (i.e., the source code the user will add to the <process_name>.c file) will be populated with fill values to indicate that a value has not yet been assigned. Details of the VAP command line options are discussed in create_adi_project Command Line Arguments.

To compile and run a process created with the create_adi_project:

  1. cd <to location of C project files>

  2. Make clean

  3. Make

  4. $> <vap_process> -a <dsdb to access> -s <site> -f <facility> -b <YYYYMMDD> -e <YYYYMMDD> -D -R

Users should examine the output produced from the template prior to inserting their own code to validate the PCM variable definitions, data conversions, and transformations were executed as expected.

Updating create_adi_project Projects

Once a developer has begun adding their own code to the VAP, future runs of the template generator typically should be limited to using one of the header templates. A simple approach is to use the -vars template to recreate all template generated header files.

$> create_adi_project -p <process name> -t vars -o <project directory>

A summary of the changes to PCM entries that can affect the content of one or more ADI templates is presented in the following table, PCM Changes that Impact create_adi_project Templates. Many changes to the PCM process, retrieval, and output DOD do not require a change to the project’s header files. For example, a reordering of variables in the output file does not affect the source code.

PCM Changes that Impact create_adi_project Templates

PCM Area

PCM Element

Change

Templates Affected

Process

Process Name

Rename

primary template [1] (i.e., transform or retriever)

Process Type

Type assignment

primary template [1]

Retrieval Editor

Source(s): group name

Add, remove, rename

trans_fields

Variable Name

Add, remove, rename

input_fields trans_fields

Coord System: name

Add, remove, rename

trans_fields

Output Datastream/DOD

Datastream name

Add, remove, rename

output_fields

Datastream level

Change

output_fields

Variable name

Add, remove, rename

output_fields

Adding Code to Templates

The only template files into which the user should add data are the Makefile.aux and the VAP source and header files, <process>_vap.c and <process_vap.h. However, a user is free to create additional *.c and *.h files as they see fit to organize their own code. It is recommended that code additions to the <process>_vap.c file be limited to function calls to functions in other *.c files created by the user.

Before adding their own source code to a project created by the create_adi_project, developers should have an understanding of the process through which data is retrieved, consolidated, and stored in the ADI applications (i.e., the data_conslidator tool and create_adi_project projects compiled ‘as is’ ).

ADI Development Steps

_images/ADIprocess_Page_1.jpg _images/ADIprocess_Page_2.jpg _images/ADIprocess_Page_3.jpg _images/ADIprocess_Page_4.jpg _images/ADIprocess_Page_5.jpg _images/ADIprocess_Page_6.jpg _images/ADIprocess_Page_7.jpg _images/ADIprocess_Page_8.jpg

FAQ

Q: I am getting an error when running create_adi_project for my process, what is causing it?

A: If you have a datastream assigned as an output, but it does not have a DOD defined for it, the project cannot be created. Either add a DOD to the datastream or remove the output datastream from the PCM’s process definition form.

Q: Why does my process end with Suggested exit value:0 (successful), but the output is not what I expected, or the process did not run to completion?

A: The exit value of 0, representing success, indicates that the process completed with no unexpected errors. It is up to the user to insert the necessary error handling logic into their source code. If a process exits with success, but the output is not valid, or the process did not complete, then additional error handling is needed. Find the point the process deviated from the desired output, and set a process error using the DSPROC_ERROR macro or dsproc_error function.

Q: My process is not retrieving a companion QC variable because the variable is not an integer data type in the input file. How do I get my process to run?

A: If a companion QC variable is not an integer it cannot be retrieved. In such cases, do not select the QC checkbox in the Retriever Editor form. Instead, elect to set new min and max values on the variable through the Retriever Editor form. This will result in a new QC variable being created and the limits specified in the PCM applied. If the original QC had more than min, max, or delta tests applied you will also need to explicitly retrieve the QC variable (versus retrieving it by selecting the QC check box for an explicitly retrieved variable). YOU MUST rename the retrieved QC variable by entering a different name in the ‘Variable Name’ column to prevent it from conflicting with the auto-generated QC variable. Lastly, in a user hook, update the auto-generated QC variable as necessary to properly document the quality of the variable.

Q: How do I create quicklooks inside the process interval loop?

A: Output netCDF files are not created until the VAP has processed all dates falling between the begin and end dates specified in the command line. To create quicklooks for each process interval you must update the VAP to produce an output netCDF file as the processing for the individual intervals are completed. To do this, call dsproc_store_dataset at the end of the process_data hook (I need to add a link to this once this documentation is added to tutorial) and create the quicklooks when it completes.

Q: The contents of my output file are not what I expect.

A: If you made changes to your source code did you recompile? Is the location of the directory of the output you’re reviewing the same directory used to create data? Did you update the PCM entries? If so, did you recreate the input, transform, and output configuration files and recompile before rerunning your process? If you did, and the PCM changes you made still didn’t take effect, are you sure you selected the appropriate Save button (if you updated a DOD, it needs to be saved separately from changes made to the process definitions).

Q: My VAP runs with an exit value of 0: (Successful), but an output data file is not created.

A: Did you define mappings from your retrieved to output variables in the PCM Retriever Editor form? In such cases the process will end successfully because it did not encounter an error, but there will be an indication of the problem in the debug messages of the process data_hook. It will be unable to produce a dump of the process_data structure and will note that no data could be found in the output data structure as shown in the following example.:

                                ----- ENTERING PROCESS DATA HOOK -------
dsproc_print.c:168              Creating dataset dump file:
                                 - dataset: /sgpvapexample3E13.c1
                                 - file:    ./debug_dumps/sgpvapexample3E13.c1.19700101.000000.process_data.debug
dsproc_hooks.c:179              ----- EXITING PROCESS DATA HOOK --------

dsproc_dataset_store.c:392      sgpvapexample3E13.c1: No data found in output dataset

===================================================================
dsproc.c:1577                   EXITING PROCESS
===================================================================