Setup Iterative OBC generation (for Long CESM-MOM6 runs) w/ the Large Data Workflow#

Often, generating OBC datasets can be done all in one shot, but in longer and larger cases (like running the Northwest Atlantic for a year) we need to start iterating through the generation. Generating open boundary condition (OBC) data is essential for the entire model runtime but can be time-consuming and resource-intensive.

The Large Data Workflow in CrocoDash helps manage this by breaking data access into smaller, more manageable components.

Large Data Workflow Overview#

The workflow is enabled by setting the too_much_data boolean in case.configure_forcings. This triggers the copying of a script folder into the case input directory forcing folder and the generation of a configuration file to download the required boundary condition files. An example of this is in CrocoGallery under features/add_data_products.ipynb. Users can trigger the workflow by running driver.py in the forcing folder and adjusting config options in the config file.

Folder Structure#

  • config.json – Defines the region-specific requirements and run parameters.

  • README – Explains the workflow.

  • driver.py – Executes all scripts needed to obtain OBC data.

  • Code/ – Contains all scripts used in the workflow.

  • raw_data/, regridded_data/ – Intermediate storage for workflow steps, preventing the need to rerun all scripts at once.

Scripts#

  1. get_data_piecewise – Retrieves raw, unprocessed data in chunks (size defined by config["params"]["step"]) and saves it to config["raw_data"].

  2. regrid_data_piecewise – Processes raw data and stores it in config["regridded_data"].

  3. merge_piecewise_dataset – Combines regridded data into the final dataset for model input.

How to Use#

  1. Identify and allocate available computing resources.

  2. Adjust the step parameter to match resource constraints (default: 5 days).

  3. Run each step manually or use driver.py as a guide.

Step 1: Trigger Large Data Workflow#

This can be done by setting the too_much_data bool to true.

case.configure_forcings(
    date_range = ["2020-01-01 00:00:00", "2020-01-09 00:00:00"],
    too_much_data = True
)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Input In [1], in <cell line: 0>()
----> 1 case.configure_forcings(
      2     date_range = ["2020-01-01 00:00:00", "2020-01-09 00:00:00"],
      3     too_much_data = True
      4 )

NameError: name 'case' is not defined

Step 2: Run the iterative OBC processor#

In a terminal session, locate the large_data_workflow folder, which is put in the case input directory under the forcing folder, default called “glorys/large_data_workflow”. Then, execute the driver.py to generate boundary conditions. It uses the config.json file to generate the OBCs in a piecewise format before merging. Modify the code as you see fit!

Especially consider adjusting the specific function being used in config.json to download the data. For example, On Derecho? Using the RDA reader. On a local computer? Use the python GLORYS api. You can change the function by changing the respective line in config.json.

Step 3: Process forcing data#

In this final step, we call the process_forcings method of CrocoDash to interpolate the initial condition as well as all boundaries. CrocoDash also updates MOM6 runtime parameters and CESM xml variables accordingly. It will auto skip the OBCs because of the large data workflow.

case.process_forcings()