Large Data Workflow#

Generating open boundary condition (OBC) data is essential for the entire model runtime but can be time-consuming and resource-intensive. The Large Data Workflow in CrocoDash helps manage this by breaking data access into smaller, more manageable components.

Workflow Overview#

The workflow is enabled by setting the too_much_data boolean in case.configure_forcings. This triggers the copying of a script folder into the case input directory forcing folder and the generation of a configuration file to download the required boundary condition files. An example of this is in CrocoGallery under features/add_data_products.ipynb. Users can trigger the workflow by running driver.py in the forcing folder and adjusting config options in the config file.

Folder Structure#

config.json – Defines the region-specific requirements and run parameters.
README – Explains the workflow.
driver.py – Executes all scripts needed to obtain OBC data.
Code/ – Contains all scripts used in the workflow.
raw_data/, regridded_data/ – Intermediate storage for workflow steps, preventing the need to rerun all scripts at once.

Scripts#

get_data_piecewise – Retrieves raw, unprocessed data in chunks (size defined by config[“params”][“step”]) and saves it to config[“raw_data”].
regrid_data_piecewise – Processes raw data and stores it in config[“regridded_data”].
merge_piecewise_dataset – Combines regridded data into the final dataset for model input.

How to Use#

Identify and allocate available computing resources.
Adjust the step parameter to match resource constraints (default: 5 days).
Run each step manually or use driver.py as a guide.

Check out the demo here.

Large Data Workflow

Contents

Large Data Workflow#

Workflow Overview#

Folder Structure#

Scripts#

How to Use#