.. _Cumulus2_Overview: Cumulus ======= System Overview ^^^^^^^^^^^^^^^ Cumulus is a midrange AMD based system that consists of 16,384 processing cores with a 4-petabyte General Parallel File System (GPFS). It is used for Large-Eddy Simulation (LES) ARM Symbiotic Simulation and Observation (LASSO) development and operation, radar data processing, large-scale reprocessing, value-added product generation, data quality analysis, and a variety of ARM-approved science projects. The system has (2) external login nodes and (128) compute nodes. +-------------+---------+--------------------------------------------------------------------------------------------+---------+ | Architecture | +-------------+---------+--------------------------------------------------------------------------------------------+---------+ | Node type | # Nodes | Compute | Memory | +=============+=========+============================================================================================+=========+ | Standard | 112 | 2 x AMD Milan 7713 processors 3Ghz (64 cores/processor) 128 cores per node | 256 GB | +-------------+---------+--------------------------------------------------------------------------------------------+---------+ | High Memory | 16 | 2 x AMD Milan 7713 processors 3Ghz (64 cores/processor) 128 cores per node | 512 GB | +-------------+---------+--------------------------------------------------------------------------------------------+---------+ | GPU | 1 | 4 x NVIDIA A100 80GB GPUs, 2 x AMD Milan 7713 2Ghz (64 cores/processor) 128 cores per node | 1 TB | +-------------+---------+--------------------------------------------------------------------------------------------+---------+ Login ^^^^^ To log into Cumulus, you will use your XCAMS/UCAMS username and password. `ssh -X username@cumulus.ccs.ornl.gov` **NOTE**: If you do not have an account yet, you can follow the steps on this page: :ref:`Account Application` Data Storage and Filesystems ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Cumulus mounts the OLCF open enclave file systems. The available filesystems are summarized in the table below. .. list-table:: :widths: 25 30 25 25 25 25 :header-rows: 1 * - Filesystem - Mount Points - Backed Up? - Purged? - Quota? - Comments * - Home Directories (NFS) - /ccsopen/home/$USER - Yes - No - 50GB - User HOME for long term data * - Project Space (NFS) - /ccsopen/proj/$PROJECT_ID - Yes - No - 50GB - * - Parallel Scratch (GPFS) - /gpfs/wolf/scratch/$USER/$PROJECT_ID | /gpfs/wolf/proj-shared/$PROJECT_ID | /gpfs/wolf/world-shared/$PROJECT_ID - No - Yes - - User scratch space not accessible by other project members | Project shared space, with Read/Write access to all project members | Space for sharing data outside of project. Read/Write access to project members, read access for world. Data Transfer ^^^^^^^^^^^^^^^^^^^^^^^^ Users can use following protocols for transferring data to or from the Cumulus cluster: * Secure copy (``scp``) * Rsync (``rsync``) * `Globus transfers `__ via `Globus Online `__ (select ``NCCS Open DTN`` endpoint) Programming Environments ^^^^^^^^^^^^^^^^^^^^^^^^ Software Modules ---------------- The software environment is managed through the **Environmental Module** tool. .. list-table:: :widths: 25 25 :header-rows: 1 * - Command - Description * - module list - Lists modules currently loaded in a user’s environment * - module avail - Lists all available modules on a system in condensed format * - module avail -l - Lists all available modules on a system in long format * - module display - Shows environment changes that will be made by loading a given module * - module load - Loads a module * - module help - Shows help for a module * - module swap - Swaps a currently loaded module for an unloaded module After logging in, you can see the modules loaded by default with ``module list``. To access ARM managed software, add the following to your ``.bashrc`` or ``.bash_profile`` ``export MODULEPATH=$MODULEPATH:/sw/cirrus/ums/cli120/modulefiles/tcl/linux-rhel8-x86_64::/sw/cirrus/ums/cli120/custom/modulefiles`` Available software modules can be listed with ``module avail`` .. code-block:: text --------------------------------------------------------- /sw/cirrus/spack-envs/base/modules/spack/linux-rhel8-x86_64/openmpi/4.0.4-iooyv4p/Core ---------------------------------------------------------- adios2/2.6.0 dftbplus/20.2.1 gromacs/2020.2-analysis libquo/1.3.1 netcdf-cxx/4.2 parallel-netcdf/1.12.1 sundials/5.5.0 veloc/1.4 boost/1.74.0 fftw/3.3.8-omp gromacs/2020.2 (D) mpip/3.5 netcdf-fortran/4.5.3 (D) rempi/1.1.0 superlu-dist/6.4.0 caliper/2.4.0 fftw/3.3.8 hdf5/1.10.7 (D) nco/4.9.3 (D) openpmd-api/0.12.0 scr/2.0.0 tau/2.30 darshan-runtime/3.2.1 globalarrays/5.7 hypre/2.20.0 netcdf-c/4.7.4 parallel-io/2.4.4 strumpack/5.0.0 trilinos/12.14.1 -------------------------------------------------------------------- /sw/cirrus/spack-envs/base/modules/spack/linux-rhel8-x86_64/Core --------------------------------------------------------------------- bolt/1.0 ferret/7.4.4 gsl/2.5 libfabric/1.8.0 octave/5.2.0-py3 qthreads/1.16 (D) tasmanian/7.5 (D) bolt/2.0 (D) fftw/3.3.8 gsl/2.7 (D) libzmq/4.3.2 openblas/0.3.12-omp r/4.0.0-py3 tau/2.30-no_omp boost/1.74.0 fftw/3.3.9-omp hdf5/1.10.7 libzmq/4.3.3 (D) openblas/0.3.12 r/4.0.0 tau/2.30.1-no_omp (D) boost/1.77.0 (D) fftw/3.3.9 (D) hpctoolkit/2020.08.03 mercurial/5.3-py3 openblas/0.3.17-omp r/4.0.3-py3-X tmux/3.1b ccache/3.7.11 gdb/9.2-py3 hpctoolkit/2021.05.15 (D) mercurial/5.8-py3 (D) openblas/0.3.17 (D) r/4.0.3-py3 tmux/3.2a (D) ccache/4.4.2 (D) gdb/11.1-py3 (D) hpcviewer/2020.07 mercury/2.0.0 openmpi/3.1.4 r/4.1.0 (D) umpire/4.1.2 cdo/1.9.9 ghostscript/9.54.0 hpcviewer/2021.05 (D) mercury/2.0.1 (D) openmpi/4.0.4 (L) raja/0.12.1 umpire/6.0.0 (D) cdo/1.9.10 (D) git/2.29.0 htop/3.0.2 nano/4.9 openmpi/4.1.1-knem sbt/1.1.6 upcxx/2020.3.0-py3 cmake/3.18.4 git/2.31.1 (D) imagemagick/7.0.8-7-py3 ncl/6.6.2-py3-parallel openmpi/4.1.1 (D) screen/4.8.0 upcxx/2021.3.0-py3 (D) cmake/3.21.3 (D) gnupg/2.2.19 jasper/2.0.16-py3-opengl ncl/6.6.2-py3-serial (D) papi/6.0.0.1 spark/2.3.0 valgrind/3.15.0 darshan-util/3.2.1 gnuplot/5.2.8-py3 jasper/2.0.16 nco/4.9.3 paraview/5.8.1-py3-pyapi spark/3.1.1 (D) valgrind/3.17.0 (D) darshan-util/3.3.1 (D) go/1.15.2 jasper/2.0.32-opengl ncview/2.1.8 paraview/5.8.1-py3 subversion/1.14.0 vim/8.2.1201 dyninst/10.2.1 go/1.17.1 (D) jasper/2.0.32 (D) netcdf-fortran/4.5.3 paraview/5.9.1-py3-pyapi superlu/5.2.1 vim/8.2.2541 (D) dyninst/11.0.1 (D) gotcha/1.0.3 kokkos-kernels/3.2.00 netlib-lapack/3.8.0 paraview/5.9.1 (D) sz/2.1.11 wget/1.20.3 emacs/27.1 grads/2.2.1-py3 kokkos/3.2.00 netlib-lapack/3.9.1 (D) pdt/3.25.1 sz/2.1.12 (D) wget/1.21.1 (D) emacs/27.2 (D) grads/2.2.2-py3 (D) kokkos/3.4.01 (D) ninja/1.10.2 qthreads/1.14 tasmanian/7.3 zfp/0.5.5 ------------------------------------------------------------------------------ /sw/cirrus/spack-envs/base/modules/site/Core ------------------------------------------------------------------------------- gcc/10.3.0 intel/20.0.4 matlab/2020a matlab/2021a (D) --------------------------------------------------------------------------------------- /sw/cirrus/modulefiles/core --------------------------------------------------------------------------------------- idl/8.7.2 netcdf-wrf/0.1 python/3.7-anaconda3 (L) ------------------------------------------------------------------------ /sw/cirrus/ums/cli120/modulefiles/tcl/linux-rhel8-x86_64 ------------------------------------------------------------------------- adi-idl-1.5.2-gcc-8.3.1-olc3pzp libcds3-1.20.2-gcc-8.3.1-63coejn libxcb-1.14-gcc-8.3.1-4xtrufa py-adi-3.4.1-gcc-8.3.1-6n6sbr2 autoconf-2.69-gcc-8.3.1-vhgdql6 libcds3-1.20.2-gcc-8.3.1-s75t6ms libxcomposite-0.4.4-gcc-8.3.1-vt3adfy py-cython-0.29.24-gcc-8.3.1-64gdxdm automake-1.16.5-gcc-8.3.1-6dbrwr4 libdbconn-1.12.3-gcc-8.3.1-ag52wxy libxdmcp-1.1.2-gcc-8.3.1-curnhwx py-numpy-1.22.2-gcc-8.3.1-ti5ujor compositeproto-0.4.2-gcc-8.3.1-p5zw2qc libdsdb3-1.12.1-gcc-8.3.1-5kfxqx2 libxext-1.3.3-gcc-8.3.1-s5x334d py-pip-21.3.1-gcc-8.3.1-xcmezev esmf-8.2.0-gcc-8.3.1-llwuxzs (L) libdsproc3-2.55.0-gcc-8.3.1-3z2zwtx libxfixes-5.0.2-gcc-8.3.1-iv6drhg py-setuptools-59.4.0-gcc-8.3.1-hh6rl4n fontconfig-2.13.94-gcc-8.3.1-bibv45v libffi-3.4.2-gcc-8.3.1-42idjbh libxscrnsaver-1.2.2-gcc-8.3.1-dslgeju py-wheel-0.37.0-gcc-8.3.1-6dik47c freetype-2.11.1-gcc-8.3.1-wh24lkd libmsngr-1.10.2-gcc-8.3.1-tge5ev6 libxt-1.1.5-gcc-8.3.1-dt5clww python-3.9.10-gcc-8.3.1-df3vytt idl-8.7.2-gcc-10.3.0-gcnbmav libncds3-1.14.7-gcc-8.3.1-3vjjfs3 m4-1.4.19-gcc-8.3.1-r374d76 sqlite-3.37.2-gcc-8.3.1-rv7wp4n lassomod-1.3.2-gcc-10.3.0-qpbesmb libsm-1.2.3-gcc-8.3.1-czia44e matlab-runtime-R2017b-gcc-8.3.1-2jjvwmq util-linux-uuid-2.37.4-gcc-8.3.1-zovvuhq lassomod-1.3.2-gcc-8.3.1-un7g2tu libtrans-2.5.1-gcc-8.3.1-z3q3wiv openblas-0.3.17-gcc-8.3.1-nf2mnrp wrfout-1.0.1-gcc-8.3.1-74uji74 libarmutils-1.14.4-gcc-8.3.1-qk27exz libx11-1.7.0-gcc-8.3.1-crwtd3u postgresql-14.0-gcc-8.3.1-tupsppa wrfstat-1.0.2-gcc-8.3.1-h6q6g5f -------------------------------------------------------------------------------- /sw/cirrus/ums/cli120/custom/modulefiles --------------------------------------------------------------------------------- lblrtm/12.1 monortm/5.2 monortm_wrapper/1.0.0 Compilers --------- The following compilers are available on Cumulus: * Intel Composer XE * GNU Compiler Collection .. **NOTE**: Upon login, the default versions of the Intel compiler and associated Message Passing Interface (MPI) libraries .. are added to each user’s environment through a programming environment module. .. Users do not need to make any environment changes to use the default version of Intel and MPI. Compiler Environments --------------------- If a different compiler is required, it is important to use the correct environment for each compiler. To aid users in pairing the correct compiler and environment, compiler modules are provided. The compiler modules will load the correct pairing of compiler version, message passing libraries, and other items required to build and run. We highly recommend that the compiler modules be used when changing compiler vendors. The following programming environment modules are available: * intel * gcc To change the default loaded Intel environment to the default GCC environment use: ``module unload intel`` and then ``module load gcc`` Or, alternativley, use the swap command: ``module swap intel gcc`` Managing Compiler Versions -------------------------- To use a specific compiler version, you must first ensure the compiler’s module is loaded, and then swap to the correct compiler version. For example, the following will configure the environment to use the GCC compilers, then load a non-default GCC compiler version: ``module swap intel gnu`` ``module swap gcc gcc/10.3.0`` Compiler Commands ----------------- The C, C++, and Fortran compilers are invoked with the following commands: * For the C compiler: ``cc`` * For the C++ compiler: ``CC`` * For the Fortran compiler: ``ftn`` These are actually compiler wrappers that automatically link in appropriate libraries (such as MPI and math libraries) and build code that targets the compute-node processor architecture. These wrappers should be used regardless of the underlying compiler (Intel, PGI, GNU, or Cray). **NOTE**: You should not call the vendor compilers (i.e icpc, gcc) directly. Commands such as mpicc, mpiCC, and mpif90 are not available on the system. You should use cc, CC, and ftn instead. General Guidelines ------------------ We recommend the following general guidelines for using the programming environment modules: * Do not purge all modules; rather, use the default module environment provided at the time of login, and modify it. Threaded Codes -------------- When building threaded codes, you may need to take additional steps to ensure a proper build. For Intel, use the ``openmp`` option: .. code-block:: bash $ cc -openmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 For GNU, add ``-fopenmp`` to the build line: .. code-block:: bash $ module swap PrgEnv-intel PrgEnv-gnu $ cc -fopenmp test.c -o test.x $ setenv OMP_NUM_THREADS 2 Running Jobs ^^^^^^^^^^^^ In High Performance Computing (HPC), computational work is performed by jobs. Individual jobs produce data that lend relevant insight into grand challenges in science and engineering. As such, the timely, efficient execution of jobs is the primary concern in the operation of any HPC system. A job on a commodity cluster typically comprises a few different components: * A batch submission script. * A binary executable. * A set of input files for the executable. * A set of output files created by the executable. And the process for running a job, in general, is to: #. Prepare executables and input files. #. Write a batch script. #. Submit the batch script to the batch scheduler. #. Optionally monitor the job before and during execution. The following sections describe in detail how to create, submit, and manage jobs for execution on commodity clusters. Login vs Compute Nodes ---------------------- When you log into an OLCF cluster, you are placed on a login node. Login node resources are shared by all users of the system. Because of this, users should be mindful when performing tasks on a login node. Login nodes should be used for basic tasks such as file editing, code compilation, data backup, and job submission. Login nodes should not be used for memory- or compute-intensive tasks. Users should also limit the number of simultaneous tasks performed on the login resources. For example, a user should not run (10) simultaneous tar processes on a login node. **NOTE**: Compute-intensive, memory-intensive, or otherwise disruptive processes running on login nodes may be killed without warning. Batch Scheduler --------------- Cumulus utilizes the Slurm batch scheduler. The following sections look at Slurm interaction in more detail. Writing Batch Scripts --------------------- Batch scripts, or job submission scripts, are the mechanism by which a user configures and submits a job for execution. A batch script is simply a shell script that also includes commands to be interpreted by the batch scheduling software (e.g. Slurm). Batch scripts are submitted to the batch scheduler, where they are then parsed for the scheduling configuration options. The batch scheduler then places the script in the appropriate queue, where it is designated as a batch job. Once the batch jobs makes its way through the queue, the script will be executed on the primary compute node of the allocated resources. Example Batch Script -------------------- .. code-block:: bash :linenos: #!/bin/bash #SBATCH -A XXXYYY #SBATCH -J test #SBATCH -N 2 #SBATCH -t 1:00:00 cd $SLURM_SUBMIT_DIR date srun -n 8 ./a.out **Interpreter Line** 1: This line is optional and can be used to specify a shell to intrepret the script. In this example, the bash shell will be used. **Slurm Options** 2: The job will be charged to the "XXXYYY" project. 3: The job will be named test. 4: The job will request (2) nodes. 5: The job will request (1) hour walltime. **Shell Commands** 6: This line is left blank, so it will be ignored. 7: This command will change the current directory to the directory from where the script was submitted. 8: This command will run the date command. 9: This command will run (8) MPI instances of the executable a.out on the compute nodes allocated by the batch system. Submitting a Batch Script ------------------------- Batch scripts can be submitted for execution using the sbatch command. For example, the following will submit the batch script named test.slurm: ``sbatch test.slurm`` If successfully submitted, a Slurm job ID will be returned. This ID can be used to track the job. It is also helpful in troubleshooting a failed job; make a note of the job ID for each of your jobs in case you must contact the OLCF User Assistance Center for support. Common Batch Options for Slurm ------------------------------ .. list-table:: :widths: 25 25 15 :header-rows: 1 * - Option - Use - Description * - ``-A`` - #SBATCH -A - Causes the job time to be charged to . * - ``-N`` - #SBATCH -N - Number of compute nodes to allocate. Jobs cannot request partial nodes. * - ``-t`` - #SBATCH -t