UCR High-Performance Computing Center

with applications in parallel computing

Isaac Quintanilla Salinas

5/20/2022

Introduction

You can follow this presentation at https://www.inqs.info/files/GSS/HPCC/hpcc_5-20.html

Terminology

job: A task a computer must complete
bash script: tells the computer what to do

My Cluster Usage

Computationally Intensive
Use R and C++ (via Rcpp)

HPCC Cluster

How do I view the cluster?

Virtual Desktop Computer
- Set number of cores (max 32 cpus)
- Set RAM (max 128 GB)
- Set storage (20 GB)

How to access the cluster?

RStudio Server
- https://rstudio2.hpcc.ucr.edu/
Terminal
- ssh
Jupyter Hub
- https://jupyter.hpcc.ucr.edu/hub/

RStudio Server

Edit any text document
Submit jobs
Upload/Download Data or Documents
Do not do any analysis in RStudio Server

Linux Commands

sbatch: submit a job to the cluster
scancel: cancel jobs
squeue: view jobs status
- dashboard.hpcc.ucr.edu
slurm_limits: view what is available to you

How I use the cluster?

Code on my computer at a smaller scale
- Parallelize using my computer’s cores
Scale it up for the cluster
Have an R Script do everything for me
Save all results in an RData file
Use the try() function to catch errors
Use RProjects

Submitting Jobs

Use a bash script to specify the parameters for the cluster
Add the following line at the end of a bash script:

Rscript NAME_OF_FILE.R

Submit the job with the line below:

sbatch NAME_OF_BASH.sh

Anatomy of Bash Script

#!/bin/bash

#SBATCH --nodes=1 # Setting the number of nodes, usually 1 node is used
#SBATCH --ntasks=1 # Setting the number of tasks, usually 1 task is used
#SBATCH --cpus-per-task=8 # Setting the number of cpus per task
#SBATCH --mem-per-cpu=1G # How much ram per cpu, usually 1 GB
#SBATCH --time=0-01:00:00    # Time to run task, changes based on predicted time of task
#SBATCH --output=my.stdout # Where to store the output, usually a standard output
#SBATCH --mail-user=NETID@ucr.edu # Where to email information about job
#SBATCH --mail-type=ALL # Not Sure
#SBATCH --job-name="Cluster Job 1" # Name of the job, can be anything
#SBATCH -p statsdept # statsdept is the only partition dept of stats students can use

Rscript Parallel_Job.R # The command that tells Linux to process your R Script

Anatomy of R Script

# Obtain System Date and Time
date_time <- format(Sys.time(),"%Y-%m-%d-%H-%M")

# Set Working Directory
setwd("~/rwork")

# Load libraries and functions
library(parallel)
source("Fxs.R")

# Pre - Parallel Analysis
# Parallel Analysis
results <- mclapply(data, FUN, mc.cores = number_of_cores)
# Post - Parallel Analysis

# Save Results
file_name <- paste("Results_", date_time, ".RData",sep = "")
save(results, file = file_name, version = 2)

Parallel Processing in R

For more information:

https://ucrgradstat.github.io/stat_comp/hpcc/parallel_notebook.nb.html

How to parallelize your R Code

From the parallel R package:

mclapply()
- Recommended for the cluster
- Has built-in try() function
- Replace any lapply() with mclapply() and add mc.cores= argument.
parLapply()
- Use if multiple nodes are involved
- Use if on Windows PC

Where to parallelize?

Identify loops or *apply functions
- Iterations must be independent of each other
Identify bottlenecks
- Use benchmark R packages

How to speed up you R Code?

Vectorize your R code
Minimize loops
- Use *apply functions
Use optimized functions
- colMeans() and rowMeans()
Implement c++ via Rcpp
More Information: Advanced R (adv-r.hadley.nz)

Simulation Study

Show that Ordinary Least Squares provides consistent estimates
Model: Y = X^Tβ + ϵ
- β = (β₀,β₁,β₂,β₃)^T = (5,4,−5,−3)^T
- X = (1,X₁,X₂,X₃)^T
- ϵ ∼ N(0,3)

Simulation Parameters

Number of Data sets: 10000
Number of Observations: 200
X ∼ N((−2,0,2)^T,I₃)
I₃: 3 × 3 Identity Matrix

Parallelization of Simulation

8 cores
Each core will process around 1250 data sets
R Script
BASH Script

R Script

## Date-Time ####
Date_Time <- format(Sys.time(),"%Y-%m-%d-%H-%M") #Used as a unique identifier

## Setting WD ####
# ("~/rwork") # Setting working directory to rwork, were all the data is saved

## Loading R Packages ####
library(parallel)


## Functions ####

data_sim <- function(seed, nobs, beta, sigma, xmeans, xsigs){ # Simulates the data set
  set.seed(seed) # Sets a seed
  xrn <- cbind(rnorm(nobs, mean = xmeans[1], sd = xsigs[1,1]),
               rnorm(nobs, mean = xmeans[2], sd = xsigs[2,2]),
               rnorm(nobs, mean = xmeans[3], sd = xsigs[3,3])) # Simulates Predictors
  xped <- cbind(rep(1,nobs),xrn) # Creating Design Matrix
  y <- xped %*% beta + rnorm(nobs ,0, sigma) # Simulating Y
  df <- data.frame(x=xrn, y=y) # Creating Data Frame
  return(df)
}


parallel_lm <- function(data){ # Applying a Ordinary Least Squares to data frame
  lm_res <- lm(y ~ x.1 + x.2 + x.3, data = data) # Find OLS Estimates
  return(list(coef=coef(lm_res), lm_results=lm_res))
}


## Parallel Parameters ####
ncores <- 8 # Number of cpus to be used

## Simulation Parameters ####
N <- 10000 # Number of Data sets
nobs <- 200 # Number of observations
beta <- c(5, 4, -5, -3) # beta parameters
xmeans <- c(-2, 0, 2) # Means for predictors
xsigs <-diag(rep(1, 3)) # Variance for predictor
sig <- 3 # Variance for error term
  
## Simulating Data ####

standard_data <- lapply(c(1:N), data_sim, # Using data_sim function to simulate N data sets
                        nobs = nobs, beta = beta, sigma = sig, # Model Parameters
                        xmeans = xmeans, xsigs = xsigs) # Predictor Parameters for simulation


## Obtaining Estimates ####
start <- Sys.time()# Used for timing process
standard_results <- lapply(standard_data, parallel_lm) # Using 1 core to process the data 
print("Standard lapply")
Sys.time()-start# Time it took

start <- Sys.time()# Used for timing process
parallel_results <- mclapply(standard_data, parallel_lm, # Using Multiple cores to process the data 
                             mc.cores = ncores) # Setting the number of cores to use
print("mclapply")
Sys.time()-start# Time it took

## Extracting Betas ####

standard_beta <- matrix(ncol=4, nrow = N) # Creating a matrix for beta values 
parallel_beta <- matrix(ncol=4, nrow = N) # Creating a matrix for beta values 
for (i in 1:N){
  standard_beta[i, ] <- standard_results[[i]]$coef #Extracting coefficients from lapply
}
for (i in 1:N){
  parallel_beta[i, ] <- parallel_results[[i]]$coef #Extracting coefficients from mclapply
}

## Average results
print("From Standard lapply")
colMeans(standard_beta)
print("From mclapply")
colMeans(parallel_beta)


## Saving Results ####
standard_save <- list(lm_res = standard_results, betas = standard_beta) # Creating a list or results from mclapply
parallel_save <- list(lm_res = parallel_results, betas = parallel_beta) # Creating a list or results from mclapply
params <- list(N = N, # Creating a list of simulation parameters
                nobs = nobs,
                beta = beta,
                xmeans = xmeans,
                xsigs = xsigs,
                sig = sig) 

results <- list(standard = standard_save, parallel = parallel_save, data = standard_data, # Combining list
                parameters = params, Date_Time = Date_Time)

save_dir <- paste("Results_", Date_Time, ".RData", sep="") # Creating file name, contains date-time
save(results, file = save_dir, version = 2) # Saving RData file, recommend using version 2

Executing Simulation Study

RStudio Server
- https://rstudio2.hpcc.ucr.edu/
- Login with credentials
Download Documents in R Console

download.file("https://ucrgradstat.github.io/presentations/Parallel_Job.R", "Parallel_Job.R")
download.file("https://ucrgradstat.github.io/presentations/Cluster_Script.sh", "Cluster_Script.sh")

Bash Script
- Line 9: Change to your email
Submit Job
- Console Pane
- Terminal Tab
- Type: sbatch Cluster_Script.sh

Advanced Topics

Linux Commands

Resources

Read the UCR HPCC Handbook for more information.

https://hpcc.ucr.edu/

ssh netid@cluster.hpcc.ucr.edu

MobaXterm
- https://mobaxterm.mobatek.net/download-home-edition.html
Windows Subsystem Linux
- https://docs.microsoft.com/en-us/windows/wsl/about

Terminal App
Install XQuartz
- https://www.xquartz.org/

Use any terminal app

File Manipulation

NVIM
Commands
Copy and Paste

Neovim is a text editor that lets you manipulate documents
3 Modes
- Insert Mode
  - Change text
  - i
- Visual Mode
  - ESC
- Command Mode
  - Execute Commands from Visual mode
  - :wq : Write and Quit
  - :q, :q!: Quit

Delete a file

rm FILE_Name

Move a file

mv FILE_NAME DIR/

Copy a file to a directory

cp FILE_NAME DIR

Copy a file to current directory
- Use a period to indicate current directory

cp FILE_PATH/FILE_NAME .

Copy all files from one directory to another

cp -r DIR_1/* DIR_2/

scp

Info
File Path
Downloading
Uploading

Similar to the cp command, scp allows you to copy and paste files from the cluster to your personal computer.
scp only works on your computer, it will not work on the cluster
The scp requires 2 commands: the file location, and where to paste it

netid@cluster.hpcc.ucr.edu:~/FILE_PATH

# OR

netid@cluster.hpcc.ucr.edu:~/FILE_PATH/FILE_NAME

There 2 components
- User and Cluster
- File Path/Location
Both are separated by a :

Download 1 File

scp netid@cluster.hpcc.ucr.edu:~/FILE_PATH/FILE_NAME .

Downloading all files in a directory

scp -r netid@cluster.hpcc.ucr.edu:~/FILE_PATH/* .

Upload 1 File

scp FILE_NAME netid@cluster.hpcc.ucr.edu:~/FILE_PATH/

Upload all files in a directory

scp -r DIR/* netid@cluster.hpcc.ucr.edu:~/FILE_PATH/

rsync

Information
Sync to Cluster
Sync from Cluster

rsync is similar to scp; however, it syncs folders instead just copy and pasting
Option -av represent archive and verbose, respectively.
- archive means to sync all files
- verbose means to print information
You need to specify /; otherwise, you will get strange results

To sync to the cluster

rsync -av DIR/ netid@cluster.hpcc.ucr.edu:~/DIR/

To sync to local computer

rsync -av netid@cluster.hpcc.ucr.edu:~/DIR/ DIR/

ssh Keys

Incorporating an ssh key allows you to logon to the cluster without your PW + DUO
This can be thought of your computer having a lock and the cluster having a key.
If you plan to ssh/scp/rsync to the cluster often, I highly recommend
Windows: https://hpcc.ucr.edu/manuals/hpc_cluster/sshkeys/sshkeys_winos/
MacOS: https://hpcc.ucr.edu/manuals/hpc_cluster/sshkeys/sshkeys_macos/
Terminal: https://hpcc.ucr.edu/manuals/login/#ssh-keys

UCR High-Performance Computing Center

Introduction

Terminology

My Cluster Usage

HPCC Cluster

How do I view the cluster?

How to access the cluster?

RStudio Server

Linux Commands

How I use the cluster?

Submitting Jobs

Anatomy of Bash Script

Anatomy of R Script

Parallel Processing in R

How to parallelize your R Code

Where to parallelize?

How to speed up you R Code?

Simulation Study

Simulation Study

Simulation Parameters

Parallelization of Simulation

R Script

Executing Simulation Study

Advanced Topics

Resources

Terminal

Directory

File Manipulation

scp

rsync

ssh Keys

Thank You!