with applications in parallel computing
Isaac Quintanilla Salinas
5/20/2022
You can follow this presentation at https://www.inqs.info/files/GSS/HPCC/hpcc_5-20.html
job: A task a computer must complete
bash script: tells the computer what to do
Computationally Intensive
Use R and C++ (via Rcpp)
Edit any text document
Submit jobs
Upload/Download Data or Documents
Do not do any analysis in RStudio Server
sbatch
: submit a job to the cluster
scancel
: cancel jobs
squeue
: view jobs status
slurm_limits
: view what is available to you
Code on my computer at a smaller scale
Scale it up for the cluster
Have an R Script do everything for me
Save all results in an RData file
Use the try()
function to catch errors
Use RProjects
Use a bash script to specify the parameters for the cluster
Add the following line at the end of a bash script:
#!/bin/bash
#SBATCH --nodes=1 # Setting the number of nodes, usually 1 node is used
#SBATCH --ntasks=1 # Setting the number of tasks, usually 1 task is used
#SBATCH --cpus-per-task=8 # Setting the number of cpus per task
#SBATCH --mem-per-cpu=1G # How much ram per cpu, usually 1 GB
#SBATCH --time=0-01:00:00 # Time to run task, changes based on predicted time of task
#SBATCH --output=my.stdout # Where to store the output, usually a standard output
#SBATCH --mail-user=NETID@ucr.edu # Where to email information about job
#SBATCH --mail-type=ALL # Not Sure
#SBATCH --job-name="Cluster Job 1" # Name of the job, can be anything
#SBATCH -p statsdept # statsdept is the only partition dept of stats students can use
Rscript Parallel_Job.R # The command that tells Linux to process your R Script
# Obtain System Date and Time
date_time <- format(Sys.time(),"%Y-%m-%d-%H-%M")
# Set Working Directory
setwd("~/rwork")
# Load libraries and functions
library(parallel)
source("Fxs.R")
# Pre - Parallel Analysis
# Parallel Analysis
results <- mclapply(data, FUN, mc.cores = number_of_cores)
# Post - Parallel Analysis
# Save Results
file_name <- paste("Results_", date_time, ".RData",sep = "")
save(results, file = file_name, version = 2)
For more information:
https://ucrgradstat.github.io/stat_comp/hpcc/parallel_notebook.nb.html
From the parallel
R package:
mclapply()
Recommended for the cluster
Has built-in try()
function
Replace any lapply()
with mclapply()
and add mc.cores=
argument.
parLapply()
Use if multiple nodes are involved
Use if on Windows PC
Identify loops or *apply functions
Identify bottlenecks
Vectorize your R code
Minimize loops
*apply
functionsUse optimized functions
colMeans()
and rowMeans()
Implement c++ via Rcpp
More Information: Advanced R (adv-r.hadley.nz)
Show that Ordinary Least Squares provides consistent estimates
Model: Y = XTβ + ϵ
β = (β0,β1,β2,β3)T = (5,4,−5,−3)T
X = (1,X1,X2,X3)T
ϵ ∼ N(0,3)
Number of Data sets: 10000
Number of Observations: 200
X ∼ N((−2,0,2)T,I3)
I3: 3 × 3 Identity Matrix
8 cores
Each core will process around 1250 data sets
## Date-Time ####
Date_Time <- format(Sys.time(),"%Y-%m-%d-%H-%M") #Used as a unique identifier
## Setting WD ####
# ("~/rwork") # Setting working directory to rwork, were all the data is saved
## Loading R Packages ####
library(parallel)
## Functions ####
data_sim <- function(seed, nobs, beta, sigma, xmeans, xsigs){ # Simulates the data set
set.seed(seed) # Sets a seed
xrn <- cbind(rnorm(nobs, mean = xmeans[1], sd = xsigs[1,1]),
rnorm(nobs, mean = xmeans[2], sd = xsigs[2,2]),
rnorm(nobs, mean = xmeans[3], sd = xsigs[3,3])) # Simulates Predictors
xped <- cbind(rep(1,nobs),xrn) # Creating Design Matrix
y <- xped %*% beta + rnorm(nobs ,0, sigma) # Simulating Y
df <- data.frame(x=xrn, y=y) # Creating Data Frame
return(df)
}
parallel_lm <- function(data){ # Applying a Ordinary Least Squares to data frame
lm_res <- lm(y ~ x.1 + x.2 + x.3, data = data) # Find OLS Estimates
return(list(coef=coef(lm_res), lm_results=lm_res))
}
## Parallel Parameters ####
ncores <- 8 # Number of cpus to be used
## Simulation Parameters ####
N <- 10000 # Number of Data sets
nobs <- 200 # Number of observations
beta <- c(5, 4, -5, -3) # beta parameters
xmeans <- c(-2, 0, 2) # Means for predictors
xsigs <-diag(rep(1, 3)) # Variance for predictor
sig <- 3 # Variance for error term
## Simulating Data ####
standard_data <- lapply(c(1:N), data_sim, # Using data_sim function to simulate N data sets
nobs = nobs, beta = beta, sigma = sig, # Model Parameters
xmeans = xmeans, xsigs = xsigs) # Predictor Parameters for simulation
## Obtaining Estimates ####
start <- Sys.time()# Used for timing process
standard_results <- lapply(standard_data, parallel_lm) # Using 1 core to process the data
print("Standard lapply")
Sys.time()-start# Time it took
start <- Sys.time()# Used for timing process
parallel_results <- mclapply(standard_data, parallel_lm, # Using Multiple cores to process the data
mc.cores = ncores) # Setting the number of cores to use
print("mclapply")
Sys.time()-start# Time it took
## Extracting Betas ####
standard_beta <- matrix(ncol=4, nrow = N) # Creating a matrix for beta values
parallel_beta <- matrix(ncol=4, nrow = N) # Creating a matrix for beta values
for (i in 1:N){
standard_beta[i, ] <- standard_results[[i]]$coef #Extracting coefficients from lapply
}
for (i in 1:N){
parallel_beta[i, ] <- parallel_results[[i]]$coef #Extracting coefficients from mclapply
}
## Average results
print("From Standard lapply")
colMeans(standard_beta)
print("From mclapply")
colMeans(parallel_beta)
## Saving Results ####
standard_save <- list(lm_res = standard_results, betas = standard_beta) # Creating a list or results from mclapply
parallel_save <- list(lm_res = parallel_results, betas = parallel_beta) # Creating a list or results from mclapply
params <- list(N = N, # Creating a list of simulation parameters
nobs = nobs,
beta = beta,
xmeans = xmeans,
xsigs = xsigs,
sig = sig)
results <- list(standard = standard_save, parallel = parallel_save, data = standard_data, # Combining list
parameters = params, Date_Time = Date_Time)
save_dir <- paste("Results_", Date_Time, ".RData", sep="") # Creating file name, contains date-time
save(results, file = save_dir, version = 2) # Saving RData file, recommend using version 2
RStudio Server
Login with credentials
Download Documents in R Console
Bash Script
Submit Job
Console Pane
Terminal Tab
Type: sbatch Cluster_Script.sh
Linux Commands
Read the UCR HPCC Handbook for more information.
MobaXterm
Windows Subsystem Linux
Terminal App
Install XQuartz
Use any terminal app
When logging on, you will be in /rhome/netid
In Linux: ~
=/rhome/netid
ls
: List all visible files
ls -a
: List all files
ls -l
: Provides information on files
ls -t
: List files chronologically
ls -R
: List all subdirectories
pwd
: Print working directory
cd
: Change working directory
cd ~
: Return to Home directory
cd ..
: Move one directory above
cd ../../
: Move two directories above
mkdir
: Creates a directory
rmdir
: Deletes an empty directory
rm -r
: Deletes a nonempty directory
Neovim is a text editor that lets you manipulate documents
3 Modes
Insert Mode
Change text
i
Visual Mode
ESC
Command Mode
Execute Commands from Visual mode
:wq
: Write and Quit
:q
, :q!
: Quit
Similar to the cp
command, scp
allows you to copy and paste files from the cluster to your personal computer.
scp
only works on your computer, it will not work on the cluster
The scp
requires 2 commands: the file location, and where to paste it
There 2 components
User and Cluster
File Path/Location
Both are separated by a :
Download 1 File
Downloading all files in a directory
rsync
is similar to scp
; however, it syncs folders instead just copy and pasting
Option -av
represent archive and verbose, respectively.
archive means to sync all files
verbose means to print information
You need to specify /
; otherwise, you will get strange results
Incorporating an ssh key allows you to logon to the cluster without your PW + DUO
This can be thought of your computer having a lock and the cluster having a key.
If you plan to ssh/scp/rsync to the cluster often, I highly recommend
Windows: https://hpcc.ucr.edu/manuals/hpc_cluster/sshkeys/sshkeys_winos/
MacOS: https://hpcc.ucr.edu/manuals/hpc_cluster/sshkeys/sshkeys_macos/