class: center, middle, inverse, title-slide # Data Manipulation ### Isaac Quintanilla ### 2021-06-10 --- ## Presentation This presentation is available at: [gitlab.com/iquintanilla/hiss](https://gitlab.com/iquintanilla/hiss) --- ## dplyr - Known as the Grammar of Data Manipulation - dplyr.tidyverse.org --- ## dplyr Functions - `mutate()` adds new variables - `select()` selects variables - `filter()` filters data - `if_else()` conditional function that returns 2 values - `group_by()` a dataset is grouped by factors - `summarise()` provides summaries of data --- ## tidyr - Used to create tidy data - tidyr.tidyverse.org --- ## tidyr Functions - `pivot_longer()` (formerly `gather()`) transforms the data from wide to long - `pivot_wider()` (formerly `spread()`) transforms the data from long to wide - `separate()` seperates a one variable to multiple variables - `unite()` merge multiple variable to one variable --- ## Pipe Operator `%>%` - The pipe operator is the real power of tidyverse. - It takes the output of a function and uses it as input for another function. - Tidyverse works best when data frames (tibbles) are used a inputs. --- ## `mutate()` - Adds a new variable to a data frame - Example: ```r mtcars %>% mutate(log_mpg=log(mpg)) %>% head() ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb log_mpg ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.044522 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.044522 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.126761 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.063391 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.928524 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2.895912 ``` --- ## `mutate()` - Each argument is a new variable added - Example: ```r mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% head() ``` ``` ## mpg cyl disp hp drat wt qsec vs am gear carb log_mpg ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 3.044522 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 3.044522 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 3.126761 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 3.063391 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 2.928524 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 2.895912 ## log_hp ## Mazda RX4 4.700480 ## Mazda RX4 Wag 4.700480 ## Datsun 710 4.532599 ## Hornet 4 Drive 4.700480 ## Hornet Sportabout 5.164786 ## Valiant 4.653960 ``` --- ## `select()` -This selects the variables to keep in the data frame -Example: ```r mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% select(mpg,log_mpg,hp,log_hp) %>% head() ``` ``` ## mpg log_mpg hp log_hp ## Mazda RX4 21.0 3.044522 110 4.700480 ## Mazda RX4 Wag 21.0 3.044522 110 4.700480 ## Datsun 710 22.8 3.126761 93 4.532599 ## Hornet 4 Drive 21.4 3.063391 110 4.700480 ## Hornet Sportabout 18.7 2.928524 175 5.164786 ## Valiant 18.1 2.895912 105 4.653960 ``` --- ## `filter()` - Selects observations that satisfy a condition - Example: ```r mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% select(mpg,log_mpg,hp,log_hp) %>% filter(log_hp<5) %>% head() ``` ``` ## mpg log_mpg hp log_hp ## Mazda RX4 21.0 3.044522 110 4.700480 ## Mazda RX4 Wag 21.0 3.044522 110 4.700480 ## Datsun 710 22.8 3.126761 93 4.532599 ## Hornet 4 Drive 21.4 3.063391 110 4.700480 ## Valiant 18.1 2.895912 105 4.653960 ## Merc 240D 24.4 3.194583 62 4.127134 ``` --- ## `if_else()` - A function that provides T (1) if the condition is met and F (0) otherwise - Example: ```r mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% select(mpg,log_mpg,hp,log_hp) %>% filter(log_hp<5) %>% mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>% head() ``` ``` ## mpg log_mpg hp log_hp hilhp ## Mazda RX4 21.0 3.044522 110 4.700480 1 ## Mazda RX4 Wag 21.0 3.044522 110 4.700480 1 ## Datsun 710 22.8 3.126761 93 4.532599 1 ## Hornet 4 Drive 21.4 3.063391 110 4.700480 1 ## Valiant 18.1 2.895912 105 4.653960 1 ## Merc 240D 24.4 3.194583 62 4.127134 0 ``` --- ## `group_by()` - This groups the data frame - Example: ```r mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% select(mpg,log_mpg,hp,log_hp) %>% filter(log_hp<5) %>% mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>% group_by(hilhp) %>% head() ``` ``` ## # A tibble: 6 x 5 ## # Groups: hilhp [2] ## mpg log_mpg hp log_hp hilhp ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 21 3.04 110 4.70 1 ## 2 21 3.04 110 4.70 1 ## 3 22.8 3.13 93 4.53 1 ## 4 21.4 3.06 110 4.70 1 ## 5 18.1 2.90 105 4.65 1 ## 6 24.4 3.19 62 4.13 0 ``` --- ## `summarise()` - Creates summary statistics for variables ```r mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% select(mpg,log_mpg,hp,log_hp) %>% filter(log_hp<5) %>% mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>% group_by(hilhp) %>% summarise(mean_mpg=mean(mpg),mean_lmpg=mean(log_mpg), sd_mpg=sd(mpg),sd_lmpg=sd(log_mpg)) %>% head() ``` ``` ## # A tibble: 2 x 5 ## hilhp mean_mpg mean_lmpg sd_mpg sd_lmpg ## <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 0 29.7 3.38 3.85 0.133 ## 2 1 22.0 3.08 3.46 0.148 ``` --- ## Long Data to Wide Data Example For the rest of this presentation we work on converting data from wide to long using the functions in the tidyr package. For many statistical analysis, long data is necessarry. --- ## Load Data Use the `read_csv()` to read 'data_3_4.csv' into an object called 'data1'; ```r data1<-read_csv(file="data_3_4.csv") ``` --- ## Wide Data ```r names(data1) ``` ``` ## [1] "ID1" "v1/mean" "v1/sd" "v1/median" "v2/mean" "v2/sd" ## [7] "v2/median" "v3/mean" "v3/sd" "v3/median" "v4/mean" "v4/sd" ## [13] "v4/median" ``` ```r data1[1,c(1:7)] ``` ``` ## # A tibble: 1 x 7 ## ID1 `v1/mean` `v1/sd` `v1/median` `v2/mean` `v2/sd` `v2/median` ## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> ## 1 Ad9131ee9 3.11 2.86 4.50 1.93 3.21 3.27 ``` --- ## Long Data ```r data1_long[c(1:4),c(1:5)] ``` ``` ## # A tibble: 4 x 5 ## ID1 time mean sd median ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Ad9131ee9 v1 3.11 2.86 4.50 ## 2 Ad9131ee9 v2 1.93 3.21 3.27 ## 3 Ad9131ee9 v3 2.65 -0.383 3.23 ## 4 Ad9131ee9 v4 0.605 0.883 4.65 ``` --- ## `pivot_longer()` - The `pivot_longer()` function grabs the variables that repeated in an observation places them in one variable: ```r data1 %>% pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% head() ``` ``` ## # A tibble: 6 x 3 ## ID1 measurement value ## <chr> <chr> <dbl> ## 1 Ad9131ee9 v1/mean 3.11 ## 2 Ad9131ee9 v1/sd 2.86 ## 3 Ad9131ee9 v1/median 4.50 ## 4 Ad9131ee9 v2/mean 1.93 ## 5 Ad9131ee9 v2/sd 3.21 ## 6 Ad9131ee9 v2/median 3.27 ``` --- ## `separate()` - The `separate()` function will seperate a variable to multiple variables: ```r data1 %>% pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% separate(col=measurement,into=c("time","stat"),sep="/") %>% head() ``` ``` ## # A tibble: 6 x 4 ## ID1 time stat value ## <chr> <chr> <chr> <dbl> ## 1 Ad9131ee9 v1 mean 3.11 ## 2 Ad9131ee9 v1 sd 2.86 ## 3 Ad9131ee9 v1 median 4.50 ## 4 Ad9131ee9 v2 mean 1.93 ## 5 Ad9131ee9 v2 sd 3.21 ## 6 Ad9131ee9 v2 median 3.27 ``` --- ## `pivot_wider()` - The `pivot_wider()` function then converts long data to wide data. ```r data1 %>% pivot_longer(`v1/mean`:`v4/median`,"measurement","value") %>% separate(measurement,c("time","stat"),sep="/") %>% pivot_wider(names_from = stat,values_from = value) %>% head() ``` ``` ## # A tibble: 6 x 5 ## ID1 time mean sd median ## <chr> <chr> <dbl> <dbl> <dbl> ## 1 Ad9131ee9 v1 3.11 2.86 4.50 ## 2 Ad9131ee9 v2 1.93 3.21 3.27 ## 3 Ad9131ee9 v3 2.65 -0.383 3.23 ## 4 Ad9131ee9 v4 0.605 0.883 4.65 ## 5 A9c5988ea v1 2.03 2.90 2.08 ## 6 A9c5988ea v2 0.709 2.27 4.13 ``` --- ## Base R vs Tidyverse - They both achieve the same outcomes - Base R has more capabilities - Tidyverse easier to read and memorize - Example: - Create a table to showing the mean of `mpg` from the `mtcars` dataset grouped by the number of `cyl` and `gear.` --- ## Base R vs Tidyverse Use the `tapply()` function in Base R: ```r tapply(mtcars$mpg, list(mtcars$cyl,mtcars$gear), mean) ``` ``` ## 3 4 5 ## 4 21.50 26.925 28.2 ## 6 19.75 19.750 19.7 ## 8 15.05 NA 15.4 ``` --- ## Base R vs Tidyverse Use a selection `group_by()` and `summarise()` in Tidyverse: ```r mtcars %>% group_by(cyl,gear) %>% summarise(ave=mean(mpg)) ``` ``` ## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument. ``` ``` ## # A tibble: 8 x 3 ## # Groups: cyl [3] ## cyl gear ave ## <dbl> <dbl> <dbl> ## 1 4 3 21.5 ## 2 4 4 26.9 ## 3 4 5 28.2 ## 4 6 3 19.8 ## 5 6 4 19.8 ## 6 6 5 19.7 ## 7 8 3 15.0 ## 8 8 5 15.4 ``` --- ## Base R vs Tidyverse Use the `pivot_wider()` to to force one variable to multiple variables: ```r mtcars %>% group_by(cyl,gear) %>% summarise(ave=mean(mpg)) %>% pivot_wider(names_from = gear,values_from = ave) %>% rename("cyl/gear"=cyl) ``` ``` ## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument. ``` ``` ## # A tibble: 3 x 4 ## # Groups: cyl/gear [3] ## `cyl/gear` `3` `4` `5` ## <dbl> <dbl> <dbl> <dbl> ## 1 4 21.5 26.9 28.2 ## 2 6 19.8 19.8 19.7 ## 3 8 15.0 NA 15.4 ``` --- ## Tidyverse Advantage What if we want to know more summary statistics, use the `unite()` function to combine multiple variables to one variable: ```r mtcars %>% group_by(cyl,gear) %>% summarise(ave=mean(mpg),sd=round(sd(mpg),2)) %>% unite(ave,sd,col="mean_sd",sep="/") ``` ``` ## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument. ``` ``` ## # A tibble: 8 x 3 ## # Groups: cyl [3] ## cyl gear mean_sd ## <dbl> <dbl> <chr> ## 1 4 3 21.5/NA ## 2 4 4 26.925/4.81 ## 3 4 5 28.2/3.11 ## 4 6 3 19.75/2.33 ## 5 6 4 19.75/1.55 ## 6 6 5 19.7/NA ## 7 8 3 15.05/2.77 ## 8 8 5 15.4/0.57 ``` --- ```r mtcars %>% group_by(cyl,gear) %>% summarise(ave=mean(mpg),sd=round(sd(mpg),2)) %>% unite(ave,sd,col="mean_sd",sep="/") %>% pivot_wider(names_from = "gear",values_from = "mean_sd") ``` ``` ## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument. ``` ``` ## # A tibble: 3 x 4 ## # Groups: cyl [3] ## cyl `3` `4` `5` ## <dbl> <chr> <chr> <chr> ## 1 4 21.5/NA 26.925/4.81 28.2/3.11 ## 2 6 19.75/2.33 19.75/1.55 19.7/NA ## 3 8 15.05/2.77 <NA> 15.4/0.57 ``` --- ## Thank You! --- ## Resources - https://www.tidyverse.org/ - https://bookdown.org/yihui/rmarkdown/ - https://rstudio.com/ - https://slides.yihui.org/xaringan/#1 - https://remarkjs.com/#1