Data Manipulation

class: center, middle, inverse, title-slide

# Data Manipulation
### Isaac Quintanilla
### 2021-06-10

---

## Presentation

This presentation is available at:

[gitlab.com/iquintanilla/hiss](https://gitlab.com/iquintanilla/hiss)

---

## dplyr

- Known as the Grammar of Data Manipulation

- dplyr.tidyverse.org

---

## dplyr Functions

- `mutate()` adds new variables

- `select()` selects variables

- `filter()` filters data

- `if_else()` conditional function that returns 2 values

- `group_by()` a dataset is grouped by factors

- `summarise()` provides summaries of data

---

## tidyr

- Used to create tidy data

- tidyr.tidyverse.org

---

## tidyr Functions

- `pivot_longer()` (formerly `gather()`) transforms the data from wide to long

- `pivot_wider()` (formerly `spread()`) transforms the data from long to wide

- `separate()` seperates a one variable to multiple variables

- `unite()` merge multiple variable to one variable

---

## Pipe Operator `%>%`

- The pipe operator is the real power of tidyverse.

- It takes the output of a function and uses it as input for another function.

- Tidyverse works best when data frames (tibbles) are used a inputs.

---

## `mutate()`

- Adds a new variable to a data frame

- Example:

```r
mtcars %>% mutate(log_mpg=log(mpg)) %>% head()
```

```
##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb  log_mpg
## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 3.044522
## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 3.044522
## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 3.126761
## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 3.063391
## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 2.928524
## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 2.895912
```

---

## `mutate()`

- Each argument is a new variable added

- Example:

```r
mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% head()
```

---

## `select()`

-This selects the variables to keep in the data frame

-Example:

```r
mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% 
  select(mpg,log_mpg,hp,log_hp) %>% head()
```

```
##                    mpg  log_mpg  hp   log_hp
## Mazda RX4         21.0 3.044522 110 4.700480
## Mazda RX4 Wag     21.0 3.044522 110 4.700480
## Datsun 710        22.8 3.126761  93 4.532599
## Hornet 4 Drive    21.4 3.063391 110 4.700480
## Hornet Sportabout 18.7 2.928524 175 5.164786
## Valiant           18.1 2.895912 105 4.653960
```

---

## `filter()`

- Selects observations that satisfy a condition

- Example:

```r
mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% 
  select(mpg,log_mpg,hp,log_hp) %>% 
  filter(log_hp<5) %>% head()
```

```
##                 mpg  log_mpg  hp   log_hp
## Mazda RX4      21.0 3.044522 110 4.700480
## Mazda RX4 Wag  21.0 3.044522 110 4.700480
## Datsun 710     22.8 3.126761  93 4.532599
## Hornet 4 Drive 21.4 3.063391 110 4.700480
## Valiant        18.1 2.895912 105 4.653960
## Merc 240D      24.4 3.194583  62 4.127134
```

---

## `if_else()`

- A function that provides T (1) if the condition is met and F (0) otherwise

- Example:

```r
mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% 
  select(mpg,log_mpg,hp,log_hp) %>% 
  filter(log_hp<5) %>% 
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>% head()
```

```
##                 mpg  log_mpg  hp   log_hp hilhp
## Mazda RX4      21.0 3.044522 110 4.700480     1
## Mazda RX4 Wag  21.0 3.044522 110 4.700480     1
## Datsun 710     22.8 3.126761  93 4.532599     1
## Hornet 4 Drive 21.4 3.063391 110 4.700480     1
## Valiant        18.1 2.895912 105 4.653960     1
## Merc 240D      24.4 3.194583  62 4.127134     0
```

---

## `group_by()`

- This groups the data frame

- Example:

```
## # A tibble: 6 x 5
## # Groups:   hilhp [2]
##     mpg log_mpg    hp log_hp hilhp
##   <dbl>   <dbl> <dbl>  <dbl> <dbl>
## 1  21      3.04   110   4.70     1
## 2  21      3.04   110   4.70     1
## 3  22.8    3.13    93   4.53     1
## 4  21.4    3.06   110   4.70     1
## 5  18.1    2.90   105   4.65     1
## 6  24.4    3.19    62   4.13     0
```

---

## `summarise()`

- Creates summary statistics for variables

```r
mtcars %>% mutate(log_mpg=log(mpg),log_hp=log(hp)) %>% 
  select(mpg,log_mpg,hp,log_hp) %>% 
  filter(log_hp<5) %>% 
  mutate(hilhp=if_else(log_hp>mean(log_hp),1,0)) %>% 
  group_by(hilhp) %>% 
  summarise(mean_mpg=mean(mpg),mean_lmpg=mean(log_mpg),
            sd_mpg=sd(mpg),sd_lmpg=sd(log_mpg)) %>% head()
```

```
## # A tibble: 2 x 5
##   hilhp mean_mpg mean_lmpg sd_mpg sd_lmpg
##   <dbl>    <dbl>     <dbl>  <dbl>   <dbl>
## 1     0     29.7      3.38   3.85   0.133
## 2     1     22.0      3.08   3.46   0.148
```

---

## Long Data to Wide Data Example

For the rest of this presentation we work on converting data from wide to long using the functions in the tidyr package. For many statistical analysis, long data is necessarry.

---

## Load Data

Use the `read_csv()` to read 'data_3_4.csv' into an object called 'data1';

```r
data1<-read_csv(file="data_3_4.csv")
```

---

## Wide Data

```r
names(data1)
```

```
##  [1] "ID1"       "v1/mean"   "v1/sd"     "v1/median" "v2/mean"   "v2/sd"    
##  [7] "v2/median" "v3/mean"   "v3/sd"     "v3/median" "v4/mean"   "v4/sd"    
## [13] "v4/median"
```

```r
data1[1,c(1:7)]
```

```
## # A tibble: 1 x 7
##   ID1       `v1/mean` `v1/sd` `v1/median` `v2/mean` `v2/sd` `v2/median`
##   <chr>         <dbl>   <dbl>       <dbl>     <dbl>   <dbl>       <dbl>
## 1 Ad9131ee9      3.11    2.86        4.50      1.93    3.21        3.27
```

---

## Long Data

```r
data1_long[c(1:4),c(1:5)]
```

```
## # A tibble: 4 x 5
##   ID1       time   mean     sd median
##   <chr>     <chr> <dbl>  <dbl>  <dbl>
## 1 Ad9131ee9 v1    3.11   2.86    4.50
## 2 Ad9131ee9 v2    1.93   3.21    3.27
## 3 Ad9131ee9 v3    2.65  -0.383   3.23
## 4 Ad9131ee9 v4    0.605  0.883   4.65
```

---

## `pivot_longer()`

- The `pivot_longer()` function grabs the variables that repeated in an observation places them in one variable:

```r
data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% 
  head()
```

```
## # A tibble: 6 x 3
##   ID1       measurement value
##   <chr>     <chr>       <dbl>
## 1 Ad9131ee9 v1/mean      3.11
## 2 Ad9131ee9 v1/sd        2.86
## 3 Ad9131ee9 v1/median    4.50
## 4 Ad9131ee9 v2/mean      1.93
## 5 Ad9131ee9 v2/sd        3.21
## 6 Ad9131ee9 v2/median    3.27
```

---

## `separate()`

- The `separate()` function will seperate a variable to multiple variables:

```r
data1 %>% 
  pivot_longer(cols=`v1/mean`:`v4/median`,names_to = "measurement",values_to = "value") %>% 
  separate(col=measurement,into=c("time","stat"),sep="/") %>% 
  head()
```

```
## # A tibble: 6 x 4
##   ID1       time  stat   value
##   <chr>     <chr> <chr>  <dbl>
## 1 Ad9131ee9 v1    mean    3.11
## 2 Ad9131ee9 v1    sd      2.86
## 3 Ad9131ee9 v1    median  4.50
## 4 Ad9131ee9 v2    mean    1.93
## 5 Ad9131ee9 v2    sd      3.21
## 6 Ad9131ee9 v2    median  3.27
```

---

## `pivot_wider()`

- The `pivot_wider()` function then converts long data to wide data.

```r
data1 %>% 
  pivot_longer(`v1/mean`:`v4/median`,"measurement","value") %>% 
  separate(measurement,c("time","stat"),sep="/") %>% 
  pivot_wider(names_from = stat,values_from = value) %>% 
  head()      
```

```
## # A tibble: 6 x 5
##   ID1       time   mean     sd median
##   <chr>     <chr> <dbl>  <dbl>  <dbl>
## 1 Ad9131ee9 v1    3.11   2.86    4.50
## 2 Ad9131ee9 v2    1.93   3.21    3.27
## 3 Ad9131ee9 v3    2.65  -0.383   3.23
## 4 Ad9131ee9 v4    0.605  0.883   4.65
## 5 A9c5988ea v1    2.03   2.90    2.08
## 6 A9c5988ea v2    0.709  2.27    4.13
```

---

## Base R vs Tidyverse

- They both achieve the same outcomes

- Base R has more capabilities

- Tidyverse easier to read and memorize

- Example:
  - Create a table to showing the mean of `mpg` from the `mtcars` dataset grouped by the number of `cyl` and `gear.`

---

## Base R vs Tidyverse

Use the `tapply()` function in Base R:

```r
tapply(mtcars$mpg, list(mtcars$cyl,mtcars$gear), mean)
```

```
##       3      4    5
## 4 21.50 26.925 28.2
## 6 19.75 19.750 19.7
## 8 15.05     NA 15.4
```

---

## Base R vs Tidyverse

Use a selection `group_by()` and `summarise()` in Tidyverse:

```r
mtcars %>%
  group_by(cyl,gear) %>% 
  summarise(ave=mean(mpg))
```

```
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
```

```
## # A tibble: 8 x 3
## # Groups:   cyl [3]
##     cyl  gear   ave
##   <dbl> <dbl> <dbl>
## 1     4     3  21.5
## 2     4     4  26.9
## 3     4     5  28.2
## 4     6     3  19.8
## 5     6     4  19.8
## 6     6     5  19.7
## 7     8     3  15.0
## 8     8     5  15.4
```

---

## Base R vs Tidyverse

Use the `pivot_wider()` to to force one variable to multiple variables:

```r
mtcars %>%
  group_by(cyl,gear) %>% 
  summarise(ave=mean(mpg)) %>% 
  pivot_wider(names_from = gear,values_from = ave) %>% 
  rename("cyl/gear"=cyl)
```

```
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
```

```
## # A tibble: 3 x 4
## # Groups:   cyl/gear [3]
##   `cyl/gear`   `3`   `4`   `5`
##        <dbl> <dbl> <dbl> <dbl>
## 1          4  21.5  26.9  28.2
## 2          6  19.8  19.8  19.7
## 3          8  15.0  NA    15.4
```

---

## Tidyverse Advantage

What if we want to know more summary statistics, use the `unite()` function to combine multiple variables to one variable:

```r
mtcars %>%
  group_by(cyl,gear) %>% 
  summarise(ave=mean(mpg),sd=round(sd(mpg),2)) %>% 
  unite(ave,sd,col="mean_sd",sep="/")
```

```
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
```

```
## # A tibble: 8 x 3
## # Groups:   cyl [3]
##     cyl  gear mean_sd    
##   <dbl> <dbl> <chr>      
## 1     4     3 21.5/NA    
## 2     4     4 26.925/4.81
## 3     4     5 28.2/3.11  
## 4     6     3 19.75/2.33 
## 5     6     4 19.75/1.55 
## 6     6     5 19.7/NA    
## 7     8     3 15.05/2.77 
## 8     8     5 15.4/0.57
```

---

```r
mtcars %>%
  group_by(cyl,gear) %>% 
  summarise(ave=mean(mpg),sd=round(sd(mpg),2)) %>% 
  unite(ave,sd,col="mean_sd",sep="/") %>% 
  pivot_wider(names_from = "gear",values_from = "mean_sd")
```

```
## `summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
```

```
## # A tibble: 3 x 4
## # Groups:   cyl [3]
##     cyl `3`        `4`         `5`      
##   <dbl> <chr>      <chr>       <chr>    
## 1     4 21.5/NA    26.925/4.81 28.2/3.11
## 2     6 19.75/2.33 19.75/1.55  19.7/NA  
## 3     8 15.05/2.77 <NA>        15.4/0.57
```

---

## Thank You!

---

## Resources

- https://www.tidyverse.org/

- https://bookdown.org/yihui/rmarkdown/

- https://rstudio.com/

- https://slides.yihui.org/xaringan/#1

- https://remarkjs.com/#1