Here, we will go over some of the basic syntax to obtain basic statistics. We will use the variables mpg and cyl from the mtcars data set. To view the data set use the head():
The variable mpg would be used as a continuous variable, and the variable cyl would be used as a categorical variable.
23.1.1 Point Estimates
The first basic statistic you can compute are point estimates. These are your means, medians, etc. Here we will calculate these estimates.
23.1.1.1 Mean
To obtain the mean, use the mean(), you only need to specify x= for the data to compute the mean:
mean(mtcars$mpg)
[1] 20.09062
23.1.1.2 Median
To obtain the median, use the median(), you only need to specify x= for the data to compute the median:
median(mtcars$mpg)
[1] 19.2
23.1.1.3 Frequency
To obtain a frequency table, use the table(), you only need to specify the data as the first argument to compute the frequency table:
table(mtcars$cyl)
4 6 8
11 7 14
23.1.1.4 Proportion
To obtain a the proportions for the frequency table, use the prop.table(). However the first argument must be the results from the table(). Use the table() inside the prop.table() to get the proportions:
prop.table(table(mtcars$cyl))
4 6 8
0.34375 0.21875 0.43750
23.1.2 Variability
In addition to point estimates, variability is an important statistic to report to let a user know about the spread of the data. Here we will calculate certain variability statistics.
23.1.2.1 Variance
To obtain the variance, use the var(), you only need to specify x= for the data to compute the variance:
var(mtcars$mpg)
[1] 36.3241
23.1.2.2 Standard deviation
To obtain the standard deviation, use the sd(), you only need to specify x= for the data to compute the standard deviation:
sd(mtcars$mpg)
[1] 6.026948
23.1.2.3 Max and Min
To obtain the max and min, use the max() and min(), respectively. You only need to specify the data as the first argument to compute the max and min:
max(mtcars$mpg)
[1] 33.9
min(mtcars$mpg)
[1] 10.4
23.1.2.4 Q1 and Q3
To obtain the Q1 and Q3, use the quantile() and specify the desired quantile with probs=. You only need to specify the data as the first argument and probs= (as a decimal) to compute the Q1 and Q3:
quantile(mtcars$mpg, .25)
25%
15.425
quantile(mtcars$mpg, .75)
75%
22.8
23.1.3 Associations
In statistics, we may be interested on how different variables are related to each other. These associations can be represented in a numerical value.
23.1.3.1 Continuous and Continuous
When we measure the association between to continuous variables, we tend to use a correlation statistic. This statistic tells us how linearly associated are the variables are to each other. Essentially, as one variable increases, what happens to the other variable? Does it increase (positive association) or does it decrease (negative association). To find the correlation in R, use the cor(). You will need to specify the x= and y= which represents vectors for each variable. Find the correlation between mpg and hp from the mtcars data set.
cor(mtcars$mpg, mtcars$hp)
[1] -0.7761684
23.1.3.2 Categorical and Continuous
When comparing categorical variables, it becomes a bit more nuanced in how to report associations. Most of time you will discuss key differences in certain groups. Here, we will talk about how to get the means for different groups of data. Our continuous variable is the mpg variable, and our categorical variable is the cyl variable. Both are from the mtcars data set. The tapply() allows us to split the data into different groups and then calculate different statistics. We only need to specify X= of the R object to split, INDEX= which is a list of factors or categories indicating how to split the data set, and FUN= which is the function that needs to be computed. Use the tapply() and find the mean mpg for each cyl group: 4, 5, and 6.
tapply(mtcars$mpg, list(mtcars$cyl), mean)
4 6 8
26.66364 19.74286 15.10000
23.1.3.3 Categorical and Categorical
Reporting the association between two categorical variables is may be challenging. If you have a \(2\times 2\) table, you can report a ratio of association. However, any other case may be challenging. You can report a hypothesis test to indicate an association, but it does not provide much information about the effect of each variable. You can also report row, column, or table proportions. Here we will talk about creating cross tables and report these proportions. To create a cross table, use the table() and use the first two arguments to specify the two categorical variables. Create a cross tabulation between cyl and carb from the mtcars data set.
Notice how the first argument is represented in the rows and the second argument is in the columns. Now create table proportions using both of the variables. You first need to create the table and store it in a variable and then use the prop.table().