R

How to Build and Interpret Statistical Models Using R

In R, statistical models are helpful in data interpretation and in comprehending the interactions between the variables. In data analysis, statistical modeling is a powerful technique for identifying trends, correlations, and patterns in datasets. We can generate predictions, obtain insights, and assist in decision-making processes using statistical approaches and models. Moreover, the statistical model evaluates the information strategically and identifies correlations between random variables. In this article, we will build and interpret the basic and advanced statistical models which include linear regression and logistic regression.

Example 1: Create and Interpret a Simple Statistical Model in R

We build the statistics of the vectors and then interpret those statistics using the summary() function that is provided by the R language. The summary() function summarizes the vector values and generates the statistical report in the output. The method to build and get the statistics vector summary is given as follows:

odd_integers <- c(3, 5, 5, 7, 9, 11, 13, 15, 17, 17, 19, 21, 21, NA)
summary(odd_integers)

Here, we define the vectors of the random numerical value which also contains “NA” inside it. The vector is then stored in the “odd_integers” variable. After that, we call the summary() method to interpret the previous statistics of the vector. The summary includes the minimum, maximum, median, quartiles, and mean values from the vector statistics. The “NA” values in the specified vector is excluded by the summary() method while generating the summary statistics.

The output interprets the statistics summary of the vector in the following image:

Example 2: Create and Interpret the Linear Regression Statistical Model in R

Linear regression analysis is a statistical procedure that is frequently employed for building the relationship between the response variable and the predictor variable. The response variable gathers the data from the predictor variable, and the predictor variable data is obtained through analysis. The response and predictor variables are set using R’s lm() function. The regression statistical information can then be interpreted using the summary() function.

A <- c(221, 264, 248, 296, 228, 226, 279, 263, 222, 241)
B <- c(53, 71, 76, 31, 97, 27, 86, 82, 72, 43)
Data_Relation <- lm(B~A)
print(summary(Data_Relation))

Here, we represent the “A” variable where the vectors of three-digit integers are stored. Similarly, we have another variable which is “B” where we store two digit integers.

After that, we call the lm() function to carry out the linear regression and pass it with the “B~A” expression. This expression indicates that “A” is the response variable, and the predictor variable is “B”. The operation of the lm() function is assigned to the “Data_Relation” variable which is passed to the summary() function to get the statistical overview of the linear regression. The summary() function is set inside the print() function of R to print the summary() function output.

Here, we can see a strong relation between the two variable summaries of the linear regression statistics.

Example 3: Create and Interpret the Logistic Regression Statistical Model in R

Next, we perform the logistic regression statistical model which is used as a model for categorical (binary) variable prediction. We can create the logistical regression in R using the gml() function and then interpret its statistics information via the summary() function. The gml() function takes the formula that represents the two variables’ relationship, the dataset, and the family argument which is binomial for the logistic regression.

data(iris)
data <- as.data.frame(iris)
model <- glm(Species ~ Sepal.Width + Petal.Length + Petal.Width, data = data, family = binomial)
summary(model)

Here, we call the as.data.frame() where the iris data frame of R is passed. The as.data.frame() method converts the dataset into the DataFrame and stores that transformed DataFrame into the “data” variable.

Next, we set the logistics regression here using the glm() method where the “Species ~ Sepal.Width + Petal.Length + Petal.Width” formula is defined. The formula indicates that the “Species” is the dependent variable which depends on the “Sepal.Width”, “Petal.Length” and “Petal.Width” predictor variables.

Then, we pass the “data” argument with the “data” variable that holds the DataFrame. After that, we set the family argument with the binomial value which specifies that we fit the logistic regression for the binary result. Using the glm() operation finally, we set the summary() method which interprets the summary of the fitted logistic regression in the output.

The outcome retrieves the summary of the logistic regression with the information-related model and the relationship between the predictors and the binary output:

 

Example 4: Create and Interpret the Linear Regression Statistical Model to Draw a Graph in R

Now, we are going to build the statistical model of the simple regression and then interpret the statistics of that model by deploying the summary method and rendering the simple regression plot against it using the abline() method. Consider and follow the step that is given in the following to get the plot of the statistical simple regression in R.

head(USArrests)
m1 <- lm(Assault ~ Murder, data=USArrests)
summary(m1)
plot(USArrests, col='green', pch='+', cex=2)
abline(m1, col='blue', lwd=2)

Here, we define the head() method to display a few data of the “USArrests” dataset:

Then, we set the “m1” variable where the lm() function is defined and passed with the arguments. First, we pass the “Assault ~ Murder” formula where the “Assault” column is dependent on the “Murder” column.

After that, we set the “data” argument which is specified with the “USArrests” dataset. We then get the summary of the statistical regression. Once the summary data are acquired, the linear regression is then plotted. We employ the plot() method and pass the “USArrests” dataset which data is used to render the scatter plot in green points by the “+” symbol for the data that is enlarged by the size of “2”.

Lastly, we have the abline() method which overlays a straight regression line with the blue color and with the width of “2” as the “lwd” is set to the value of “2”.

The statistical simple regression is interpreted through the following summary:

Then, we execute the plot code lines to visualize the simple regression plot that is rendered in the following:

Conclusion

Basic statistical computations for data exploration and advanced statistics for the analysis of predictive data are provided by default by the R programming language. We learned to build the statistical model and then interpret that statistical model in the R language. We explored the different examples to analyze the statistical model including the vectors’ statistics, linear regression, and logistic regression. Ultimately, we generated the linear regression plot after summarizing the statistical information.

About the author

Saeed Raza

Hello geeks! I am here to guide you about your tech-related issues. My expertise revolves around Linux, Databases & Programming. Additionally, I am practicing law in Pakistan. Cheers to all of you.