Analytics Development R

R Programming Tutorial

Beginner to Expert Rapidly: Learn R and build your first machine learning model

1. Compiling and Running R from the Command Line

  • Two ways to run R programs are: an R script, which is widely used and is most preferred and the second one is R CMD BATCH, it is not a commonly used command. We can call them directly from the command line or any other job scheduler.
  • You can conceivably call these commands from a shell built into the IDE and nowadays, the RStudio IDE does come with tools that enhance or manage the R script and R CMD BATCH functions.
  • source() function inside R is a good alternative to using the command line. This function can also call a script, but for using this function, you must have to be inside the R environment.

2. Built-in Data Sets

  • To list the data sets that are built-in with R, use the data () command, then find what you want, and use the name of the dataset in the data () function. Like data (functionname).
  • The question mark (?) could be used to ask for help for datasets.
  • To check for everything use summary ()
  • Plot () is also a function that is used to plot graphs.
  • Open the filename.R file that we have saved in our myR folder. Windowsusers would have it in their “C” drive. It in on the Desktop for Mac users and in their home path for Linux users.

Code:

# Built-in datasets
data ()
data(functionname)
?
functionname
functionname
summary (functionname)
plot(functionname)

3. Data Frames

For storing data tables, we use Data Frame. It is used to list the vectors of equal length. For example, the following variable nm is a data frame containing three vectors x, y, z.


> x = c(2, 3, 5) 
> y = c("aa", "bb", "cc") 
> z = c(TRUE, FALSE, TRUE) 
> nm = data.frame(n, s, b)       # nm is a data frame

Built-in Data Frame

Mtcars is a built-in data frame in R, which we are using as an example, for our better understanding.  Because this example is quite easy to understand.


> mtcars 
               mpg cyl disp  hp  drat   wt ... 
Mazda RX4     21.0   6  160   110  3.90 2.62 ... 
bus RX4 Wag 21.0   6  160   110  3.90 2.88 ... 
Datsun 710    22.8   4  108   93   3.85 2.32 ... 
               ............
  • The header is the top line of the table that contains the column names. Date Row is donated by each horizontal line; it starts with the name of the row and then followed by the actual data. The data member of a row is termed as a cell.
  • We would enter the row and column coordinates in a single square bracket ‘[]’ operator to retrieve data in a cell. To separate the coordinates, we use a comma.
  • The order is essential. The coordinate begins with row then comma and then ends with the column.
    Cell value of 2nd row and 1st column is given as:

    > mtcars[2, 2] 
       [1] 6
    
  • We can also use row and column name instead of coordinates:
    > mtcars["Bus RX4", "mpg"] 
    [1] 6
    
  • nrow function is used to find the number of rows in the data frame.
    > nrow(mtcars)     # number of data rows
    [1] 32
    
  • ncol function is used to find the number of columns in a data frame.
    > ncol(mtcars)      # number of columns
    [1] 11
    

4. Loops

Under some conditions, we use loops when we want to automate some part of code, or we want to repeat a sequence of instructions.

  • For loop in R: If we want to print out these years information more than one time.
    
    
    print(paste("The year is", 2000))
    "The year is 2000"
    print(paste("The year is", 2001))
    "The year is 2001"
    print(paste("The year is", 2002))
    "The year is 2002"
    print(paste("The year is", 2003))
    "The year is 2003"
    print(paste("The year is", 2004))
    "The year is 2004"
    print(paste("The year is", 2005))
    "The year is 2005"
  • Instead of repeating our statement again and again if we use for loop it will be much easier for us. Like this:
    
    
    for (year in c(2000,2001,2002,2003,2004,2005)){
      print(paste("The year is", year))
    }

    "The year is 2000"
    "The year is 2001"
    "The year is 2002"
    "The year is 2003"
    "The year is 2004"
    "The year is 2005"
  • While Loop:
    
    
    While    (expression)
    {
    statement
    }

    If the result of expression is TRUE, the body of the loop is entered. The statements inside the loop are performed, and the flow returns to assess the expression again. The loop will repeat itself until the expression evaluates to FALSE, in which case the loop exits.

    
    
    Example of while Loop:
    i   < = 0
    While (i<5)
    {
    Print (i)
    I=i+1
    }
    Output:
    0
    1
    2
    3
    4
    i is initially initialized to 0.
  • Here, the expression is i<5 which measures to TRUE since 0 is less than 5. Hence, the body of the loop is executed, and i is output and incremented.
  • It is important to increment in i so it will somehow meet the condition at some point.
  • In the next loop, the value of i is 1, and the loop continues.
  • It will repeat itself until i equals 5 when the condition 5<5 reached loop will give FALSE and the while loop will exit.

5. Functions

To create a function we use directive function (). Specifically, they are R objects of class function.

f <- function() {
 ##some piece of instructions
 }

Notably,

  • Functions could be passed to other functions as arguments.
  • Functions could be nested, to let you determine a function inside of another function.

Function Arguments: Functions have some named arguments that have default values.

  • The arguments admitted in the function definition are the formal arguments
  • The formals function could give back a list of all the formal arguments of a function
  • Not every function call in R uses all the formal arguments
  • Function arguments might have default values, or they may be missing
    Defining a Function:
    f <- function (x, y = 1, z = 2, s= NULL) {
     }
    

    If you don’t want a default value, you can set its value to NULL.

6. Creating a logistic regression model with built-in data set

  • glm() function is used in R to fit the logistic regression. glm() function is similar to the lm() but glm() has some additional parameters.
  • Its format is
  • glm(X~Z1+Z2+Z3, family=binomial (link=”logit”), data=mydata)
  • X is dependent on the values of Z1, Z2, and Z3. Which means Z1, Z2 and Z3 are independent variables and X is the dependent Function involves extra parameter family and it has value binomial (link=“logit”) that means link function is logit and the probability distribution of regression model is binomial.
  • Suppose we have an example of student where he will get admission on the basis of two exam results. The data set contains the following parameters:
1.      result _1- Result-1 score

2.      result _2- Result -2 score

3.      admitted- 1 if admitted or 0 if not admitted

  • In this example, we have two values 1 if a student got admission and 0 if he did not get admission. We have to generate a model to predict that the student got admission or not,. For a given problem, admitted is considered as a dependent variable, exam_1 and exam_2 are considered as independent variables. For that model, our R code is given
    >Model_1<-glm(admitted ~ result_1 +result_2, family = binomial("logit"), data=data)
    
  • Let suppose that we have two results of the student. Result-1 65% and result-2 90%, now we will predict that the student gets admission or not for estimating the probability of student to get admission our R code is as below:
    >in_frame<-data.frame(result_1=65,result_2=90) >predict(Model_1,in_frame, type="response")
    

    Output:
    0.9894302

  • The above output shows us the probability between 0 and 1. If then it is less than 0.5 it means student did not get admission. In this condition, it will be FALSE. If it is greater than 0.5, the condition will be considered as TRUE which means the student has got admission. We have to use round () function to predict probability between 0 and 1.
  • R code for that is as shown below;
    >round(predict(Model_1, in_frame, type="response"))

    Output: 1

  • A student will get admission as the output is 1. Moreover, we may also predict for other observations in the same manner.

7. Using logistic regression model (scoring) with new data

  • When needed we can save the model in a file
  • R code for our train model will look like this:
    the_model <- glm(my_formula, family=binomial(link='logit'),data=model_set)
  • This model can be saved with:
    save(file="filename",the_file)
  • You can use the file after saving the it, by using that peace of R code:
    load(file="filename")
  • For applying model for new data you can use this line of a code:
    model_set$pred <- predict(the_model, newdata=model_set, type='response')

NOTE:

The model_set cannot be assigned to any variable. To load a model we will use the function load(). New observations will not change anything in the model. The model will remain the same. We use the old model to make predictions about the new data not to change anything in the model.

About the author

Admin

A passionate Linux user for personal and professional reasons, always exploring what is new in the world of Linux and sharing with my readers.