R Programming for Data Science

R Programming for Data Science

A Complete Tutorial to learn Data Science in R from Scratch


  • Free tutorial to learn Data Science in R for beginners
  • Covers predictive modeling, data manipulation, data exploration, and machine learning algorithms in R


R is a powerful language utilized broadly for data investigation and factual computing. It was created in early 90s. From that point forward, unlimited efforts have been made to improve R’s user interface. The journey of R language from a rudimentary word processor to interactive R Studio and more recently Jupyter Notebooks has drawn in numerous data science communities across the world.

This was conceivable simply because of generous contributions by R users worldwide. Incorporation of powerful bundles in R has made it more and more powerful with time. Bundles, for example, dplyr, tidyr, readr, data.table, SparkR, ggplot2 have made data control, visualization and calculation a lot faster.

But, what about Machine Learning?

My first impression of R was that it’s only a software for factual computing. Beneficial thing, I wasn’t right! R has enough provisions to execute machine learning algorithms in a quick and basic manner.

This is a finished tutorial to learn data science and machine learning utilizing R. Before the finish of this tutorial, you will have a decent exposure to building predictive models utilizing machine learning all alone.

Note: No prior information on data science/analytics is required. However, prior information on algebra and measurements will be useful.

Table of Contents

  1. Basics of R Programming for Data Science
    • Why learn R ?
    • How to install R / R Studio ?
    • How to install R packages ?
    • Basic computations in R
  2. Essentials of R Programming
    • Data Types and Objects in R
    • Control Structures (Functions) in R
    • Useful R Packages
  3. Exploratory Data Analysis in R
    • Basic Graphs
    • Treating Missing values
    • Working with Continuous and Categorical Variables
  4. Data Manipulation in R
    • Feature Engineering
    • Label Encoding / One Hot Encoding
  5. Predictive Modeling using Machine Learning in R
    • Linear Regression
    • Decision Tree
    • Random Forest

1. Basics of R Programming

Why learn R ?

I don’t have the foggiest idea whether I have a strong reason to persuade you, however let me share what kicked me off. I have no prior coding experience. In reality, I never had computer science in my subjects. I came to realize that to learn data science, one must learn either R or Python as a starter. I picked the former. Here are a few benefits I found after utilizing R:

  1. The style of coding is quite simple.
  2. It’s open source. No compelling reason to pay any subscription charges.
  3. Availability of moment admittance to over 7800 bundles modified for various calculation assignments.
  4. The community support is overwhelming. There are numerous forums to get you out.
  5. Get elite computing experience ( require bundles)
  6. One of profoundly looked for ability by analytics and data science organizations.

There are a lot more benefits. Be that as it may, these are the ones which have propped me up. On the off chance that you think they are exciting, stay and move to next area. Furthermore, in the event that you aren’t persuaded, you may like Complete Python Tutorial from Scratch.

How to install R / R Studio ?

You could download and introduce the old version of R. Be that as it may, I’d demand you to start with RStudio. It provides much better coding experience. For Windows users, R Studio is accessible for Windows Vista or more versions. Follow the means underneath for introducing R Studio:

  1. Go to https://www.rstudio.com/products/rstudio/download/
  2. In ‘Installers for Supported Platforms’ area, pick and snap the R Studio installer dependent on your operating framework. The download should start when you click.
  3. Snap Next..Next..Finish.
  4. Download Complete.

To Start R Studio, click on its work area symbol or use ‘search windows’ to get to the program. It resembles this:


How about we rapidly understand the interface of R Studio:

R Console: This area shows the yield of code you run. Likewise, you can directly write codes in support. Code entered directly in R support can’t be traced later. This is where R script comes to utilize.

R Script: As the name propose, here you get space to write codes. To run those codes, basically select the line(s) of code and press Ctrl + Enter. Alternatively, you can tap on little ‘Run’ button area at upper right corner of R Script.

R environment: This space shows the arrangement of external components included. This incorporates data set, variables, vectors, capacities and so forth. To check if data has been stacked properly in R, consistently take a gander at this area.

Graphical Output: This space show the graphs created during exploratory data investigation. Not simply graphs, you could choose bundles, look for help with inserted R’s authentic documentation.

How to install R Packages ?

The sheer power of R lies in its incredible packages. In R, most data handling tasks can be performed in 2 ways: Using R packages and R base functions. In this tutorial, I’ll also introduce you with the most handy and powerful R packages. To install a package, simply type:

install.packages("package name")

As a first time user, a pop might appear to select your CRAN mirror (country server), choose accordingly and press OK.

Note: You can type this either in console directly and press ‘Enter’ or in R script and click ‘Run’.

Basic Computations in R

Let’s begin with basics. To get familiar with R coding environment, start with some basic calculations. R console can be used as an interactive calculator too. Type the following in your console:

> 2 + 3
> 5 

> 6 / 3
>  2

> (3*8)/(2*3)
> 4 

> log(12)
> 1.07

> sqrt (121)
> 11

Similarly, you can experiment various combinations of calculations and get the results. In case, you want to obtain the previous calculation, this can be done in two ways. First, click in R console, and press ‘Up / Down Arrow’  key on your keyboard. This will activate the previously executed commands. Press Enter.

But, what if you have done too many calculations ? It would be too painful to scroll through every command and find it out. In such situations, creating variable is a helpful way.

In R, you can create a variable using <- or = sign. Let’s say I want to create a variable x to compute the sum of 7 and 8. I’ll write it as:

> x <- 8 + 7
> x
> 15

Once we create a variable, you no longer get the output directly (like calculator), unless you call the variable in the next line. Remember, variables can be alphabets, alphanumeric but not numeric. You can’t create numeric variables.

2. Essentials of R Programming

Understand and practice this area thoroughly. This is the structure square of your R programming information. In the event that you get this right, you would confront less trouble in investigating.

R has five fundamental or ‘nuclear’ classes of articles. Wait, what is an item ?

Everything you see or create in R is an article. A vector, matrix, data frame, even a variable is an article. R treats it that way. Thus, R has 5 essential classes of articles. This incorporates:

  1. Character
  2. Numeric (Real Numbers)
  3. Integer (Whole Numbers)
  4. Complex
  5. Logical (True / False)

Since these classes are self-explanatory by names, I wouldn’t elaborate on that. These classes have attributes. Think of attributes as their ‘identifier’, a name or number which aptly identifies them. An object can have following attributes:

  1. names, dimension names
  2. dimensions
  3. class
  4. length

Attributes of an object can be accessed using attributes() function. More on this coming in following section.

Let’s understand the concept of object and attributes practically. The most basic object in R is known as vector. You can create an empty vector using vector(). Remember, a vector contains object of same class.

For example: Let’s create vectors of different classes. We can create vector using c() or concatenate command also.

> a <- c(1.8, 4.5)   #numeric
> b <- c(1 + 2i, 3 - 6i) #complex
> d <- c(23, 44)   #integer
> e <- vector("logical", length = 5)

Similarly, you can create vector of various classes.

Data Types in R

R has various type of ‘data types’ which includes vector (numeric, integer etc), matrices, data frames and list. Let’s understand them one by one.

Vector: As mentioned above, a vector contains object of same class. But, you can mix objects of different classes too. When objects of different classes are mixed in a list, coercion occurs. This effect causes the objects of different types to ‘convert’ into one class. For example:

> qt <- c("Time", 24, "October", TRUE, 3.33)  #character
> ab <- c(TRUE, 24) #numeric
> cd <- c(2.5, "May") #character

To check the class of any object, use class(“vector name”) function.

> class(qt)

To convert the class of a vector, you can use as. command.

> bar <- 0:5
> class(bar)
> "integer"
> as.numeric(bar)
> class(bar)
> "numeric"
> as.character(bar)
> class(bar)
> "character"

Similarly, you can change the class of any vector. But, you should pay attention here. If you try to convert a “character” vector to “numeric” , NAs will be introduced. Hence, you should be careful to use this command.


List: A list is a special type of vector which contain elements of different data types. For example:

> my_list <- list(22, "ab", TRUE, 1 + 2i)
> my_list

[1] 22

[1] "ab"

[1] TRUE

[1] 1+2i

As you can see, the output of a list is different from a vector. This is because, all the objects are of different types. The double bracket [[1]] shows the index of first element and so on. Hence, you can easily extract the element of lists depending on their index. Like this:

> my_list[[3]]
> [1] TRUE

You can use [] single bracket too. But, that would return the list element with its index number, instead of the result above. Like this:

> my_list[3]
> [[1]]
  [1] TRUE


Matrices: When a vector is introduced with row and column i.e. a dimension attribute, it becomes a matrix. A matrix is represented by set of rows and columns. It is a 2 dimensional data structure. It consist of elements of same class. Let’s create a matrix of 3 rows and 2 columns:

> my_matrix <- matrix(1:6, nrow=3, ncol=2)
> my_matrix
[,1] [,2]
[1,] 1 4
[2,] 2 5
[3,] 3 6

> dim(my_matrix)
[1] 3 2

> attributes(my_matrix)
[1] 3 2

As you can see, the dimensions of a matrix can be obtained using either dim() or attributes() command.  To extract a particular element from a matrix, simply use the index shown above. For example(try this at your end):

> my_matrix[,2]   #extracts second column
> my_matrix[,1]   #extracts first column
> my_matrix[2,]   #extracts second row
> my_matrix[1,]   #extracts first row

As an interesting fact, you can also create a matrix from a vector. All you need to do is, assign dimension dim() later. Like this:

> age <- c(23, 44, 15, 12, 31, 16)
> age
[1] 23 44 15 12 31 16

> dim(age) <- c(2,3)
> age
[,1] [,2] [,3]
[1,] 23 15 31
[2,] 44 12 16

> class(age)
[1] "matrix"

You can also join two vectors using cbind() and rbind() functions. But, make sure that both vectors have same number of elements. If not, it will return NA values.

> x <- c(1, 2, 3, 4, 5, 6)
> y <- c(20, 30, 40, 50, 60)
> cbind(x, y)
> cbind(x, y)
x    y
[1,] 1 20
[2,] 2 30
[3,] 3 40
[4,] 4 50
[5,] 5 60
[6,] 6 70

class(cbind(x, y))
[1] “matrix”

Data Frame: This is the most normally utilized member of data types family. It is utilized to store tabular data. It is different from matrix. In a matrix, every component must have same class. However, in a data frame, you can put rundown of vectors containing different classes. This implies, every section of a data frame acts like top notch. Every time you will read data in R, it will be stored as a data frame. Thus, it is important to understand the majorly utilized orders on data frame:

> df <- data.frame(name = c("ash","jane","paul","mark"), score = c(67,56,87,91))
> df
name score
1 ash 67
2 jane 56
3 paul 87
4 mark 91

> dim(df)
[1] 4 2

> str(df)
'data.frame': 4 obs. of 2 variables:
$ name : Factor w/ 4 levels "ash","jane","mark",..: 1 2 4 3
$ score: num 67 56 87 91

> nrow(df)
[1] 4

> ncol(df)
[1] 2

Let’s understand the code above. df is the name of data frame. dim() returns the dimension of data frame as 4 rows and 2 columns. str() returns the structure of a data frame i.e. the list of variables stored in the data frame. nrow() and ncol() return the number of rows and number of columns in a data set respectively.

Here you see “name” is a factor variable and “score” is numeric. In data science, a variable can be categorized into two types: Continuous and Categorical.

Constant variables are those which can take any form, for example, 1, 2, 3.5, 4.66 and so on. Categorical variables are those which takes just discrete qualities, for example, 2, 5, 11, 15 and so forth. In R, categorical qualities are represented by factors. In df, name is a factor variable having 4 extraordinary levels. Factor or categorical variable are exceptionally treated in a data set. For more clarification, click here. Similarly, you can discover methods to manage consistent variables here.

How about we presently understand the idea of missing qualities in R. This is one of the most difficult yet crucial part of predictive displaying. You should know about all procedures to manage them. The total clarification on such strategies is provided here.

Missing values in R are represented by NA and NaN. Now we’ll check if a data set has missing values (using the same data frame df).

> df[1:2,2] <- NA #injecting NA at 1st, 2nd row and 2nd column of df 
> df
name score
1 ash NA
2 jane NA
3 paul 87
4 mark 91

> is.na(df) #checks the entire data set for NAs and return logical output
name score
> table(is.na(df)) #returns a table of logical output
6      2

> df[!complete.cases(df),] #returns the list of rows having missing values
name  score
1 ash  NA
2 jane NA

Missing values hinder normal calculations in a data set. For example, let’s say, we want to compute the mean of score. Since there are two missing values, it can’t be done directly. Let’s see:

[1] NA
> mean(df$score, na.rm = TRUE)
[1] 89

The use of na.rm = TRUE parameter tells R to ignore the NAs and compute the mean of remaining values in the selected column (score). To remove rows with NA values in a data frame, you can use na.omit:

new_df <- na.omit(df)
> new_df
name score
3 paul 87
4 mark 91

Control Structures in R

As the name propose, a control structure ‘controls’ the progression of code/orders written inside a capacity. A capacity is a lot of numerous orders written to computerize a repetitive coding task.

For instance: You have 10 data sets. You need to locate the mean ‘Old enough’ segment present in every data set. This should be possible in 2 different ways: either you write the code to figure mean multiple times or you basically create a capacity and pass the data set to it.

How about we understand the control structures in R with basic models:

if, else – This structure is used to test a condition. Below is the syntax:

if (<condition>){
         ##do something
} else {
         ##do something


#initialize a variable
N <- 10

#check if this variable * 5 is > 40
if (N * 5 > 40){
       print("This is easy!")
} else {
       print ("It's not easy!")
[1] "This is easy!"


for – This structure is used when a loop is to be executed fixed number of times. It is commonly used for iterating over the elements of an object (list, vector). Below is the syntax:

for (<search condition>){
          #do something


#initialize a vector
y <- c(99,45,34,65,76,23)

#print the first 4 numbers of this vector
for(i in 1:4){
     print (y[i])
[1] 99
[1] 45
[1] 34
[1] 65


while – It begins by testing a condition, and executes only if the condition is found to be true. Once the loop is executed, the condition is tested again. Hence, it’s necessary to alter the condition such that the loop doesn’t go infinity. Below is the syntax:

#initialize a condition
Age <- 12

#check if age is less than 17
while(Age < 17){
         Age <- Age + 1 #Once the loop is executed, this code breaks the loop
[1] 12
[1] 13
[1] 14
[1] 15
[1] 16

There are other control structures as well but are less frequently used than explained above. Those structures are:

  1. repeat – It executes an infinite loop
  2. break – It breaks the execution of a loop
  3. next – It allows to skip an iteration in a loop
  4. return – It help to exit a function

Note: If you find the section ‘control structures’ difficult to understand, not to worry. R is supported by various packages to compliment the work done by control structures.

Useful R Packages

Out of ~7800 bundles recorded on CRAN, I’ve recorded the absolute generally powerful and normally utilized bundles in predictive displaying in this article. Since, I’ve already clarified the strategy for introducing bundles, you can feel free to introduce them now. Sooner or later you’ll require them.

Importing Data: R offers wide range of bundles for importing data accessible in any format, for example, .txt, .csv, .json, .sql and so on. To import large records of data rapidly, it is fitting to introduce and utilize data.table, readr, RMySQL, sqldf, jsonlite.

Data Visualization: R has in constructed plotting orders also. They are acceptable to create straightforward graphs. However, becomes complex with regards to creating progressed graphics. Subsequently, you ought to introduce ggplot2.

Data Manipulation: R has a phenomenal assortment of bundles for data control. These bundles permits you to do fundamental and progressed calculations rapidly. These bundles are dplyr, plyr, tidyr, lubridate, stringr. Look at this total tutorial on data control bundles in R.

Demonstrating/Machine Learning: For displaying, caret bundle in R is powerful enough to cater to every requirement for creating machine learning model. However, you can introduce bundles algorithms insightful, for example, randomForest, rpart, gbm and so on

Note: I’ve just referenced the regularly utilized bundles. You may jump at the chance to check this interesting infographic on complete rundown of valuable R bundles.

Till here, you got comfortable with the fundamental work style in R and its related segments. From next area, we’ll start with predictive demonstrating. In any case, before you proceed. I need you to practice, what you’ve learnt till here.

Practice Assignment: As a part of this task, introduce ‘swirl’ bundle in bundle. At that point type, library(swirl) to initiate the bundle. Also, complete this interactive R tutorial. On the off chance that you have followed this article thoroughly, this task ought to be a simple errand for you!

3. Exploratory Data Analysis in R

From this segment onwards, we’ll plunge profound into various phases of predictive displaying. Subsequently, ensure you understand every part of this area. In the event that you discover anything hard to understand, ask me in the remarks area beneath.

Data Exploration is a crucial phase of predictive model. You can’t construct great and practical models except if you learn to explore the data from start to end. This stage forms a concrete establishment for data control (the very next stage). We should understand it in R.

In this tutorial, I’ve taken the data set from Big Mart Sales Prediction. Before we start, you should get familiar with these terms:

Response Variable (a.k.a Dependent Variable): In a data set, the response variable (y) is one on which we make predictions. For this situation, we’ll predict ‘Item_Outlet_Sales’. (Refer to picture demonstrated as follows)

Predictor Variable (a.k.a Independent Variable): In a data set, predictor variables (Xi) are those utilizing which the prediction is made on response variable. (Picture beneath).


Test Data: Once the model is built, it’s accuracy is ‘tested’ on test data. This data always contains less number of observations than train data set. Also, it does not include ‘response variable’.

Right now, you should download the data set. Take a good look at train and test data. Cross check the information shared above and then proceed.

Let’s now begin with importing and exploring data.

#working directory
path <- ".../Data/BigMartSales"

#set working directory


As a beginner, I’ll advise you to keep the train and test files in your working directly to avoid unnecessary directory troubles. Once the directory is set, we can easily import the .csv files using commands below.

#Load Datasets
train <- read.csv("Train_UWu5bXk.csv")
test <- read.csv("Test_u94Q5KV.csv")

In fact, even prior to loading data in R, it’s a good practice to look at the data in Excel. This helps in strategizing  the complete prediction modeling process. To check if the data set has been loaded successfully, look at R environment. The data can be seen there. Let’s explore the data quickly.

#check dimesions ( number of row & columns) in data set
> dim(train)
[1] 8523 12

> dim(test)
[1] 5681 11

We have 8523 rows and 12 columns in train data set and 5681 rows and 11 columns in data set. This makes sense. Test data should always have one column less (mentioned above right?). Let’s get deeper in train data set now.

#check the variables and their types in train
> str(train)
'data.frame': 8523 obs. of 12 variables:
$ Item_Identifier : Factor w/ 1559 levels "DRA12","DRA24",..: 157 9 663 1122 1298 759 697 739 441 991 ...
$ Item_Weight : num 9.3 5.92 17.5 19.2 8.93 ...
$ Item_Fat_Content : Factor w/ 5 levels "LF","low fat",..: 3 5 3 5 3 5 5 3 5 5 ...
$ Item_Visibility : num 0.016 0.0193 0.0168 0 0 ...
$ Item_Type : Factor w/ 16 levels "Baking Goods",..: 5 15 11 7 10 1 14 14 6 6 ...
$ Item_MRP : num 249.8 48.3 141.6 182.1 53.9 ...
$ Outlet_Identifier : Factor w/ 10 levels "OUT010","OUT013",..: 10 4 10 1 2 4 2 6 8 3 ...
$ Outlet_Establishment_Year: int 1999 2009 1999 1998 1987 2009 1987 1985 2002 2007 ...
$ Outlet_Size : Factor w/ 4 levels "","High","Medium",..: 3 3 3 1 2 3 2 3 1 1 ...
$ Outlet_Location_Type : Factor w/ 3 levels "Tier 1","Tier 2",..: 1 3 1 3 3 3 3 3 2 2 ...
$ Outlet_Type : Factor w/ 4 levels "Grocery Store",..: 2 3 2 1 2 3 2 4 2 2 ...
$ Item_Outlet_Sales : num 3735 443 2097 732 995 ...

Let’s do some quick data exploration.

To begin with, I’ll first check if this data has missing values. This can be done by using:

> table(is.na(train))

100813 1463

In train data set, we have 1463 missing values. Let’s check the variables in which these values are missing. It’s important to find and locate these missing values. Many data scientists have repeatedly advised beginners to pay close attention to missing value in data exploration stages.

> colSums(is.na(train))
Item_Identifier Item_Weight
0                1463
Item_Fat_Content Item_Visibility
0                 0
Item_Type         Item_MRP
0                 0
Outlet_Identifier Outlet_Establishment_Year
0                 0
Outlet_Size       Outlet_Location_Type
0                 0
Outlet_Type       Item_Outlet_Sales
0                 0

Hence, we see that column Item_Weight has 1463 missing values. Let’s get more inferences from this data.

> summary(train)

Here are some quick inferences drawn from variables in train data set:

  1. Item_Fat_Content has mis-matched factor levels.
  2. Minimum value of item_visibility is 0. Practically, this is not possible. If an item occupies shelf space in a grocery store, it ought to have some visibility. We’ll treat all 0’s as missing values.
  3. Item_Weight has 1463 missing values (already explained above).
  4. Outlet_Size has a unmatched factor levels.

These inference will help us in treating these variable more accurately.

Graphical Representation of Variables

I’m sure you would understand these variables better when clarified outwardly. Utilizing graphs, we can dissect the data in 2 different ways: Univariate Analysis and Bivariate Analysis.

Univariate examination is finished with one variable. Bivariate investigation is finished with two variables. Univariate examination is a great deal simple to do. Thus, I’ll avoid that part here. I’d recommend you to try it at your end. How about we presently experiment doing bivariate examination and carve out concealed bits of knowledge.

For visualization, I’ll use ggplot2 bundle. These graphs would assist us with understanding the distribution and frequency of variables in the data set.

> ggplot(train, aes(x= Item_Visibility, y = Item_Outlet_Sales)) + geom_point(size = 2.5, color="navy") + xlab("Item Visibility") + ylab("Item Outlet Sales") + ggtitle("Item Visibility vs Item Outlet Sales")


We can see that majority of sales has been obtained from products having visibility less than 0.2. This suggests that item_visibility < 2 must be an important factor in determining sales. Let’s plot few more interesting graphs and explore such hidden stories.

> ggplot(train, aes(Outlet_Identifier, Item_Outlet_Sales)) + geom_bar(stat = "identity", color = "purple") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "black"))  + ggtitle("Outlets vs Total Sales") + theme_bw()


Here, we infer that OUT027 has contributed to majority of sales followed by OUT35. OUT10 and OUT19 have probably the least footfall, thereby contributing to the least outlet sales.

> ggplot(train, aes(Item_Type, Item_Outlet_Sales)) + geom_bar( stat = "identity") +theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "navy")) + xlab("Item Type") + ylab("Item Outlet Sales")+ggtitle("Item Type vs Sales")


From this graph, we can infer that Fruits and Vegetables contribute to the highest amount of outlet sales followed by snack foods and household products. This information can also be represented using a box plot chart. The benefit of using a box plot is, you get to see the outlier and mean deviation of corresponding levels of a variable (shown below).

> ggplot(train, aes(Item_Type, Item_MRP)) +geom_boxplot() +ggtitle("Box Plot") + theme(axis.text.x = element_text(angle = 70, vjust = 0.5, color = "red")) + xlab("Item Type") + ylab("Item MRP") + ggtitle("Item Type vs Item MRP")

box plot in R tutorial

The dark point you see, is an outlier. The mid line you find in the case, is the mean estimation of every item type. To find out about boxplots, check this tutorial.

Presently, we have a thought of the variables and their importance on response variable. How about we presently move back to where we started. Missing qualities. Presently we’ll attribute the missing qualities.

We saw variable Item_Weight has missing qualities. Item_Weight is a constant variable. Henceforth, for this situation we can credit missing qualities with mean/middle of item_weight. These are the most usually utilized strategies for crediting missing worth. To explore other strategies for this methods, look at this tutorial.

We should first combine the data sets. This will spare our time as we don’t have to write separate codes for train and test data sets. To combine the two data frames, we should ensure that they have equivalent sections, which isn’t the situation.

> dim(train)
[1] 8523 12

> dim(test)
[1] 5681 11

Test data set has one less column (response variable). Let’s first add the column. We can give this column any value. An intuitive approach would be to extract the mean value of sales from train data set and use it as placeholder for test variable Item _Outlet_ Sales. Anyways, let’s make it simple for now. I’ve taken a value 1. Now, we’ll combine the data sets.

test$Item_Outlet_Sales <-  1
> combi <- rbind(train, test)

Trouble with Continuous Variables & Categorical Variables

It’s important to learn to manage nonstop and categorical variables separately in a data set. In other words, they need exceptional consideration. In this data set, we have just 3 persistent variables and rest are categorical in nature. On the off chance that you are as yet befuddled, I’ll recommend you to indeed take a gander at the data set utilizing str() and proceed.

We should take up Item_Visibility. In the graph above, we saw item visibility has zero worth likewise, which is practically not doable. Subsequently, we’ll consider it as a missing worth and indeed make the ascription utilizing middle.

> combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0,
                           median(combi$Item_Visibility), combi$Item_Visibility) 

Let’s proceed to categorical variables now. During exploration, we saw there are mis-matched levels in variables which needs to be corrected.

> levels(combi$Outlet_Size)[1] <- "Other"
> library(plyr)
> combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content,
c("LF" = "Low Fat", "reg" = "Regular"))
> combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content, c("low fat" = "Low Fat"))
> table(combi$Item_Fat_Content)
  Low Fat Regular
  9185    5019

Using the commands above, I’ve assigned the name ‘Other’ to unnamed level in Outlet_Size variable. Rest, I’ve simply renamed the various levels of Item_Fat_Content.

4. Data Manipulation in R

We should call it as, the serious degree of data exploration. In this segment we’ll practically learn about feature engineering and other helpful angles.

Feature Engineering: This segment separates a keen data researcher from an actually empowered data researcher. You may approach large machines to run substantial calculations and algorithms, however the power delivered by new features, can’t be coordinated. We create new variables to extract and provide as much ‘new’ information to the model, to assist it with making accurate predictions.

On the off chance that you have been thinking this time, great. In any case, this is the ideal opportunity to think deeper. Take a gander at the data set and ask yourself, what else (factor) could impact Item_Outlet_Sales ? At any rate, the answer is beneath. Be that as it may, I need you to try it out first, before scrolling down.

1. Check of Outlet Identifiers – There are 10 novel outlets in this data. This variable will give us information on include of outlets in the data set. More the number of checks of an outlet, odds are more will be the deals contributed by it.

> library(dplyr)
> a <- combi%>%

> head(a)
Source: local data frame [6 x 2]
Outlet_Identifier n
(fctr)           (int)
1 OUT010         925
2 OUT013         1553
3 OUT017         1543
4 OUT018         1546
5 OUT019         880
6 OUT027         1559

> names(a)[2] <- "Outlet_Count"
> combi <- full_join(a, combi, by = "Outlet_Identifier")

As you can see, dplyr package makes data manipulation quite effortless. You no longer need to write long function. In the code above, I’ve simply stored the new data frame in a variable a. Later, the new column Outlet_Count is added in our original ‘combi’ data set.

3. Outlet Years – This variable represent the information of existence of a particular outlet since year 2013. Why just 2013? You’ll find the answer in problem statement My hypothesis is, older the outlet, more footfall, large base of loyal customers and larger the outlet sales.

> c <- combi%>%
           mutate(Outlet_Year = 2013 - combi$Outlet_Establishment_Year)
 > head(c)
Outlet_Establishment_Year  Outlet_Year
1 1999                       14
2 2009                        4
3 1999                       14
4 1998                       15
5 1987                       26
6 2009                        4

> combi <- full_join(c, combi)

This suggests that outlets established in 1999 were 14 years old in 2013 and so on.


4. Item Type New – Now, pay attention to Item_Identifiers. We are about to discover a new trend. Look carefully, there is a pattern in the identifiers starting with “FD”,”DR”,”NC”. Now, check the corresponding Item_Types to these identifiers in the data set. You’ll discover, items corresponding to “DR”,  are mostly eatables. Items corresponding to “FD”, are drinks. And, item corresponding to “NC”, are products which can’t be consumed, let’s call them non-consumable. Let’s extract these variables into a new variable representing their counts.

Here I’ll use substr()gsub() function to extract and rename the variables respectively.

> q <- substr(combi$Item_Identifier,1,2)
> q <- gsub("FD","Food",q)
> q <- gsub("DR","Drinks",q)
> q <- gsub("NC","Non-Consumable",q)
> table(q)
   Drinks Food  Non-Consumable
   1317   10201 2686

Let’s now add this information in our data set with a variable name ‘Item_Type_New.

> combi$Item_Type_New <- q

I’ll leave the rest of feature engineering intuition to you. You can think of more variables which could add more information to the model. But make sure, the variable aren’t correlated. Since, they are emanating from a same set of variable, there is a high chance for them to be correlated. You can check the same in R using cor() function.

Label Encoding and One Hot Encoding

Only, one final part of feature engineering left. Mark Encoding and One Hot Encoding.

Mark Encoding, in basic words, is the practice of numerically encoding (replacing) different degrees of a categorical variables. For instance: In our data set, the variable Item_Fat_Content has 2 levels: Low Fat and Regular. Thus, we’ll encode Low Fat as 0 and Regular as 1. This will assist us with converting a factor variable in numeric variable. This can be basically done utilizing if else proclamation in R.

> combi$Item_Fat_Content <- ifelse(combi$Item_Fat_Content == "Regular",1,0)

One Hot Encoding, in simple words, is the splitting a categorical variable into its unique levels, and eventually removing the original variable from data set. Confused ? Here’s an example: Let’s take any categorical variable, say, Outlet_ Location_Type. It has 3 levels. One hot encoding of this variable, will create 3 different variables consisting of 1s and 0s. 1s will represent the existence of variable and 0s will represent non-existence of variable. Let look at a sample:

> sample <- select(combi, Outlet_Location_Type)
> demo_sample <- data.frame(model.matrix(~.-1,sample))
> head(demo_sample)
Outlet_Location_TypeTier.1 Outlet_Location_TypeTier.2 Outlet_Location_TypeTier.3
1             1                         0                        0
2             0                         0                        1
3             1                         0                        0
4             0                         0                        1
5             0                         0                        1
6             0                         0                        1

model.matrix creates a matrix of encoded variables.   ~. -1 tells R, to encode all variables in the data frame, but suppress the intercept. So, what will happen if you don’t write -1 ? model.matrix will skip the first level of the factor, thereby resulting in just 2 out of 3 factor levels (loss of information).

This was the demonstration of one hot encoding. Hope you have understood the concept now. Let’s now apply this technique to all categorical variables in our data set (excluding ID variable).

>combi <- dummy.data.frame(combi, names = c('Outlet_Size','Outlet_Location_Type','Outlet_Type', 'Item_Type_New'),  sep='_')

With this, I have shared 2 different methods of performing one hot encoding in R.  Let’s check if encoding has been done.

> str (combi)
$ Outlet_Size_Other : int 0 1 1 0 1 0 0 0 0 0 ...
$ Outlet_Size_High : int 0 0 0 1 0 0 0 0 0 0 ...
$ Outlet_Size_Medium : int 1 0 0 0 0 0 1 1 0 1 ...
$ Outlet_Size_Small : int 0 0 0 0 0 1 0 0 1 0 ...
$ Outlet_Location_Type_Tier 1 : int 1 0 0 0 0 0 0 0 1 0 ...
$ Outlet_Location_Type_Tier 2 : int 0 1 0 0 1 1 0 0 0 0 ...
$ Outlet_Location_Type_Tier 3 : int 0 0 1 1 0 0 1 1 0 1 ...
$ Outlet_Type_Grocery Store : int 0 0 1 0 0 0 0 0 0 0 ...
$ Outlet_Type_Supermarket Type1: int 1 1 0 1 1 1 0 0 1 0 ...
$ Outlet_Type_Supermarket Type2: int 0 0 0 0 0 0 0 1 0 0 ...
$ Outlet_Type_Supermarket Type3: int 0 0 0 0 0 0 1 0 0 1 ...
$ Item_Outlet_Sales : num 1 3829 284 2553 2553 ...
$ Year : num 14 11 15 26 6 9 28 4 16 28 ...
$ Item_Type_New_Drinks : int 1 1 1 1 1 1 1 1 1 1 ...
$ Item_Type_New_Food : int 0 0 0 0 0 0 0 0 0 0 ...
$ Item_Type_New_Non-Consumable : int 0 0 0 0 0 0 0 0 0 0 ...

As you can see, after one hot encoding, the original variables are removed automatically from the data set.

5. Predictive Modeling using Machine Learning

Finally, we’ll drop the columns which have either been converted using other variables or are identifier variables. This can be accomplished using select from dplyr package.

> combi <- select(combi, -c(Item_Identifier, Outlet_Identifier, Item_Fat_Content,                                Outlet_Establishment_Year,Item_Type))
> str(combi)

In this section, I’ll cover Regression, Decision Trees and Random Forest. A detailed explanation of these algorithms is outside the scope of this article. These algorithms have been satisfactorily explained in our previous articles. I’ve provided the links for useful resources.

As you can see, we have encoded all our categorical variables. Now, this data set is good to take forward to modeling. Since, we started from Train and Test, let’s now divide the data sets.

new_train <- combi[1:nrow(train),]
> new_test <- combi[-(1:nrow(train)),]

Linear (Multiple) Regression

Various Regression is utilized when response variable is persistent in nature and predictors are many. Had it been categorical, we would have utilized Logistic Regression. Before you proceed, sharpen your nuts and bolts of Regression here.

Linear Regression takes following suppositions:

There exists a linear relationship among response and predictor variables

The predictor (autonomous) variables are not correlated with one another. Presence of collinearity prompts a marvel known as multicollinearity.

The error terms are uncorrelated. Otherwise, it will prompt autocorrelation.

Error terms must have consistent variance. Non-steady variance prompts heteroskedasticity.

How about we presently work out first regression model on this data set. R utilizes lm() work for regression.

> linear_model <- lm(Item_Outlet_Sales ~ ., data = new_train)
> summary(linear_model)

Adjusted R² measures the goodness of fit of a regression model. Higher the R², better is the model. Our R² = 0.2085. It means we really did something drastically wrong.  Let’s figure it out.

In our case, I could find our new variables aren’t helping much i.e. Item count, Outlet Count and Item_Type_New. Neither of these variables are significant. Significant variables are denoted by ‘*’ sign.

As we know, correlated predictor variables brings down the model accuracy. Let’s find out the amount of correlation present in our predictor variables. This can be simply calculated using:

> cor(new_train)

Alternatively, you can also use corrplot package for some fancy correlation plots. Scrolling through the long list of correlation coefficients, I could find a deadly correlation coefficient:

cor(new_train$Outlet_Count, new_train$`Outlet_Type_Grocery Store`)
[1] -0.9991203

Outlet_Count is highly correlated (negatively) with Outlet Type Grocery Store. Here are some problems I could find in this model:

  1. We have correlated predictor variables.
  2. We did one hot encoding and label encoding. That’s not necessary since linear regression handle categorical variables by creating dummy variables intrinsically.
  3. The new variables (item count, outlet count, item type new) created in feature engineering are not significant.

Let’s try to create a more robust regression model. This time, I’ll be using a building a simple model without encoding and new features. Below is the entire code:

#load directory
> path <- "C:/Users/manish/desktop/Data/February 2016"

> setwd(path)

#load data
> train <- read.csv("train_Big.csv")
> test <- read.csv("test_Big.csv")

#create a new variable in test file
> test$Item_Outlet_Sales <- 1

#combine train and test data
> combi <- rbind(train, test)

#impute missing value in Item_Weight
> combi$Item_Weight[is.na(combi$Item_Weight)] <- median(combi$Item_Weight, na.rm = TRUE)

#impute 0 in item_visibility
> combi$Item_Visibility <- ifelse(combi$Item_Visibility == 0, median(combi$Item_Visibility),                         combi$Item_Visibility)

#rename level in Outlet_Size
> levels(combi$Outlet_Size)[1] <- "Other"

#rename levels of Item_Fat_Content
> library(plyr)
> combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content,c("LF" = "Low Fat", "reg" =                                   "Regular"))
> combi$Item_Fat_Content <- revalue(combi$Item_Fat_Content, c("low fat" = "Low Fat"))

#create a new column 2013 - Year
> combi$Year <- 2013 - combi$Outlet_Establishment_Year

#drop variables not required in modeling
> library(dplyr)
> combi <- select(combi, -c(Item_Identifier, Outlet_Identifier, Outlet_Establishment_Year))

#divide data set
> new_train <- combi[1:nrow(train),]
> new_test <- combi[-(1:nrow(train)),]

#linear regression
> linear_model <- lm(Item_Outlet_Sales ~ ., data = new_train)
> summary(linear_model)

Now we have got R² = 0.5623. This teaches us that, sometimes all you need is simple thought process to get high accuracy. Quite a good improvement from previous model. Next, time when you work on any model, always remember to start with a simple model.

Let’s check out regression plot to find out more ways to improve this model.

> par(mfrow=c(2,2))
> plot(linear_model)

regression plots

You can zoom these graphs in R Studio at your end. Every one of these plots have a different story to tell. In any case, the most important story is being portrayed by Residuals versus Fitted graph.

Residual qualities are the difference among genuine and predicted result esteems. Fitted qualities are the predicted qualities. In the event that you see carefully, you’ll discover it as a channel shape graph (from right to left ). The state of this graph recommends that our model is suffering from heteroskedasticity (inconsistent variance in error terms). Had there been consistent variance, there would be no pattern obvious in this graph.

A typical practice to handle heteroskedasticity is by taking the log of response variable. How about we do it and check in the event that we can get further improvement.

> linear_model <- lm(log(Item_Outlet_Sales) ~ ., data = new_train)
> summary(linear_model)


And, here’s a snapshot of my model output. Congrats! We have got an improved model with R² = 0.72. Now, we are on the right path. Once again you can check the residual plots (you might zoom it). You’ll find there is no longer a trend in residual vs fitted value plot.


Decision Trees

Before you start, I’d recommend you to glance through the basics of decision tree algorithms. To understand what makes it superior than linear regression, check this tutorial Part 1 and Part 2.

In R, decision tree algorithm can be implemented using rpart package. In addition, we’ll use caret package for doing cross validation. Cross validation is a technique to build robust models which are not prone to overfitting. Read more about Cross Validation.

In R, decision tree uses a complexity parameter (cp). It measures the tradeoff between model complexity and accuracy on training set. A smaller cp will lead to a bigger tree, which might overfit the model. Conversely, a large cp value might underfit the model. Underfitting occurs when the model does not capture underlying trends properly. Let’s find out the optimum cp value for our model with 5 fold cross validation.

#loading required libraries
> library(rpart)
> library(e1071)
> library(rpart.plot)
> library(caret)

#setting the tree control parameters
> fitControl <- trainControl(method = "cv", number = 5)
> cartGrid <- expand.grid(.cp=(1:50)*0.01)

#decision tree
> tree_model <- train(Item_Outlet_Sales ~ ., data = new_train, method = "rpart", trControl = fitControl, tuneGrid = cartGrid)
> print(tree_model)

The final value for cp = 0.01. You can also check the table populated in console for more information. The model with cp = 0.01 has the least RMSE. Let’s now build a decision tree with 0.01 as complexity parameter.

> main_tree <- rpart(Item_Outlet_Sales ~ ., data = new_train, control = rpart.control(cp=0.01))
> prp(main_tree)


Here is the tree structure of our model. If you have gone through the basics, you would now understand that this algorithm has marked Item_MRP as the most important variable (being the root node). Let’s check the RMSE of this model and see if this is any better than regression.

> pre_score <- predict(main_tree, type = "vector")
> rmse(new_train$Item_Outlet_Sales, pre_score)
[1] 1102.774

As you can see, our RMSE has further improved from 1140 to 1102.77 with decision tree.  To improve this score further, you can further tune the parameters for greater accuracy.

Random Forest

Random Forest is a powerful algorithm which comprehensively deals with missing qualities, outliers and other non-linearities in the data set. It’s essentially an assortment of arrangement trees, thus the name ‘forest’. I’d recommend you to rapidly refresh your rudiments of random forest with this tutorial.

In R, random forest algorithm can be actualize utilizing randomForest bundle. Once more, we’ll use train bundle for cross approval and discovering ideal estimation of model parameters.

For this problem, I’ll center around two parameters of random forest. mtry and ntree. ntree is the number of trees to be grown in the forest. mtry is the number of variables taken at every hub to manufacture a tree. Furthermore, we’ll do a 5 crease cross approval.

How about we do it!

#load randomForest library
> library(randomForest)

#set tuning parameters
> control <- trainControl(method = "cv", number = 5)

#random forest model
> rf_model <- train(Item_Outlet_Sales ~ ., data = new_train, method = "parRF", trControl =                 control, prox = TRUE, allowParallel = TRUE)

#check optimal parameters
> print(rf_model)


If you notice, you’ll see I’ve used method = “parRF”. This is parallel random forest. This is parallel implementation of random forest. This package causes your local machine to take less time in random forest computation. Alternatively, you can also use method = “rf” as a standard random forest function.

Now we’ve got the optimal value of mtry = 15. Let’s use 1000 trees for computation.

#random forest model
> forest_model <- randomForest(Item_Outlet_Sales ~ ., data = new_train, mtry = 15, ntree = 1000)
> print(forest_model)
> varImpPlot(forest_model)

This model throws RMSE = 1132.04 which is not an improvement over decision tree model. Random forest has a feature of presenting the important variables. We see that the most important variable is Item_MRP (also shown by decision tree algorithm).


This model can be further improved by tuning parameters. Also, Let’s make out first submission with our best RMSE score by decision tree.

> main_predict <- predict(main_tree, newdata = new_test, type = "vector")
> sub_file <- data.frame(Item_Identifier = test$Item_Identifier, Outlet_Identifier = test$Outlet_Identifier,       Item_Outlet_Sales = main_predict)
> write.csv(sub_file, 'Decision_tree_sales.csv')

When predicted on out of sample data, our RMSE has come out to be 1174.33. Here are some things you can do to improve this model further:

Since we didn’t utilize encoding, I encourage you to utilize one hot encoding and mark encoding for random forest model.

Parameters Tuning will help.

Use Gradient Boosting.

Fabricate a gathering of these models. Read more about Ensemble Modeling.

Do execute the thoughts recommended above and share your improvement in the remarks segment beneath. Currently, Rank 1 on Leaderboard has gotten RMSE score of 1137.71. Beat it!