Chapter 3 Basic Data Analysis in R

Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Training Objectives section, we’ll outline the syllabus, identify the key objectives and set up expectations for each module. Under the Libraries and Setup section you’ll see some code to initialize our workspace and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here.

3.1 Preface

3.1.1 Introduction

From the previous course, we have learned how R and its extension Rmarkdown can be a versatile tool to generate a high-quality report in various outputs (PDF, Word, HTML Report, Presentation, etc.) Furthermore, Rmarkdown is distinguished for its ability to incorporate the data analytical process from raw data into charts on the document, without ruining its visual. Proper data analysis will certainly result in high-quality content for your business report. Knowing various data analytical processes in R will greatly enhance our ability to use R for automated reporting, using it to its fullest potential.

3.1.2 Training Objectives

This is the second course of Automate: Business Reporting with R. The objective of this course is to provide participants a comprehensive introduction to understand the basic data analysis in R. We will learn techniques on how to transform raw data into a format that is easier to analyze and using descriptive statistics to explain our data. The syllabus covers:

Data Science Workflow
Basic Programming in R
- Object and Environment
- Data Structures in R
- Data Type in R
Tidying Messy Sheets in R
- Importing Dataset
- Data Cleansing
- Data Transformation

3.1.3 Library and Setup

In this Library and Setup section you’ll see some code to initialize our workspace, and the packages we’ll be using for this project.

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

You will need to use install.packages() to install any packages that are not yet downloaded onto your machine. To install packages, type the command below on your console then press ENTER.

install.packages("tidyverse")
install.packages("readxl")

Then you need to load the package into your workspace using the library() function:

library(tidyverse)
library(readxl)

3.2 Data Science Workflow

Data analysis is highly linked with a buzz-word that many people talk about these days that is Data Science. Data Science is a field that focuses on the study and development of how we can analyze data. It compromises various acts of data analysis which aims to transform data into valuable insights.

There is a certain workflow that applies to a data science project. Below is a good chart that explains the workflow of a data science project.

The first step of a data science work is to import our data into our analytics toolset. There are various analytical tools available, but most of the data science work is usually performed using programming languages. In this workshop, we will be using R.

Once we’ve imported our data, we need to clean our data. Data cleaning means to store our data with the correct labels, values, and structure in the way it should be stored. Data cleaning is important because it helps us focus on our questions about the data, instead of struggling to get our data into the right form for different purposes.

Once we have clean data, the next step is to transform it. Data transformation is the process of converting data or information from one format to another. In data science work, it is usually related to narrowing our observations of interest from the rest of the data. This step is important because it helps us focus on our business questions.

As an example is to see communities of people in Jakarta instead of in Indonesia. This is done by creating new data from our dataset by subsetting, summarizing, or adding new variables or information that explain our data.

Once we have transformed our data, we can perform some exploratory data analysis using descriptive statistics and visualize them through charts to improve our understanding of the data. We may also build a machine learning model to predict future values.

The last step of data science work is communication. It is a critical part of any data analysis project. In this workshop, we will communicate our findings on data through business reports using Rmarkdown.

3.3 Basic Programming in R

The first important step in learning data analysis in R is to recognize how R works, and how it works using RStudio- Integrated Development Environment (IDE) for R. First, we should learn about how R and the features in RStudio work.

As an introduction, let’s write our first R code by typing print("Hello!") into the console (by default, it is located at the bottom-left of RStudio) and press ENTER. Another preferred method is to run it in the chunk of R Markdown; see the green button that says “Run Current Chunk”, or press CTRL + SHIFT + ENTER in the chunk:

# this is a nice comment
print("Hello!")

#> [1] "Hello!"

# print("Hello Algoritma!")

print() is a function in R. Function is code that commands or tells R to do something. In this example, print() tells R to “print” the parameter we put inside the bracket, that is "Hello!".

Also, note the # character before our code on several lines in the chunks above. These characters will be considered as comments, and will not be considered as commands by R. If we run the chunk above, the print("Hello Algoritma!") will be ignored by R

3.3.1 Object and Environment

"Hello!" in the previous chunk, was an object. Objects can be written raw, like the example above, or can be a value that assigned to an object that has a name, using an operator <-:

activity <- "Programming"

print(activity)

#> [1] "Programming"

After assigning it to an object, we can see a list of the objects we have created in the Environment Pane (by default, located in the upper-right of RStudio). We usually assign our data into an object and further analyze it in R.

It is also important to note that R is case-sensitive; for example, an object “Algoritma” and “algoritma” will be considered different:

# use `==` to compare 2 objects
activity == "Programming"

#> [1] TRUE

activity == "programming"

#> [1] FALSE

# try using object `activity` inside the function `paste()`

paste(activity, "is one of the most therapeutic activities.")

#> [1] "Programming is one of the most therapeutic activities."

# there is no object named 'Activity';
# therefore the command below will reveals an error
# print(Activity)

Every object that has a name will be unique, so if we create an object with the same name, it will replace the old object:

activity <- "Coding"

paste(activity, "is one of the most therapeutic activities.")

#> [1] "Coding is one of the most therapeutic activities."

Notice that the word “Programming” has been replaced by “Coding”.

3.3.2 Data Structures in R

The object named activity that we have created before is one of the basic forms of data structures in R, namely vector. The other data structures that we will study are matrix, and data.frame.

3.3.2.1 Vector

Vector is a 1-dimensional object (meaning that it only has length; not in the form of rows & columns). A vector can be either one of these types:

character
numeric
integer
logical

A vector can only store values with the same type:

vector1 <- c("Learning", "Data", "Science", "2019")
vector2 <- c(1, 0, 1)

# we can check the data type using `class()`
class(vector1)

#> [1] "character"

class(vector2)

#> [1] "numeric"

We use the function c() or ‘concatenate’ to merge two or more values/objects. We will see a lot of this function throughout our courses.

Every object has attributes. Some common attributes that we usually inspect are class, length, dim, colnames, etc.

We can use length() to inspect the length or the total number of values in a vector.

length(vector2)

#> [1] 3

We will discuss more about dim and colnames later in this course when we talk about 2-dimensional object.

3.3.2.1.1 Character

A character type vector will often be used to work with text data. For example, we have a list of names that we will send an email to:

name_vec <- c("Samuel Chan", "Tiara Dwiputri", "Ahmad Husain Abdullah")

name_vec

#> [1] "Samuel Chan"           "Tiara Dwiputri"        "Ahmad Husain Abdullah"

We can use the paste() function to make the sentence “Dear XXX,”, programmatically:

paste("Dear ", name_vec, ",")

#> [1] "Dear  Samuel Chan ,"           "Dear  Tiara Dwiputri ,"       
#> [3] "Dear  Ahmad Husain Abdullah ,"

As we see from the example above, we can process text data by assigning them into character vector. There is also a wide possibility for advanced data processing techniques in text analysis.

3.3.2.1.2 Numeric

A numeric type vector often used for numerical data processing, especially mathematical operations. For example, suppose that we have profit data from the last quarter of our business operations (3 months):

profit <- c(150, 87, 95)

profit

#> [1] 150  87  95

We can use mean() to calculate the average of our profits:

mean(profit)

#> [1] 110.6667

R and its extensions have various functions that we can use to perform various kinds of mathematical operations and more complex statistics.

3.3.2.1.3 Integer

An integer type vector is basically the same as the numeric type vector; however, integer type cannot have numbers behind commas or is not a fraction.

int_vec <- c(0L, 1L, 4L, 7L, 10L) 

#`L` is used to prevent R read these data as 'numeric' & label them as 'integer' instead.

int_vec

#> [1]  0  1  4  7 10

Because integers are basically the same as numeric, we can perform mathematical operations using them.

However, the best practice of using integers is to store index information or information that acts as an identifier, such as employee numbers, ID numbers, etc.

3.3.2.1.4 Logical

A logical type vector is better known as boolean terms in general programming languages. This type of vector has 2 kinds of values (TRUE/FALSE; T/F).

This type of vector is usually obtained from other operations, rather than being made by initially. For example, when we have data related to customers that are eligible/not to be given a promo based on their loyalty point:

cust_promo <- c(150, 15, 200, 175)

If for example, the minimum points required to get a promo is 100 (cust_promo >= 100), we can use function ifelse() to classify whether a customer will be given a promo or not:

cust_promo2 <- ifelse(test = cust_promo >= 100, 
                      yes = TRUE, no = FALSE)
cust_promo2

#> [1]  TRUE FALSE  TRUE  TRUE

3.3.2.1.5 Other Data Types: Factor & Date

During data processing in R, we will often use several more advanced data types, namely factor and date.

Factor

Factor is a character type vector that has levels or categorical values. Each level can appear more than once in a given vector and therefore have frequency.

fctr <- c("Type A", "Type A", "Type B", "Type A", "Type C", "Type C")

fctr <- as.factor(fctr) # converting character-type vector into factor

fctr

#> [1] Type A Type A Type B Type A Type C Type C
#> Levels: Type A Type B Type C

Date

Date is a special data type that contains date data in a standardized format.

purchase_date <- c("2019-01-01", "2019-01-01", "2019-01-02")

purchase_date <- as.Date(purchase_date) # converting character-type vector into date (ymd)

purchase_date

#> [1] "2019-01-01" "2019-01-01" "2019-01-02"

3.3.2.2 Matrix

Matrix is one way to store 2-dimensional data in an object. Same as vectors, a matrix can only store objects with the same type. To make a matrix, we can use the function matrix():

mat <- matrix(11:16, nrow = 3, ncol = 2)
mat

#>      [,1] [,2]
#> [1,]   11   14
#> [2,]   12   15
#> [3,]   13   16

Notice how the value in a matrix has index [1, 1] - can be read as [row index, column index]. We can access a certain value from a matrix by its index.

mat[1, ]

#> [1] 11 14

mat[3,1]

#> [1] 13

To see the dimension of a matrix, we can use the dim() function and it will return the number of rows and columns in the matrix.

dim(mat)

#> [1] 3 2

We can also retrieve or set the column names of a matrix-like object by using colnames(), while for the row names using rownames().

3.3.2.3 Data Frame

During our time processing data, we will be frequently faced with data in the form of tables, such as the results of a survey of each employee’s activities or the recapitulation of sales from each store branch. Data in the form of table is also known as data.frame. Similar to matrix, it has 2 dimensions namely rows–of observations–and columns–of variables, but it can store different data types.

For example, we can create a data.frame using a function data.frame():

categories_df <- data.frame(categories = c("OfficeSupplies", "Computers", "Packaging"), category_id = 111:113)

categories_df

#>       categories category_id
#> 1 OfficeSupplies         111
#> 2      Computers         112
#> 3      Packaging         113

In this course, we will be focusing more on this data type.

3.4 Tidying Messy Sheets in R

We are now going to learn techniques on how to transform raw data into a format that is easier for analysis. R provides various functions that we can use to process our dataset; from importing datasets to complex aggregations. These functions are compiled in a package. To further learn about packages, you can see Library and Setup section.

Along with the development of R, we can now use several packages that are currently considered as best practices for processing data in R. These packages are included in tidyverse package. Additionally, as most workers use Microsoft Excel to store their data, we will also use readxl package to import an excel spreadsheet into R.

To load packages into our R session, type:

library(tidyverse)
library(readxl)

3.4.1 Importing Dataset

R can read data from various formats. For example, by default R has provided a function to import data in comma-separated value (.csv) format using read.csv(). Meanwhile, data stored in other formats may need additional packages and functions for importing.

In this example, we can use readxl package to import data stored in excel formats (.xlxs, .xls, etc.). We will use “data-attrition.xlsx” (located in /data of our working directory) as our data input and assign it to an object named attrition (for Sheet 1) and survey (for Sheet 2). This data stores employees attrition data and commonly analyzed by the Human Resource Department to know what factors might influence the employee’s decisions on leaving the company.

attrition <- read_excel(path = "data_input/data-attrition.xlsx",
                             sheet = 1) # sheet index or name

survey <- read_excel(path = "data_input/data-attrition.xlsx",
                          sheet = "survey")

By default, the read_* function will try to detect the data type of each column. We can use the head() to see the first 6 rows of our data to confirm this:

library(kableExtra)

head(attrition, 10)

employee_number	age	gender	department	job_role	job_level	education	education_field	monthly_income	hourly_rate	daily_rate	monthly_rate	percent_salary_hike	performance_rating	standard_hours	over_time	over_18	marital_status	business_travel	distance_from_home	stock_option_level	num_companies_worked	total_working_years	training_times_last_year	years_at_company	years_in_current_role	years_since_last_promotion	years_with_curr_manager	attrition
1	41	female	sales	sales_executive	2	2	life_sciences	5993	94	1102	19479	11	3	80	yes	y	single	travel_rarely	1	0	8	8	0	6	4	0	5	yes
2	49	male	research_development	research_scientist	2	1	life_sciences	5130	61	279	24907	23	4	80	no	y	married	travel_frequently	8	1	1	10	3	10	7	1	7	no
4	37	male	research_development	laboratory_technician	1	2	other	2090	92	1373	2396	15	3	80	yes	y	single	travel_rarely	2	0	6	7	3	0	0	0	0	yes
5	33	female	research_development	research_scientist	1	4	life_sciences	2909	56	1392	23159	11	3	80	yes	y	married	travel_frequently	3	0	1	8	3	8	7	3	0	no
7	27	male	research_development	laboratory_technician	1	1	medical	3468	40	591	16632	12	3	80	no	y	married	travel_rarely	2	1	9	6	3	2	2	2	2	no
8	32	male	research_development	laboratory_technician	1	2	life_sciences	3068	79	1005	11864	13	3	80	no	y	single	travel_frequently	2	0	0	8	2	7	7	3	6	no
10	59	female	research_development	laboratory_technician	1	3	medical	2670	81	1324	9964	20	4	80	yes	y	married	travel_rarely	3	3	4	12	3	1	0	0	0	no
11	30	male	research_development	laboratory_technician	1	1	life_sciences	2693	67	1358	13335	22	4	80	no	y	divorced	travel_rarely	24	1	1	1	2	1	0	0	0	no
12	38	male	research_development	manufacturing_director	3	3	life_sciences	9526	44	216	8787	21	4	80	no	y	single	travel_frequently	23	0	0	10	2	9	7	1	8	no
13	36	male	research_development	healthcare_representative	2	3	medical	5237	94	1299	16577	13	3	80	no	y	married	travel_rarely	27	2	6	17	3	7	7	7	7	no

head(survey, 10)

employee_number	job_involvement	relationship_satisfaction	environment_satisfaction	work_life_balance	job_satisfaction
1	3	1	2	1	4
2	2	4	3	3	2
4	2	2	4	3	3
5	3	3	4	3	3
7	3	4	1	3	2
8	3	3	4	2	4
10	4	1	3	2	1
11	3	2	4	3	3
12	2	2	4	3	3
13	3	2	3	2	3

3.4.1.1 Relational Data

In various types of businesses, usually have a lot of data tables that must be combined to answer several business questions that are interested in. Multiple data tables are commonly called relational data because they have relations, not just one single independent data. Here’s an example of relational data looks like:

There are two common verbs that are designed to work with relational data: mutating joins and filtering join. But in this section, we will focus more on examples of mutating joins. Lookups to the Introduction to Relational Data for more examples of join data tables.

Mutating joins adds new variables to one data frame from matching observations in another.

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other. Here are some types of mutating join:

inner join
outer join

left_join
- right_join
- full_join

Below is an illustration of the different types of joins with a Venn diagram:

3.4.1.1.1 Inner Join

inner_join(x, y): All rows from x where there are matching values in y, and all columns from x and y.

We are going to use this type of join for our data frame.

# joining data frame
data_join <- inner_join(x = attrition, y = survey, by = "employee_number")

head(data_join)

employee_number	age	gender	department	job_role	job_level	education	education_field	monthly_income	hourly_rate	daily_rate	monthly_rate	percent_salary_hike	performance_rating	standard_hours	over_time	over_18	marital_status	business_travel	distance_from_home	stock_option_level	num_companies_worked	total_working_years	training_times_last_year	years_at_company	years_in_current_role	years_since_last_promotion	years_with_curr_manager	attrition	job_involvement	relationship_satisfaction	environment_satisfaction	work_life_balance	job_satisfaction
1	41	female	sales	sales_executive	2	2	life_sciences	5993	94	1102	19479	11	3	80	yes	y	single	travel_rarely	1	0	8	8	0	6	4	0	5	yes	3	1	2	1	4
2	49	male	research_development	research_scientist	2	1	life_sciences	5130	61	279	24907	23	4	80	no	y	married	travel_frequently	8	1	1	10	3	10	7	1	7	no	2	4	3	3	2
4	37	male	research_development	laboratory_technician	1	2	other	2090	92	1373	2396	15	3	80	yes	y	single	travel_rarely	2	0	6	7	3	0	0	0	0	yes	2	2	4	3	3
5	33	female	research_development	research_scientist	1	4	life_sciences	2909	56	1392	23159	11	3	80	yes	y	married	travel_frequently	3	0	1	8	3	8	7	3	0	no	3	3	4	3	3
7	27	male	research_development	laboratory_technician	1	1	medical	3468	40	591	16632	12	3	80	no	y	married	travel_rarely	2	1	9	6	3	2	2	2	2	no	3	4	1	3	2
8	32	male	research_development	laboratory_technician	1	2	life_sciences	3068	79	1005	11864	13	3	80	no	y	single	travel_frequently	2	0	0	8	2	7	7	3	6	no	3	3	4	2	4

# save data_join

write_csv(data_join, path = "data_input/sales_join.csv")

3.4.1.1.2 Left Join

left_join(x, y): All rows from x and all columns from x and y. Rows in x with no matches in y will have NA values in the joined data frame.

# generete data dummy

band <- data.frame(name = c("Mick", "John", "Paul"), 
                   band = c("Stones", "Beatles", "Beatles"))

instrument <- data.frame(name = c("John", "Paul", "Keith"),
                         plays = c("guitar", "bass", "guitar"))

instrument2 <- data.frame(artist = c("John", "Paul", "Keith"),
                          plays = c("guitar", "bass", "guitar"))

# example left join

left_join(band, instrument, by = "name")

name	band	plays
Mick	Stones	NA
John	Beatles	guitar
Paul	Beatles	bass

3.4.1.1.3 Right Join

right_join(x, y): All rows from y and all columns from x and y. Rows in y with no matches in x will have NA values in the joined data frame.

# example right join

right_join(band, instrument, by = "name")

name	band	plays
John	Beatles	guitar
Paul	Beatles	bass
Keith	NA	guitar

3.4.1.1.4 Full Join

full_join(x, y): All rows and all columns from both x and y. Where there are non-matching values, returns NA.

# example full join

full_join(band, instrument, by = "name")

name	band	plays
Mick	Stones	NA
John	Beatles	guitar
Paul	Beatles	bass
Keith	NA	guitar

3.4.1.2 [Opt] Working with Database

In real life, data sources are stored in many different ways. Some might have it in CSV files, some have it in excel, some in a server, some are retrieved from API. Some useful skills you might want to put is how to connect to data sources from R. Most of structured databases is stored using a relational database, and SQL is a common DBMS to use. In this example, I’m going to show you how to connect to a remote Microsoft SQL database. Please do note that connecting to a database require different driver, and we will use ODBC connection. In most cases, R is compatible to connect to major databases with available drivers, such as Microsoft SQL Server, SAP, SPSS.

The driver we want to use can be loaded using odbc and some other SQL-related packages. If you are interested to try, let’s start by installing the packages first:

# install required sql library
install.packages(c("DBI", "odbc", "RMySQL"))

Then call them using library():

# import required library
library(DBI)
library(odbc)
library(RMySQL)

Once connected, you can make a chunk that runs an SQL query in it that uses the connection object we just made, or just use it in an R session. This is an example of how to make a connection:

# example of connecting through odbc

cn <- DBI::dbConnect(
  odbc::odbc(),
  Driver = "SQL Driver Name", # sql driver name as installed in machine
  Server = "hostname-or-ip-server", # hostname / IP server
  uid = "yourusername", # username
  Database = "yourdatabase", # database
  PWD = rstudioapi::askForPassword("yourpassword"),
  Port = 9999 # your port Number
)

3.4.2 Data Cleansing

After importing our data, we can perform data cleansing. These are some criteria that make up a clean data:

Each variable is placed in its own column, each observation in its own row, and each value in its own cell.

This is a very important aspect of clean data. This helps us perform various functions and analysis correctly. An example of this is in the figure below:

Source: Grolemund & Wickham, 2016.

Standardized column names

R has programming style guides that aim to make R code created by users easier to read, share, and verify. Examples of R programming style guides are Google’s R Style Guide and the Tidyverse Style Guide. Our column names should at least follow this set of style guides. Some examples are naming our columns using lowercase letter and separate it with a dot(.) or underscore(_).
Consistent naming for character-type values (avoids duplicate data)
Numerical values should not contain characters and vice versa
Date-type variables should use either of this format: ymd or ymd-hms

Below is an example of ‘dirty’ data:

Kode SITC 3 digit	01-01-2017	02-01-2017	03-01-2017	04-01-2017	05-01-2017	06-01-2017	07-01-2017	08-01-2017
001 Live animals other than animals of division 03	102.77	95.34	100.09	98.38	104.27	100.43	103.84	111.69
012 Daging lainnya dan sisa daging yang dapat dimakan, segar, dingin atau beku (kecuali daging dan sisa daging yang tidak layak atau tidak cocok untuk konsumsi manusia)	132	161.21	115.02	108.81	107.21	99.99	111.43	102.23
016 Daging dan sisa daging yang dapat dimakan, asin, dalam air garam, kering atau diasap; tepung dan tepung kasar dari daging atau sisanya	299.17	101.4	86.61	29.37	62.57	107.22	176.28	383.32
017 Daging dan sisa daging yang dapat dimakan, diolah atau diawetkan, tidak dirinci	117	179.67	248.48	153.21	121.5	460.65	161.26	126.28
022 Susu dan krim dan produk susu selain mentega atau keju	96.62	129.2	98.67	100.31	107.95	91.81	94.09	91.02
023 Mentega dan lemak lainnya dan minyak yang berasal dari susu	2 721.92	85.01	175.95	7 153.54	63.33	139.16	40.26	102.77
024 Keju dan dadih	74.47	105.21	61.41	85.58	20.64	88.45	97.43	93.82
034 Ikan, segar (hidup atau mati), dingin atau beku	127.81	120.58	137.06	149.04	212.62	136.2	128.64	121.34
035 Ikan, kering, asin atau dalam air garam; ikan asap (dimasak atau tidak sebelum atau selama proses pengasapan); tepung, tepung kasar dan pelet ikan, layak untuk dikonsumsi manusia	141.31	249.47	128.43	160.96	148.65	196.31	155.64	148.07
036 Krustasea, moluska dan invertebrata air, berkulit maupun tidak, segar (hidup atau mati), dingin, beku, kering, asin atau dalam air garam; krustasea, berkulit, dikukus atau direbus, dingin atau tidak , beku, kering, asin atau dalam air garam	157.06	301.47	109.3	181.98	173.3	135.85	127.1	360.75

This data considered as ‘dirty’ because:

the variables were placed in rows instead of columns,
the date-type data assigned as column names (this prevent us from utilizing information related to date)
the numerical value still labeled as characters (chr),
the numerical value doesn’t have consistency in writing/storing their data (spaces used as the separator, decimals are not applied to all values, etc.)

On the contrary, here is an example of a ‘clean’ data,

code	date	export
001 Live animals other than animals of division 03	2017-01-01	102.77
001 Live animals other than animals of division 03	2017-01-02	95.34
001 Live animals other than animals of division 03	2017-01-03	100.09
001 Live animals other than animals of division 03	2017-01-04	98.38
001 Live animals other than animals of division 03	2017-01-05	104.27
001 Live animals other than animals of division 03	2017-01-06	100.43

The above data is already cleaned by changing its data structure using tidyr package. The date information that previously scattered in different columns is now stored in one column date. Additionally, the date information also changed into the correct date data type using lubridate package. These packages are compiled in tidyverse and were simultaneously loaded into our session when we previously loaded tidyverse. Although we will not use all functions and packages provided in tidyverse, we will discuss the basic concept and practices of data cleaning and transformation in R. This understanding will ease your time in exploring many possibilities of data analysis in R in the long run!

Now, let’s look back at our data. We can use the function glimpse() from dplyr package to inspect the structure of our data.

glimpse(data_join)

#> Observations: 1,470
#> Variables: 34
#> $ employee_number            <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20,...
#> $ age                        <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 34, 2...
#> $ gender                     <chr> "female", "male", "male", "female", "male", "male", "fema...
#> $ department                 <chr> "sales", "research_development", "research_development", ...
#> $ job_role                   <chr> "sales_executive", "research_scientist", "laboratory_tech...
#> $ job_level                  <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, 3, 1, 1, 4, ...
#> $ education                  <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 4, 2, 2, 4, ...
#> $ education_field            <chr> "life_sciences", "life_sciences", "other", "life_sciences...
#> $ monthly_income             <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2693, 9526, 523...
#> $ hourly_rate                <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 49, 31, 93, 5...
#> $ daily_rate                 <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358, 216, 1299, ...
#> $ monthly_rate               <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964, 13335, 878...
#> $ percent_salary_hike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 12, 17, 11, 1...
#> $ performance_rating         <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
#> $ standard_hours             <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
#> $ over_time                  <chr> "yes", "no", "yes", "yes", "no", "no", "yes", "no", "no",...
#> $ over_18                    <chr> "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y...
#> $ marital_status             <chr> "single", "married", "single", "married", "married", "sin...
#> $ business_travel            <chr> "travel_rarely", "travel_frequently", "travel_rarely", "t...
#> $ distance_from_home         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, 19, 24, 21, ...
#> $ stock_option_level         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, 1, 2, 2, 0, ...
#> $ num_companies_worked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, 1, 0, 1, 2, ...
#> $ total_working_years        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3, 6, 10, 7, ...
#> $ training_times_last_year   <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, 1, 5, 2, 3, ...
#> $ years_at_company           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4, 10, 6, 1, 2...
#> $ years_in_current_role      <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, 9, 2, 0, 8, ...
#> $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, 8, 0, 0, 3, ...
#> $ years_with_curr_manager    <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, 8, 5, 0, 7, ...
#> $ attrition                  <chr> "yes", "no", "yes", "no", "no", "no", "no", "no", "no", "...
#> $ job_involvement            <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, 4, 4, 4, 2, ...
#> $ relationship_satisfaction  <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, 3, 4, 2, 3, ...
#> $ environment_satisfaction   <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, 2, 1, 4, 1, ...
#> $ work_life_balance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 3, ...
#> $ job_satisfaction           <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, 1, 2, 4, 4, ...

The result above shown that our data consist of 1,470 rows (observations) and 35 columns (variables) whereas each column has different data types. This kind of information is very helpful for us to understand our data. We should also understand the description of each column to evaluate whether that column has the correct data type and whether it has valuable information or not. While most of the column names are self-explanatory, below are additional descriptions for certain columns:

attrition: employee’s churn status (yes/no)
education: employee’s education level (1 = Below College, 2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor)
employee_count: employee’s count for respective id
employee_number: employee’s id
num_companies_worked: number of companies worked at
stock_option_level: how many company stocks employees can own from the company

From the above result, we can see that our data is quite tidy for its structure. Each variable is placed in its own column, each observation in its own row, and each value in its own cell. Even so, we can also see that some columns are not stored in the correct data type. Column education, maybe better stored as factors for it may not represent a continuous scale but a categorical value. All columns stored as character (chr) should also be stored as factors.

The dplyr package from tidyverse provides various functions for data cleaning such as making/changing columns. Not only that, ‘dplyr’ also provides Grammar of Coding that simplifies our work with codes.

We can change the column data type using the following command and we will breakdown for each steps below.

data_join <- data_join %>% 
  mutate(education = as.factor(education)) %>% 
  mutate_if(is.character, as.factor)

Breakdown

The first thing we need to do is to understand the grammar of coding using tidyverse. If you see at the end of each command from the chunk above, there is a pipe operator %>% that allows users to perform chaining operations. In chaining operations, the data on the left side of the %>% will be the data used for the next function/command.
The first function we used above, is mutate(). This function is used to transform a column or to create a new column. In this case, we use ‘mutate()’ to transform ‘education’, ‘job_level’, and ‘stock_option_level’ into factor data type using function as.factor().
There is also function mutate_if() to transform multiple columns with a certain condition, similar to conditional formatting in Excel. In this case, if it stores a character data type (is.character) then it will be transformed into factor using as.factor().

Now let’s see the first 6 observations of our data to check whether our data is already clean enough.

head(data_join, 10)

employee_number	age	gender	department	job_role	job_level	education	education_field	monthly_income	hourly_rate	daily_rate	monthly_rate	percent_salary_hike	performance_rating	standard_hours	over_time	over_18	marital_status	business_travel	distance_from_home	stock_option_level	num_companies_worked	total_working_years	training_times_last_year	years_at_company	years_in_current_role	years_since_last_promotion	years_with_curr_manager	attrition	job_involvement	relationship_satisfaction	environment_satisfaction	work_life_balance	job_satisfaction
1	41	female	sales	sales_executive	2	2	life_sciences	5993	94	1102	19479	11	3	80	yes	y	single	travel_rarely	1	0	8	8	0	6	4	0	5	yes	3	1	2	1	4
2	49	male	research_development	research_scientist	2	1	life_sciences	5130	61	279	24907	23	4	80	no	y	married	travel_frequently	8	1	1	10	3	10	7	1	7	no	2	4	3	3	2
4	37	male	research_development	laboratory_technician	1	2	other	2090	92	1373	2396	15	3	80	yes	y	single	travel_rarely	2	0	6	7	3	0	0	0	0	yes	2	2	4	3	3
5	33	female	research_development	research_scientist	1	4	life_sciences	2909	56	1392	23159	11	3	80	yes	y	married	travel_frequently	3	0	1	8	3	8	7	3	0	no	3	3	4	3	3
7	27	male	research_development	laboratory_technician	1	1	medical	3468	40	591	16632	12	3	80	no	y	married	travel_rarely	2	1	9	6	3	2	2	2	2	no	3	4	1	3	2
8	32	male	research_development	laboratory_technician	1	2	life_sciences	3068	79	1005	11864	13	3	80	no	y	single	travel_frequently	2	0	0	8	2	7	7	3	6	no	3	3	4	2	4
10	59	female	research_development	laboratory_technician	1	3	medical	2670	81	1324	9964	20	4	80	yes	y	married	travel_rarely	3	3	4	12	3	1	0	0	0	no	4	1	3	2	1
11	30	male	research_development	laboratory_technician	1	1	life_sciences	2693	67	1358	13335	22	4	80	no	y	divorced	travel_rarely	24	1	1	1	2	1	0	0	0	no	3	2	4	3	3
12	38	male	research_development	manufacturing_director	3	3	life_sciences	9526	44	216	8787	21	4	80	no	y	single	travel_frequently	23	0	0	10	2	9	7	1	8	no	2	2	4	3	3
13	36	male	research_development	healthcare_representative	2	3	medical	5237	94	1299	16577	13	3	80	no	y	married	travel_rarely	27	2	6	17	3	7	7	7	7	no	3	2	3	2	3

We can see that our attrition data is clean enough and therefore we can continue to the data transformation.

3.4.3 Data Transformation

Data transformation is a process of converting data from one format or structure into another. Data transformation is a useful tool for Explanatory Data Analysis (EDA) which aims to gain insight and improve our understanding of data by looking at a more detailed or narrower perspective, based on our business question. The package dplyr also provides various functions for data transformation such as making/changing columns, subsetting columns/rows, and summarizing data.

To put it into practice, let’s try performing data transformation to our data. Imagine ourselves as an HR staff who wants to evaluate the employee’s attrition status of your company. We would like to know:

Which department gives the highest number of employee attrition?

The following are complete codes that command R to perform the data transformation. We will breakdown each step and explain it in detail.

data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup() %>% 
  filter(attrition == "yes") %>% 
  select(-attrition) %>% 
  arrange(desc(count))

department	count
research_development	133
sales	92
human_resources	12

Breakdown

In the example above, the first thing we do is creating a summarise of our data using group_by(), summarise(), and ungroup(). The function group_by will take existing data (columns) and converts it into a ‘grouped table’ where operations are performed ‘by group’. The function summarise() will perform a summary statistic calculation, in this case, count by using n() on the ‘grouped table’ from the previous chaining operation. The function ungroup() remove the grouping condition. These functions are quite similar to the pivot table in Excel.

data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup()

attrition	department	count
no	human_resources	51
no	research_development	828
no	sales	354
yes	human_resources	12
yes	research_development	133
yes	sales	92

The next function is filter(). We can use this function to subset our row data based on specific conditions. In this case, we filter (only take) every observation or rows from column attrition which have ‘yes’ as it values.

data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup() %>% 
  filter(attrition == "yes")

attrition	department	count
yes	human_resources	12
yes	research_development	133
yes	sales	92

Next we can use select() to remove the column attrition from our data.

data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup() %>% 
  filter(attrition == "yes") %>% 
  select(-attrition)

department	count
human_resources	12
research_development	133
sales	92

The final step is to sort our data using arrange(). To sort it descending, we can use the helper function desc():

data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup() %>% 
  filter(attrition == "yes") %>% 
  select(-attrition) %>% 
  arrange(desc(count))

department	count
research_development	133
sales	92
human_resources	12

From the result above, we know that the department which gives the highest number of employees attrition is from Research and Development. This department may need further analysis regarding its employee’s attrition.

The above example shows how we can summarise our full data into a simplified version which only displays the number of leaving employees (attrition = “yes”) per department. Others might prefer to display both staying and leaving employees from all departments but with a more compressed structure. For example, we will use some of the code we made earlier but without filtering.

data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup()

Instead, we will transform our data into its wider format using function pivot_wider() from tidyr package. This function will separate the values from column ‘attrition’ into 2 different columns and filling it with values from column ‘count’.

data_pivot <- data_join %>%
  group_by(attrition, department) %>%
  summarise(count = n()) %>%
  ungroup() %>% 
  pivot_wider(names_from = attrition, values_from = count)

data_pivot

department	no	yes
human_resources	51	12
research_development	828	133
sales	354	92

Alternatively, we can transform data from its wide into its long format using pivot_longer():

data_pivot %>% 
  pivot_longer(cols = c(no, yes), # columns to be joined
               names_to = "attrition", # name of new column
               values_to = "count") # name of new column (value)

department	attrition	count
human_resources	no	51
human_resources	yes	12
research_development	no	828
research_development	yes	133
sales	no	354
sales	yes	92

There are many more functions and technical details that we can explore further from tidyverse. If you are interested, you can visit the following articles:

Dive Deeper

The grammar of code from ‘dplyr’ is very powerful for Exploratory Data Analysis (EDA). Using attrition data, try to answer the following questions to sharpen your intuition on data analysis using R:

From the Research and Development department, which job role gives the highest number of employee attrition and what is their average relationship satisfaction? What insight can we take from the result?

# your code here:

Is there gender inequality in your company? Calculate the percentage of the female and male employees from each department!

# your code here:

You can also create your own question and try to answer it to sharpen your analytical skills!

employee_number	job_involvement	relationship_satisfaction	environment_satisfaction	work_life_balance	job_satisfaction
1	3	1	2	1	4
2	2	4	3	3	2
4	2	2	4	3	3
5	3	3	4	3	3
7	3	4	1	3	2
8	3	3	4	2	4
10	4	1	3	2	1
11	3	2	4	3	3
12	2	2	4	3	3
13	3	2	3	2	3

employee_number	job_involvement	relationship_satisfaction	environment_satisfaction	work_life_balance	job_satisfaction
1	3	1	2	1	4
2	2	4	3	3	2
4	2	2	4	3	3
5	3	3	4	3	3
7	3	4	1	3	2
8	3	3	4	2	4
10	4	1	3	2	1
11	3	2	4	3	3
12	2	2	4	3	3
13	3	2	3	2	3

employee_number	job_involvement	relationship_satisfaction	environment_satisfaction	work_life_balance	job_satisfaction
1	3	1	2	1	4
2	2	4	3	3	2
4	2	2	4	3	3
5	3	3	4	3	3
7	3	4	1	3	2
8	3	3	4	2	4
10	4	1	3	2	1
11	3	2	4	3	3
12	2	2	4	3	3
13	3	2	3	2	3