Chapter 3 Basic Data Analysis in R

Before you go ahead and run the codes in this coursebook, it’s often a good idea to go through some initial setup. Under the Training Objectives section, we’ll outline the syllabus, identify the key objectives and set up expectations for each module. Under the Libraries and Setup section you’ll see some code to initialize our workspace and the libraries we’ll be using for the projects. You may want to make sure that the libraries are installed beforehand by referring back to the packages listed here.

3.1 Preface

3.1.1 Introduction

From the previous course, we have learned how R and its extension Rmarkdown can be a versatile tool to generate a high-quality report in various outputs (PDF, Word, HTML Report, Presentation, etc.) Furthermore, Rmarkdown is distinguished for its ability to incorporate the data analytical process from raw data into charts on the document, without ruining its visual. Proper data analysis will certainly result in high-quality content for your business report. Knowing various data analytical processes in R will greatly enhance our ability to use R for automated reporting, using it to its fullest potential.

3.1.2 Training Objectives

This is the second course of Automate: Business Reporting with R. The objective of this course is to provide participants a comprehensive introduction to understand the basic data analysis in R. We will learn techniques on how to transform raw data into a format that is easier to analyze and using descriptive statistics to explain our data. The syllabus covers:

  • Data Science Workflow

  • Basic Programming in R

    • Object and Environment
    • Data Structures in R
    • Data Type in R
  • Tidying Messy Sheets in R

    • Importing Dataset
    • Data Cleansing
    • Data Transformation

3.1.3 Library and Setup

In this Library and Setup section you’ll see some code to initialize our workspace, and the packages we’ll be using for this project.

Packages are collections of R functions, data, and compiled code in a well-defined format. The directory where packages are stored is called the library. R comes with a standard set of packages. Others are available for download and installation. Once installed, they have to be loaded into the session to be used.

You will need to use install.packages() to install any packages that are not yet downloaded onto your machine. To install packages, type the command below on your console then press ENTER.

Then you need to load the package into your workspace using the library() function:


3.2 Data Science Workflow

Data analysis is highly linked with a buzz-word that many people talk about these days that is Data Science. Data Science is a field that focuses on the study and development of how we can analyze data. It compromises various acts of data analysis which aims to transform data into valuable insights.

There is a certain workflow that applies to a data science project. Below is a good chart that explains the workflow of a data science project.

The first step of a data science work is to import our data into our analytics toolset. There are various analytical tools available, but most of the data science work is usually performed using programming languages. In this workshop, we will be using R.

Once we’ve imported our data, we need to clean our data. Data cleaning means to store our data with the correct labels, values, and structure in the way it should be stored. Data cleaning is important because it helps us focus on our questions about the data, instead of struggling to get our data into the right form for different purposes.

Once we have clean data, the next step is to transform it. Data transformation is the process of converting data or information from one format to another. In data science work, it is usually related to narrowing our observations of interest from the rest of the data. This step is important because it helps us focus on our business questions.

As an example is to see communities of people in Jakarta instead of in Indonesia. This is done by creating new data from our dataset by subsetting, summarizing, or adding new variables or information that explain our data.

Once we have transformed our data, we can perform some exploratory data analysis using descriptive statistics and visualize them through charts to improve our understanding of the data. We may also build a machine learning model to predict future values.

The last step of data science work is communication. It is a critical part of any data analysis project. In this workshop, we will communicate our findings on data through business reports using Rmarkdown.

3.3 Basic Programming in R

The first important step in learning data analysis in R is to recognize how R works, and how it works using RStudio- Integrated Development Environment (IDE) for R. First, we should learn about how R and the features in RStudio work.

As an introduction, let’s write our first R code by typing print("Hello!") into the console (by default, it is located at the bottom-left of RStudio) and press ENTER. Another preferred method is to run it in the chunk of R Markdown; see the green button that says “Run Current Chunk”, or press CTRL + SHIFT + ENTER in the chunk:

#> [1] "Hello!"

print() is a function in R. Function is code that commands or tells R to do something. In this example, print() tells R to “print” the parameter we put inside the bracket, that is "Hello!".

Also, note the # character before our code on several lines in the chunks above. These characters will be considered as comments, and will not be considered as commands by R. If we run the chunk above, the print("Hello Algoritma!") will be ignored by R

3.3.1 Object and Environment

"Hello!" in the previous chunk, was an object. Objects can be written raw, like the example above, or can be a value that assigned to an object that has a name, using an operator <-:

#> [1] "Programming"

After assigning it to an object, we can see a list of the objects we have created in the Environment Pane (by default, located in the upper-right of RStudio). We usually assign our data into an object and further analyze it in R.

It is also important to note that R is case-sensitive; for example, an object “Algoritma” and “algoritma” will be considered different:

#> [1] TRUE
#> [1] FALSE
#> [1] "Programming is one of the most therapeutic activities."

Every object that has a name will be unique, so if we create an object with the same name, it will replace the old object:

#> [1] "Coding is one of the most therapeutic activities."

Notice that the word “Programming” has been replaced by “Coding”.

3.3.2 Data Structures in R

The object named activity that we have created before is one of the basic forms of data structures in R, namely vector. The other data structures that we will study are matrix, and data.frame.

3.3.2.1 Vector

Vector is a 1-dimensional object (meaning that it only has length; not in the form of rows & columns). A vector can be either one of these types:

  1. character
  2. numeric
  3. integer
  4. logical

A vector can only store values with the same type:

#> [1] "character"
#> [1] "numeric"

We use the function c() or ‘concatenate’ to merge two or more values/objects. We will see a lot of this function throughout our courses.

Every object has attributes. Some common attributes that we usually inspect are class, length, dim, colnames, etc.

We can use length() to inspect the length or the total number of values in a vector.

#> [1] 3

We will discuss more about dim and colnames later in this course when we talk about 2-dimensional object.

3.3.2.1.1 Character

A character type vector will often be used to work with text data. For example, we have a list of names that we will send an email to:

#> [1] "Samuel Chan"           "Tiara Dwiputri"        "Ahmad Husain Abdullah"

We can use the paste() function to make the sentence “Dear XXX,”, programmatically:

#> [1] "Dear  Samuel Chan ,"           "Dear  Tiara Dwiputri ,"       
#> [3] "Dear  Ahmad Husain Abdullah ,"

As we see from the example above, we can process text data by assigning them into character vector. There is also a wide possibility for advanced data processing techniques in text analysis.

3.3.2.1.2 Numeric

A numeric type vector often used for numerical data processing, especially mathematical operations. For example, suppose that we have profit data from the last quarter of our business operations (3 months):

#> [1] 150  87  95

We can use mean() to calculate the average of our profits:

#> [1] 110.6667

R and its extensions have various functions that we can use to perform various kinds of mathematical operations and more complex statistics.

3.3.2.1.3 Integer

An integer type vector is basically the same as the numeric type vector; however, integer type cannot have numbers behind commas or is not a fraction.

#> [1]  0  1  4  7 10

Because integers are basically the same as numeric, we can perform mathematical operations using them.

However, the best practice of using integers is to store index information or information that acts as an identifier, such as employee numbers, ID numbers, etc.

3.3.2.1.4 Logical

A logical type vector is better known as boolean terms in general programming languages. This type of vector has 2 kinds of values (TRUE/FALSE; T/F).

This type of vector is usually obtained from other operations, rather than being made by initially. For example, when we have data related to customers that are eligible/not to be given a promo based on their loyalty point:

If for example, the minimum points required to get a promo is 100 (cust_promo >= 100), we can use function ifelse() to classify whether a customer will be given a promo or not:

#> [1]  TRUE FALSE  TRUE  TRUE
3.3.2.1.5 Other Data Types: Factor & Date

During data processing in R, we will often use several more advanced data types, namely factor and date.

  • Factor

    Factor is a character type vector that has levels or categorical values. Each level can appear more than once in a given vector and therefore have frequency.

#> [1] Type A Type A Type B Type A Type C Type C
#> Levels: Type A Type B Type C
  • Date

    Date is a special data type that contains date data in a standardized format.

#> [1] "2019-01-01" "2019-01-01" "2019-01-02"

3.3.2.2 Matrix

Matrix is one way to store 2-dimensional data in an object. Same as vectors, a matrix can only store objects with the same type. To make a matrix, we can use the function matrix():

#>      [,1] [,2]
#> [1,]   11   14
#> [2,]   12   15
#> [3,]   13   16

Notice how the value in a matrix has index [1, 1] - can be read as [row index, column index]. We can access a certain value from a matrix by its index.

#> [1] 11 14
#> [1] 13

To see the dimension of a matrix, we can use the dim() function and it will return the number of rows and columns in the matrix.

#> [1] 3 2

We can also retrieve or set the column names of a matrix-like object by using colnames(), while for the row names using rownames().

3.3.2.3 Data Frame

During our time processing data, we will be frequently faced with data in the form of tables, such as the results of a survey of each employee’s activities or the recapitulation of sales from each store branch. Data in the form of table is also known as data.frame. Similar to matrix, it has 2 dimensions namely rows–of observations–and columns–of variables, but it can store different data types.

For example, we can create a data.frame using a function data.frame():

#>       categories category_id
#> 1 OfficeSupplies         111
#> 2      Computers         112
#> 3      Packaging         113

In this course, we will be focusing more on this data type.

3.4 Tidying Messy Sheets in R

We are now going to learn techniques on how to transform raw data into a format that is easier for analysis. R provides various functions that we can use to process our dataset; from importing datasets to complex aggregations. These functions are compiled in a package. To further learn about packages, you can see Library and Setup section.

Along with the development of R, we can now use several packages that are currently considered as best practices for processing data in R. These packages are included in tidyverse package. Additionally, as most workers use Microsoft Excel to store their data, we will also use readxl package to import an excel spreadsheet into R.

To load packages into our R session, type:

3.4.1 Importing Dataset

R can read data from various formats. For example, by default R has provided a function to import data in comma-separated value (.csv) format using read.csv(). Meanwhile, data stored in other formats may need additional packages and functions for importing.

In this example, we can use readxl package to import data stored in excel formats (.xlxs, .xls, etc.). We will use “data-attrition.xlsx” (located in /data of our working directory) as our data input and assign it to an object named attrition (for Sheet 1) and survey (for Sheet 2). This data stores employees attrition data and commonly analyzed by the Human Resource Department to know what factors might influence the employee’s decisions on leaving the company.

By default, the read_* function will try to detect the data type of each column. We can use the head() to see the first 6 rows of our data to confirm this:

employee_number age gender department job_role job_level education education_field monthly_income hourly_rate daily_rate monthly_rate percent_salary_hike performance_rating standard_hours over_time over_18 marital_status business_travel distance_from_home stock_option_level num_companies_worked total_working_years training_times_last_year years_at_company years_in_current_role years_since_last_promotion years_with_curr_manager attrition
1 41 female sales sales_executive 2 2 life_sciences 5993 94 1102 19479 11 3 80 yes y single travel_rarely 1 0 8 8 0 6 4 0 5 yes
2 49 male research_development research_scientist 2 1 life_sciences 5130 61 279 24907 23 4 80 no y married travel_frequently 8 1 1 10 3 10 7 1 7 no
4 37 male research_development laboratory_technician 1 2 other 2090 92 1373 2396 15 3 80 yes y single travel_rarely 2 0 6 7 3 0 0 0 0 yes
5 33 female research_development research_scientist 1 4 life_sciences 2909 56 1392 23159 11 3 80 yes y married travel_frequently 3 0 1 8 3 8 7 3 0 no
7 27 male research_development laboratory_technician 1 1 medical 3468 40 591 16632 12 3 80 no y married travel_rarely 2 1 9 6 3 2 2 2 2 no
8 32 male research_development laboratory_technician 1 2 life_sciences 3068 79 1005 11864 13 3 80 no y single travel_frequently 2 0 0 8 2 7 7 3 6 no
10 59 female research_development laboratory_technician 1 3 medical 2670 81 1324 9964 20 4 80 yes y married travel_rarely 3 3 4 12 3 1 0 0 0 no
11 30 male research_development laboratory_technician 1 1 life_sciences 2693 67 1358 13335 22 4 80 no y divorced travel_rarely 24 1 1 1 2 1 0 0 0 no
12 38 male research_development manufacturing_director 3 3 life_sciences 9526 44 216 8787 21 4 80 no y single travel_frequently 23 0 0 10 2 9 7 1 8 no
13 36 male research_development healthcare_representative 2 3 medical 5237 94 1299 16577 13 3 80 no y married travel_rarely 27 2 6 17 3 7 7 7 7 no
employee_number job_involvement relationship_satisfaction environment_satisfaction work_life_balance job_satisfaction
1 3 1 2 1 4
2 2 4 3 3 2
4 2 2 4 3 3
5 3 3 4 3 3
7 3 4 1 3 2
8 3 3 4 2 4
10 4 1 3 2 1
11 3 2 4 3 3
12 2 2 4 3 3
13 3 2 3 2 3

3.4.1.1 Relational Data

In various types of businesses, usually have a lot of data tables that must be combined to answer several business questions that are interested in. Multiple data tables are commonly called relational data because they have relations, not just one single independent data. Here’s an example of relational data looks like:

There are two common verbs that are designed to work with relational data: mutating joins and filtering join. But in this section, we will focus more on examples of mutating joins. Lookups to the Introduction to Relational Data for more examples of join data tables.

  • Mutating joins adds new variables to one data frame from matching observations in another.

A mutating join allows you to combine variables from two tables. It first matches observations by their keys, then copies across variables from one table to the other. Here are some types of mutating join:

  1. inner join
  2. outer join
  • left_join
    • right_join
    • full_join

Below is an illustration of the different types of joins with a Venn diagram:

3.4.1.1.1 Inner Join

inner_join(x, y): All rows from x where there are matching values in y, and all columns from x and y.

We are going to use this type of join for our data frame.

employee_number age gender department job_role job_level education education_field monthly_income hourly_rate daily_rate monthly_rate percent_salary_hike performance_rating standard_hours over_time over_18 marital_status business_travel distance_from_home stock_option_level num_companies_worked total_working_years training_times_last_year years_at_company years_in_current_role years_since_last_promotion years_with_curr_manager attrition job_involvement relationship_satisfaction environment_satisfaction work_life_balance job_satisfaction
1 41 female sales sales_executive 2 2 life_sciences 5993 94 1102 19479 11 3 80 yes y single travel_rarely 1 0 8 8 0 6 4 0 5 yes 3 1 2 1 4
2 49 male research_development research_scientist 2 1 life_sciences 5130 61 279 24907 23 4 80 no y married travel_frequently 8 1 1 10 3 10 7 1 7 no 2 4 3 3 2
4 37 male research_development laboratory_technician 1 2 other 2090 92 1373 2396 15 3 80 yes y single travel_rarely 2 0 6 7 3 0 0 0 0 yes 2 2 4 3 3
5 33 female research_development research_scientist 1 4 life_sciences 2909 56 1392 23159 11 3 80 yes y married travel_frequently 3 0 1 8 3 8 7 3 0 no 3 3 4 3 3
7 27 male research_development laboratory_technician 1 1 medical 3468 40 591 16632 12 3 80 no y married travel_rarely 2 1 9 6 3 2 2 2 2 no 3 4 1 3 2
8 32 male research_development laboratory_technician 1 2 life_sciences 3068 79 1005 11864 13 3 80 no y single travel_frequently 2 0 0 8 2 7 7 3 6 no 3 3 4 2 4
3.4.1.1.3 Right Join

right_join(x, y): All rows from y and all columns from x and y. Rows in y with no matches in x will have NA values in the joined data frame.

name band plays
John Beatles guitar
Paul Beatles bass
Keith NA guitar
3.4.1.1.4 Full Join

full_join(x, y): All rows and all columns from both x and y. Where there are non-matching values, returns NA.

name band plays
Mick Stones NA
John Beatles guitar
Paul Beatles bass
Keith NA guitar

3.4.1.2 [Opt] Working with Database

In real life, data sources are stored in many different ways. Some might have it in CSV files, some have it in excel, some in a server, some are retrieved from API. Some useful skills you might want to put is how to connect to data sources from R. Most of structured databases is stored using a relational database, and SQL is a common DBMS to use. In this example, I’m going to show you how to connect to a remote Microsoft SQL database. Please do note that connecting to a database require different driver, and we will use ODBC connection. In most cases, R is compatible to connect to major databases with available drivers, such as Microsoft SQL Server, SAP, SPSS.

The driver we want to use can be loaded using odbc and some other SQL-related packages. If you are interested to try, let’s start by installing the packages first:

Then call them using library():

Once connected, you can make a chunk that runs an SQL query in it that uses the connection object we just made, or just use it in an R session. This is an example of how to make a connection:

3.4.2 Data Cleansing

After importing our data, we can perform data cleansing. These are some criteria that make up a clean data:

  • Each variable is placed in its own column, each observation in its own row, and each value in its own cell.

    This is a very important aspect of clean data. This helps us perform various functions and analysis correctly. An example of this is in the figure below:

Source: Grolemund & Wickham, 2016.

  • Standardized column names

    R has programming style guides that aim to make R code created by users easier to read, share, and verify. Examples of R programming style guides are Google’s R Style Guide and the Tidyverse Style Guide. Our column names should at least follow this set of style guides. Some examples are naming our columns using lowercase letter and separate it with a dot(.) or underscore(_).

  • Consistent naming for character-type values (avoids duplicate data)

  • Numerical values should not contain characters and vice versa

  • Date-type variables should use either of this format: ymd or ymd-hms

Below is an example of ‘dirty’ data:

Kode SITC 3 digit 01-01-2017 02-01-2017 03-01-2017 04-01-2017 05-01-2017 06-01-2017 07-01-2017 08-01-2017
001 Live animals other than animals of division 03 102.77 95.34 100.09 98.38 104.27 100.43 103.84 111.69
012 Daging lainnya dan sisa daging yang dapat dimakan, segar, dingin atau beku (kecuali daging dan sisa daging yang tidak layak atau tidak cocok untuk konsumsi manusia) 132 161.21 115.02 108.81 107.21 99.99 111.43 102.23
016 Daging dan sisa daging yang dapat dimakan, asin, dalam air garam, kering atau diasap; tepung dan tepung kasar dari daging atau sisanya 299.17 101.4 86.61 29.37 62.57 107.22 176.28 383.32
017 Daging dan sisa daging yang dapat dimakan, diolah atau diawetkan, tidak dirinci 117 179.67 248.48 153.21 121.5 460.65 161.26 126.28
022 Susu dan krim dan produk susu selain mentega atau keju 96.62 129.2 98.67 100.31 107.95 91.81 94.09 91.02
023 Mentega dan lemak lainnya dan minyak yang berasal dari susu 2 721.92 85.01 175.95 7 153.54 63.33 139.16 40.26 102.77
024 Keju dan dadih 74.47 105.21 61.41 85.58 20.64 88.45 97.43 93.82
034 Ikan, segar (hidup atau mati), dingin atau beku 127.81 120.58 137.06 149.04 212.62 136.2 128.64 121.34
035 Ikan, kering, asin atau dalam air garam; ikan asap (dimasak atau tidak sebelum atau selama proses pengasapan); tepung, tepung kasar dan pelet ikan, layak untuk dikonsumsi manusia 141.31 249.47 128.43 160.96 148.65 196.31 155.64 148.07
036 Krustasea, moluska dan invertebrata air, berkulit maupun tidak, segar (hidup atau mati), dingin, beku, kering, asin atau dalam air garam; krustasea, berkulit, dikukus atau direbus, dingin atau tidak , beku, kering, asin atau dalam air garam 157.06 301.47 109.3 181.98 173.3 135.85 127.1 360.75

This data considered as ‘dirty’ because:

  • the variables were placed in rows instead of columns,
  • the date-type data assigned as column names (this prevent us from utilizing information related to date)
  • the numerical value still labeled as characters (chr),
  • the numerical value doesn’t have consistency in writing/storing their data (spaces used as the separator, decimals are not applied to all values, etc.)

On the contrary, here is an example of a ‘clean’ data,

code date export
001 Live animals other than animals of division 03 2017-01-01 102.77
001 Live animals other than animals of division 03 2017-01-02 95.34
001 Live animals other than animals of division 03 2017-01-03 100.09
001 Live animals other than animals of division 03 2017-01-04 98.38
001 Live animals other than animals of division 03 2017-01-05 104.27
001 Live animals other than animals of division 03 2017-01-06 100.43

The above data is already cleaned by changing its data structure using tidyr package. The date information that previously scattered in different columns is now stored in one column date. Additionally, the date information also changed into the correct date data type using lubridate package. These packages are compiled in tidyverse and were simultaneously loaded into our session when we previously loaded tidyverse. Although we will not use all functions and packages provided in tidyverse, we will discuss the basic concept and practices of data cleaning and transformation in R. This understanding will ease your time in exploring many possibilities of data analysis in R in the long run!

Now, let’s look back at our data. We can use the function glimpse() from dplyr package to inspect the structure of our data.

#> Observations: 1,470
#> Variables: 34
#> $ employee_number            <dbl> 1, 2, 4, 5, 7, 8, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20,...
#> $ age                        <dbl> 41, 49, 37, 33, 27, 32, 59, 30, 38, 36, 35, 29, 31, 34, 2...
#> $ gender                     <chr> "female", "male", "male", "female", "male", "male", "fema...
#> $ department                 <chr> "sales", "research_development", "research_development", ...
#> $ job_role                   <chr> "sales_executive", "research_scientist", "laboratory_tech...
#> $ job_level                  <dbl> 2, 2, 1, 1, 1, 1, 1, 1, 3, 2, 1, 2, 1, 1, 1, 3, 1, 1, 4, ...
#> $ education                  <dbl> 2, 1, 2, 4, 1, 2, 3, 1, 3, 3, 3, 2, 1, 2, 3, 4, 2, 2, 4, ...
#> $ education_field            <chr> "life_sciences", "life_sciences", "other", "life_sciences...
#> $ monthly_income             <dbl> 5993, 5130, 2090, 2909, 3468, 3068, 2670, 2693, 9526, 523...
#> $ hourly_rate                <dbl> 94, 61, 92, 56, 40, 79, 81, 67, 44, 94, 84, 49, 31, 93, 5...
#> $ daily_rate                 <dbl> 1102, 279, 1373, 1392, 591, 1005, 1324, 1358, 216, 1299, ...
#> $ monthly_rate               <dbl> 19479, 24907, 2396, 23159, 16632, 11864, 9964, 13335, 878...
#> $ percent_salary_hike        <dbl> 11, 23, 15, 11, 12, 13, 20, 22, 21, 13, 13, 12, 17, 11, 1...
#> $ performance_rating         <dbl> 3, 4, 3, 3, 3, 3, 4, 4, 4, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
#> $ standard_hours             <dbl> 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 80, 8...
#> $ over_time                  <chr> "yes", "no", "yes", "yes", "no", "no", "yes", "no", "no",...
#> $ over_18                    <chr> "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y", "y...
#> $ marital_status             <chr> "single", "married", "single", "married", "married", "sin...
#> $ business_travel            <chr> "travel_rarely", "travel_frequently", "travel_rarely", "t...
#> $ distance_from_home         <dbl> 1, 8, 2, 3, 2, 2, 3, 24, 23, 27, 16, 15, 26, 19, 24, 21, ...
#> $ stock_option_level         <dbl> 0, 1, 0, 0, 1, 0, 3, 1, 0, 2, 1, 0, 1, 1, 0, 1, 2, 2, 0, ...
#> $ num_companies_worked       <dbl> 8, 1, 6, 1, 9, 0, 4, 1, 0, 6, 0, 0, 1, 0, 5, 1, 0, 1, 2, ...
#> $ total_working_years        <dbl> 8, 10, 7, 8, 6, 8, 12, 1, 10, 17, 6, 10, 5, 3, 6, 10, 7, ...
#> $ training_times_last_year   <dbl> 0, 3, 3, 3, 3, 2, 3, 2, 2, 3, 5, 3, 1, 2, 4, 1, 5, 2, 3, ...
#> $ years_at_company           <dbl> 6, 10, 0, 8, 2, 7, 1, 1, 9, 7, 5, 9, 5, 2, 4, 10, 6, 1, 2...
#> $ years_in_current_role      <dbl> 4, 7, 0, 7, 2, 7, 0, 0, 7, 7, 4, 5, 2, 2, 2, 9, 2, 0, 8, ...
#> $ years_since_last_promotion <dbl> 0, 1, 0, 3, 2, 3, 0, 0, 1, 7, 0, 0, 4, 1, 0, 8, 0, 0, 3, ...
#> $ years_with_curr_manager    <dbl> 5, 7, 0, 0, 2, 6, 0, 0, 8, 7, 3, 8, 3, 2, 3, 8, 5, 0, 7, ...
#> $ attrition                  <chr> "yes", "no", "yes", "no", "no", "no", "no", "no", "no", "...
#> $ job_involvement            <dbl> 3, 2, 2, 3, 3, 3, 4, 3, 2, 3, 4, 2, 3, 3, 2, 4, 4, 4, 2, ...
#> $ relationship_satisfaction  <dbl> 1, 4, 2, 3, 4, 3, 1, 2, 2, 2, 3, 4, 4, 3, 2, 3, 4, 2, 3, ...
#> $ environment_satisfaction   <dbl> 2, 3, 4, 4, 1, 4, 3, 4, 4, 3, 1, 4, 1, 2, 3, 2, 1, 4, 1, ...
#> $ work_life_balance          <dbl> 1, 3, 3, 3, 3, 2, 2, 3, 3, 2, 3, 3, 2, 3, 3, 3, 2, 2, 3, ...
#> $ job_satisfaction           <dbl> 4, 2, 3, 3, 2, 4, 1, 3, 3, 3, 2, 3, 3, 4, 3, 1, 2, 4, 4, ...

The result above shown that our data consist of 1,470 rows (observations) and 35 columns (variables) whereas each column has different data types. This kind of information is very helpful for us to understand our data. We should also understand the description of each column to evaluate whether that column has the correct data type and whether it has valuable information or not. While most of the column names are self-explanatory, below are additional descriptions for certain columns:

  • attrition: employee’s churn status (yes/no)
  • education: employee’s education level (1 = Below College, 2 = College, 3 = Bachelor, 4 = Master, 5 = Doctor)
  • employee_count: employee’s count for respective id
  • employee_number: employee’s id
  • num_companies_worked: number of companies worked at
  • stock_option_level: how many company stocks employees can own from the company

From the above result, we can see that our data is quite tidy for its structure. Each variable is placed in its own column, each observation in its own row, and each value in its own cell. Even so, we can also see that some columns are not stored in the correct data type. Column education, maybe better stored as factors for it may not represent a continuous scale but a categorical value. All columns stored as character (chr) should also be stored as factors.

The dplyr package from tidyverse provides various functions for data cleaning such as making/changing columns. Not only that, ‘dplyr’ also provides Grammar of Coding that simplifies our work with codes.

We can change the column data type using the following command and we will breakdown for each steps below.

Breakdown

  • The first thing we need to do is to understand the grammar of coding using tidyverse. If you see at the end of each command from the chunk above, there is a pipe operator %>% that allows users to perform chaining operations. In chaining operations, the data on the left side of the %>% will be the data used for the next function/command.

  • The first function we used above, is mutate(). This function is used to transform a column or to create a new column. In this case, we use ‘mutate()’ to transform ‘education’, ‘job_level’, and ‘stock_option_level’ into factor data type using function as.factor().

  • There is also function mutate_if() to transform multiple columns with a certain condition, similar to conditional formatting in Excel. In this case, if it stores a character data type (is.character) then it will be transformed into factor using as.factor().

Now let’s see the first 6 observations of our data to check whether our data is already clean enough.

employee_number age gender department job_role job_level education education_field monthly_income hourly_rate daily_rate monthly_rate percent_salary_hike performance_rating standard_hours over_time over_18 marital_status business_travel distance_from_home stock_option_level num_companies_worked total_working_years training_times_last_year years_at_company years_in_current_role years_since_last_promotion years_with_curr_manager attrition job_involvement relationship_satisfaction environment_satisfaction work_life_balance job_satisfaction
1 41 female sales sales_executive 2 2 life_sciences 5993 94 1102 19479 11 3 80 yes y single travel_rarely 1 0 8 8 0 6 4 0 5 yes 3 1 2 1 4
2 49 male research_development research_scientist 2 1 life_sciences 5130 61 279 24907 23 4 80 no y married travel_frequently 8 1 1 10 3 10 7 1 7 no 2 4 3 3 2
4 37 male research_development laboratory_technician 1 2 other 2090 92 1373 2396 15 3 80 yes y single travel_rarely 2 0 6 7 3 0 0 0 0 yes 2 2 4 3 3
5 33 female research_development research_scientist 1 4 life_sciences 2909 56 1392 23159 11 3 80 yes y married travel_frequently 3 0 1 8 3 8 7 3 0 no 3 3 4 3 3
7 27 male research_development laboratory_technician 1 1 medical 3468 40 591 16632 12 3 80 no y married travel_rarely 2 1 9 6 3 2 2 2 2 no 3 4 1 3 2
8 32 male research_development laboratory_technician 1 2 life_sciences 3068 79 1005 11864 13 3 80 no y single travel_frequently 2 0 0 8 2 7 7 3 6 no 3 3 4 2 4
10 59 female research_development laboratory_technician 1 3 medical 2670 81 1324 9964 20 4 80 yes y married travel_rarely 3 3 4 12 3 1 0 0 0 no 4 1 3 2 1
11 30 male research_development laboratory_technician 1 1 life_sciences 2693 67 1358 13335 22 4 80 no y divorced travel_rarely 24 1 1 1 2 1 0 0 0 no 3 2 4 3 3
12 38 male research_development manufacturing_director 3 3 life_sciences 9526 44 216 8787 21 4 80 no y single travel_frequently 23 0 0 10 2 9 7 1 8 no 2 2 4 3 3
13 36 male research_development healthcare_representative 2 3 medical 5237 94 1299 16577 13 3 80 no y married travel_rarely 27 2 6 17 3 7 7 7 7 no 3 2 3 2 3

We can see that our attrition data is clean enough and therefore we can continue to the data transformation.

3.4.3 Data Transformation

Data transformation is a process of converting data from one format or structure into another. Data transformation is a useful tool for Explanatory Data Analysis (EDA) which aims to gain insight and improve our understanding of data by looking at a more detailed or narrower perspective, based on our business question. The package dplyr also provides various functions for data transformation such as making/changing columns, subsetting columns/rows, and summarizing data.

To put it into practice, let’s try performing data transformation to our data. Imagine ourselves as an HR staff who wants to evaluate the employee’s attrition status of your company. We would like to know:

Which department gives the highest number of employee attrition?

The following are complete codes that command R to perform the data transformation. We will breakdown each step and explain it in detail.

department count
research_development 133
sales 92
human_resources 12

Breakdown

  • In the example above, the first thing we do is creating a summarise of our data using group_by(), summarise(), and ungroup(). The function group_by will take existing data (columns) and converts it into a ‘grouped table’ where operations are performed ‘by group’. The function summarise() will perform a summary statistic calculation, in this case, count by using n() on the ‘grouped table’ from the previous chaining operation. The function ungroup() remove the grouping condition. These functions are quite similar to the pivot table in Excel.
attrition department count
no human_resources 51
no research_development 828
no sales 354
yes human_resources 12
yes research_development 133
yes sales 92
  • The next function is filter(). We can use this function to subset our row data based on specific conditions. In this case, we filter (only take) every observation or rows from column attrition which have ‘yes’ as it values.
attrition department count
yes human_resources 12
yes research_development 133
yes sales 92
  • Next we can use select() to remove the column attrition from our data.
department count
human_resources 12
research_development 133
sales 92
  • The final step is to sort our data using arrange(). To sort it descending, we can use the helper function desc():
department count
research_development 133
sales 92
human_resources 12

From the result above, we know that the department which gives the highest number of employees attrition is from Research and Development. This department may need further analysis regarding its employee’s attrition.

The above example shows how we can summarise our full data into a simplified version which only displays the number of leaving employees (attrition = “yes”) per department. Others might prefer to display both staying and leaving employees from all departments but with a more compressed structure. For example, we will use some of the code we made earlier but without filtering.

Instead, we will transform our data into its wider format using function pivot_wider() from tidyr package. This function will separate the values from column ‘attrition’ into 2 different columns and filling it with values from column ‘count’.

department no yes
human_resources 51 12
research_development 828 133
sales 354 92

Alternatively, we can transform data from its wide into its long format using pivot_longer():

department attrition count
human_resources no 51
human_resources yes 12
research_development no 828
research_development yes 133
sales no 354
sales yes 92

There are many more functions and technical details that we can explore further from tidyverse. If you are interested, you can visit the following articles:

Dive Deeper

The grammar of code from ‘dplyr’ is very powerful for Exploratory Data Analysis (EDA). Using attrition data, try to answer the following questions to sharpen your intuition on data analysis using R:

  1. From the Research and Development department, which job role gives the highest number of employee attrition and what is their average relationship satisfaction? What insight can we take from the result?
  1. Is there gender inequality in your company? Calculate the percentage of the female and male employees from each department!

You can also create your own question and try to answer it to sharpen your analytical skills!