Chapter 4 Statistical Report

library(tidyverse)
library(rsample)
library(caret)

In general statistics reports, there are 2 main parts, descriptive statistics and inferential statistics (analysis test). To provide you how to create a statistical analysis report, we will divide some of the content into:

Statistical Descriptive
- Measure of Central Tendency
- Measure of Spread
- Measure of Relationship between data
Statistical Inferencial: Statistics Model Focused
- Motivation and Definition “Machine learning”
- Employee Attrition Analysis
- How to interpret analysis result
- Convert statistical object into textual report with easystat’s style

4.1 Statistical Descriptive

Statisticians and data scientists use descriptive statistics to summarize and describe a large number of measurements. Many times, this task is accompanied with graphs and plots that help describe the numerical summary of data. When data science is applied in the business context, an example of descriptive statistic is the average number of transactions per month. Another example is the percentage of e-commerce transactions with a voucher code applied. The simple rule is that descriptive statistics do not involve generalizing beyond the data we have obtained, and are merely descriptive of what we have at hand.

4.1.1 Measure of Central Tendency

The measure of central tendency enable us to compare two or more distribution pertaining to the same time period or within the same distribution over time.

attrition <- read.csv("data_input/data-attrition.csv")

head(attrition, 10)

attrition	age	business_travel	daily_rate	department	distance_from_home	education	education_field	employee_count	employee_number	environment_satisfaction	gender	hourly_rate	job_involvement	job_level	job_role	job_satisfaction	marital_status	monthly_income	monthly_rate	num_companies_worked	over_18	over_time	percent_salary_hike	performance_rating	relationship_satisfaction	standard_hours	stock_option_level	total_working_years	training_times_last_year	work_life_balance	years_at_company	years_in_current_role	years_since_last_promotion	years_with_curr_manager
yes	41	travel_rarely	1102	sales	1	2	life_sciences	1	1	2	female	94	3	2	sales_executive	4	single	5993	19479	8	y	yes	11	3	1	80	0	8	0	1	6	4	0	5
no	49	travel_frequently	279	research_development	8	1	life_sciences	1	2	3	male	61	2	2	research_scientist	2	married	5130	24907	1	y	no	23	4	4	80	1	10	3	3	10	7	1	7
yes	37	travel_rarely	1373	research_development	2	2	other	1	4	4	male	92	2	1	laboratory_technician	3	single	2090	2396	6	y	yes	15	3	2	80	0	7	3	3	0	0	0	0
no	33	travel_frequently	1392	research_development	3	4	life_sciences	1	5	4	female	56	3	1	research_scientist	3	married	2909	23159	1	y	yes	11	3	3	80	0	8	3	3	8	7	3	0
no	27	travel_rarely	591	research_development	2	1	medical	1	7	1	male	40	3	1	laboratory_technician	2	married	3468	16632	9	y	no	12	3	4	80	1	6	3	3	2	2	2	2
no	32	travel_frequently	1005	research_development	2	2	life_sciences	1	8	4	male	79	3	1	laboratory_technician	4	single	3068	11864	0	y	no	13	3	3	80	0	8	2	2	7	7	3	6
no	59	travel_rarely	1324	research_development	3	3	medical	1	10	3	female	81	4	1	laboratory_technician	1	married	2670	9964	4	y	yes	20	4	1	80	3	12	3	2	1	0	0	0
no	30	travel_rarely	1358	research_development	24	1	life_sciences	1	11	4	male	67	3	1	laboratory_technician	3	divorced	2693	13335	1	y	no	22	4	2	80	1	1	2	3	1	0	0	0
no	38	travel_frequently	216	research_development	23	3	life_sciences	1	12	4	male	44	2	3	manufacturing_director	3	single	9526	8787	0	y	no	21	4	2	80	0	10	2	3	9	7	1	8
no	36	travel_rarely	1299	research_development	27	3	medical	1	13	3	male	94	3	2	healthcare_representative	3	married	5237	16577	6	y	no	13	3	2	80	2	17	3	2	7	7	7	7

4.1.1.1 Mean

Often times in the exploratory data analysis phase, we want to get a sense of what the most representative score of a particular measurement is. We often simplify this idea by referring to it as the “average”, but there are in fact, three measures of central tendency that you need to have in your statistical toolset.

The most popular measure of central tendency is the mean, which is sometimes represented as \(\bar x\) when computed on a sample and represented as \(\mu\) when computed on a population. Mean is really the sum of all your measurements, divided by the number of measurements, and works best on data that has an even distribution or a normal distribution (don’t worry if the idea of a normal distribution isn’t clear - we’ll get to that in a while!). In R, the mean function will return the mean:

sum(attrition$monthly_income)/length(attrition$monthly_income)

#> [1] 6502.931

mean(attrition$monthly_income)

#> [1] 6502.931

Mean is based on all the items in a series, a change in the value of any item will lead to a change in the value of the mean. So in the case of highly skewed distribution, the mean may get distorted on account of a few items with extreme values. In such a case, it may not appropriate for represent the characteristics of the distribution.

4.1.1.2 Median

The median is the point of value that cuts the distribution into two equal halves such that 50% of the observations are below it. To find this value, we would order the observations and find the middle value that separates the distribution into two equal halves.

median(attrition$monthly_income)

#> [1] 4919

We need to be cautious when applying the mean on data with a skewed distribution because the mean may not be the best candidate for a most representative score compared to other measures of central tendency. For example, a company surveys its employees household income and posted the following monthly household income (IDR, in Mil):

salary <- c(7.8, 7.5, 6, 7.5, 4.5, 105, 45, 7.5, 5.5, 4)
mean(salary)

#> [1] 20.03

median(salary)

#> [1] 7.5

While the median puts that figure at about 7.25, the mean is about 2.67 times higher and is not truly representative of the actual household income. While most of the employees have a combined household earning of less than 8 mil, the mean value of our household income would have believe that the average household income of our employees is in fact more than 20 mil IDR.

The median in this case is a better measure of centrality because it is not sensitive to the outlier data.

If we are in fact, required to compute the mean on data with skewed distribution, another technique to reduce the influence of outlier data is to use a slight variation of the mean, called the Trimmed Mean. The trimmed mean removes a small designated percentage of the largest and smallest values before computing the mean. A trimmed mean that computes the middle 95% of the distribution can be performed in R fairly easily:

# 5% of observations to be trimmed
mean(salary, trim = 0.2)

#> [1] 6.966667

4.1.1.3 Mode

When there are discreet values for a variable, the mode refers to the value that occurs most frequently. This statistic is rarely used in practice

most <- function(x){
  names(sort(-table(x)))[1]
}

most(attrition$education_field)

#> [1] "life_sciences"

Because R do not have a built-in way of computing the mode, we wrote the code above to tabulate our data (using table), multiply the calculation by -1 (and hence giving it the effect of sorting our data in descending order) and then pick the first value.

Dive Deeper

The following data give the savings bank accounts balances of nine sample household selected in a survey.

savings <- c(745, 2000, 1500, 68000, 461, 549, 3750, 1800, 4795)

Find the mean and the median for these data
Do these data contain an outlier? if so, what we can do to dealing with extreme number?
Which of these two summary measures is more appropirate for this series?

4.1.2 Measures of Spread

In the previous chapter, we have explained the measures of central tendency. It may be noted that these measures do not indicate the extent of dispersion or variability in a distribution. Measures of spread measures the extent to which value in a distribution differ from each other. In practice, it is far easier to compute the distance between the values to their mean and when we square each one of these distances and add them all up the average1 of that result is known as variance. Taking the square root of the variance will result in the standard deviation. Just like the mean, standard deviation is the “expected value” of how far the scores deviate from the mean.

4.1.2.1 Variance

hourly_rate <- attrition$hourly_rate
sum((hourly_rate - mean(hourly_rate))^2)/(length(hourly_rate) - 1)

#> [1] 413.2856

var(hourly_rate)

#> [1] 413.2856

4.1.2.2 Standard Deviation

And taking the square root of variance yields the standard deviation:

sqrt(var(hourly_rate))

#> [1] 20.32943

sd(hourly_rate)

#> [1] 20.32943

Variance and standard deviation are always positive when the values are not identical. When there’s no variability, the variance is 0. Because variance and standard deviation are sensitive to every value, they may not be the most “representative” measurement for skewed data.

4.1.2.3 Range

Other measurements of the spread are the range and the interquartile range The range is the distance from our smallest measurement to the largest one:

max(hourly_rate) - min(hourly_rate)

#> [1] 70

diff(range(hourly_rate))

#> [1] 70

4.1.2.4 IQR

The interquartile range is the range computed for the middle 50% of the distribution:

IQR(hourly_rate)

#> [1] 35.75

as.numeric(quantile(hourly_rate, 0.75) - quantile(hourly_rate, 0.25))

#> [1] 35.75

While we can use quantile() to obtain the 3rd and 1st quartile individually, these two figures are also presented together with the 0th (the min()), 50th (the median()), and the 100th (the max()) quartiles: together they are called the five-number summary.

When we call fivenum(), we get this summary that we can use as a measure of variation for even potentially skewed data:

fivenum(attrition$hourly_rate)

#> [1]  30  48  66  84 100

From the above, observe that the absolute lowest profit (the min) approximates 6,600 and the highest profit approximates 8,400 (the max); Observe also that 25% of our transactions make a profit of less than 1.728. Half of the transactions (the middle 50% of the value) make a profit between 1.728 and 29.364 - recall that this range is called the interquartile range (IQR). When we use summary() on continuous data, we’ll get the five number summary and the mean in return:

summary(attrition$hourly_rate)

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   30.00   48.00   66.00   65.89   83.75  100.00

4.1.2.5 Uses of the measure of spread

Discussion:

In a small business firm, two typist of employed. Typist A and typist B. Here is the data number of typed pages from both typist in 10 days. Which typist shows greater consistency in his output?

typist_a <- c(28.1, 30.4, 34.2, 30.2, 32.2, 34.7, 32.9, 29.9, 33.7, 39.1)
typist_b <- c(72.2, 48.0, 50.4, 32.2, 40.6, 59.7, 63.6, 31.2, 49.6, 31.0)

Solution:

paste("mean typist A:", mean(typist_a))

#> [1] "mean typist A: 32.54"

paste("mean typist B:", mean(typist_b))

#> [1] "mean typist B: 47.85"

paste("Standard Deviation typist A:", round(var(typist_a), digits = 2))

#> [1] "Standard Deviation typist A: 9.86"

paste("Standard Deviation typist B:", round(var(typist_b), digits = 2))

#> [1] "Standard Deviation typist B: 205.51"

These calculation indicate that althoufh typist B types out more pages on average (47.85 per a day), there is a greater variation in his output as compared to that of typist A. We can say this in a different way: Though typist A’s daily output is much less, he is more consistent than typist B.

Discussion 2:

Which financial assets has more votality in their annual price?

price.coins <- c(1.4, 0.4, 0.8, 1.1, 1.8, 2.2, 2.3, 1.2)
price.oil <- c(1.6, 1.2, 1.9, 0.8, 0.6, 1.5, 2.1, 1.5)

The primary measure of votality used by stock traders and financial analysts is standard deviation, and recall that this metric reflects the average amount of an item’s price over a period of time. While the price for our fictional “oil” asset and “coins” asset averaged out to be USD 1.4 over time, which of these two present a higher votality than the other?

4.1.3 Covariance and Correlation

Statistical methods of measures of central tendency, measure of spread are helpful for the purpose of comparison and analysis of distribution involving only one variable. However, decribing the relationship between two or more variables, is another important part of statistics. In many business research situations, the key of to decision making lies in understanding relationship between two or more variables. The statistical methods of Covariance and Correlation are helpful in knowing the relationship between two or more vairbles.

In all these cases involving two or more variables, we may be interested in seeing:

if there is any association between the variables;
if there is an association, is it strong enough to be useful;
if so, what form the relationship between the two variables takes

4.1.3.1 Covariance

When we have two samples, X and Y, of the same size, then the covariance is an estimate of how variation in X is related to the variation in Y. Covariance measures how two variables covary and is represented as:

\[Cov(X,Y) = \frac{1}{n-1}\sum_{i = 1}^{i}(X_i - \mu x)(Y_i - \mu y)\]

years_at_company <- attrition$years_at_company
monthly_income <- attrition$monthly_income

sum((years_at_company - mean(years_at_company))*(monthly_income - mean(monthly_income)))/(length(monthly_income)-1)

#> [1] 14833.73

cor(years_at_company, monthly_income)

#> [1] 0.5142848

Getting a positive covariance means that higher X tends to be associated with larger Y (and vice versa). The covariance of any variable with itself is its variance. Notice also that cov(X,Y) = cov(Y,X). Notice also that cov(X,Y) = cov(Y,X).

cov(years_at_company, years_at_company)

#> [1] 37.53431

var(years_at_company)

#> [1] 37.53431

But there is a problem with covariances: they are hard to compare because variables are sometimes expressed in different units or scales. It is hard to tell if there is an “objectively stronger” variance between UK Pound and Indonesia Rupiah or bitcoin prices and the US dollar because the “scale” at which we measure and compute the covariance on is different.

One solution is to “normalize” the covariance: we divide the covariance by something that encapsulate the scale in both the covariates, leading us up to a value that is bounded to the range of -1 and +1. This is the correlation.

4.1.3.2 Correlation

Whatever units our original variables were in, this transformation will get us a measurement that allow us to compare whether two variables exhibit a correlation stronger than another:

\[Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)*Var(Y)}}\]

cov(years_at_company, monthly_income) / sqrt(var(years_at_company)*var(monthly_income))

#> [1] 0.5142848

And to find the correlation instead of the covariance, we would just use cor() instead. Correlation, unlike covariance, is not sensitive to the units in which our variables X and Y are measured and hence more useful for determining how strong the relationship is between variables:

cor(years_at_company, monthly_income)

#> [1] 0.5142848

Some facts about correlation:

\(Cor(X,Y) == Cor(Y,X)\)
\(-1 <= Cor(X,Y) <= 1\)
Cor(X,Y) is 1 or -1 only when the X and Y observations fall perfectly on a positive or negatively sloped line
Cor(X,Y) = 0 implies no linear relationship

Discussion:

We can think of X and Y as the returns of two stocks (both stocks have a return, which is basically how much money they’re expected to make, and both have risks, which measure how much the return fluctuates; this is an example that Mike Parzen, from the Harvard Statistics Department, often uses in his courses to build intuition). In the Statistics ‘world’, it’s pretty much the norm to think of return as the average, or expectation, of a stock, and risk as the variance. So, say that you were building a portfolio (a compilation of multiple stocks) and wanted to find the risk and return of the entire portfolio.

We could break this down into the ‘individual risks’ of the stocks - the separate Variances Var(A), Var(B), Var(C), etc - and the ‘interactive’ risks of them together - the Covariance term. The individual risk is straightforward enough (just the marginal variance of each stock), but think more about the interactive risks. If two stocks tend to move together, then they are certainly riskier. Finally, given the value of covariance and the correlation between the two stocks, we are more likely to choose 2 stocks that do not have a correlation (0) which means reducing the risk 2 stocks decrease simultaneously.

4.2 Statistics Model

4.2.1 Motivation and Definition “Machine Learning”

Machine learning on a very basic level, refers to a subfield of computer science that “gives computer the ability to learn without being explicitly programmed”, this realization and quote was credited to Arthur Samuel, who coined the term “machine learning” and created the world’s first self-learning program called the Samuel Checkers-playing Program in 1952. When Samuel was about to demonstrate the program, the founder and president of IBM remarked that the demonstration would raise the price of IBM stock by 15 points. It did. In 1961 Samuel challenged the Connecticut state checker champion (4th ranked nationwide) and his program won.

With the advances in machine learning, society as a collective has pushed new boundaries around making machines “smarter”, or less-sensationally, making machines more able to perform tasks without human intervention. The whole notion of making machines perform these tasks that, for a long time in history were done by human brains, is what most people meant when they say “Artificial intelligence”. Compared to machine learning, AI describes a broad concept (“ideal”). Machine learning on the other hand, offers a particular approach to arriving at that “ideal”.

Supervised Machine Learning currently makes up most of the ML that is being used by systems across the world. The input variable (x) is used to connect with the output variable (y) through the use of an algorithm. All of the input, the output, the algorithm, and the scenario are being provided by humans.

Supervised learning: We feed our model training examples (input) and tag each of these example with a corresponding target, and is so doing, allow our model to produce a function that maps our input to its target.

Supervised learning algorithms allows machines to do predictive analytics for a specific target. Supervised learning are used to solve for classification and regression problems. Good examples for the financial industry are credit risk scoring (regression or classification), loan default prediction (classification), and customer lifetime value (regression). Supervised learning is also useful for performing predictive analytics on Employee Attrition data, which we are going to explore in the following section.

Unsupervised learning: If we feed our model training examples (input) without any labels, it is unsupervised learning.

Good examples of unsupervised learning problems in the finance / banking sector include anomaly detection (there is no target variables in anomaly detection, there is not even necessarily any right or wrong answer as to when an observation is an anomaly and how many anomaly exist in our data) and auto segmentation (again, no right or wrong answers as to how many clusters of customer segments is the right amount).

Which of the following do you think is a supervised learning problem?

Training an email spam filter
Find possible patterns from a group of 5000 financial transactions
Discover how many market segments can be drawn from a CRM (customer relationship system)
Categorizing transactions into high / medium / low risk - Classifying blood cell as benign * or malign

4.2.2 Employee Retention Analysis

The objective is to understand what factors contributed most to employee attrition and to create a model that can predict if a certain employee will leave the company or not. The goal also includes helping in formulating different retention strategies on targeted employees. Overall, the implementation of this model will allow management to create better decision-making actions.

Let’s import our data and inspect our variable;

employee <- read.csv("data_input/HR-Employee-Attrition.csv")

glimpse(employee)

#> Observations: 1,602
#> Variables: 27
#> $ Age                      <int> 39, 28, 28, 38, 40, 33, 41, 50, 46, 45, 34, 47, 52, 35, 42,...
#> $ BusinessTravel           <fct> Travel_Frequently, Travel_Rarely, Travel_Rarely, Non-Travel...
#> $ DailyRate                <int> 443, 304, 1451, 573, 658, 722, 1206, 316, 406, 1199, 648, 1...
#> $ DistanceFromHome         <int> 8, 9, 2, 6, 10, 17, 23, 8, 3, 7, 11, 4, 3, 8, 9, 5, 2, 3, 9...
#> $ EnvironmentSatisfaction  <int> 3, 2, 1, 2, 1, 4, 4, 4, 1, 1, 3, 3, 4, 1, 4, 2, 4, 4, 4, 1,...
#> $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Male, Male, Male, M...
#> $ HourlyRate               <int> 48, 92, 67, 79, 67, 38, 80, 54, 52, 77, 56, 92, 39, 72, 93,...
#> $ JobInvolvement           <int> 3, 3, 2, 1, 2, 3, 3, 3, 3, 4, 2, 2, 2, 3, 2, 3, 3, 4, 2, 3,...
#> $ JobLevel                 <int> 1, 2, 1, 2, 3, 4, 3, 1, 4, 2, 2, 3, 3, 1, 5, 2, 4, 2, 3, 1,...
#> $ JobSatisfaction          <int> 3, 4, 2, 4, 2, 3, 3, 2, 3, 3, 2, 2, 3, 4, 4, 4, 2, 4, 2, 1,...
#> $ MaritalStatus            <fct> Married, Single, Married, Divorced, Divorced, Single, Singl...
#> $ MonthlyIncome            <int> 3755, 5253, 3201, 5329, 9705, 17444, 7082, 3875, 17465, 643...
#> $ MonthlyRate              <int> 17872, 20750, 19911, 15717, 20652, 20489, 11591, 9983, 1559...
#> $ NumCompaniesWorked       <int> 1, 1, 0, 7, 2, 1, 3, 7, 3, 4, 4, 8, 2, 1, 8, 6, 1, 1, 1, 2,...
#> $ OverTime                 <fct> No, No, No, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No,...
#> $ PercentSalaryHike        <int> 11, 16, 17, 12, 12, 11, 16, 15, 12, 17, 11, 12, 14, 16, 22,...
#> $ PerformanceRating        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 4,...
#> $ RelationshipSatisfaction <int> 1, 4, 1, 4, 2, 4, 4, 4, 4, 4, 4, 3, 3, 2, 4, 1, 2, 2, 2, 4,...
#> $ StockOptionLevel         <int> 1, 0, 0, 3, 1, 0, 0, 1, 1, 1, 2, 1, 0, 1, 0, 0, 0, 2, 0, 1,...
#> $ TotalWorkingYears        <int> 8, 7, 6, 17, 11, 10, 21, 4, 23, 9, 14, 28, 28, 3, 24, 8, 22...
#> $ TrainingTimesLastYear    <int> 3, 1, 2, 3, 2, 2, 2, 2, 3, 1, 5, 4, 4, 1, 2, 2, 2, 3, 3, 3,...
#> $ WorkLifeBalance          <int> 3, 3, 1, 3, 2, 3, 3, 3, 3, 3, 4, 3, 3, 2, 3, 4, 3, 3, 4, 4,...
#> $ YearsAtCompany           <int> 8, 7, 5, 13, 1, 10, 2, 2, 12, 3, 10, 22, 5, 3, 1, 5, 22, 6,...
#> $ YearsInCurrentRole       <int> 3, 5, 3, 11, 0, 8, 0, 2, 9, 2, 9, 11, 4, 2, 0, 4, 10, 5, 7,...
#> $ YearsSinceLastPromotion  <int> 0, 0, 0, 1, 0, 6, 0, 2, 4, 0, 1, 14, 0, 0, 0, 1, 0, 0, 7, 1...
#> $ YearsWithCurrManager     <int> 7, 7, 4, 9, 0, 0, 2, 2, 9, 2, 8, 10, 4, 2, 1, 2, 4, 4, 7, 0...
#> $ Attrition                <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,...

The data we’ve prepared is originally made available in kaggle by lnvardanyan : ibm-hr-analytics-attrition-dataset. The following are the description of some features:

EnvironmentSatisfaction: 1 Low, 2 Medium, 3 High, 4 Very High
JobInvolvement: 1 Low, 2 Medium, 3 High, 4 Very High
JobSatisfaction: 1 Low, 2 Medium, 3 High, 4 Very High
PerformanceRating: 1 Low, 2 Good, 3 Excellent, 4 Outstanding
RelationshipSatisfaction: 1 Low, 2 Medium, 3 High, 4 Very High
WorkLifeBalance: 1 Bad, 2 Good, 3 Better, 4 Best.

4.2.2.1 Modeling Employee Retention for Predictive Analytics

We can use an algorithm called Logistic Regression to build a prediction model for employee retention. In R, we can pass the data and the prediction scenario (formula) to the function glm(). Additionally, we can also perform feature selection to improve the model performance. In this example we use the non-business wise stepwise regression (using the function step()).

model_logit <- glm(formula = Attrition ~., 
                   data = train, 
                   family = "binomial")

stepmodel_logit <- step(model_logit, direction = "backward", trace = FALSE)


tidy(stepmodel_logit)

term	estimate	std.error	statistic	p.value
(Intercept)	0.9922016	0.7347908	1.350318	0.1769138
BusinessTravelTravel_Frequently	2.0879389	0.2517729	8.292945	0.0000000
BusinessTravelTravel_Rarely	1.3420174	0.2300301	5.834096	0.0000000
DailyRate	-0.0002744	0.0001393	-1.969656	0.0488778
DistanceFromHome	0.0304200	0.0070695	4.302997	0.0000169
EnvironmentSatisfaction	-0.4734375	0.0517351	-9.151193	0.0000000
GenderMale	0.4001570	0.1159402	3.451409	0.0005577
JobInvolvement	-0.6302589	0.0815772	-7.725916	0.0000000
JobLevel	0.4604945	0.1790987	2.571177	0.0101353
JobSatisfaction	-0.3511748	0.0524822	-6.691312	0.0000000
MaritalStatusMarried	0.6928134	0.1640395	4.223456	0.0000241
MaritalStatusSingle	1.2064628	0.1667051	7.237108	0.0000000
MonthlyIncome	-0.0001540	0.0000440	-3.495895	0.0004725
NumCompaniesWorked	0.1458214	0.0249731	5.839133	0.0000000
OverTimeYes	1.4736563	0.1225765	12.022337	0.0000000
PercentSalaryHike	-0.0435064	0.0238889	-1.821199	0.0685766
PerformanceRating	0.6463730	0.2508763	2.576461	0.0099817
TotalWorkingYears	-0.0981441	0.0165229	-5.939891	0.0000000
TrainingTimesLastYear	-0.2162572	0.0462549	-4.675337	0.0000029
WorkLifeBalance	-0.1918679	0.0748872	-2.562092	0.0104044
YearsAtCompany	0.0672745	0.0228587	2.943066	0.0032498
YearsInCurrentRole	-0.1037842	0.0301443	-3.442911	0.0005755
YearsSinceLastPromotion	0.1781443	0.0251507	7.083081	0.0000000
YearsWithCurrManager	-0.0914658	0.0288522	-3.170152	0.0015236

4.2.2.2 Model Interpretation

Using the model we have built with stepwise regression, we can create a likelihood table and analyze the contribution of each variable in determining probability. Stepwise regression will calculate a coefficient (estimate) for each variable. The coefficient reflects variable contribution to the prediction result and can be transformed into odds ratio. For example, let’s take a coefficient value from a numerical variable Years with Current Manager that is -0.09. This negative coefficient can be interpreted as:

The longer the years spent with the Current Manager, the smaller employee’s chance to leave.

tidy(stepmodel_logit) %>% 
  mutate(odds_ratio = round(exp(estimate),2),
         p.value = round(p.value, 2),
         estimate = round(estimate, 2)) %>% 
  select(term, estimate, odds_ratio, p.value) %>% 
  arrange(-estimate) %>% 
  filter(p.value < 0.05)

term	estimate	odds_ratio	p.value
BusinessTravelTravel_Frequently	2.09	8.07	0.00
OverTimeYes	1.47	4.37	0.00
BusinessTravelTravel_Rarely	1.34	3.83	0.00
MaritalStatusSingle	1.21	3.34	0.00
MaritalStatusMarried	0.69	2.00	0.00
PerformanceRating	0.65	1.91	0.01
JobLevel	0.46	1.58	0.01
GenderMale	0.40	1.49	0.00
YearsSinceLastPromotion	0.18	1.19	0.00
NumCompaniesWorked	0.15	1.16	0.00
YearsAtCompany	0.07	1.07	0.00
DistanceFromHome	0.03	1.03	0.00
MonthlyIncome	0.00	1.00	0.00
YearsWithCurrManager	-0.09	0.91	0.00
TotalWorkingYears	-0.10	0.91	0.00
YearsInCurrentRole	-0.10	0.90	0.00
WorkLifeBalance	-0.19	0.83	0.01
TrainingTimesLastYear	-0.22	0.81	0.00
JobSatisfaction	-0.35	0.70	0.00
EnvironmentSatisfaction	-0.47	0.62	0.00
JobInvolvement	-0.63	0.53	0.00

Here are a quick summary of the table above:

Overtime has a positive coefficient with the odds ratio yes to no is 5.32. This says that the event of an employee who works overtime and leaves the company is about 5.32 more likely than the employee who is not working overtime.
The variables are linked (directly or indirectly) to work-life-balance (Job Satisfaction, Environment Satisfaction, Job Involvement) have a negative coefficient. We can say that the more satisfied and the higher job involvement the employee, the less likely that employees will leave the company.

4.2.2.3 Predicting

To test whether our model has a good performance, we can check the number of correctly classified/misclassified Attrition status in the unseen data.

prob_logit <- predict(stepmodel_logit, newdata = test, type = "response")
pred_logit <- ifelse(prob_logit > 0.5, "Yes", "No")

pred_logit <- as.factor(pred_logit)

table("prediction" = pred_logit, "actual" = test$Attrition)

#>           actual
#> prediction  No Yes
#>        No  180  21
#>        Yes  66  52

This table above is also known as the confusion matrix.

Observe from the confusion matrix that:

Out of the 73 actual employee leave we classified 52 of them correctly
Out of the 246 employee that stay we classified 184 of them correctly
Out of the 319 cases of attrition in our test set, we classified 236 of them correctly

4.2.3 Convert Statistical Object Like `easystats` Style

easystats is a development packages in R to provide a unifying and consistent framework to tame and harness the scary of R statistical models. We can automate convert an object of R from simple statistical model into textual report that ease our daily works in making interpretation of the data.

4.2.3.1 Correlation Test

get_narative_cor <- function(x, y, xname, yname){
  
  temp <- cor.test(x, y)
  paste0(
    "The Pearson's product-moment correlation between ",
    xname, 
    " and ", 
    yname, 
    " is ", 
    ifelse(temp$estimate > 0, "positive ", "negative "), 
    ifelse(temp$p.value < 0.05, "significant", "but not significant enough"), 
    " with a value ", 
    round(temp$estimate, digits = 2)
  )
  
}

narativecor <- get_narative_cor(x = employee$YearsAtCompany, employee$MonthlyIncome, xname = "Years at Company", yname = "Monthly Income")

The Pearson’s product-moment correlation between Years at Company and Monthly Income is positive significant with a value 0.54

4.2.3.2 Print All Parameter from Model

get_narative_model <- function(model, target){


tidy_estimate <- tidy(model) %>% 
  mutate(term = gsub(term, pattern = "([[:upper:]])", replacement = ' \\1') %>% 
           str_remove(pattern = "[[:punct:]]") %>% 
           str_squish())

text <- paste0(
  "We fitted a logistic regression to predict ", 
  target, 
  ".",
  "",
  " The model Intercepet is at ", 
  round(tidy_estimate$estimate[1], digits = 2),
  ". Within this model: <br>"
)

for (i in 2:nrow(tidy_estimate)) {

  text[i] <- paste0(
    i-1,
    ". The effect of ", 
    tidy_estimate$term[i], 
    " is ", 
    ifelse(tidy_estimate$estimate[i] > 0, "positive", "negative"), 
    " with value: ", 
    round(tidy_estimate$estimate[i], digits = 2), 
    "<br>"
  ) 
}
 
return(text)
 
}

narativemodel <- get_narative_model(model = stepmodel_logit, target = "Attrition Status")

We fitted a logistic regression to predict Attrition Status. The model Intercepet is at 0.99. Within this model:
, 1. The effect of Business Travel Travel Frequently is positive with value: 2.09
, 2. The effect of Business Travel Travel Rarely is positive with value: 1.34
, 3. The effect of Daily Rate is negative with value: 0
, 4. The effect of Distance From Home is positive with value: 0.03
, 5. The effect of Environment Satisfaction is negative with value: -0.47
, 6. The effect of Gender Male is positive with value: 0.4
, 7. The effect of Job Involvement is negative with value: -0.63
, 8. The effect of Job Level is positive with value: 0.46
, 9. The effect of Job Satisfaction is negative with value: -0.35
, 10. The effect of Marital Status Married is positive with value: 0.69
, 11. The effect of Marital Status Single is positive with value: 1.21
, 12. The effect of Monthly Income is negative with value: 0
, 13. The effect of Num Companies Worked is positive with value: 0.15
, 14. The effect of Over Time Yes is positive with value: 1.47
, 15. The effect of Percent Salary Hike is negative with value: -0.04
, 16. The effect of Performance Rating is positive with value: 0.65
, 17. The effect of Total Working Years is negative with value: -0.1
, 18. The effect of Training Times Last Year is negative with value: -0.22
, 19. The effect of Work Life Balance is negative with value: -0.19
, 20. The effect of Years At Company is positive with value: 0.07
, 21. The effect of Years In Current Role is negative with value: -0.1
, 22. The effect of Years Since Last Promotion is positive with value: 0.18
, 23. The effect of Years With Curr Manager is negative with value: -0.09