Chapter 4 Statistical Report

In general statistics reports, there are 2 main parts, descriptive statistics and inferential statistics (analysis test). To provide you how to create a statistical analysis report, we will divide some of the content into:

  • Statistical Descriptive

    • Measure of Central Tendency
    • Measure of Spread
    • Measure of Relationship between data
  • Statistical Inferencial: Statistics Model Focused

    • Motivation and Definition “Machine learning”
    • Employee Attrition Analysis
    • How to interpret analysis result
    • Convert statistical object into textual report with easystat’s style

4.1 Statistical Descriptive

Statisticians and data scientists use descriptive statistics to summarize and describe a large number of measurements. Many times, this task is accompanied with graphs and plots that help describe the numerical summary of data. When data science is applied in the business context, an example of descriptive statistic is the average number of transactions per month. Another example is the percentage of e-commerce transactions with a voucher code applied. The simple rule is that descriptive statistics do not involve generalizing beyond the data we have obtained, and are merely descriptive of what we have at hand.

4.1.1 Measure of Central Tendency

The measure of central tendency enable us to compare two or more distribution pertaining to the same time period or within the same distribution over time.

attrition age business_travel daily_rate department distance_from_home education education_field employee_count employee_number environment_satisfaction gender hourly_rate job_involvement job_level job_role job_satisfaction marital_status monthly_income monthly_rate num_companies_worked over_18 over_time percent_salary_hike performance_rating relationship_satisfaction standard_hours stock_option_level total_working_years training_times_last_year work_life_balance years_at_company years_in_current_role years_since_last_promotion years_with_curr_manager
yes 41 travel_rarely 1102 sales 1 2 life_sciences 1 1 2 female 94 3 2 sales_executive 4 single 5993 19479 8 y yes 11 3 1 80 0 8 0 1 6 4 0 5
no 49 travel_frequently 279 research_development 8 1 life_sciences 1 2 3 male 61 2 2 research_scientist 2 married 5130 24907 1 y no 23 4 4 80 1 10 3 3 10 7 1 7
yes 37 travel_rarely 1373 research_development 2 2 other 1 4 4 male 92 2 1 laboratory_technician 3 single 2090 2396 6 y yes 15 3 2 80 0 7 3 3 0 0 0 0
no 33 travel_frequently 1392 research_development 3 4 life_sciences 1 5 4 female 56 3 1 research_scientist 3 married 2909 23159 1 y yes 11 3 3 80 0 8 3 3 8 7 3 0
no 27 travel_rarely 591 research_development 2 1 medical 1 7 1 male 40 3 1 laboratory_technician 2 married 3468 16632 9 y no 12 3 4 80 1 6 3 3 2 2 2 2
no 32 travel_frequently 1005 research_development 2 2 life_sciences 1 8 4 male 79 3 1 laboratory_technician 4 single 3068 11864 0 y no 13 3 3 80 0 8 2 2 7 7 3 6
no 59 travel_rarely 1324 research_development 3 3 medical 1 10 3 female 81 4 1 laboratory_technician 1 married 2670 9964 4 y yes 20 4 1 80 3 12 3 2 1 0 0 0
no 30 travel_rarely 1358 research_development 24 1 life_sciences 1 11 4 male 67 3 1 laboratory_technician 3 divorced 2693 13335 1 y no 22 4 2 80 1 1 2 3 1 0 0 0
no 38 travel_frequently 216 research_development 23 3 life_sciences 1 12 4 male 44 2 3 manufacturing_director 3 single 9526 8787 0 y no 21 4 2 80 0 10 2 3 9 7 1 8
no 36 travel_rarely 1299 research_development 27 3 medical 1 13 3 male 94 3 2 healthcare_representative 3 married 5237 16577 6 y no 13 3 2 80 2 17 3 2 7 7 7 7

4.1.1.1 Mean

Often times in the exploratory data analysis phase, we want to get a sense of what the most representative score of a particular measurement is. We often simplify this idea by referring to it as the “average”, but there are in fact, three measures of central tendency that you need to have in your statistical toolset.

The most popular measure of central tendency is the mean, which is sometimes represented as \(\bar x\) when computed on a sample and represented as \(\mu\) when computed on a population. Mean is really the sum of all your measurements, divided by the number of measurements, and works best on data that has an even distribution or a normal distribution (don’t worry if the idea of a normal distribution isn’t clear - we’ll get to that in a while!). In R, the mean function will return the mean:

#> [1] 6502.931
#> [1] 6502.931

Mean is based on all the items in a series, a change in the value of any item will lead to a change in the value of the mean. So in the case of highly skewed distribution, the mean may get distorted on account of a few items with extreme values. In such a case, it may not appropriate for represent the characteristics of the distribution.

4.1.1.2 Median

The median is the point of value that cuts the distribution into two equal halves such that 50% of the observations are below it. To find this value, we would order the observations and find the middle value that separates the distribution into two equal halves.

#> [1] 4919

We need to be cautious when applying the mean on data with a skewed distribution because the mean may not be the best candidate for a most representative score compared to other measures of central tendency. For example, a company surveys its employees household income and posted the following monthly household income (IDR, in Mil):

#> [1] 20.03
#> [1] 7.5

While the median puts that figure at about 7.25, the mean is about 2.67 times higher and is not truly representative of the actual household income. While most of the employees have a combined household earning of less than 8 mil, the mean value of our household income would have believe that the average household income of our employees is in fact more than 20 mil IDR.

The median in this case is a better measure of centrality because it is not sensitive to the outlier data.

If we are in fact, required to compute the mean on data with skewed distribution, another technique to reduce the influence of outlier data is to use a slight variation of the mean, called the Trimmed Mean. The trimmed mean removes a small designated percentage of the largest and smallest values before computing the mean. A trimmed mean that computes the middle 95% of the distribution can be performed in R fairly easily:

#> [1] 6.966667

4.1.1.3 Mode

When there are discreet values for a variable, the mode refers to the value that occurs most frequently. This statistic is rarely used in practice

#> [1] "life_sciences"

Because R do not have a built-in way of computing the mode, we wrote the code above to tabulate our data (using table), multiply the calculation by -1 (and hence giving it the effect of sorting our data in descending order) and then pick the first value.

Dive Deeper

The following data give the savings bank accounts balances of nine sample household selected in a survey.

  1. Find the mean and the median for these data
  2. Do these data contain an outlier? if so, what we can do to dealing with extreme number?
  3. Which of these two summary measures is more appropirate for this series?

4.1.2 Measures of Spread

In the previous chapter, we have explained the measures of central tendency. It may be noted that these measures do not indicate the extent of dispersion or variability in a distribution. Measures of spread measures the extent to which value in a distribution differ from each other. In practice, it is far easier to compute the distance between the values to their mean and when we square each one of these distances and add them all up the average1 of that result is known as variance. Taking the square root of the variance will result in the standard deviation. Just like the mean, standard deviation is the “expected value” of how far the scores deviate from the mean.

4.1.2.2 Standard Deviation

And taking the square root of variance yields the standard deviation:

#> [1] 20.32943
#> [1] 20.32943

Variance and standard deviation are always positive when the values are not identical. When there’s no variability, the variance is 0. Because variance and standard deviation are sensitive to every value, they may not be the most “representative” measurement for skewed data.

4.1.2.3 Range

Other measurements of the spread are the range and the interquartile range The range is the distance from our smallest measurement to the largest one:

#> [1] 70
#> [1] 70

4.1.2.4 IQR

The interquartile range is the range computed for the middle 50% of the distribution:

#> [1] 35.75
#> [1] 35.75

While we can use quantile() to obtain the 3rd and 1st quartile individually, these two figures are also presented together with the 0th (the min()), 50th (the median()), and the 100th (the max()) quartiles: together they are called the five-number summary.

When we call fivenum(), we get this summary that we can use as a measure of variation for even potentially skewed data:

#> [1]  30  48  66  84 100

From the above, observe that the absolute lowest profit (the min) approximates 6,600 and the highest profit approximates 8,400 (the max); Observe also that 25% of our transactions make a profit of less than 1.728. Half of the transactions (the middle 50% of the value) make a profit between 1.728 and 29.364 - recall that this range is called the interquartile range (IQR). When we use summary() on continuous data, we’ll get the five number summary and the mean in return:

#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
#>   30.00   48.00   66.00   65.89   83.75  100.00

4.1.2.5 Uses of the measure of spread

Discussion:

  • In a small business firm, two typist of employed. Typist A and typist B. Here is the data number of typed pages from both typist in 10 days. Which typist shows greater consistency in his output?

Solution:

#> [1] "mean typist A: 32.54"
#> [1] "mean typist B: 47.85"
#> [1] "Standard Deviation typist A: 9.86"
#> [1] "Standard Deviation typist B: 205.51"

These calculation indicate that althoufh typist B types out more pages on average (47.85 per a day), there is a greater variation in his output as compared to that of typist A. We can say this in a different way: Though typist A’s daily output is much less, he is more consistent than typist B.

Discussion 2:

Which financial assets has more votality in their annual price?

The primary measure of votality used by stock traders and financial analysts is standard deviation, and recall that this metric reflects the average amount of an item’s price over a period of time. While the price for our fictional “oil” asset and “coins” asset averaged out to be USD 1.4 over time, which of these two present a higher votality than the other?

4.1.3 Covariance and Correlation

Statistical methods of measures of central tendency, measure of spread are helpful for the purpose of comparison and analysis of distribution involving only one variable. However, decribing the relationship between two or more variables, is another important part of statistics. In many business research situations, the key of to decision making lies in understanding relationship between two or more variables. The statistical methods of Covariance and Correlation are helpful in knowing the relationship between two or more vairbles.

In all these cases involving two or more variables, we may be interested in seeing:

  • if there is any association between the variables;
  • if there is an association, is it strong enough to be useful;
  • if so, what form the relationship between the two variables takes

4.1.3.1 Covariance

When we have two samples, X and Y, of the same size, then the covariance is an estimate of how variation in X is related to the variation in Y. Covariance measures how two variables covary and is represented as:

\[Cov(X,Y) = \frac{1}{n-1}\sum_{i = 1}^{i}(X_i - \mu x)(Y_i - \mu y)\]

#> [1] 14833.73
#> [1] 0.5142848

Getting a positive covariance means that higher X tends to be associated with larger Y (and vice versa). The covariance of any variable with itself is its variance. Notice also that cov(X,Y) = cov(Y,X). Notice also that cov(X,Y) = cov(Y,X).

#> [1] 37.53431
#> [1] 37.53431

But there is a problem with covariances: they are hard to compare because variables are sometimes expressed in different units or scales. It is hard to tell if there is an “objectively stronger” variance between UK Pound and Indonesia Rupiah or bitcoin prices and the US dollar because the “scale” at which we measure and compute the covariance on is different.

One solution is to “normalize” the covariance: we divide the covariance by something that encapsulate the scale in both the covariates, leading us up to a value that is bounded to the range of -1 and +1. This is the correlation.

4.1.3.2 Correlation

Whatever units our original variables were in, this transformation will get us a measurement that allow us to compare whether two variables exhibit a correlation stronger than another:

\[Corr(X,Y) = \frac{Cov(X,Y)}{\sqrt{Var(X)*Var(Y)}}\]

#> [1] 0.5142848

And to find the correlation instead of the covariance, we would just use cor() instead. Correlation, unlike covariance, is not sensitive to the units in which our variables X and Y are measured and hence more useful for determining how strong the relationship is between variables:

#> [1] 0.5142848

Some facts about correlation:

  • \(Cor(X,Y) == Cor(Y,X)\)
  • \(-1 <= Cor(X,Y) <= 1\)
  • Cor(X,Y) is 1 or -1 only when the X and Y observations fall perfectly on a positive or negatively sloped line
  • Cor(X,Y) = 0 implies no linear relationship

Discussion:

We can think of X and Y as the returns of two stocks (both stocks have a return, which is basically how much money they’re expected to make, and both have risks, which measure how much the return fluctuates; this is an example that Mike Parzen, from the Harvard Statistics Department, often uses in his courses to build intuition). In the Statistics ‘world’, it’s pretty much the norm to think of return as the average, or expectation, of a stock, and risk as the variance. So, say that you were building a portfolio (a compilation of multiple stocks) and wanted to find the risk and return of the entire portfolio.

We could break this down into the ‘individual risks’ of the stocks - the separate Variances Var(A), Var(B), Var(C), etc - and the ‘interactive’ risks of them together - the Covariance term. The individual risk is straightforward enough (just the marginal variance of each stock), but think more about the interactive risks. If two stocks tend to move together, then they are certainly riskier. Finally, given the value of covariance and the correlation between the two stocks, we are more likely to choose 2 stocks that do not have a correlation (0) which means reducing the risk 2 stocks decrease simultaneously.

4.2 Statistics Model

4.2.1 Motivation and Definition “Machine Learning”

Machine learning on a very basic level, refers to a subfield of computer science that “gives computer the ability to learn without being explicitly programmed”, this realization and quote was credited to Arthur Samuel, who coined the term “machine learning” and created the world’s first self-learning program called the Samuel Checkers-playing Program in 1952. When Samuel was about to demonstrate the program, the founder and president of IBM remarked that the demonstration would raise the price of IBM stock by 15 points. It did. In 1961 Samuel challenged the Connecticut state checker champion (4th ranked nationwide) and his program won.

With the advances in machine learning, society as a collective has pushed new boundaries around making machines “smarter”, or less-sensationally, making machines more able to perform tasks without human intervention. The whole notion of making machines perform these tasks that, for a long time in history were done by human brains, is what most people meant when they say “Artificial intelligence”. Compared to machine learning, AI describes a broad concept (“ideal”). Machine learning on the other hand, offers a particular approach to arriving at that “ideal”.

Supervised Machine Learning currently makes up most of the ML that is being used by systems across the world. The input variable (x) is used to connect with the output variable (y) through the use of an algorithm. All of the input, the output, the algorithm, and the scenario are being provided by humans.

Supervised learning: We feed our model training examples (input) and tag each of these example with a corresponding target, and is so doing, allow our model to produce a function that maps our input to its target.

Supervised learning algorithms allows machines to do predictive analytics for a specific target. Supervised learning are used to solve for classification and regression problems. Good examples for the financial industry are credit risk scoring (regression or classification), loan default prediction (classification), and customer lifetime value (regression). Supervised learning is also useful for performing predictive analytics on Employee Attrition data, which we are going to explore in the following section.

Unsupervised learning: If we feed our model training examples (input) without any labels, it is unsupervised learning.

Good examples of unsupervised learning problems in the finance / banking sector include anomaly detection (there is no target variables in anomaly detection, there is not even necessarily any right or wrong answer as to when an observation is an anomaly and how many anomaly exist in our data) and auto segmentation (again, no right or wrong answers as to how many clusters of customer segments is the right amount).

Which of the following do you think is a supervised learning problem?

  • Training an email spam filter
  • Find possible patterns from a group of 5000 financial transactions
  • Discover how many market segments can be drawn from a CRM (customer relationship system)
  • Categorizing transactions into high / medium / low risk - Classifying blood cell as benign * or malign

4.2.2 Employee Retention Analysis

The objective is to understand what factors contributed most to employee attrition and to create a model that can predict if a certain employee will leave the company or not. The goal also includes helping in formulating different retention strategies on targeted employees. Overall, the implementation of this model will allow management to create better decision-making actions.

Let’s import our data and inspect our variable;

#> Observations: 1,602
#> Variables: 27
#> $ Age                      <int> 39, 28, 28, 38, 40, 33, 41, 50, 46, 45, 34, 47, 52, 35, 42,...
#> $ BusinessTravel           <fct> Travel_Frequently, Travel_Rarely, Travel_Rarely, Non-Travel...
#> $ DailyRate                <int> 443, 304, 1451, 573, 658, 722, 1206, 316, 406, 1199, 648, 1...
#> $ DistanceFromHome         <int> 8, 9, 2, 6, 10, 17, 23, 8, 3, 7, 11, 4, 3, 8, 9, 5, 2, 3, 9...
#> $ EnvironmentSatisfaction  <int> 3, 2, 1, 2, 1, 4, 4, 4, 1, 1, 3, 3, 4, 1, 4, 2, 4, 4, 4, 1,...
#> $ Gender                   <fct> Female, Male, Male, Female, Male, Male, Male, Male, Male, M...
#> $ HourlyRate               <int> 48, 92, 67, 79, 67, 38, 80, 54, 52, 77, 56, 92, 39, 72, 93,...
#> $ JobInvolvement           <int> 3, 3, 2, 1, 2, 3, 3, 3, 3, 4, 2, 2, 2, 3, 2, 3, 3, 4, 2, 3,...
#> $ JobLevel                 <int> 1, 2, 1, 2, 3, 4, 3, 1, 4, 2, 2, 3, 3, 1, 5, 2, 4, 2, 3, 1,...
#> $ JobSatisfaction          <int> 3, 4, 2, 4, 2, 3, 3, 2, 3, 3, 2, 2, 3, 4, 4, 4, 2, 4, 2, 1,...
#> $ MaritalStatus            <fct> Married, Single, Married, Divorced, Divorced, Single, Singl...
#> $ MonthlyIncome            <int> 3755, 5253, 3201, 5329, 9705, 17444, 7082, 3875, 17465, 643...
#> $ MonthlyRate              <int> 17872, 20750, 19911, 15717, 20652, 20489, 11591, 9983, 1559...
#> $ NumCompaniesWorked       <int> 1, 1, 0, 7, 2, 1, 3, 7, 3, 4, 4, 8, 2, 1, 8, 6, 1, 1, 1, 2,...
#> $ OverTime                 <fct> No, No, No, Yes, No, No, Yes, No, No, No, No, Yes, Yes, No,...
#> $ PercentSalaryHike        <int> 11, 16, 17, 12, 12, 11, 16, 15, 12, 17, 11, 12, 14, 16, 22,...
#> $ PerformanceRating        <int> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 4, 3, 3, 4, 3, 4,...
#> $ RelationshipSatisfaction <int> 1, 4, 1, 4, 2, 4, 4, 4, 4, 4, 4, 3, 3, 2, 4, 1, 2, 2, 2, 4,...
#> $ StockOptionLevel         <int> 1, 0, 0, 3, 1, 0, 0, 1, 1, 1, 2, 1, 0, 1, 0, 0, 0, 2, 0, 1,...
#> $ TotalWorkingYears        <int> 8, 7, 6, 17, 11, 10, 21, 4, 23, 9, 14, 28, 28, 3, 24, 8, 22...
#> $ TrainingTimesLastYear    <int> 3, 1, 2, 3, 2, 2, 2, 2, 3, 1, 5, 4, 4, 1, 2, 2, 2, 3, 3, 3,...
#> $ WorkLifeBalance          <int> 3, 3, 1, 3, 2, 3, 3, 3, 3, 3, 4, 3, 3, 2, 3, 4, 3, 3, 4, 4,...
#> $ YearsAtCompany           <int> 8, 7, 5, 13, 1, 10, 2, 2, 12, 3, 10, 22, 5, 3, 1, 5, 22, 6,...
#> $ YearsInCurrentRole       <int> 3, 5, 3, 11, 0, 8, 0, 2, 9, 2, 9, 11, 4, 2, 0, 4, 10, 5, 7,...
#> $ YearsSinceLastPromotion  <int> 0, 0, 0, 1, 0, 6, 0, 2, 4, 0, 1, 14, 0, 0, 0, 1, 0, 0, 7, 1...
#> $ YearsWithCurrManager     <int> 7, 7, 4, 9, 0, 0, 2, 2, 9, 2, 8, 10, 4, 2, 1, 2, 4, 4, 7, 0...
#> $ Attrition                <fct> No, No, No, No, No, No, No, No, No, No, No, No, No, No, No,...

The data we’ve prepared is originally made available in kaggle by lnvardanyan : ibm-hr-analytics-attrition-dataset. The following are the description of some features:

  • EnvironmentSatisfaction: 1 Low, 2 Medium, 3 High, 4 Very High
  • JobInvolvement: 1 Low, 2 Medium, 3 High, 4 Very High
  • JobSatisfaction: 1 Low, 2 Medium, 3 High, 4 Very High
  • PerformanceRating: 1 Low, 2 Good, 3 Excellent, 4 Outstanding
  • RelationshipSatisfaction: 1 Low, 2 Medium, 3 High, 4 Very High
  • WorkLifeBalance: 1 Bad, 2 Good, 3 Better, 4 Best.

4.2.2.1 Modeling Employee Retention for Predictive Analytics

We can use an algorithm called Logistic Regression to build a prediction model for employee retention. In R, we can pass the data and the prediction scenario (formula) to the function glm(). Additionally, we can also perform feature selection to improve the model performance. In this example we use the non-business wise stepwise regression (using the function step()).

term estimate std.error statistic p.value
(Intercept) 0.9922016 0.7347908 1.350318 0.1769138
BusinessTravelTravel_Frequently 2.0879389 0.2517729 8.292945 0.0000000
BusinessTravelTravel_Rarely 1.3420174 0.2300301 5.834096 0.0000000
DailyRate -0.0002744 0.0001393 -1.969656 0.0488778
DistanceFromHome 0.0304200 0.0070695 4.302997 0.0000169
EnvironmentSatisfaction -0.4734375 0.0517351 -9.151193 0.0000000
GenderMale 0.4001570 0.1159402 3.451409 0.0005577
JobInvolvement -0.6302589 0.0815772 -7.725916 0.0000000
JobLevel 0.4604945 0.1790987 2.571177 0.0101353
JobSatisfaction -0.3511748 0.0524822 -6.691312 0.0000000
MaritalStatusMarried 0.6928134 0.1640395 4.223456 0.0000241
MaritalStatusSingle 1.2064628 0.1667051 7.237108 0.0000000
MonthlyIncome -0.0001540 0.0000440 -3.495895 0.0004725
NumCompaniesWorked 0.1458214 0.0249731 5.839133 0.0000000
OverTimeYes 1.4736563 0.1225765 12.022337 0.0000000
PercentSalaryHike -0.0435064 0.0238889 -1.821199 0.0685766
PerformanceRating 0.6463730 0.2508763 2.576461 0.0099817
TotalWorkingYears -0.0981441 0.0165229 -5.939891 0.0000000
TrainingTimesLastYear -0.2162572 0.0462549 -4.675337 0.0000029
WorkLifeBalance -0.1918679 0.0748872 -2.562092 0.0104044
YearsAtCompany 0.0672745 0.0228587 2.943066 0.0032498
YearsInCurrentRole -0.1037842 0.0301443 -3.442911 0.0005755
YearsSinceLastPromotion 0.1781443 0.0251507 7.083081 0.0000000
YearsWithCurrManager -0.0914658 0.0288522 -3.170152 0.0015236

4.2.2.2 Model Interpretation

Using the model we have built with stepwise regression, we can create a likelihood table and analyze the contribution of each variable in determining probability. Stepwise regression will calculate a coefficient (estimate) for each variable. The coefficient reflects variable contribution to the prediction result and can be transformed into odds ratio. For example, let’s take a coefficient value from a numerical variable Years with Current Manager that is -0.09. This negative coefficient can be interpreted as:

The longer the years spent with the Current Manager, the smaller employee’s chance to leave.

term estimate odds_ratio p.value
BusinessTravelTravel_Frequently 2.09 8.07 0.00
OverTimeYes 1.47 4.37 0.00
BusinessTravelTravel_Rarely 1.34 3.83 0.00
MaritalStatusSingle 1.21 3.34 0.00
MaritalStatusMarried 0.69 2.00 0.00
PerformanceRating 0.65 1.91 0.01
JobLevel 0.46 1.58 0.01
GenderMale 0.40 1.49 0.00
YearsSinceLastPromotion 0.18 1.19 0.00
NumCompaniesWorked 0.15 1.16 0.00
YearsAtCompany 0.07 1.07 0.00
DistanceFromHome 0.03 1.03 0.00
MonthlyIncome 0.00 1.00 0.00
YearsWithCurrManager -0.09 0.91 0.00
TotalWorkingYears -0.10 0.91 0.00
YearsInCurrentRole -0.10 0.90 0.00
WorkLifeBalance -0.19 0.83 0.01
TrainingTimesLastYear -0.22 0.81 0.00
JobSatisfaction -0.35 0.70 0.00
EnvironmentSatisfaction -0.47 0.62 0.00
JobInvolvement -0.63 0.53 0.00

Here are a quick summary of the table above:

  1. Overtime has a positive coefficient with the odds ratio yes to no is 5.32. This says that the event of an employee who works overtime and leaves the company is about 5.32 more likely than the employee who is not working overtime.
  2. The variables are linked (directly or indirectly) to work-life-balance (Job Satisfaction, Environment Satisfaction, Job Involvement) have a negative coefficient. We can say that the more satisfied and the higher job involvement the employee, the less likely that employees will leave the company.

4.2.2.3 Predicting

To test whether our model has a good performance, we can check the number of correctly classified/misclassified Attrition status in the unseen data.

#>           actual
#> prediction  No Yes
#>        No  180  21
#>        Yes  66  52

This table above is also known as the confusion matrix.

Observe from the confusion matrix that:

  • Out of the 73 actual employee leave we classified 52 of them correctly
  • Out of the 246 employee that stay we classified 184 of them correctly
  • Out of the 319 cases of attrition in our test set, we classified 236 of them correctly

4.2.3 Convert Statistical Object Like easystats Style

easystats is a development packages in R to provide a unifying and consistent framework to tame and harness the scary of R statistical models. We can automate convert an object of R from simple statistical model into textual report that ease our daily works in making interpretation of the data.