Hypothesis Testing - My Coding Marathon

Performing statistical analyses is an important part of many business decisions. It allows the business to have supporting evidence and confidence in the decisions that it makes. Hypothesis testing is one helpful way to perform statistical analyses, that is both simple to perform and reliable. Through this post, I will explain what a hypothesis test is and will use an example to show how to perform one. Hopefully, this will provide some guidance for anyone hoping to use hypothesis testing to inform business decisions.

So what exactly is a hypothesis test? A hypothesis test evaluates how likely it is that an assumption made about a population is true. For example, your assumption may be that taking Vitamin C makes you less likely to catch a cold. You could then use a hypothesis test to determine how probable it is that Vitamin C does in fact prevent colds.

For the purposes of this explanation, I am going to use an example utilizing data from the Northwind database, which contains data from a fictional company. Access to the database can be found here: [https://docs.microsoft.com/en-us/dotnet/framework/data/adonet/sql/linq/downloading-sample-databases]

The question I will be answering is:

Do USA employees generate significantly higher revenue than UK employees?

There are 4 steps to hypothesis testing that I will follow:

Set up null and alternative hypotheses
Explore the data and capture the necessary data points
Perform the hypothesis test
Interpret the findings and either accept or reject the null hypothesis

Alright let’s dive into our example!

Defining Hypotheses:

The Null Hypothesis (H0) is typically that there is no difference between the samples, while the Alternative Hypothesis (HA) is our educated guess about the relationship between our two samples. We will also be using an alpha value of .05 to determine if our data is statistically significant. This means that we are fine with accepting the alternative hypothesis as true if there is less than a 5% chance the results we are getting are actually due to randomness.

In our example, the hypotheses are:

Null Hypothesis (H0) = There is no difference in the revenue generated by employees in the USA vs. in the UK
Alternative Hypothesis (HA) = The revenue generated by USA employees is statistically greater than the revenue generated by UK employees
Significance level (Alpha) = .05

Explore The Data:

For our analysis, we will need to gather data that includes the employee ID numbers, the country where the employee works, the total number of orders and the quantity, unit price, and discount amount for each order (these three will be used to calculate the total revenue per order). This information is spread across 3 tables in our database, so I will use the below code to extract the data and put it into an easy to use pandas dataframe. I will group the data by employee ID so that we can get each employees total revenue on all orders.

# Select the data we want to work with
cur.execute('''SELECT EmployeeId, Country, sum(Quantity*UnitPrice*(1-Discount)) AS Total_Revenue, 
                    count(OrderId) as Total_Orders
                    FROM [Order] as o
                    JOIN Employee as e
                    ON o.EmployeeId = e.Id
                    JOIN OrderDetail as od
                    ON o.Id = od.OrderId
                    GROUP BY o.EmployeeId
                ;''')
employee_revenue = pd.DataFrame(cur.fetchall()) # put data into easy-to-use dataframe
employee_revenue.columns = [x[0] for x in cur.description] # add column labels to dataframe
employee_revenue # print dataframe

Once I have the necessary data pulled, I then have to split my data into 2 samples; a group of all the USA employees revenue and a separate group of all the UK employees revenues. The below code allows me to select data of a certain value.

USA = employee_revenue[employee_revenue.Country == 'USA']['Total_Revenue']
UK  = employee_revenue[employee_revenue.Country == 'UK']['Total_Revenue']

Hypothesis Test:

Now that I have my data split into the two groups that I want to compare, it’s time to perform the hypothesis test! First, I must choose which type of test is best to use. There are a few differently widely used tests to choose from, all of which determine if there are statistical differences between the groups tested:

T-Test – useful for determining whether the means of 2 small samples indicate different underlying population parameters
Z-Test – used when you to know if your sample comes from a certain population
ANOVA test – useful for comparing more than 2 groups

For the purposes of our example, a t-test makes the most sense to use. Within the t-test, there are two common versions used:

Student’s t-test - commonly used when the sample sizes and variances between the two samples are equal
Welch’s t-test - used when sample sizes and variances are unequal. To determine which we should use, set the variances and then the sample sizes equal to each other to get either a ‘true’ or ‘false’ response.

print('Are variances equal?:',np.var(USA) == np.var(UK))
print('Are sample sizes equal?:',len(USA) == len(UK))

In our case, both of these return ‘False’ values (meaning they are unequal), so a Welch’s t-test should be used. Next we need to determine if we should use a 1-tailed or 2-tailed test. A 1-tailed test should be used when the region of rejection is on just one side of the distribution (i.e. USA revenue is greater than UK revenue), while a 2-tailed test has the region of rejection on either side (i.e. USA revenue is EITHER greater OR less than UK revenue). We will use a 1-sided test here since we only care about USA employee revenue being greater than UK employee revenue.

To perform the 1-sided Welch’s t-test, I will use the stats.ttest_ind() function. To use this function, I first need to import the stats library.

from scipy import stats

I can then perform my test and print my results. Passing the ‘equal_var = False’ parameter tells the function that our variances are unequal, as we previously determined. To get the 1-sided scores, I will need to divide my p-value by 2.

result2 = stats.ttest_ind(USA,UK, equal_var = False)
print('t-statistic:',result2[0],'p-value:',result2[1]/2)

Interpret Results:

We get a p-value of ~0.001596, which is less than our alpha=.05 significance level, meaning we can reject our null hypothesis that there is no difference between the revenues generated by USA employees and UK employees. Therefore, USA employees do in fact generate higher total revenues than UK employees.

The last thing I am going to do is look at the effect size to help us understand the practical significance of our results. In other words, how meaningful is the statistical difference between our two groups. To understand the effect size, I will use Cohen’s d, which represents the magnitude of differences between 2 groups on a given variable. Larger values for Cohen’s d will indicate greater differentiation between the two groups.

The formula for Cohen’s d is: 𝑑 = effect size (difference of means) / pooled standard deviation

def Cohen_d(group1, group2):

    diff = group1.mean() - group2.mean()

    n1, n2 = len(group1), len(group2)
    var1 = group1.var()
    var2 = group2.var()

    # Calculate the pooled threshold
    pooled_var = (n1 * var1 + n2 * var2) / (n1 + n2)
    
    # Calculate Cohen's d statistic
    d = diff / np.sqrt(pooled_var)
    
    return d

Cohen_d(USA,UK)

The general rule of thumb guidelines for interpreting d are:

Small effect = .2
Medium effect = .5
Large effect = .8

We got a very large Cohen’s d value of ~2.85, indicating that there is a large effect between our samples.

And with that, we have finished our hypothesis test! We now have evidence to support our hypothesis that USA employees generate higher total revenues than UK employees.