This blog post will explore a statistical tool: the Z test in R. The Z test is a fundamental hypothesis test that allows us to conclude population parameters based on sample data. This tutorial will provide step-by-step guidance on conducting one and two sample Z tests in R, enabling you to make informed decisions and draw meaningful insights from your data.
Using the Z test in R, we can assess the significance of sample statistics and compare sample means. Whether dealing with a small or large dataset, the Z test is valuable in data analysis and hypothesis testing.
Now, let us get into Z tests in R, exploring its applications and benefits for data-driven decision-making. The following sections will take you through implementing and interpreting one and two sample Z tests in R.
Table of Contents
- Outline
- Prerequisites
- The Z-Test
- Example in Psychology
- Synthetic data
- One-sample Z-test in R
- One-Sample Z Test in R using the BSDA Package
- Two-Sample Z test in R using the BSDA Package
- Conclusion
- Resources
Outline
The outline of this post is to provide a comprehensive guide to Z-tests in R, covering both one-sample and two-sample scenarios. We will start with the prerequisites. Next, we will introduce the Z-test concept and its relevance in statistical analysis.
Next, we learn about the one-sample Z-test, outlining the hypotheses for this type of test. We explain the step-by-step process, including setting up the hypotheses, calculating the sample mean and standard deviation, computing the Z test statistic, and interpreting the results. A practical example in psychology illustrates its application in real-world scenarios.
Moving on, we explore the two-sample Z-test, explaining the hypotheses specific to this test. We provide an example in psychology that demonstrates how to conduct a two-sample Z-test and interpret the findings.
To facilitate practical learning, we generate synthetic data that mimic real-world scenarios. Next, we will look at calculating the Z statistic using base R. However, there is a more straightforward method: using the BSDA package. Therefore, we will also use this package to perform one-sample and two-sample Z-tests.
Prerequisites
To follow this post, you should have basic knowledge of R, and having RStudio installed is recommended but optional. Additionally, suppose you plan to generate synthetic data for practice. In that case, you must install the dplyr package, a powerful tool for data-wrangling tasks such as column selection, factor renaming, adding new columns, and renaming a column in R. Having an updated version of R before installing any packages is beneficial to ensure compatibility and access to the latest features. See the following post for more information:
For those interested in utilizing the BSDA package for the Z-test examples, you will also need to install it. This package simplifies statistical testing and analysis, providing functions designed explicitly for Z-tests and other hypothesis tests.
The Z-Test
We can use the Z test as a statistical hypothesis test to assess the significance of sample statistics. Moreover, we can use it to make inferences about population parameters. It is widely employed when dealing with large sample sizes or when the population standard deviation is known. The Z test allows us to test hypotheses about population means and proportions, making it a valuable tool for drawing meaningful insights from data.
One Sample Z-Test:
In the context of a one-sample Z test, we compare the mean of a sample to a known population mean or a hypothesized value. The null hypothesis (H0) for a one-sample Z test states no significant difference between the sample mean and the hypothesized value. On the other hand, the alternative hypothesis (Ha) asserts a significant difference between the sample mean and the hypothesized value.
Hypotheses for One-Sample Z-Test:
- Null Hypothesis (H0): The population mean equals the hypothesized value (μ = μ0).
- Alternative Hypothesis (Ha): The population mean is not equal to the hypothesized value (μ ≠ μ0).
Two Sample Z-Test
In a two-sample Z test, we compare the means of two independent samples to determine if there is a significant difference between them. This test is handy when comparing the means of two groups: treatment versus control, male versus female, or before versus after an intervention.
Hypotheses for Two Sample Z-Test:
- Null Hypothesis (H0): The population means of the two groups are equal (μ1 = μ2).
- Alternative Hypothesis (Ha): The population means of the two groups are not equal (μ1 ≠ μ2).
Example in Psychology
This section will showcase a practical example of how the Z test can be applied in psychology. To illustrate its application, it will contain an example of a one-sample and a two-sample Z test.
One-Sample Z-Test in Psychology
Suppose a psychologist is conducting a study to determine if a new therapy significantly reduces anxiety levels in individuals. The psychologist collects a sample of 50 participants who have undergone the therapy and records their anxiety levels after the treatment. The psychologist wants to compare the mean anxiety level of the sample to the hypothesized value of the population’s mean anxiety level before the therapy, which is 75.
By performing a one-sample Z test, we can test the hypothesis that the therapy has no significant effect on anxiety levels (H0: μ = 75) against the alternative hypothesis that the treatment does have a significant effect (Ha: μ ≠ 75).
Two-Sample Z-Test in Psychology
Consider a different scenario where we are interested in comparing the anxiety levels of two participants: those who received the new therapy and those who received standard therapy. We collect independent samples from each group and want to determine if there is a significant difference in anxiety levels between the two groups.
Using a two-sample Z test, we can test the null hypothesis that the population means of anxiety levels in both groups are equal (H0: μ1 = μ2) against the alternative hypothesis that the means are different (Ha: μ1 ≠ μ2).
In the next section, we will walk through how to conduct these one-sample and two-sample Z tests in R, demonstrating the practical implementation and interpretation of the results.
Synthetic data
Let us generate synthetic data for the one-sample and two-sample Z tests using the examples from the previous section. For simplicity, we will assume the data follows a normal distribution.
# Load dplyr
library(dplyr)
# Set seed for reproducibility
set.seed(123)
# Generate synthetic data for one-sample Z test
# Hypothesized mean anxiety level before therapy: 75
one_sample_data <- tibble(anxiety_levels = rnorm(50, mean = 75, sd = 10))
# Generate synthetic data for two-sample Z test
# Anxiety levels for group 1 (new therapy): mean = 70, standard deviation = 8
# Anxiety levels for group 2 (standard therapy): mean = 75, standard deviation = 9
group1_data <- tibble(anxiety_levels = rnorm(50, mean = 70, sd = 8))
group2_data <- tibble(anxiety_levels = rnorm(50, mean = 75, sd = 9))
# Combine the two groups into one dataset
two_sample_data <- bind_rows(group1_data %>% mutate(group = "Group 1"),
group2_data %>% mutate(group = "Group 2"))
Code language: PHP (php)
In the code chunk above, we first load the dplyr
package, which provides R data manipulation and analysis functions.
To ensure reproducibility, we set the seed value to 2023using set.seed(2023)
. This allows us to generate the same random numbers each time we run the code.
Next, we generate synthetic data for a one-sample Z test by creating a tibble called one_sample_data
. We use the rnorm()
function to generate 50 random values from a normal distribution with a mean of 75 and a standard deviation 10. These values represent the anxiety levels of individuals before the therapy.
Following that, we generate synthetic data for a two-sample Z test. We create two separate tibbles called group1_data
and group2_data
. We use the rnorm()
function for each group to generate 50 random values from a normal distribution with different means and standard deviations. The group1_data
represents the anxiety levels of individuals who received the new therapy, with a mean of 70 and a standard deviation of 8. The group2_data
represents the anxiety levels of individuals who received the standard therapy, with a mean of 75 and a standard deviation of 9.
After generating the two separate datasets for each group, we combine them into one dataset for the two-sample Z test using the bind_rows()
function from dplyr
. In this combined dataset called two_sample_data, we introduce a new column called group
using the mutate()
function. The group
column identifies the membership of each data point, where “Group 1” represents the individuals who received the new therapy, and “Group 2” represents those who received the standard therapy.
The resulting one_sample_data
and two_sample_data
datasets are now ready for further analysis and conducting the one- and two-sample Z tests in R. You can and should use our data (if you have any).
One-sample Z-test in R
In this section, we will perform a one-sample Z test using the one_sample_data
dataset. The goal is to determine if there is a significant difference between the mean anxiety level of the sample and the hypothesized value of 75, representing the population’s mean anxiety level before the therapy.
To conduct the one-sample Z test in R, we will follow these steps:
- Set up the hypotheses.
- Calculate the sample mean and sample standard deviation.
- Compute the Z test statistic.
- Determine the critical Z value or the p-value.
- Make a decision and interpret the results.
Let us proceed with the analysis:
Step 1: Set up the hypotheses
In the one-sample Z test, our null hypothesis (H0) states no significant difference between the sample and the hypothesized population mean (μ = 75). The alternative hypothesis (Ha) asserts a significant difference between the sample mean and the hypothesized population mean (μ ≠ 75).
Step 2: Calculate the sample mean and sample standard deviation
Here is how to calculate the mean and standard deviation in R:
# Calculate the sample mean
sample_mean <- mean(one_sample_data$anxiety_levels)
# Calculate the sample standard deviation
sample_sd <- sd(one_sample_data$anxiety_levels)
Code language: R (r)
In the code chunk above, we first calculate the sample mean and sample standard deviation of the anxiety levels in the one_sample_data dataset.
Here we use R’s $-operator, which allows us to access a specific column in the dataframe. To perform the calculations, we access the anxiety_levels
column from the one_sample_data dataframe.
The mean()
function calculates the sample mean of the anxiety levels, providing us with the average value of the data points. Similarly, the sd()
function calculates the sample standard deviation, indicating the variation or spread in the data.
Step 3: Compute the Z Test statistic
Here is how to calculate the z-statistic in R:
# Hypothesized population mean
population_mean <- 75
# Sample size
sample_size <- nrow(one_sample_data)
# Calculate the Z test statistic
z_stat <- (sample_mean - population_mean) / (sample_sd / sqrt(sample_size))
Code language: R (r)
In the code block above, we set the hypothesized population mean as population_mean, representing the value we want to compare with the sample mean. In this case, we assume the population mean to be 75, reflecting the hypothesized average anxiety level before the therapy.
Next, we calculate the sample size as sample_size
. The nrow()
function obtains the number of rows in the one_sample_data
dataframe, corresponding to the dataset’s observations. The sample size is a crucial parameter for performing statistical tests as it influences the precision of the estimates.
With the population mean and sample size determined, we calculate the Z test statistic. The formula for the Z test statistic is derived from the formula for the standard error of the mean. It measures how many standard errors the sample mean deviates from the hypothesized population mean. Z_stat
represents the Z test statistic.
- How to Calculate Descriptive Statistics in R the Easy Way with dplyr
- How to Standardize Data in R
- How to Calculate Z Score in R
Step 4: Determine the critical Z value or the p-value
For the one-sample Z test, we can determine the critical Z value using a significance level (α) and the standard normal distribution table or calculate the p-value directly. Here is a Z critical value table: https://www2.math.upenn.edu/~chhays/zscoretable.pdf. Alternatively, we can calculate the p-value ourselves:
# Calculate the corresponding p-value
p_value <- 2 * (1 - pnorm(abs(z_stat)))
# Display the p-value
print(p_value)
Code language: PHP (php)
In the code block above, we calculate the corresponding p-value for the one-sample Z-test. Here we use the pnorm()
function from base R.
To calculate the p-value, we use the formula 2 * (1 - pnorm(abs(z_stat)))
. Taking the absolute value in R-*
r using the abs() function ensures that the Z test statistic is treated as a positive value. This is necessary because the standard normal distribution is symmetric around zero. The pnorm()
function calculates the cumulative probability for a given Z-value from the standard normal distribution.
The factor of 2 in the formula accounts for the two-tailed nature of the test. Since we are interested in the probability of observing a Z test statistic as extreme as the one we obtained in either tail of the distribution, we multiply the cumulative probability by 2. After calculating the p-value, we store it in the variable p_value.
Lastly, we use the print()
function to display the calculated p-value in the R console, allowing us to observe the significance level for the one-sample Z-test. A significant p-value suggests that there is substantial evidence to reject the null hypothesis and support the alternative hypothesis, implying that the sample mean significantly differs from the hypothesized population mean.
Step 5: Make a decision and interpret the results
Compare the calculated Z test statistic with the critical Z value or compare the p-value with the significance level (α) to decide on the null hypothesis.
If the calculated Z test statistic falls outside the critical Z values or if the p-value is less than the significance level (α), we reject the null hypothesis and conclude that there is a significant difference between the sample mean anxiety level and the hypothesized population mean. Otherwise, if the calculated Z test statistic falls within the critical Z values or the p-value is greater than the significance level (α), we fail to reject the null hypothesis, indicating no significant difference between the sample mean, and the hypothesized population mean. In our example, the Z test is not statistically significant.
By performing these steps, we can effectively conduct a one-sample Z test in R and draw meaningful conclusions about the population based on our sample data.
One-Sample Z Test in R using the BSDA Package
A more straightforward method is installing and using the BSDA
package and the z.test()
function:
library(BSDA)
# Extract sample mean, sample standard deviation, & sample size from the one_sample_data dataset
sample_mean <- mean(one_sample_data$anxiety_levels)
sample_sd <- sd(one_sample_data$anxiety_levels)
sample_size <- nrow(one_sample_data)
# Hypothesized population mean
population_mean <- 75
# Perform the one-sample Z-test and calculate the p-value
ztest_result <- z.test(x = one_sample_data$anxiety_levels,
sigma.x = sample_sd, mu = population_mean,
alternative = "two.sided")
print(ztest_result)
Code language: PHP (php)
In the code chunk above, we first load the BSDA package using library(BSDA)
.
Next, we extract the necessary data for the one-sample Z-test from the one_sample_data
dataset. Specifically, we calculate the sample mean, standard deviation, and sample size described in the previous section.
After extracting the data, we set the hypothesized population mean as population_mean
, representing the value we want to compare with the sample mean.
Now, with all the required data and the population mean defined, we perform the one-sample Z-test in R using the z.test()
function. We pass the sample data one_sample_data$anxiety_levels
, the sample standard deviation sample_sd, the hypothesized population mean population_mean, and specify the alternative hypothesis as “two.sided” to indicate a two-tailed test.
Finally, we use the print()
function to display the results of the Z-test stored in the ztest_result
object. For simplicity, we will use the BSDA
package to carry out the two-sample Z test.
Two-Sample Z test in R using the BSDA Package
Here is how to carry out a two-sample Z test in R using the z.test()
function:
# Load the BSDA package
library(BSDA)
# Hypothesized population mean difference (if any)
population_mean_diff <- 0
# Perform the two-sample Z-test and calculate the p-value
ztest_result <- z.test(x = two_sample_data$anxiety_levels,
y = two_sample_data$group,
mu = population_mean_diff,
alternative = "two.sided")
# Display the results
print(ztest_result)
Code language: R (r)
In the code above, we first load the BSDA package using library(BSDA)
.
Next, we set the population_mean_diff
to 0. Remember, this represents the hypothesized difference in population means between the two groups.
The z.test()
function takes the two data sets as inputs using x
and y
. In this case, x
is the numeric vector of anxiety levels, and y
is the factor variable indicating the group membership (“Group 1” or “Group 2”).
We specify mu
as the hypothesized population mean difference, 0 in this example since we want to test if the means differ. The alternative
argument is set to “two.sided” to perform a two-tailed test.
The z.test()
function will calculate the Z test statistic, the p-value, and the confidence interval for the mean difference between the two groups. Additionally, the results are stored in the ztest_result
object. Finally, we use the print()
function to display the results of the two-sample Z-test.
Conclusion
In this post, you have learned how to carry out a Z test in R. We covered both one-sample and two-sample Z tests. Therefore, provide clear explanations and practical examples to enhance your understanding.
For one-sample Z tests, we explored setting up hypotheses, calculating sample statistics, and interpreting results. Moreover, we compared means between groups in the two-sample Z tests. Calculating Z statistic and p-value in base R is a bit cumbersome. Therefore, we simplified the testing process by generating synthetic data and using the BSDA package, making statistical analysis more accessible.
Now that you have gained valuable insights into Z tests, you can confidently apply this knowledge. With it, you can draw meaningful conclusions from your data and make informed decisions.
Please share this post on social media. Feel free to comment below for any suggestions or help or to let me know if you found the tutorial helpful.
Resources
Here are some tutorials you might find helpful:
- Countif function in R with Base and dplyr
- How to Randomly Select Rows in R – Sample from Dataframe
- Mastering SST & SSE in R: A Complete Guide for Analysts
- How to do a Kruskal-Wallis Test in R
- Test for Normality in R: Three Different Methods & Interpretation
- Binning in R: Create Bins of Continuous Variables
- Coefficient of Variation in R