Essential Statistical Concepts

1. Power

Statistical power is the probability that a test will correctly reject a false null hypothesis (H0). It is influenced by the significance level (alpha), sample size, and effect size.

Type I error (α): The probability of rejecting H0 when it is true.
- \(α=P(rejecting H0∣H0 is true)\)
- Fixed by the user (common values: 0.05, 0.01).
Type II error (β): The probability of not rejecting H0 when it is false.
- \(β=P(not rejecting H0∣H0 is false)\)
- Not fixed, depends on α, sample size, and effect size.
Power (1-β): The probability of correctly rejecting H0 when it is false.
- \(1−β=P(rejecting H0∣H0 is false)\)

Improving Power:

Increase sample size.
Increase effect size.
Increase alpha level.
Decrease error variance by blocking or using covariates (CV).

library(pwr)

# Power analysis
pwr.t.test(d = 0.5, power = 0.8, sig.level = 0.05, type = "two.sample")

2. Effect Size

Effect size measures the strength of the relationship between variables. It is crucial for understanding the practical significance of research findings.

Types: Cohen’s d, Pearson’s r, eta squared (η²), partial eta squared.
Use in ANOVA: η² and partial η² measure the proportion of variance explained by an independent variable.

library(effsize)

# Calculate Cohen's d for two groups
cohen.d(mtcars$mpg[mtcars$cyl == 4], mtcars$mpg[mtcars$cyl == 6])

When to Use η²

Simple ANOVA: Use η² when you are conducting a simple one-way ANOVA or when your model includes only one independent variable.
Total Variance: Use η² if you are interested in understanding the proportion of total variance explained by an independent variable, including both the effect and error terms.

When to Use Partial η²

Complex Models: Use partial η² when you have multiple independent variables or a factorial design. Partial η² provides a clearer picture of the effect size for each variable, controlling for the variance explained by other variables in the model.
Control for Other Variables: Use partial η² when you want to control for other factors in the model, thus isolating the effect of a specific variable.
Multiple Comparisons: Use partial η² in repeated measures or mixed-design ANOVA, where you want to account for the variability within subjects or other factors.

3. Sample Size

Determining the appropriate sample size is essential to ensure sufficient power and reliable results. I highly recommend you calculate your appropriate sample size before collecting data (it will save you a lot of time and energy!)

Depends on: Desired power, effect size, significance level.
Sample size calculators: pwr package in R.

# Use same method from power analysis
pwr.t.test(d = 0.5, power = 0.8, sig.level = 0.05, type = "two.sample")

4. Missing Values

Why Missing Data Occurs:

Data Collection Errors: Mistakes or malfunctions during data collection processes.
Nonresponse: Participants choose not to respond to certain questions.
Loss of Data: Data may be lost due to technical issues or other unforeseen circumstances.

Types of Missing Data:

MCAR (Missing Completely at Random): The probability of missingness is the same across all observations.
MAR (Missing at Random): The probability of missingness is related to observed data but not the missing data itself.
MNAR (Missing Not at Random): The probability of missingness is related to the missing data itself.

Implications

Loss of Information: Removing rows with missing data can lead to a significant reduction in the dataset size, potentially losing valuable information.
Bias: If the data is not missing completely at random, deleting rows with missing values can introduce bias into the analysis.
Reduced Statistical Power: Smaller sample sizes due to deletion of rows can lead to reduced power to detect effects.

How You Should Handle Missing Data

Listwise Deletion

Involves removing any row with missing data. It is simple but often not recommended unless data is MCAR.

data("airquality")

clean_data <- na.omit(airquality)

Mean/Median Imputation

Replaces missing values with the mean or median of the observed data. It is simple but can underestimate variability.

airquality$Ozone[is.na(airquality$Ozone)] <- mean(airquality$Ozone, na.rm = TRUE)

Multiple Imputation

Creates multiple datasets with imputed values and combines the results to account for the uncertainty of the missing data.

library(mice)

imputed_data <- mice(airquality, m = 5, method = 'pmm', seed = 500)
complete_data <- complete(imputed_data, 1)

Model-Based Methods

You can use models to predict and fill in missing values. This will provide more accurate imputations by leveraging relationships between variables.

library(missForest)

# Model-based imputation using random forests
imputed_data_rf <- missForest(airquality)
complete_data_rf <- imputed_data_rf$ximp

5. Confounding Variables

Confounding variables are extraneous variables that correlate with both the independent and dependent variables, potentially leading to biased results.

Blocking: Group subjects into blocks based on confounding variables (e.g., sex, age). Must be discrete; otherwise, use ANCOVA for continuous variables.
Covariate (CV): Adjust group means based on pretest or unexplained exogenous variables.

library(car)

# Example: ANCOVA with a covariate
model <- lm(mpg ~ wt + factor(am), data = mtcars)
Anova(model, type = "III")

6. Fixed vs Random Effects

Fixed effects: Levels of independent variables (IVs) are selected by the researcher.
- ex: Researcher selects specific treatment levels.
Random effects: Levels are randomly selected from the population to generalize findings.
- ex: Randomly selects a sample of levels from the population.

library(lme4)

# Mixed-effects model with fixed and random effects
model <- lmer(Reaction ~ Days + (1|Subject), data = sleepstudy)
summary(model)