Jose Wilhelm Statistical Inference Project

What is GSS?

The General Social Survey (GSS) is a sociological survey created and regularly collected since 1972 by the National Opinion Research Center at the University of Chicago. The GSS collects information and keeps a historical record of the concerns, experiences, attitudes, and practices of residents of the United States. The GSS aims to gather data on contemporary American society to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.

Source: The GSS page on Wikipedia.

Assignment

This project, from Inferential Statistics course at Coursera, consists of 4 parts:

Data: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality.)
Research Question: Come up with a research question that you want to answer and perform inference using the data.
Exploratory Data Analysis (EDA): Perform exploratory data analysis that addresses the research question outlined above. The EDA should contain numerical summaries and visualizations.
Inference: Perform a hypothesis test on the research question to support your conclusions. The hypotheses should be clearly stated and match the research question, the conditions for the hypothesis test are checked in context of the data, the appropriate method is stated and described, the conclusions are correctly interpreted, and if applicable, include the confidence interval and comment on agreement of the results.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(scales)
library(GGally)

Load data

load("gss.Rdata")

Part 1: Data

The researchers in the study apply simple random sampling; they randomly select respondents in households across the United States. Most of the data was obtained by face-to-face interviews, although for those individuals that had difficulties setting up an in-person interview, telephone interviews was used. Computer-assisted personal interviewing began in the 2002 GSS.

Since random sampling was used, all the conclusions drawn from the study and analysis of its data can be generalized to the whole US population. Also, since this research is purely observational (through survey) and not an experiment, we cannot infer causality from the study of the data.

Part 2: Research question

Research question: Has there been an increase in the proportion of women working before and after year 2000?

We want to prove if there’s enough evidence pointing that women have taken a more active role in the workforce in the recent years (year 2000 and above) from previous years (below year 2000), thus proving a success in gender equality.

Part 3: Exploratory data analysis

We will make use of three variables for this research: year, sex, and wrkstat. We will create a dataset called womwork for this specific research question. We begin by analyzing and discussing all variables.

# Check the time scope of the data
range(gss$year)

## [1] 1972 2012

By analyzing first the year variable, we can confirm that the survey has been taking place from year 1972 to 2012. We are interested in splitting our data between those respondents before year 2000 and after year 2000 inclusive. We will conveniently differentiate both groups by century by creating a new variable in our womwork dataset.

# Check the levels of sex categories in the data
str(gss$sex)

##  Factor w/ 2 levels "Male","Female": 2 1 2 2 2 1 1 1 2 2 ...

There are only 2 possible categories for this variable: male and female. We are only interested in the “female” category for this research question.

# Total number of respondents
n_total = gss %>% nrow()

# Explore the distribution of working categories
gss %>% group_by(wrkstat) %>% 
  summarise(count = n(), percent = round((n()/n_total*100), 2))

## # A tibble: 9 x 3
##   wrkstat          count percent
##   <fct>            <int>   <dbl>
## 1 Working Fulltime 28207   49.4 
## 2 Working Parttime  5842   10.2 
## 3 Temp Not Working  1213    2.13
## 4 Unempl, Laid Off  1873    3.28
## 5 Retired           7642   13.4 
## 6 School            1751    3.07
## 7 Keeping House     9387   16.4 
## 8 Other             1132    1.98
## 9 <NA>                14    0.02

There are many different categories in this variable. For analysis purposes, we will filter out the “NA”, “School”, and “Retired” categories, because they are out of scope for this research question; they don’t represent a meaningful population for our analysis. Additionally, we will create a new variable called working in our womwork that will group the respondents in these categories:

“yes”: “Working Fulltime” and “Working Parttime”.
“no”: “Temp Not Working”, “Unempl”, “Laid Off”, “Keeping House”, and “Other”.

Now we create our womwork dataset considering all the previous discussions for the containing variables.

#Create data set with required variables and filters
womwork <- gss %>% select(year, sex, wrkstat) %>% 
  filter(sex == 'Female', !is.na(wrkstat), wrkstat != 'School', wrkstat != 'Retired') 

#add century variable in data set to separate respondents by pre and post year 2000
womwork <- womwork %>% mutate(century = ifelse(year < 2000, 'XX', 'XXI'))

#add working variable to separate respondents by either working or not working
womwork <- womwork %>% 
  mutate(working = ifelse(wrkstat == 'Working Fulltime' | wrkstat == 'Working Parttime', 'yes', 'no'))

Now we are able to explore our data set variables and perform analysis.

#Number of respondents in the XX century
numXX <- womwork %>% filter(century == 'XX') %>% nrow()

#Summary statistics for respondents working status proportion from XX century
womwork %>% filter(century == 'XX') %>% group_by(working) %>%
  summarise(proportion = round((n()/numXX*100), 2))

## # A tibble: 2 x 2
##   working proportion
##   <chr>        <dbl>
## 1 no            43.7
## 2 yes           56.3

#Number of respondents in the XXI century
numXXI <- womwork %>% filter(century == 'XXI') %>% nrow()

#Summary statistics for respondents working status proportion from XXI century
womwork %>% filter(century == 'XXI') %>% group_by(working) %>%
  summarise(proportion = round((n()/numXXI*100), 2))

## # A tibble: 2 x 2
##   working proportion
##   <chr>        <dbl>
## 1 no            32.1
## 2 yes           67.9

There seems to be a difference in the proportion of women working between the centuries–more than 10% increase–. However, to rigorously prove if this fact is correct, we will run a hypothesis test later.

We can observe this difference visually with this graph:

ggplot(data = womwork) + 
  geom_bar(aes(x = working, y = ..prop.., fill = working, by = factor(century)), stat = "prop") + 
  scale_fill_manual(values = c("coral3", "cornflowerblue")) +
  scale_y_continuous(labels = percent) +
  facet_wrap(~century) +
  ggtitle("Change in work status of women", "Employment rate of women over the XX and XXI century") +
  ylab("Percent") +
  xlab("Actively working on a job?")

The difference is clearly seen by the gap between both bars indicating the yes proportion category.

Part 4: Inference

State Hypotheses

Our null hypothesis states that the proportion of women working before and after year 2000 is roughly equal, and that the difference seen from the data is due to chance alone.

The alternative hypothesis states that there is indeed an increase in the proportion of women working before and after year 2000.

‘p’, the probability of success, is indicated by the proportions of ‘yes’. Therefore,

H₀: p_XXI - p_XX = 0

H_A: p_XXI - p_XX > 0

Check conditions

We check for independence within and between groups.

Independence within groups: Sampled females are independent from each other and represent less than 10% of the whole US female population.
Independence between groups: There’s no reason to conclude that females sampled in both groups are dependent.

To check if the success-failure condition holds for both groups, we will use the following table.

#success-failure condition for confidence intervals
womwork %>% group_by(century, working) %>% summarise(condition = n())

## # A tibble: 4 x 3
## # Groups:   century [2]
##   century working condition
##   <chr>   <chr>       <int>
## 1 XX      no           8212
## 2 XX      yes         10572
## 3 XXI     no           2748
## 4 XXI     yes          5803

Each cell in the condition column is greater than 10, hence the success-failure condition holds and we can assume the sampling distribution of the difference between the two proportions to be nearly normal.

We have checked the necessary conditions for using a confidence interval method, now we check the success-failure condition for hypothesis testing, for which we need to calculate and use a p_pool proportion.

#number of successes
num_suss <- womwork %>% filter(working == 'yes') %>% nrow()

#total number of respondents
num_tot <- womwork %>% nrow()

#p_pool calculation
p_pool = num_suss / num_tot

#show its value
p_pool

## [1] 0.5990488

Now that we have the value of our p_pool variable, we can now check the success failure for the hypothesis test.

#success-failure condition for hypothesis tests
womwork %>% group_by(century) %>% 
  summarise(success = n() * p_pool, failure = n() * (1 - p_pool))

## # A tibble: 2 x 3
##   century success failure
##   <chr>     <dbl>   <dbl>
## 1 XX       11253.   7531.
## 2 XXI       5122.   3429.

Each cell in the table is greater than 10, so the success-failure condition holds and we can assume that the sampling distribution for the p_pool proportion is nearly normal. Now we have checked all the necessary conditions to run a hypothesis test.

State the method(s) to be used and why and how

We use both confidence interval and hypothesis test methods for 2 independent proportions for this research question since both variables are categorical. Because none of these variables have more than 2 levels, we don’t use any chi-square method. Moreover, we use theoretical methods and not simulation methods since the success-failure condition is met.

Perform inference and interpret results

Hypothesis test

We use a default significance level of 5%. Since we are interested if there was an increase in proportion rates, represented by the greater-than sign in our alternative hypothesis, we evaluate a one-sided hypothesis test.

#hypothesis test
inference(y = working, x = century, data = womwork, type = "ht", 
  statistic = "proportion", success = 'yes', order = c('XXI', 'XX'), 
  method = 'theoretical', null = 0, alternative = "greater")

## Response variable: categorical (2 levels, success: yes)
## Explanatory variable: categorical (2 levels) 
## n_XXI = 8551, p_hat_XXI = 0.6786
## n_XX = 18784, p_hat_XX = 0.5628
## H0: p_XXI =  p_XX
## HA: p_XXI > p_XX
## z = 18.1146
## p_value = < 0.0001

The p-value is really low (less than 0.0001), and far less than our significance level of 0.05, which means that we have very strong evidence against the null hypothesis, so we reject it in favor of the alternative. This also means that the probability of observing this difference in work status proportion from the data due to chance alone, given that the null hypothesis were true, is less than 0.01%.

Confidence interval

We use a default confidence level of 95% for this analysis.

#Confidence interval
inference(y = working, x = century, data = womwork, type = "ci", 
  statistic = "proportion", success = 'yes', order = c('XXI', 'XX'),
  method = "theoretical")

## Response variable: categorical (2 levels, success: yes)
## Explanatory variable: categorical (2 levels) 
## n_XXI = 8551, p_hat_XXI = 0.6786
## n_XX = 18784, p_hat_XX = 0.5628
## 95% CI (XXI - XX): (0.1036 , 0.128)

Our confidence interval is (0.1036, 0.128), so we are 95% confident that the true increase in proportion of women working from before year 2000 to after year 2000 inclusive is between 10.36% and 12.8%. Additionally, because 0 is not contained within the confidence interval, there’s strong evidence that both proportions of working women aren’t equal, thus confirming our conclusion about rejecting the null hypothesis.