The General Social Survey (GSS) is a sociological survey created and regularly collected since 1972 by the National Opinion Research Center at the University of Chicago. The GSS collects information and keeps a historical record of the concerns, experiences, attitudes, and practices of residents of the United States. The GSS aims to gather data on contemporary American society to monitor and explain trends and constants in attitudes, behaviors, and attributes; to examine the structure and functioning of society in general as well as the role played by relevant subgroups; to compare the United States to other societies to place American society in comparative perspective and develop cross-national models of human society; and to make high-quality data easily accessible to scholars, students, policy makers, and others, with minimal cost and waiting.
Source: The GSS page on Wikipedia.
This project, from Inferential Statistics course at Coursera, consists of 4 parts:
Data: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality.)
Research Question: Come up with a research question that you want to answer and perform inference using the data.
Exploratory Data Analysis (EDA): Perform exploratory data analysis that addresses the research question outlined above. The EDA should contain numerical summaries and visualizations.
Inference: Perform a hypothesis test on the research question to support your conclusions. The hypotheses should be clearly stated and match the research question, the conditions for the hypothesis test are checked in context of the data, the appropriate method is stated and described, the conclusions are correctly interpreted, and if applicable, include the confidence interval and comment on agreement of the results.
The researchers in the study apply simple random sampling; they randomly select respondents in households across the United States. Most of the data was obtained by face-to-face interviews, although for those individuals that had difficulties setting up an in-person interview, telephone interviews was used. Computer-assisted personal interviewing began in the 2002 GSS.
Since random sampling was used, all the conclusions drawn from the study and analysis of its data can be generalized to the whole US population. Also, since this research is purely observational (through survey) and not an experiment, we cannot infer causality from the study of the data.
Research question: Has there been an increase in the proportion of women working before and after year 2000?
We want to prove if there’s enough evidence pointing that women have taken a more active role in the workforce in the recent years (year 2000 and above) from previous years (below year 2000), thus proving a success in gender equality.
We will make use of three variables for this research: year
, sex
, and wrkstat
. We will create a dataset called womwork
for this specific research question. We begin by analyzing and discussing all variables.
## [1] 1972 2012
By analyzing first the year
variable, we can confirm that the survey has been taking place from year 1972 to 2012. We are interested in splitting our data between those respondents before year 2000 and after year 2000 inclusive. We will conveniently differentiate both groups by century by creating a new variable in our womwork
dataset.
## Factor w/ 2 levels "Male","Female": 2 1 2 2 2 1 1 1 2 2 ...
There are only 2 possible categories for this variable: male and female. We are only interested in the “female” category for this research question.
# Total number of respondents
n_total = gss %>% nrow()
# Explore the distribution of working categories
gss %>% group_by(wrkstat) %>%
summarise(count = n(), percent = round((n()/n_total*100), 2))
## # A tibble: 9 x 3
## wrkstat count percent
## <fct> <int> <dbl>
## 1 Working Fulltime 28207 49.4
## 2 Working Parttime 5842 10.2
## 3 Temp Not Working 1213 2.13
## 4 Unempl, Laid Off 1873 3.28
## 5 Retired 7642 13.4
## 6 School 1751 3.07
## 7 Keeping House 9387 16.4
## 8 Other 1132 1.98
## 9 <NA> 14 0.02
There are many different categories in this variable. For analysis purposes, we will filter out the “NA”, “School”, and “Retired” categories, because they are out of scope for this research question; they don’t represent a meaningful population for our analysis. Additionally, we will create a new variable called working
in our womwork
that will group the respondents in these categories:
Now we create our womwork
dataset considering all the previous discussions for the containing variables.
#Create data set with required variables and filters
womwork <- gss %>% select(year, sex, wrkstat) %>%
filter(sex == 'Female', !is.na(wrkstat), wrkstat != 'School', wrkstat != 'Retired')
#add century variable in data set to separate respondents by pre and post year 2000
womwork <- womwork %>% mutate(century = ifelse(year < 2000, 'XX', 'XXI'))
#add working variable to separate respondents by either working or not working
womwork <- womwork %>%
mutate(working = ifelse(wrkstat == 'Working Fulltime' | wrkstat == 'Working Parttime', 'yes', 'no'))
Now we are able to explore our data set variables and perform analysis.
#Number of respondents in the XX century
numXX <- womwork %>% filter(century == 'XX') %>% nrow()
#Summary statistics for respondents working status proportion from XX century
womwork %>% filter(century == 'XX') %>% group_by(working) %>%
summarise(proportion = round((n()/numXX*100), 2))
## # A tibble: 2 x 2
## working proportion
## <chr> <dbl>
## 1 no 43.7
## 2 yes 56.3
#Number of respondents in the XXI century
numXXI <- womwork %>% filter(century == 'XXI') %>% nrow()
#Summary statistics for respondents working status proportion from XXI century
womwork %>% filter(century == 'XXI') %>% group_by(working) %>%
summarise(proportion = round((n()/numXXI*100), 2))
## # A tibble: 2 x 2
## working proportion
## <chr> <dbl>
## 1 no 32.1
## 2 yes 67.9
There seems to be a difference in the proportion of women working between the centuries–more than 10% increase–. However, to rigorously prove if this fact is correct, we will run a hypothesis test later.
We can observe this difference visually with this graph:
ggplot(data = womwork) +
geom_bar(aes(x = working, y = ..prop.., fill = working, by = factor(century)), stat = "prop") +
scale_fill_manual(values = c("coral3", "cornflowerblue")) +
scale_y_continuous(labels = percent) +
facet_wrap(~century) +
ggtitle("Change in work status of women", "Employment rate of women over the XX and XXI century") +
ylab("Percent") +
xlab("Actively working on a job?")
The difference is clearly seen by the gap between both bars indicating the yes
proportion category.
Our null hypothesis states that the proportion of women working before and after year 2000 is roughly equal, and that the difference seen from the data is due to chance alone.
The alternative hypothesis states that there is indeed an increase in the proportion of women working before and after year 2000.
‘p’, the probability of success, is indicated by the proportions of ‘yes’. Therefore,
H0: pXXI - pXX = 0
HA: pXXI - pXX > 0
We check for independence within and between groups.
Independence within groups: Sampled females are independent from each other and represent less than 10% of the whole US female population.
Independence between groups: There’s no reason to conclude that females sampled in both groups are dependent.
To check if the success-failure condition holds for both groups, we will use the following table.
#success-failure condition for confidence intervals
womwork %>% group_by(century, working) %>% summarise(condition = n())
## # A tibble: 4 x 3
## # Groups: century [2]
## century working condition
## <chr> <chr> <int>
## 1 XX no 8212
## 2 XX yes 10572
## 3 XXI no 2748
## 4 XXI yes 5803
Each cell in the condition column is greater than 10, hence the success-failure condition holds and we can assume the sampling distribution of the difference between the two proportions to be nearly normal.
We have checked the necessary conditions for using a confidence interval method, now we check the success-failure condition for hypothesis testing, for which we need to calculate and use a ppool proportion.
#number of successes
num_suss <- womwork %>% filter(working == 'yes') %>% nrow()
#total number of respondents
num_tot <- womwork %>% nrow()
#p_pool calculation
p_pool = num_suss / num_tot
#show its value
p_pool
## [1] 0.5990488
Now that we have the value of our p_pool
variable, we can now check the success failure for the hypothesis test.
#success-failure condition for hypothesis tests
womwork %>% group_by(century) %>%
summarise(success = n() * p_pool, failure = n() * (1 - p_pool))
## # A tibble: 2 x 3
## century success failure
## <chr> <dbl> <dbl>
## 1 XX 11253. 7531.
## 2 XXI 5122. 3429.
Each cell in the table is greater than 10, so the success-failure condition holds and we can assume that the sampling distribution for the p_pool
proportion is nearly normal. Now we have checked all the necessary conditions to run a hypothesis test.
We use both confidence interval and hypothesis test methods for 2 independent proportions for this research question since both variables are categorical. Because none of these variables have more than 2 levels, we don’t use any chi-square method. Moreover, we use theoretical methods and not simulation methods since the success-failure condition is met.
We use a default significance level of 5%. Since we are interested if there was an increase in proportion rates, represented by the greater-than sign in our alternative hypothesis, we evaluate a one-sided hypothesis test.
#hypothesis test
inference(y = working, x = century, data = womwork, type = "ht",
statistic = "proportion", success = 'yes', order = c('XXI', 'XX'),
method = 'theoretical', null = 0, alternative = "greater")
## Response variable: categorical (2 levels, success: yes)
## Explanatory variable: categorical (2 levels)
## n_XXI = 8551, p_hat_XXI = 0.6786
## n_XX = 18784, p_hat_XX = 0.5628
## H0: p_XXI = p_XX
## HA: p_XXI > p_XX
## z = 18.1146
## p_value = < 0.0001
The p-value
is really low (less than 0.0001), and far less than our significance level of 0.05, which means that we have very strong evidence against the null hypothesis, so we reject it in favor of the alternative. This also means that the probability of observing this difference in work status proportion from the data due to chance alone, given that the null hypothesis were true, is less than 0.01%.
We use a default confidence level of 95% for this analysis.
#Confidence interval
inference(y = working, x = century, data = womwork, type = "ci",
statistic = "proportion", success = 'yes', order = c('XXI', 'XX'),
method = "theoretical")
## Response variable: categorical (2 levels, success: yes)
## Explanatory variable: categorical (2 levels)
## n_XXI = 8551, p_hat_XXI = 0.6786
## n_XX = 18784, p_hat_XX = 0.5628
## 95% CI (XXI - XX): (0.1036 , 0.128)
Our confidence interval is (0.1036, 0.128), so we are 95% confident that the true increase in proportion of women working from before year 2000 to after year 2000 inclusive is between 10.36% and 12.8%. Additionally, because 0 is not contained within the confidence interval, there’s strong evidence that both proportions of working women aren’t equal, thus confirming our conclusion about rejecting the null hypothesis.