Jose Wilhelm Exploratory Data Analysis Project

What is BRFSS?

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Source: Duke University Data and Visualization Services

Assignment

This project, from Introduction to Probability and Data with R course at coursera, consists of 3 parts:

Data: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).
Research questions: Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.
EDA: Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Additionally, the formatting, organization, and readability of the project is taken into consideration.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(scales)
library(GGally)

Load data

load("brfss2013.RData")

Part 1: Data

The observations in this study are gathered through household and cellphone calls, where each chosen household is randomly sampled from all US households. Since random sampling was used for data gathering, all conclusions can be generalized to all US population, or at least to people in the US living in a household and who also owns a telephone and/or cellphone.

Since this is an observational study and not an experiment, random assignment was not used and therefore we can’t assign direct causal relationship to any possible conclusion from the analysis of the data.

Part 2: Research questions

Research question 1: Are people who eat dark green vegetables more likely to have lower BMI?

We want to know whether the BMI and weight of people who eat dark green vegetables frequently differs from those who do not.

Research question 2: Is income level related to the amount of hours of sleep per night?

We want to explore a possible existing relationship between income levels and hours of sleep.

Research question 3: Is smoking tobacco products related to emotional well-being?

We want to know if a potential association between smoking tobacco products with emotional health exists.

Part 3: Exploratory data analysis

Research question 1: Are people who eat dark green vegetables more likely to have lower BMI?

To answer this question we will examine three variables: grenday_, _bmi5cat, and _bmi5. Because of R naming rules however, the last 2 names change to X_bmi5cat and X_bmi5 respectively. We will pass these variables to a table (data frame) called vegbmi for this research question for ease of use. For analysis purposes, we get rid of the NA values:

# Creating Dataframe with non NAs for question 1
vegbmi <- brfss2013 %>% select(X_bmi5cat, X_bmi5, grenday_) %>%
  filter(!is.na(grenday_), !is.na(X_bmi5cat), !is.na(X_bmi5))

grenday_: It represents the number of times a person eats dark green vegetables per day on average. Discrete numerical variable.
X_bmi5cat: Can take values of ‘underweight’, ‘normal weight’, ‘overweight’, and ‘obese’. Regular categorical variable.
X_bmi5: BMI of a person. Continuous numerical variable.

There’s a problem with the data, though. According to BRFSS documentation, some calculated variables like grenday_ or X_bmi5 are rounded up to 2 decimal places and then multiplied by 100 so all decimals are taken care of, so if for example we see a value of 450 on grenday_, it means that the person really eats 4.50 times vegetables per day, not 450. To fix this problem, we divide all grenday_ and X_bmi5 observations by 100. We want to do this so that the summaries and graphs represent precise information:

# Transforming grenday_ and X_bmi5 variables
vegbmi <- vegbmi %>% mutate(grenday_ = grenday_ / 100, X_bmi5 = X_bmi5 / 100)

If we examine the range of X_bmi5 we will notice something funny:

# Look for outliers
range(vegbmi$X_bmi5)

## [1]  0.01 97.69

0.01 and 97.69 are very extreme outliers observations in the data. We know that BMI values range normally from 18 to 35 or so. A number as low as 12 or as high as 60 is extremely unlikely, so we want to take care of these extreme outlier numbers. Likewise, there are also extreme outliers in grenday_ that we will take care of.

# Filter extreme data
vegbmi <- vegbmi %>% filter(X_bmi5 <= 60, X_bmi5 >= 12)
vegbmi <- vegbmi %>% filter(grenday_ <= 20)

Now we can begin to analyze our data.

Let’s take a preliminary look at our variables:

# Show descriptive statistics for grenday_
vegbmi %>% summarise(veg_avg = mean(grenday_), veg_sd = sd(grenday_), 
  veg_med = median(grenday_), veg_iqr = IQR(grenday_))

##     veg_avg    veg_sd veg_med veg_iqr
## 1 0.5477564 0.5698253    0.43    0.69

Most people eat dark green vegetables less than once a day or never, and as we can see from the variability, it’s highly unlikely that a person would eat dark green vegetables at least twice a day, given that its value is roughly three standard deviations away from the mean.

# Show descriptive statistics for X_bmi5
vegbmi %>% 
  summarise(bmi_avg = mean(X_bmi5), bmi_sd = sd(X_bmi5), 
            bmi_med = median(X_bmi5), bmi_iqr = IQR(X_bmi5))

##    bmi_avg   bmi_sd bmi_med bmi_iqr
## 1 27.80245 5.993559   26.65    7.15

It may be interesting to note that this reveals that most of the US population fall under the category of Overweight on average, and that most of the variability tend to fall under normal weight and obese categories. It’s also very unlikely that a randomly chosen person from the sample falls under the underweight category, looking how apart is the BMI value from the mean and given the current SD.

# BMI categories bar plot
ggplot(data = vegbmi) + 
  geom_bar(aes(x = X_bmi5cat, y = (..count../sum(..count..)), fill=X_bmi5cat)) +
  scale_y_continuous(labels = percent) +
  scale_fill_manual(values = c("darkgoldenrod2", "cornflowerblue", "coral2", "darkseagreen")) +
  ggtitle("Body Mass Index (BMI)", "Population proportions") +
  ylab("Percentage") + 
  xlab("BMI Category")

We can see that indeed just a tiny fraction of observations fall under the ‘underweight’ category, and that the huge majority of individuals fall under the normal weight, overweight or obese category.

Now, attending to the question, we would like to explore the relationship between dark green vegetables consumption and a person’s BMI.

# Average statistics on grenday_ variable per BMI level
vegbmi %>% group_by(X_bmi5cat) %>% summarise(veg_mean = mean(grenday_), 
  veg_sd = sd(grenday_), veg_med = median(grenday_), veg_iqr = IQR(grenday_))

## # A tibble: 4 x 5
##   X_bmi5cat     veg_mean veg_sd veg_med veg_iqr
##   <fct>            <dbl>  <dbl>   <dbl>   <dbl>
## 1 Underweight      0.558  0.690    0.43    0.86
## 2 Normal weight    0.600  0.599    0.43    0.83
## 3 Overweight       0.540  0.550    0.43    0.57
## 4 Obese            0.497  0.547    0.33    0.53

It’s not much clear from the summarized data the relationship among these two variables; the mean for grenday_ do appear to be higher for normal weight individuals, and variability on the data tend to decrease as BMI increases. Let’s take a further look at a plot:

# Dark green vegetables consumption vs BMI plot
ggplot(data = vegbmi) + 
  geom_smooth(aes(x = X_bmi5, y = grenday_), color = "cornflowerblue") +
  ggtitle("Dark green vegetables consumption and BMI") +
  ylab("Dark green vegetables portions per day") +
  xlab("Body Mass Index")

Dark green vegetables consumption starts of low at first for underweight individuals, then it reaches its maximum value for normal weight individuals, then the average consumption decreases slowly with the overweight and Obese BMI categories.

This confirms our initial guess: the greatest average dark green vegetable consumption seems to come from normal weight individuals with BMI a bit past 20, the point in the plot where it reaches its highest peak. What might also come surprising, is that the data suggests that overweight and Obese individuals tend to eat more dark green vegetables than underweight individuals on average.

Even though an association seems to exist, as pointed by the data and the plot, the difference in mean dark green vegetable consumption appears to be very small, just around 0.3 times of difference (0.6 - 0.3, the highest and lowest points in the graph). However, taking into account that this is daily consumption, another method of picturing this difference is by translating it to a weekly one. By multiplying this difference by 7 we get that 0.3*0.7 is around 2.1, this suggests that the highest difference in consumption is of around 2 extra dark green vegetable portions per week.

Research question 2: Is income level related to the amount of hours of sleep per night?

The variables needed for this research question are sleptim1 and X_incomg. Again, the original name of X_incomg is _incomg but is changed because of R naming rules for variables. The data frame we will use for convenience in this research question will be called slepincom. We foremost make sure to get rid of NA values.

# Create dataframe with no NA values for question 2
slepincom <- brfss2013 %>% select(sleptim1, X_incomg) %>% 
  filter(!is.na(sleptim1), !is.na(X_incomg))

sleptim1: Average number of hours of sleep in a 24-hours period. Continuous numerical variable.
X_incomg: Income level. Ordinal categorical variable.

There’s some cleaning we need to do to the data first.

# Look for outliers
range(slepincom$sleptim1)

## [1]   1 103

The maximum value of this variable can’t go past 24, so we filter the data of those extreme outliers.

# Filter outliers
slepincom <- slepincom %>% filter(sleptim1 <= 24)

Let’s explore both variables in depth.

First, to analyze relative frequencies, let’s observe how many observations we have in our cleaned-up data frame.

# Dimensions of the dataframe
dim(slepincom)

## [1] 415921      2

We have 415921 observations, we will use that number to calculate the relative frequencies of X_incomg correctly.

# Frequency of each income level
slepincom %>% group_by(X_incomg) %>% 
  summarise(percent = round((n() / 415921), 2))

## # A tibble: 5 x 2
##   X_incomg                     percent
##   <fct>                          <dbl>
## 1 Less than $15,000               0.12
## 2 $15,000 to less than $25,000    0.18
## 3 $25,000 to less than $35,000    0.12
## 4 $35,000 to less than $50,000    0.15
## 5 $50,000 or more                 0.43

Most households fall under the ‘$50.000 or more’ category, followed by anything between $15.000 and $25.000.

# Summary statistics for average hours of sleep
slepincom %>% summarise(slep_avg = mean(sleptim1), slep_sd = sd(sleptim1),
  slep_median = median(sleptim1), slep_iqr = IQR(sleptim1))

##   slep_avg  slep_sd slep_median slep_iqr
## 1 7.033819 1.450543           7        2

This data suggests that the average amount of sleep individuals get per night is around 7. Also, as inferred from the IQR, roughly 75% of observations sleep 6, 7, or 8 hours on average.

Now, to tackle the research question directly, let’s make use of summary statistics.

# Summary statistics on average hours of sleep for each income level
slepincom %>% group_by(X_incomg) %>% summarise(avg_sleep = mean(sleptim1), 
  sd_sleep = sd(sleptim1), median_sleep = median(sleptim1), 
  iqr_sleep = IQR(sleptim1))

## # A tibble: 5 x 5
##   X_incomg                     avg_sleep sd_sleep median_sleep iqr_sleep
##   <fct>                            <dbl>    <dbl>        <int>     <dbl>
## 1 Less than $15,000                 6.91     1.98            7         2
## 2 $15,000 to less than $25,000      7.03     1.69            7         2
## 3 $25,000 to less than $35,000      7.07     1.47            7         2
## 4 $35,000 to less than $50,000      7.07     1.35            7         2
## 5 $50,000 or more                   7.05     1.16            7         2

Not much information can be inferred from the median and IQR, and the average sleep of every income category is very close to 7 hours. The standard deviation of the data, however, appears to suggest a negative relationship between income level and variability in hours of sleep; it seems that the higher the income, the more likely it is for an individual to sleep closer to 7 hours of sleep on average. Let’s plot this relationship to get a better picture. ..prop.. is the y value of aes() and is used to show relative frequencies on the plot rather than count. facet_wrap() let us generate a separate graph for every X_incomg category.

# Average hours of sleep for each income level plot
ggplot(data = slepincom, aes(x = sleptim1, y = ..prop..)) + 
  geom_bar(fill = "cornflowerblue") + 
  facet_wrap(~X_incomg) +
  scale_y_continuous(label = percent) +
  ggtitle("How sleep relates to income?", "Average amount of sleep vs income levels") +
  ylab("Percentage") +
  xlab("Average hours of sleep per night")

Examining the graph closely, we can see that the middle column, which represents 7 hours of average sleep, increases as income level does. We can then assume from the data that the higher the income level, the higher the proportion of individuals sleeping 7 hours on average.

Research question 3: Research question 3: Is smoking tobacco products related to emotional well-being?

We will use the variables smoke100, misnervs, mishopls, misrstls, misdeprd, and miswtles. The data frame this time will be called smokement. We take care of getting rid of NA values first.

# Create data frame for question 3
smokement <- brfss2013 %>% 
  select(smoke100, misnervs, mishopls, misrstls, misdeprd, miswtles) %>%
  filter(!is.na(smoke100), !is.na(misnervs), !is.na(mishopls), 
    !is.na(misrstls), !is.na(misdeprd), !is.na(miswtles))

Since these are a lot of variables in this research question, we will not go much in depth with their individual analysis so the discussion won’t end up being unnecessarily lengthy. Let’s skim through these variables to see how they function:

smoke100: Did the individual smoke at least 100 cigarettes? Possible answers: Yes/no. Categorical variable.
misnervs: How often felt nervous in the past 30 days. Ordinal categorical variable.
mishopls: How often felt hopeless in the past 30 days. Ordinal categorical variable.
misrstls: How often felt restless or fidgety in the past 30 days. Ordinal categorical variable.
misdeprd: How often felt depressed in the past 30 days. Ordinal categorical variable.
miswtles: How often felt worthless in the past 30 days. Ordinal categorical variable.

# Frequency of each level of feelings of nervousness
smokement %>% 
  group_by(misnervs) %>% 
  summarise(count = n(), percent = round(n() / nrow(smokement) * 100, 2))

## # A tibble: 5 x 3
##   misnervs count percent
##   <fct>    <int>   <dbl>
## 1 All        784    2.21
## 2 Most      1238    3.49
## 3 Some      5147   14.5 
## 4 A little 10565   29.8 
## 5 None     17737   50

The variable misnervs has only 5 possible categories listed above. Likewise, the variables mishopls, misrstls, misdeprd, and miswtles are categorized in the same way.

For this analysis, we will only compare smoke100 with each of the other variables to evaluate its impact on them. Let’s first examine its relationship with misnervs (feelings of nervousness).

# Smoking and feelings of nervousness plot
ggplot(data = smokement) +
  geom_bar(aes(smoke100, fill = smoke100, y = ..prop.., by = misnervs), stat = "prop") +
  scale_fill_manual(values = c("cornflowerblue", "coral3")) +
  scale_y_continuous(labels = percent) +
  facet_wrap(~misnervs, nrow = 1) +
  ggtitle("Smoking and frequency of feelings of nervousness", "How often did you feel nervous in the past 30 days?") +
  ylab("Percentage") +
  xlab("Smoked in the past 100 days?")

The graph shows that the proportion of observations who feel nervous in all or most days have smoked at least 100 cigarettes, while those who felt a little nervous or none did not. The graph evens at ‘some’ for both smoking categories. It appears that smoking is related to increased feelings of nervousness.

Now smoke100 with mishopls (feelings of hopelessness).

# smoking and feelings of hopelessness plot
ggplot(data = smokement) +
  geom_bar(aes(smoke100, fill = smoke100, y = ..prop.., by = mishopls), stat = "prop") +
  scale_fill_manual(values = c("cornflowerblue", "coral3")) +
  scale_y_continuous(labels = percent) +
  facet_wrap(~mishopls, nrow = 1) +
  ggtitle("Smoking and frequency of feelings of hopelessness", "How often did you feel hopeless in the past 30 days?") +
  ylab("Percentage") +
  xlab("Smoked in the past 100 days?")

This graph shows almost the same pattern as the previous graph and our previous statement still holds true, but now that smoking is related to increased feelings of hopelessness.

We will see that, from this point on, all the following graphs display the same relationship between the emotional health variable and smoke100.

Now smoke100 with misrstls (feelings of restlessness).

# Smoking and feelings of restlessness plot
ggplot(data = smokement) +
  geom_bar(aes(smoke100, fill = smoke100, y = ..prop.., by = misrstls), stat = "prop") +
  scale_fill_manual(values = c("cornflowerblue", "coral3")) +
  scale_y_continuous(labels = percent) +
  facet_wrap(~misrstls, nrow = 1) +
  ggtitle("Smoking and frequency of feelings of restlessness", "How often did you feel restless in the past 30 days?") +
  ylab("Percentage") +
  xlab("Smoked in the past 100 days?")

Now smoke100 with misdeprd (feelings of depression).

# Smoking and feelings of depression plot
ggplot(data = smokement) +
  geom_bar(aes(smoke100, fill = smoke100, y = ..prop.., by = misdeprd), stat = "prop") +
  scale_fill_manual(values = c("cornflowerblue", "coral3")) +
  scale_y_continuous(labels = percent) +
  facet_wrap(~misdeprd, nrow = 1) +
  ggtitle("Smoking and frequency of feelings of depression", "How often did you feel depressed in the past 30 days?") +
  ylab("Percentage") +
  xlab("Smoked in the past 100 days?")

And finally, smoke100 with miswtles (feelings of worthlessness).

# Smoking and feelings of worthlessness plot
ggplot(data = smokement) +
  geom_bar(aes(smoke100, fill = smoke100, y = ..prop.., by = miswtles), stat = "prop") +
  scale_fill_manual(values = c("cornflowerblue", "coral3")) +
  scale_y_continuous(labels = percent) +
  facet_wrap(~miswtles, nrow = 1) +
  ggtitle("Smoking and frequency of feelings of worthlessness", "How often did you feel worthless in the past 30 days?") +
  ylab("Percentage") +
  xlab("Smoked in the past 100 days?")

Our final conclusion from this research question, as demonstrated by the graphs, is that individuals who have smoked at least 100 cigarettes appear to be less emotionally healthy from those individuals who haven’t. These smoking individuals seem to experience higher levels of nervousness, depression, hopelessness, restlessness, and worthlessness.