What is BRFSS?

The Behavioral Risk Factor Surveillance System (BRFSS) is a collaborative project between all of the states in the United States (US) and participating US territories and the Centers for Disease Control and Prevention (CDC). BRFSS is an ongoing surveillance system designed to measure behavioral risk factors for the non-institutionalized adult population (18 years of age and older) residing in the US. The BRFSS objective is to collect uniform, state-specific data on preventive health practices and risk behaviors that are linked to chronic diseases, injuries, and preventable infectious diseases that affect the adult population. Factors assessed by the BRFSS in 2013 include tobacco use, HIV/AIDS knowledge and prevention, exercise, immunization, health status, healthy days — health-related quality of life, health care access, inadequate sleep, hypertension awareness, cholesterol awareness, chronic health conditions, alcohol consumption, fruits and vegetables consumption, arthritis burden, and seatbelt use.

Source: Duke University Data and Visualization Services

Assignment

This project, from Introduction to Probability and Data with R course at coursera, consists of 3 parts:

  1. Data: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality).

  2. Research questions: Come up with at least three research questions that you want to answer using these data. You should phrase your research questions in a way that matches up with the scope of inference your dataset allows for. Make sure that at least two of these questions involve at least three variables. You are welcomed to create new variables based on existing ones. With each question include a brief discussion (1-2 sentences) as to why this question is of interest to you and/or your audience.

  3. EDA: Perform exploratory data analysis (EDA) that addresses each of the three research questions you outlined above. Your EDA should contain numerical summaries and visualizations. Each R output and plot should be accompanied by a brief interpretation.

Additionally, the formatting, organization, and readability of the project is taken into consideration.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(scales)
library(GGally)

Load data

load("brfss2013.RData")

Part 1: Data

The observations in this study are gathered through household and cellphone calls, where each chosen household is randomly sampled from all US households. Since random sampling was used for data gathering, all conclusions can be generalized to all US population, or at least to people in the US living in a household and who also owns a telephone and/or cellphone.

Since this is an observational study and not an experiment, random assignment was not used and therefore we can’t assign direct causal relationship to any possible conclusion from the analysis of the data.


Part 2: Research questions

Research question 1: Are people who eat dark green vegetables more likely to have lower BMI?

We want to know whether the BMI and weight of people who eat dark green vegetables frequently differs from those who do not.

Research question 2: Is income level related to the amount of hours of sleep per night?

We want to explore a possible existing relationship between income levels and hours of sleep.

Research question 3: Is smoking tobacco products related to emotional well-being?

We want to know if a potential association between smoking tobacco products with emotional health exists.


Part 3: Exploratory data analysis

Research question 1: Are people who eat dark green vegetables more likely to have lower BMI?

To answer this question we will examine three variables: grenday_, _bmi5cat, and _bmi5. Because of R naming rules however, the last 2 names change to X_bmi5cat and X_bmi5 respectively. We will pass these variables to a table (data frame) called vegbmi for this research question for ease of use. For analysis purposes, we get rid of the NA values:

# Creating Dataframe with non NAs for question 1
vegbmi <- brfss2013 %>% select(X_bmi5cat, X_bmi5, grenday_) %>%
  filter(!is.na(grenday_), !is.na(X_bmi5cat), !is.na(X_bmi5))
  • grenday_: It represents the number of times a person eats dark green vegetables per day on average. Discrete numerical variable.
  • X_bmi5cat: Can take values of ‘underweight’, ‘normal weight’, ‘overweight’, and ‘obese’. Regular categorical variable.
  • X_bmi5: BMI of a person. Continuous numerical variable.

There’s a problem with the data, though. According to BRFSS documentation, some calculated variables like grenday_ or X_bmi5 are rounded up to 2 decimal places and then multiplied by 100 so all decimals are taken care of, so if for example we see a value of 450 on grenday_, it means that the person really eats 4.50 times vegetables per day, not 450. To fix this problem, we divide all grenday_ and X_bmi5 observations by 100. We want to do this so that the summaries and graphs represent precise information:

# Transforming grenday_ and X_bmi5 variables
vegbmi <- vegbmi %>% mutate(grenday_ = grenday_ / 100, X_bmi5 = X_bmi5 / 100)

If we examine the range of X_bmi5 we will notice something funny:

# Look for outliers
range(vegbmi$X_bmi5)
## [1]  0.01 97.69

0.01 and 97.69 are very extreme outliers observations in the data. We know that BMI values range normally from 18 to 35 or so. A number as low as 12 or as high as 60 is extremely unlikely, so we want to take care of these extreme outlier numbers. Likewise, there are also extreme outliers in grenday_ that we will take care of.

# Filter extreme data
vegbmi <- vegbmi %>% filter(X_bmi5 <= 60, X_bmi5 >= 12)
vegbmi <- vegbmi %>% filter(grenday_ <= 20)

Now we can begin to analyze our data.

Let’s take a preliminary look at our variables:

# Show descriptive statistics for grenday_
vegbmi %>% summarise(veg_avg = mean(grenday_), veg_sd = sd(grenday_), 
  veg_med = median(grenday_), veg_iqr = IQR(grenday_))
##     veg_avg    veg_sd veg_med veg_iqr
## 1 0.5477564 0.5698253    0.43    0.69

Most people eat dark green vegetables less than once a day or never, and as we can see from the variability, it’s highly unlikely that a person would eat dark green vegetables at least twice a day, given that its value is roughly three standard deviations away from the mean.

# Show descriptive statistics for X_bmi5
vegbmi %>% 
  summarise(bmi_avg = mean(X_bmi5), bmi_sd = sd(X_bmi5), 
            bmi_med = median(X_bmi5), bmi_iqr = IQR(X_bmi5))
##    bmi_avg   bmi_sd bmi_med bmi_iqr
## 1 27.80245 5.993559   26.65    7.15

It may be interesting to note that this reveals that most of the US population fall under the category of Overweight on average, and that most of the variability tend to fall under normal weight and obese categories. It’s also very unlikely that a randomly chosen person from the sample falls under the underweight category, looking how apart is the BMI value from the mean and given the current SD.

# BMI categories bar plot
ggplot(data = vegbmi) + 
  geom_bar(aes(x = X_bmi5cat, y = (..count../sum(..count..)), fill=X_bmi5cat)) +
  scale_y_continuous(labels = percent) +
  scale_fill_manual(values = c("darkgoldenrod2", "cornflowerblue", "coral2", "darkseagreen")) +
  ggtitle("Body Mass Index (BMI)", "Population proportions") +
  ylab("Percentage") + 
  xlab("BMI Category")

We can see that indeed just a tiny fraction of observations fall under the ‘underweight’ category, and that the huge majority of individuals fall under the normal weight, overweight or obese category.

Now, attending to the question, we would like to explore the relationship between dark green vegetables consumption and a person’s BMI.

# Average statistics on grenday_ variable per BMI level
vegbmi %>% group_by(X_bmi5cat) %>% summarise(veg_mean = mean(grenday_), 
  veg_sd = sd(grenday_), veg_med = median(grenday_), veg_iqr = IQR(grenday_))
## # A tibble: 4 x 5
##   X_bmi5cat     veg_mean veg_sd veg_med veg_iqr
##   <fct>            <dbl>  <dbl>   <dbl>   <dbl>
## 1 Underweight      0.558  0.690    0.43    0.86
## 2 Normal weight    0.600  0.599    0.43    0.83
## 3 Overweight       0.540  0.550    0.43    0.57
## 4 Obese            0.497  0.547    0.33    0.53

It’s not much clear from the summarized data the relationship among these two variables; the mean for grenday_ do appear to be higher for normal weight individuals, and variability on the data tend to decrease as BMI increases. Let’s take a further look at a plot:

# Dark green vegetables consumption vs BMI plot
ggplot(data = vegbmi) + 
  geom_smooth(aes(x = X_bmi5, y = grenday_), color = "cornflowerblue") +
  ggtitle("Dark green vegetables consumption and BMI") +
  ylab("Dark green vegetables portions per day") +
  xlab("Body Mass Index")

Dark green vegetables consumption starts of low at first for underweight individuals, then it reaches its maximum value for normal weight individuals, then the average consumption decreases slowly with the overweight and Obese BMI categories.

This confirms our initial guess: the greatest average dark green vegetable consumption seems to come from normal weight individuals with BMI a bit past 20, the point in the plot where it reaches its highest peak. What might also come surprising, is that the data suggests that overweight and Obese individuals tend to eat more dark green vegetables than underweight individuals on average.

Even though an association seems to exist, as pointed by the data and the plot, the difference in mean dark green vegetable consumption appears to be very small, just around 0.3 times of difference (0.6 - 0.3, the highest and lowest points in the graph). However, taking into account that this is daily consumption, another method of picturing this difference is by translating it to a weekly one. By multiplying this difference by 7 we get that 0.3*0.7 is around 2.1, this suggests that the highest difference in consumption is of around 2 extra dark green vegetable portions per week.