Jose Wilhelm Linear Regression Project

Assignment

This Project, from Linear Regression and Modeling course at Coursera, consists of 6 parts:

Data: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality.)
Research question: Come up with a research question about the data to perform a multiple linear regression on.
Exploratory Data Analysis (EDA): Perform exploratory data analysis that addresses the research question outlined above. The EDA should contain numerical summaries and visualizations.
Modeling: Develop a multiple linear regression model to predict a numerical variable in the dataset. Discuss which variables to consider and which to exclude. Carry out a model selection process and perform model diagnostics. Lastly, interpret the model coefficients.
Prediction: Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the predict function in R. Also quantify the uncertainty around this prediction using an appropriate interval.
Conclusion: Summarize your findings from the previous sections and include ideas for possible future research.

Setup

Load packages

library(ggplot2)
library(dplyr)
library(statsr)
library(GGally)

Load data

load("movies.Rdata")

Part 1: Data

The dataset consists of 651 randomly selected movies produced and released before 2016. There are 32 variables in the dataset which include data from the movies like title, genre, runtime, main actors, director, year of release, or ratings.

Since random sampling was used, conclusions drawn from the dataset can be generalized to all movies released before 2016. Because experiments were not conducted –this is an observational study– we can’t assign causal relationships from the study of this dataset.

Part 2: Research question

Research question: What variables are associated and might be used to predict the number of IMDB votes of a movie?

It may result interesting to find out what movie characteristics make people leave their vote on IMDB, whether to leave a positive or a negative one.

Part 3: Exploratory data analysis

Our response variable of interest, and thus the final target of our analysis is imdb_num_votes. Let’s examine this variable.

movies %>% summarise(median = median(imdb_num_votes), 
  Q1 = quantile(imdb_num_votes, 1/4), Q3 = quantile(imdb_num_votes, 3/4), 
  IQR = IQR(imdb_num_votes), avg = mean(imdb_num_votes), 
  min = min(imdb_num_votes), max = max(imdb_num_votes))

## # A tibble: 1 x 7
##   median    Q1     Q3   IQR    avg   min    max
##    <int> <dbl>  <dbl> <dbl>  <dbl> <int>  <int>
## 1  15116 4546. 58300. 53755 57533.   180 893008

These numerical summaries show that half of number of votes amount is between 180 (minimum value) and 15116 (median), with the other half being between the median and the max value of 898003. This max value is comparatively so high that we can expect extremely right skew behavior for this variable. This is demonstrated in the following graph (extreme values were filtered out of the plot for better representation):

movies %>% filter(imdb_num_votes < 100000) %>%
  ggplot(aes(x = imdb_num_votes)) + 
    geom_density(color = "cornflowerblue", fill = "cornflowerblue") + 
    ylab("Frequency") + 
    xlab("Number of votes") +
    ggtitle("IMDB number of votes distribution") +
    theme(axis.text.y = element_blank(), axis.ticks = element_blank())

This strong right skew behavior makes sense: it’s more likely that the average number of votes is located around the median and that only very few movies, the most popular ones, get such an enormous amount of number of votes like +500000.

Part 4: Modeling

Variables selection

Our model will initially consist of several variables: title_type, runtime, mpaa_rating, thtr_rel_year, ratings variable like imdb_rating or audience_score, top200_box, and whether the movie, the director or its actors was subject to any Oscar.

Some variables like studio, any of the actor variables, or director, are not included for the initial model because it has so many factors that our model would look very messy:

#Amount of factors in `studio`
str(movies$studio)

##  Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...

#factors in `actor1`
str(movies$actor1)

##  chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" ...

#factors in `director`
str(movies$director)

##  chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" ...

Other variables like title, imdb_url and rt_url are likewise not included because they don’t provide predictive information.

To build our initial model, we need to weed out correlating variables first so that our model won’t be subject to any bias. We begin by comparing all scores variables and look for their correlation: imdb_rating, critics_score and audience_score.

#Compare the 3 dataset score variables for correlation
ggpairs(movies, columns = c(13, 16, 18))

These variables are highly correlated with each other, so we decide to use only imdb_rating from these because it’s the one with the highest correlation, thus being the most representative, among them.

To decide what to do with critics_rating and audience_rating, let’s compare them with their score counterpart:

#Comparing audience_rating with audience_score
ggplot(movies, aes(audience_rating, audience_score)) + 
  geom_boxplot(color = "black") +
  xlab("Audience rating category") +
  ylab("Audience score") +
  ggtitle("Comparing audience score and rating")

We can see that audience_score is a significant predictor of audience_rating, therefore, we don’t include audience_rating in our initial model.

Likewise, we do the same procedure with critics_rating and critics_score:

#comparing critics_rating with critics_score
ggplot(movies, aes(critics_rating, critics_score)) + 
  geom_boxplot() +
  xlab("Critics rating category") +
  ylab("Critics score") +
  ggtitle("Comparing critics score and rating")

We reach the same conclusion for critics_score and critics_rating, that because critics_score is a significant predictor of critics_rating, we chose the former over the latter.

We are left up with critics_score and audience_score as significant predictor of their rating counterparts, and if we recall from our ggpairs plot, we chose imdb_rating as the ultimate predictor for these variables.

For simplistic purposes, we will create a variable in the dataset called oscar, that will have values of either yes or no. It is yes if any of the best_pic_nom, best_pic_win, best_actor_win, best_actress_win or best_dir_win is yes (whether an actor, director, or the movie itself, won or was nominated to an oscar), otherwise its value is no. We will use it in our model.

#implementing 'oscar' variable in the dataset
movies <- movies %>% 
  mutate(oscar = ifelse(best_pic_nom == 'yes' | 
  best_pic_win == 'yes' | best_actor_win == 'yes' | best_actress_win == 'yes' |
  best_dir_win == 'yes', 'yes', 'no'))

Now we can make the first initial model.

Model selection

#elaborating the first multiple regression model
first_model <- lm(imdb_num_votes ~ title_type + genre + runtime + mpaa_rating +
  thtr_rel_year + imdb_rating + oscar + top200_box, data = movies)

These variables make up the initial parameters of our first model. Now we take a look at the summary output of this model:

#summary output for the first model
summary(first_model)

## 
## Call:
## lm(formula = imdb_num_votes ~ title_type + genre + runtime + 
##     mpaa_rating + thtr_rel_year + imdb_rating + oscar + top200_box, 
##     data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -191155  -47381  -12610   23091  639481 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -5632297.0   713485.3  -7.894 1.31e-14 ***
## title_typeFeature Film            47143.0    33227.1   1.419  0.15645    
## title_typeTV Movie                31854.5    52353.7   0.608  0.54311    
## genreAnimation                     8169.5    35313.5   0.231  0.81712    
## genreArt House & International   -74188.9    27088.9  -2.739  0.00634 ** 
## genreComedy                      -12388.3    14915.2  -0.831  0.40653    
## genreDocumentary                 -61366.9    35725.3  -1.718  0.08634 .  
## genreDrama                       -51744.6    12919.9  -4.005 6.94e-05 ***
## genreHorror                       -8680.7    22332.4  -0.389  0.69763    
## genreMusical & Performing Arts   -91401.0    30545.1  -2.992  0.00288 ** 
## genreMystery & Suspense          -29932.0    16678.4  -1.795  0.07319 .  
## genreOther                        26597.0    25395.5   1.047  0.29536    
## genreScience Fiction & Fantasy    28701.6    31837.6   0.901  0.36767    
## runtime                            1305.4      214.5   6.087 2.01e-09 ***
## mpaa_ratingNC-17                  22458.9    67729.5   0.332  0.74030    
## mpaa_ratingPG                      3985.0    24774.7   0.161  0.87226    
## mpaa_ratingPG-13                  22205.0    26006.6   0.854  0.39353    
## mpaa_ratingR                      17062.3    24937.0   0.684  0.49409    
## mpaa_ratingUnrated               -42078.3    28986.5  -1.452  0.14710    
## thtr_rel_year                      2628.4      356.0   7.383 4.96e-13 ***
## imdb_rating                       42496.0     3908.5  10.873  < 2e-16 ***
## oscaryes                          11049.2     8647.7   1.278  0.20183    
## top200_boxyes                    157460.9    24222.0   6.501 1.63e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89230 on 627 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3889, Adjusted R-squared:  0.3674 
## F-statistic: 18.14 on 22 and 627 DF,  p-value: < 2.2e-16

Our first multiple regression model is not too shabby; we have an Adjusted R-squared of 0.3674.

For our model selection method, we will use p-value backward elimination. We use the p-value for model selection criteria mostly because we care only about figuring out which variables are statistically significant predictors, and because the model selection process is much simpler.

We begin by looking backwards at the summary output of our model and eliminating the variable with the highest p-value among them, which in this case is oscar.

#Second model with 'oscar' variable eliminated
model_2 <- lm(imdb_num_votes ~ title_type + genre + runtime + mpaa_rating +
  thtr_rel_year + imdb_rating + top200_box, data = movies)

summary(model_2)

## 
## Call:
## lm(formula = imdb_num_votes ~ title_type + genre + runtime + 
##     mpaa_rating + thtr_rel_year + imdb_rating + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184497  -47251  -12913   24701  634300 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -5632862.6   713844.3  -7.891 1.34e-14 ***
## title_typeFeature Film            48311.5    33231.2   1.454 0.146501    
## title_typeTV Movie                32745.9    52375.4   0.625 0.532057    
## genreAnimation                    10464.5    35285.6   0.297 0.766897    
## genreArt House & International   -74848.3    27097.6  -2.762 0.005910 ** 
## genreComedy                      -11519.2    14907.2  -0.773 0.439974    
## genreDocumentary                 -60620.8    35738.5  -1.696 0.090338 .  
## genreDrama                       -50344.5    12879.8  -3.909 0.000103 ***
## genreHorror                       -8811.1    22343.4  -0.394 0.693459    
## genreMusical & Performing Arts   -91936.6    30557.6  -3.009 0.002729 ** 
## genreMystery & Suspense          -27783.6    16601.7  -1.674 0.094720 .  
## genreOther                        28128.5    25380.0   1.108 0.268158    
## genreScience Fiction & Fantasy    28407.5    31852.8   0.892 0.372821    
## runtime                            1382.4      205.9   6.713 4.28e-11 ***
## mpaa_ratingNC-17                  25437.7    67723.5   0.376 0.707333    
## mpaa_ratingPG                      4995.8    24774.5   0.202 0.840256    
## mpaa_ratingPG-13                  23171.4    26008.7   0.891 0.373318    
## mpaa_ratingR                      17512.1    24947.0   0.702 0.482957    
## mpaa_ratingUnrated               -42608.4    28998.1  -1.469 0.142239    
## thtr_rel_year                      2623.8      356.2   7.366 5.54e-13 ***
## imdb_rating                       42788.0     3903.7  10.961  < 2e-16 ***
## top200_boxyes                    157313.2    24233.9   6.491 1.73e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89270 on 628 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3873, Adjusted R-squared:  0.3668 
## F-statistic:  18.9 on 21 and 628 DF,  p-value: < 2.2e-16

Now we remove title_type:

#Third model without 'title_type'
model_3 <- lm(imdb_num_votes ~ genre + runtime + mpaa_rating + thtr_rel_year +
  imdb_rating + top200_box, data = movies)

#Summary statistics for the third model
summary(model_3)

## 
## Call:
## lm(formula = imdb_num_votes ~ genre + runtime + mpaa_rating + 
##     thtr_rel_year + imdb_rating + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -184808  -46896  -12612   24905  635187 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -5567717.0   712451.5  -7.815 2.31e-14 ***
## genreAnimation                    11245.5    35288.2   0.319 0.750079    
## genreArt House & International   -73326.3    27078.7  -2.708 0.006955 ** 
## genreComedy                      -12702.9    14886.5  -0.853 0.393808    
## genreDocumentary                -102633.1    20567.6  -4.990 7.82e-07 ***
## genreDrama                       -50219.7    12867.0  -3.903 0.000105 ***
## genreHorror                       -8444.0    22340.6  -0.378 0.705582    
## genreMusical & Performing Arts  -106826.8    28783.7  -3.711 0.000224 ***
## genreMystery & Suspense          -27661.9    16604.7  -1.666 0.096228 .  
## genreOther                        27579.6    25250.2   1.092 0.275140    
## genreScience Fiction & Fantasy    28344.3    31858.4   0.890 0.373969    
## runtime                            1392.5      205.8   6.766 3.03e-11 ***
## mpaa_ratingNC-17                  26365.1    67730.7   0.389 0.697213    
## mpaa_ratingPG                      5688.5    24772.3   0.230 0.818452    
## mpaa_ratingPG-13                  24216.6    26002.6   0.931 0.352048    
## mpaa_ratingR                      18376.4    24944.8   0.737 0.461590    
## mpaa_ratingUnrated               -46110.9    28844.7  -1.599 0.110412    
## thtr_rel_year                      2615.7      356.2   7.343 6.46e-13 ***
## imdb_rating                       42346.9     3880.7  10.912  < 2e-16 ***
## top200_boxyes                    157763.1    24236.4   6.509 1.54e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 89290 on 630 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3851, Adjusted R-squared:  0.3666 
## F-statistic: 20.77 on 19 and 630 DF,  p-value: < 2.2e-16

This time we remove mpaa_rating from the model:

#fourth model without 'mpaa_rating'
model_4 <- lm(imdb_num_votes ~ genre + runtime + thtr_rel_year + imdb_rating +
  top200_box, data = movies)

#Summary statistics for this fourth model
summary(model_4)

## 
## Call:
## lm(formula = imdb_num_votes ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212262  -46896  -11777   23283  637620 
## 
## Coefficients:
##                                  Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    -5522449.1   664416.7  -8.312 5.73e-16 ***
## genreAnimation                    -1071.6    32302.8  -0.033 0.973546    
## genreArt House & International   -87609.7    26766.0  -3.273 0.001121 ** 
## genreComedy                      -10286.6    14908.4  -0.690 0.490456    
## genreDocumentary                -137055.0    18536.3  -7.394 4.52e-13 ***
## genreDrama                       -48440.6    12741.7  -3.802 0.000158 ***
## genreHorror                      -11351.8    22048.0  -0.515 0.606824    
## genreMusical & Performing Arts  -115073.3    28880.6  -3.984 7.55e-05 ***
## genreMystery & Suspense          -23494.2    16445.9  -1.429 0.153620    
## genreOther                        23218.3    25360.5   0.916 0.360260    
## genreScience Fiction & Fantasy    26518.0    32121.7   0.826 0.409371    
## runtime                            1412.2      201.9   6.995 6.75e-12 ***
## thtr_rel_year                      2602.7      331.0   7.862 1.62e-14 ***
## imdb_rating                       41197.4     3863.3  10.664  < 2e-16 ***
## top200_boxyes                    155811.6    24211.7   6.435 2.43e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90120 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3687, Adjusted R-squared:  0.3548 
## F-statistic: 26.49 on 14 and 635 DF,  p-value: < 2.2e-16

This fourth model has all statistically significant variables that serve as meaningful predictors of imdb_num_votes, thus it becomes our final model.

Model diagnostics

In order to check if our model is reliable, we need to check four conditions:

Check for linearity

To check for this condition, we plot the model’s residuals against each of the numerical explanatory variables: runtime, thtr_rel_year and imdb_rating. We look for random scatter of the residuals around 0. We begin by runtime.

#Check linearity: residuals vs 'runtime'
ggplot(model_4, aes(x = runtime, y = .resid)) +
  geom_jitter(color="deepskyblue3") +
  geom_hline(yintercept = 0, linetype = "dashed", color="brown4") + 
  ylab("Residuals") + 
  xlab("Movie duration") +
  ggtitle("Check linearity of runtime")

This plot makes sense, although it might not be obvious at first. The residuals look to be condensed when runtime is 100 instead of being thoroughly distributed, but this is expected since most movies have a duration of 100 minutes or so. The residuals then are randomly scattered around 0 with few outliers and we can affirm that this condition holds for this variable.

#check linearity of residuals vs 'thtr_rel_year'
ggplot(model_4, aes(x = thtr_rel_year, y = .resid)) + 
  geom_jitter(color="deepskyblue3") +
  geom_hline(yintercept = 0, linetype = "dashed", color="brown4") + 
  ylab("Residuals") + 
  xlab("Year of release") +
  ggtitle("Check linearity of release year in theaters")

The residuals seem evenly distributed around 0 with few outliers, so this condition also holds for thtr_rel_year.

#check linearity of residuals vs 'imdb_rating'
ggplot(model_4, aes(x = imdb_rating, y = .resid)) + 
  geom_jitter(color="deepskyblue3") +
  geom_hline(yintercept = 0, linetype = "dashed", color="brown4") + 
  ylab("Residuals") + 
  xlab("IMDB rating") +
  ggtitle("Check linearity of IMDB ratings")

We run into a problem here, the residuals are clearly not evenly distributed around 0. For low imdb_rating values, the residuals only take positive values, then for ratings between 5 and 8, the residuals are mostly negative. Finally for residuals around 8 and higher, their values are very high. This linearity condition therefore doesn’t hold for imdb_rating.

We decide to carry on and use this variable as it is because the violation of this condition isn’t severe, but it is important to be aware of this behavior. Had the variability of the residuals around 0 been more extreme, we would had to take action in correcting it.

Check for normal residuals

To check for this condition, we plot residuals in a histogram and in a normal probability plot and see if they are randomly scattered around 0.

#Check for normal residuals
#Histogram of residuals' distribution
ggplot(model_4, aes(.resid)) + 
  geom_histogram(bins=60, fill="cornflowerblue") + 
  xlab("Residuals") + 
  ylab("Frequency") + 
  ggtitle("Residuals distribution histogram")

#Normal probability plot of residuals
ggplot(model_4, aes(sample = .resid)) + 
  geom_qq(color="deepskyblue3") + 
  geom_qq_line(color="brown4") +
  xlab("Residuals") + 
  ylab("Frequency") + 
  ggtitle("Residuals normal probability plot")

The residuals appear to be right skewed, but it’s only so for a minority of the population. Since this is a large dataset, and because most of the residuals perfectly fit the normal probability line, we can affirm that this condition is sufficiently satisfied for our purposes.

Constant variability of residuals

We plot the residuals vs the predicted values of our response variable, or y-hat, which in this case is imdb_num_votes. In the plots we look if the residuals are equally variable around 0 from low and high values of y-hat. The second plot is the same as the first one but with absolute value residuals, which is helpful to check for outliers:

#Plot for checking constant variability of residuals
ggplot(model_4, aes(x = .fitted, y = .resid)) + 
  geom_point(color="deepskyblue3") +
  geom_hline(yintercept = 0, linetype = "dashed", color="brown4") + 
  xlab("Fitted values") + 
  ylab("Residuals") + 
  ggtitle("Residuals plot for constant variability")

#Same plot but with absolute value residuals
ggplot(model_4, aes(x = .fitted, y = abs(.resid))) + 
  geom_point(color="deepskyblue3") +
  geom_hline(yintercept = 0, linetype = "dashed", color="brown4") + 
  xlab("Fitted values") + 
  ylab("Residuals") + 
  ggtitle("Absolute value of residuals for constant variability")

The variability of the residuals on this plot is clearly not constant. The residuals plot seem to show a trend for a line with negative slope. Moreover, a fan-shape distribution is evident in the way that as the fitted value increases, so does the variability of residuals, thus, this condition is not satisfied.

At this point, we can deduct that a linear model is not suitable for predicting imdb_num_votes because the constant variability condition on residuals was not met, but we will elaborate this point further in the conclusions section.

However, for completeness sake on this assignment, we will carry on with the last model diagnostic, interpret the model’s coefficients and the prediction section.

Independent Residuals

Again, we are interested in having random residuals scattered around 0. The plot to use for this purpose is residuals vs index, which is the order in which the data is sampled or put into the dataset.

#Residuals vs index
options(scipen=10000)
plot(model_4$residuals, col="deepskyblue4", ylab="Residuals", 
     xlab="Movie data entry", main="Check independency of residuals")
abline(h=0, lty=2)

This condition is completely satisfied as the vast majority of residuals are randomly scattered around 0 horizontally throughout the plot.

Model coefficients interpretation

Let’s take another look at our final model summary.

#Final model summary output
summary(model_4)

## 
## Call:
## lm(formula = imdb_num_votes ~ genre + runtime + thtr_rel_year + 
##     imdb_rating + top200_box, data = movies)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -212262  -46896  -11777   23283  637620 
## 
## Coefficients:
##                                  Estimate Std. Error t value
## (Intercept)                    -5522449.1   664416.7  -8.312
## genreAnimation                    -1071.6    32302.8  -0.033
## genreArt House & International   -87609.7    26766.0  -3.273
## genreComedy                      -10286.6    14908.4  -0.690
## genreDocumentary                -137055.0    18536.3  -7.394
## genreDrama                       -48440.6    12741.7  -3.802
## genreHorror                      -11351.8    22048.0  -0.515
## genreMusical & Performing Arts  -115073.3    28880.6  -3.984
## genreMystery & Suspense          -23494.2    16445.9  -1.429
## genreOther                        23218.3    25360.5   0.916
## genreScience Fiction & Fantasy    26518.0    32121.7   0.826
## runtime                            1412.2      201.9   6.995
## thtr_rel_year                      2602.7      331.0   7.862
## imdb_rating                       41197.4     3863.3  10.664
## top200_boxyes                    155811.6    24211.7   6.435
##                                            Pr(>|t|)    
## (Intercept)                    0.000000000000000573 ***
## genreAnimation                             0.973546    
## genreArt House & International             0.001121 ** 
## genreComedy                                0.490456    
## genreDocumentary               0.000000000000452323 ***
## genreDrama                                 0.000158 ***
## genreHorror                                0.606824    
## genreMusical & Performing Arts 0.000075478316115214 ***
## genreMystery & Suspense                    0.153620    
## genreOther                                 0.360260    
## genreScience Fiction & Fantasy             0.409371    
## runtime                        0.000000000006751055 ***
## thtr_rel_year                  0.000000000000016209 ***
## imdb_rating                    < 0.0000000000000002 ***
## top200_boxyes                  0.000000000242746229 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 90120 on 635 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.3687, Adjusted R-squared:  0.3548 
## F-statistic: 26.49 on 14 and 635 DF,  p-value: < 0.00000000000000022

The slope in this model, -5522449.1, is the hypothetical number of votes on IMDB of a movie that lasts 0 minutes, released in year 0, with an IMDB rating of 0, that doesn’t appear on the Top 200 Box Office list on BoxOfficeMojo, and whose genre is Action & Adventure.

Of course, the value of this intercept doesn’t offer any useful information and it only serves to adjust the height of the line.

With all else being equal, the coefficient of runtime means that for each additional minute of the movie, we expect the movie to get 1412.2 more votes on IMDB on average.

Likewise, with all else being equal, the coefficient of thtr_rel_year means that for each additional year that a movie is released in theaters, we expect the number of votes on IMDB to increase by 2602.7 on average.

Following the same reasoning as before, each additional whole point on imdb_rating is expected to increase the number of votes on IMDB by 41197.4 on average.

The categorical variables genre and top200_box shift the value of the intercept either up, down or none depending on the value it takes, whether the movie or not appears on Top 200 Box Office list on BoxOfficeMojo and the genre of the movie. The respective values they take are defined by their coefficients in the summary table. In case genre is “Action & Adventure” and top200_box is “no”, their values are zero.

Part 5: Prediction

As a first note, we must be careful about predicting data out of the realm of our model. A movie released on 2016 is representative of an extrapolated-predicted value by the model because it was designed to work on movies up to 2015. However, 2016 is very, very close to 2015 so it’s expected for the model not to yield significant problems and to work closely as expected. We have to take into consideration that it’s still an extrapolation after all, though.

For the prediction section, we are going to use the movie called Hush, whose values were taken directly from its IMDb page (link here). We look for the value of the variables used in the model for this movie:

imdb_num_votes (response variable): 107005
genre: “Horror”
runtime: 82
thtr_rel_year: 2016
imdb_rating: 6.6
top200_box: “no”

So we make an observation in a new data frame with the movie data:

# Hush dataframe
hush <- data.frame(genre="Horror", runtime=82, thtr_rel_year=2016, 
  imdb_rating=6.6, top200_box="no")

Using this model, we can predict the response variable for this movie:

# Predict number of voters on imdb for hush movie
predict(model_4, hush, interval="prediction", level=0.95)

##        fit       lwr    upr
## 1 101024.5 -80440.08 282489

The predicted value is pretty close to the real number of votes on IMDb, 107005, so the prediction itself was quite accurate. The lower and upper values mean that we are 95% confident that the true number of votes on IMDb is between -80440.08 and 282489 respectively, which encloses our real value. The lower bound, however, is not meaningful in this context because we cannot have a negative value since the lowest possible value for the number of votes is zero.

Part 6: Conclusion

From the study of the model, the research question and the overall analysis of the variables involved in this model, we conclude that this model is not reliable for making accurate predictions.

By revisiting the previous diagnostic plots and remembering that (1) imdb_rating is not linear with the model’s residuals, (2) the residuals distributions is very right skewed and (3) residuals don’t possess constant variability, there are 3 important violations to model conditions, therefore we consequently conclude that a linear model is not suitable for making predictions on imdb_num_votes with the explanatory variables chosen.

Though the response variable predicted on the previous section was indeed accurate, it was most likely due to chance alone, as the model is not expected to make consistent predictions with accuracy.

There seems to be a trend and perhaps a non-linear model that can explain imdb_num_votes with accuracy exists and can be further investigated. Likewise, there has to exist methods in which we can treat the variables such that they can meet the condition requirements on model diagnostics. However, both such situations are beyond the scope of this course but can be further explored by expert statisticians.

On the plus side, we did partially answered the research question in determining which variables are statistically significant predictors of imdb_num_votes, and such variables are the ones selected as explanatory variables in our final model.