This Project, from Linear Regression and Modeling course at Coursera, consists of 6 parts:
Data: Describe how the observations in the sample are collected, and the implications of this data collection method on the scope of inference (generalizability / causality.)
Research question: Come up with a research question about the data to perform a multiple linear regression on.
Exploratory Data Analysis (EDA): Perform exploratory data analysis that addresses the research question outlined above. The EDA should contain numerical summaries and visualizations.
Modeling: Develop a multiple linear regression model to predict a numerical variable in the dataset. Discuss which variables to consider and which to exclude. Carry out a model selection process and perform model diagnostics. Lastly, interpret the model coefficients.
Prediction: Pick a movie from 2016 (a new movie that is not in the sample) and do a prediction for this movie using your the model you developed and the predict function in R. Also quantify the uncertainty around this prediction using an appropriate interval.
Conclusion: Summarize your findings from the previous sections and include ideas for possible future research.
The dataset consists of 651 randomly selected movies produced and released before 2016. There are 32 variables in the dataset which include data from the movies like title, genre, runtime, main actors, director, year of release, or ratings.
Since random sampling was used, conclusions drawn from the dataset can be generalized to all movies released before 2016. Because experiments were not conducted –this is an observational study– we can’t assign causal relationships from the study of this dataset.
Research question: What variables are associated and might be used to predict the number of IMDB votes of a movie?
It may result interesting to find out what movie characteristics make people leave their vote on IMDB, whether to leave a positive or a negative one.
Our response variable of interest, and thus the final target of our analysis is imdb_num_votes
. Let’s examine this variable.
movies %>% summarise(median = median(imdb_num_votes),
Q1 = quantile(imdb_num_votes, 1/4), Q3 = quantile(imdb_num_votes, 3/4),
IQR = IQR(imdb_num_votes), avg = mean(imdb_num_votes),
min = min(imdb_num_votes), max = max(imdb_num_votes))
## # A tibble: 1 x 7
## median Q1 Q3 IQR avg min max
## <int> <dbl> <dbl> <dbl> <dbl> <int> <int>
## 1 15116 4546. 58300. 53755 57533. 180 893008
These numerical summaries show that half of number of votes amount is between 180 (minimum value) and 15116 (median), with the other half being between the median and the max value of 898003. This max value is comparatively so high that we can expect extremely right skew behavior for this variable. This is demonstrated in the following graph (extreme values were filtered out of the plot for better representation):
movies %>% filter(imdb_num_votes < 100000) %>%
ggplot(aes(x = imdb_num_votes)) +
geom_density(color = "cornflowerblue", fill = "cornflowerblue") +
ylab("Frequency") +
xlab("Number of votes") +
ggtitle("IMDB number of votes distribution") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank())
This strong right skew behavior makes sense: it’s more likely that the average number of votes is located around the median and that only very few movies, the most popular ones, get such an enormous amount of number of votes like +500000.
Our model will initially consist of several variables: title_type
, runtime
, mpaa_rating
, thtr_rel_year
, ratings variable like imdb_rating
or audience_score
, top200_box
, and whether the movie, the director or its actors was subject to any Oscar.
Some variables like studio
, any of the actor
variables, or director
, are not included for the initial model because it has so many factors that our model would look very messy:
## Factor w/ 211 levels "20th Century Fox",..: 91 202 167 34 13 163 147 118 88 84 ...
## chr [1:651] "Gina Rodriguez" "Sam Neill" "Christopher Guest" ...
## chr [1:651] "Michael D. Olmos" "Rob Sitch" "Christopher Guest" ...
Other variables like title
, imdb_url
and rt_url
are likewise not included because they don’t provide predictive information.
To build our initial model, we need to weed out correlating variables first so that our model won’t be subject to any bias. We begin by comparing all scores variables and look for their correlation: imdb_rating
, critics_score
and audience_score
.
These variables are highly correlated with each other, so we decide to use only imdb_rating
from these because it’s the one with the highest correlation, thus being the most representative, among them.
To decide what to do with critics_rating
and audience_rating
, let’s compare them with their score
counterpart:
#Comparing audience_rating with audience_score
ggplot(movies, aes(audience_rating, audience_score)) +
geom_boxplot(color = "black") +
xlab("Audience rating category") +
ylab("Audience score") +
ggtitle("Comparing audience score and rating")
We can see that audience_score
is a significant predictor of audience_rating
, therefore, we don’t include audience_rating
in our initial model.
Likewise, we do the same procedure with critics_rating
and critics_score
:
#comparing critics_rating with critics_score
ggplot(movies, aes(critics_rating, critics_score)) +
geom_boxplot() +
xlab("Critics rating category") +
ylab("Critics score") +
ggtitle("Comparing critics score and rating")
We reach the same conclusion for critics_score
and critics_rating
, that because critics_score
is a significant predictor of critics_rating
, we chose the former over the latter.
We are left up with critics_score
and audience_score
as significant predictor of their rating
counterparts, and if we recall from our ggpairs
plot, we chose imdb_rating
as the ultimate predictor for these variables.
For simplistic purposes, we will create a variable in the dataset called oscar
, that will have values of either yes
or no
. It is yes
if any of the best_pic_nom
, best_pic_win
, best_actor_win
, best_actress_win
or best_dir_win
is yes
(whether an actor, director, or the movie itself, won or was nominated to an oscar), otherwise its value is no
. We will use it in our model.
#implementing 'oscar' variable in the dataset
movies <- movies %>%
mutate(oscar = ifelse(best_pic_nom == 'yes' |
best_pic_win == 'yes' | best_actor_win == 'yes' | best_actress_win == 'yes' |
best_dir_win == 'yes', 'yes', 'no'))
Now we can make the first initial model.
#elaborating the first multiple regression model
first_model <- lm(imdb_num_votes ~ title_type + genre + runtime + mpaa_rating +
thtr_rel_year + imdb_rating + oscar + top200_box, data = movies)
These variables make up the initial parameters of our first model. Now we take a look at the summary output of this model:
##
## Call:
## lm(formula = imdb_num_votes ~ title_type + genre + runtime +
## mpaa_rating + thtr_rel_year + imdb_rating + oscar + top200_box,
## data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -191155 -47381 -12610 23091 639481
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5632297.0 713485.3 -7.894 1.31e-14 ***
## title_typeFeature Film 47143.0 33227.1 1.419 0.15645
## title_typeTV Movie 31854.5 52353.7 0.608 0.54311
## genreAnimation 8169.5 35313.5 0.231 0.81712
## genreArt House & International -74188.9 27088.9 -2.739 0.00634 **
## genreComedy -12388.3 14915.2 -0.831 0.40653
## genreDocumentary -61366.9 35725.3 -1.718 0.08634 .
## genreDrama -51744.6 12919.9 -4.005 6.94e-05 ***
## genreHorror -8680.7 22332.4 -0.389 0.69763
## genreMusical & Performing Arts -91401.0 30545.1 -2.992 0.00288 **
## genreMystery & Suspense -29932.0 16678.4 -1.795 0.07319 .
## genreOther 26597.0 25395.5 1.047 0.29536
## genreScience Fiction & Fantasy 28701.6 31837.6 0.901 0.36767
## runtime 1305.4 214.5 6.087 2.01e-09 ***
## mpaa_ratingNC-17 22458.9 67729.5 0.332 0.74030
## mpaa_ratingPG 3985.0 24774.7 0.161 0.87226
## mpaa_ratingPG-13 22205.0 26006.6 0.854 0.39353
## mpaa_ratingR 17062.3 24937.0 0.684 0.49409
## mpaa_ratingUnrated -42078.3 28986.5 -1.452 0.14710
## thtr_rel_year 2628.4 356.0 7.383 4.96e-13 ***
## imdb_rating 42496.0 3908.5 10.873 < 2e-16 ***
## oscaryes 11049.2 8647.7 1.278 0.20183
## top200_boxyes 157460.9 24222.0 6.501 1.63e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89230 on 627 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3889, Adjusted R-squared: 0.3674
## F-statistic: 18.14 on 22 and 627 DF, p-value: < 2.2e-16
Our first multiple regression model is not too shabby; we have an Adjusted R-squared of 0.3674.
For our model selection method, we will use p-value backward elimination. We use the p-value for model selection criteria mostly because we care only about figuring out which variables are statistically significant predictors, and because the model selection process is much simpler.
We begin by looking backwards at the summary output of our model and eliminating the variable with the highest p-value among them, which in this case is oscar
.
#Second model with 'oscar' variable eliminated
model_2 <- lm(imdb_num_votes ~ title_type + genre + runtime + mpaa_rating +
thtr_rel_year + imdb_rating + top200_box, data = movies)
##
## Call:
## lm(formula = imdb_num_votes ~ title_type + genre + runtime +
## mpaa_rating + thtr_rel_year + imdb_rating + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184497 -47251 -12913 24701 634300
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5632862.6 713844.3 -7.891 1.34e-14 ***
## title_typeFeature Film 48311.5 33231.2 1.454 0.146501
## title_typeTV Movie 32745.9 52375.4 0.625 0.532057
## genreAnimation 10464.5 35285.6 0.297 0.766897
## genreArt House & International -74848.3 27097.6 -2.762 0.005910 **
## genreComedy -11519.2 14907.2 -0.773 0.439974
## genreDocumentary -60620.8 35738.5 -1.696 0.090338 .
## genreDrama -50344.5 12879.8 -3.909 0.000103 ***
## genreHorror -8811.1 22343.4 -0.394 0.693459
## genreMusical & Performing Arts -91936.6 30557.6 -3.009 0.002729 **
## genreMystery & Suspense -27783.6 16601.7 -1.674 0.094720 .
## genreOther 28128.5 25380.0 1.108 0.268158
## genreScience Fiction & Fantasy 28407.5 31852.8 0.892 0.372821
## runtime 1382.4 205.9 6.713 4.28e-11 ***
## mpaa_ratingNC-17 25437.7 67723.5 0.376 0.707333
## mpaa_ratingPG 4995.8 24774.5 0.202 0.840256
## mpaa_ratingPG-13 23171.4 26008.7 0.891 0.373318
## mpaa_ratingR 17512.1 24947.0 0.702 0.482957
## mpaa_ratingUnrated -42608.4 28998.1 -1.469 0.142239
## thtr_rel_year 2623.8 356.2 7.366 5.54e-13 ***
## imdb_rating 42788.0 3903.7 10.961 < 2e-16 ***
## top200_boxyes 157313.2 24233.9 6.491 1.73e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89270 on 628 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3873, Adjusted R-squared: 0.3668
## F-statistic: 18.9 on 21 and 628 DF, p-value: < 2.2e-16
Now we remove title_type
:
#Third model without 'title_type'
model_3 <- lm(imdb_num_votes ~ genre + runtime + mpaa_rating + thtr_rel_year +
imdb_rating + top200_box, data = movies)
##
## Call:
## lm(formula = imdb_num_votes ~ genre + runtime + mpaa_rating +
## thtr_rel_year + imdb_rating + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -184808 -46896 -12612 24905 635187
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5567717.0 712451.5 -7.815 2.31e-14 ***
## genreAnimation 11245.5 35288.2 0.319 0.750079
## genreArt House & International -73326.3 27078.7 -2.708 0.006955 **
## genreComedy -12702.9 14886.5 -0.853 0.393808
## genreDocumentary -102633.1 20567.6 -4.990 7.82e-07 ***
## genreDrama -50219.7 12867.0 -3.903 0.000105 ***
## genreHorror -8444.0 22340.6 -0.378 0.705582
## genreMusical & Performing Arts -106826.8 28783.7 -3.711 0.000224 ***
## genreMystery & Suspense -27661.9 16604.7 -1.666 0.096228 .
## genreOther 27579.6 25250.2 1.092 0.275140
## genreScience Fiction & Fantasy 28344.3 31858.4 0.890 0.373969
## runtime 1392.5 205.8 6.766 3.03e-11 ***
## mpaa_ratingNC-17 26365.1 67730.7 0.389 0.697213
## mpaa_ratingPG 5688.5 24772.3 0.230 0.818452
## mpaa_ratingPG-13 24216.6 26002.6 0.931 0.352048
## mpaa_ratingR 18376.4 24944.8 0.737 0.461590
## mpaa_ratingUnrated -46110.9 28844.7 -1.599 0.110412
## thtr_rel_year 2615.7 356.2 7.343 6.46e-13 ***
## imdb_rating 42346.9 3880.7 10.912 < 2e-16 ***
## top200_boxyes 157763.1 24236.4 6.509 1.54e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 89290 on 630 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3851, Adjusted R-squared: 0.3666
## F-statistic: 20.77 on 19 and 630 DF, p-value: < 2.2e-16
This time we remove mpaa_rating
from the model:
#fourth model without 'mpaa_rating'
model_4 <- lm(imdb_num_votes ~ genre + runtime + thtr_rel_year + imdb_rating +
top200_box, data = movies)
##
## Call:
## lm(formula = imdb_num_votes ~ genre + runtime + thtr_rel_year +
## imdb_rating + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212262 -46896 -11777 23283 637620
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -5522449.1 664416.7 -8.312 5.73e-16 ***
## genreAnimation -1071.6 32302.8 -0.033 0.973546
## genreArt House & International -87609.7 26766.0 -3.273 0.001121 **
## genreComedy -10286.6 14908.4 -0.690 0.490456
## genreDocumentary -137055.0 18536.3 -7.394 4.52e-13 ***
## genreDrama -48440.6 12741.7 -3.802 0.000158 ***
## genreHorror -11351.8 22048.0 -0.515 0.606824
## genreMusical & Performing Arts -115073.3 28880.6 -3.984 7.55e-05 ***
## genreMystery & Suspense -23494.2 16445.9 -1.429 0.153620
## genreOther 23218.3 25360.5 0.916 0.360260
## genreScience Fiction & Fantasy 26518.0 32121.7 0.826 0.409371
## runtime 1412.2 201.9 6.995 6.75e-12 ***
## thtr_rel_year 2602.7 331.0 7.862 1.62e-14 ***
## imdb_rating 41197.4 3863.3 10.664 < 2e-16 ***
## top200_boxyes 155811.6 24211.7 6.435 2.43e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90120 on 635 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3687, Adjusted R-squared: 0.3548
## F-statistic: 26.49 on 14 and 635 DF, p-value: < 2.2e-16
This fourth model has all statistically significant variables that serve as meaningful predictors of imdb_num_votes
, thus it becomes our final model.
In order to check if our model is reliable, we need to check four conditions:
To check for this condition, we plot the model’s residuals against each of the numerical explanatory variables: runtime
, thtr_rel_year
and imdb_rating
. We look for random scatter of the residuals around 0. We begin by runtime
.
#Check linearity: residuals vs 'runtime'
ggplot(model_4, aes(x = runtime, y = .resid)) +
geom_jitter(color="deepskyblue3") +
geom_hline(yintercept = 0, linetype = "dashed", color="brown4") +
ylab("Residuals") +
xlab("Movie duration") +
ggtitle("Check linearity of runtime")
This plot makes sense, although it might not be obvious at first. The residuals look to be condensed when runtime
is 100 instead of being thoroughly distributed, but this is expected since most movies have a duration of 100 minutes or so. The residuals then are randomly scattered around 0 with few outliers and we can affirm that this condition holds for this variable.
#check linearity of residuals vs 'thtr_rel_year'
ggplot(model_4, aes(x = thtr_rel_year, y = .resid)) +
geom_jitter(color="deepskyblue3") +
geom_hline(yintercept = 0, linetype = "dashed", color="brown4") +
ylab("Residuals") +
xlab("Year of release") +
ggtitle("Check linearity of release year in theaters")
The residuals seem evenly distributed around 0 with few outliers, so this condition also holds for thtr_rel_year
.
#check linearity of residuals vs 'imdb_rating'
ggplot(model_4, aes(x = imdb_rating, y = .resid)) +
geom_jitter(color="deepskyblue3") +
geom_hline(yintercept = 0, linetype = "dashed", color="brown4") +
ylab("Residuals") +
xlab("IMDB rating") +
ggtitle("Check linearity of IMDB ratings")
We run into a problem here, the residuals are clearly not evenly distributed around 0. For low imdb_rating
values, the residuals only take positive values, then for ratings between 5 and 8, the residuals are mostly negative. Finally for residuals around 8 and higher, their values are very high. This linearity condition therefore doesn’t hold for imdb_rating
.
We decide to carry on and use this variable as it is because the violation of this condition isn’t severe, but it is important to be aware of this behavior. Had the variability of the residuals around 0 been more extreme, we would had to take action in correcting it.
To check for this condition, we plot residuals in a histogram and in a normal probability plot and see if they are randomly scattered around 0.
#Check for normal residuals
#Histogram of residuals' distribution
ggplot(model_4, aes(.resid)) +
geom_histogram(bins=60, fill="cornflowerblue") +
xlab("Residuals") +
ylab("Frequency") +
ggtitle("Residuals distribution histogram")
#Normal probability plot of residuals
ggplot(model_4, aes(sample = .resid)) +
geom_qq(color="deepskyblue3") +
geom_qq_line(color="brown4") +
xlab("Residuals") +
ylab("Frequency") +
ggtitle("Residuals normal probability plot")
The residuals appear to be right skewed, but it’s only so for a minority of the population. Since this is a large dataset, and because most of the residuals perfectly fit the normal probability line, we can affirm that this condition is sufficiently satisfied for our purposes.
We plot the residuals vs the predicted values of our response variable, or y-hat, which in this case is imdb_num_votes
. In the plots we look if the residuals are equally variable around 0 from low and high values of y-hat. The second plot is the same as the first one but with absolute value residuals, which is helpful to check for outliers:
#Plot for checking constant variability of residuals
ggplot(model_4, aes(x = .fitted, y = .resid)) +
geom_point(color="deepskyblue3") +
geom_hline(yintercept = 0, linetype = "dashed", color="brown4") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Residuals plot for constant variability")
#Same plot but with absolute value residuals
ggplot(model_4, aes(x = .fitted, y = abs(.resid))) +
geom_point(color="deepskyblue3") +
geom_hline(yintercept = 0, linetype = "dashed", color="brown4") +
xlab("Fitted values") +
ylab("Residuals") +
ggtitle("Absolute value of residuals for constant variability")
The variability of the residuals on this plot is clearly not constant. The residuals plot seem to show a trend for a line with negative slope. Moreover, a fan-shape distribution is evident in the way that as the fitted value increases, so does the variability of residuals, thus, this condition is not satisfied.
At this point, we can deduct that a linear model is not suitable for predicting imdb_num_votes
because the constant variability condition on residuals was not met, but we will elaborate this point further in the conclusions section.
However, for completeness sake on this assignment, we will carry on with the last model diagnostic, interpret the model’s coefficients and the prediction section.
Again, we are interested in having random residuals scattered around 0. The plot to use for this purpose is residuals vs index, which is the order in which the data is sampled or put into the dataset.
#Residuals vs index
options(scipen=10000)
plot(model_4$residuals, col="deepskyblue4", ylab="Residuals",
xlab="Movie data entry", main="Check independency of residuals")
abline(h=0, lty=2)
This condition is completely satisfied as the vast majority of residuals are randomly scattered around 0 horizontally throughout the plot.
Let’s take another look at our final model summary.
##
## Call:
## lm(formula = imdb_num_votes ~ genre + runtime + thtr_rel_year +
## imdb_rating + top200_box, data = movies)
##
## Residuals:
## Min 1Q Median 3Q Max
## -212262 -46896 -11777 23283 637620
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) -5522449.1 664416.7 -8.312
## genreAnimation -1071.6 32302.8 -0.033
## genreArt House & International -87609.7 26766.0 -3.273
## genreComedy -10286.6 14908.4 -0.690
## genreDocumentary -137055.0 18536.3 -7.394
## genreDrama -48440.6 12741.7 -3.802
## genreHorror -11351.8 22048.0 -0.515
## genreMusical & Performing Arts -115073.3 28880.6 -3.984
## genreMystery & Suspense -23494.2 16445.9 -1.429
## genreOther 23218.3 25360.5 0.916
## genreScience Fiction & Fantasy 26518.0 32121.7 0.826
## runtime 1412.2 201.9 6.995
## thtr_rel_year 2602.7 331.0 7.862
## imdb_rating 41197.4 3863.3 10.664
## top200_boxyes 155811.6 24211.7 6.435
## Pr(>|t|)
## (Intercept) 0.000000000000000573 ***
## genreAnimation 0.973546
## genreArt House & International 0.001121 **
## genreComedy 0.490456
## genreDocumentary 0.000000000000452323 ***
## genreDrama 0.000158 ***
## genreHorror 0.606824
## genreMusical & Performing Arts 0.000075478316115214 ***
## genreMystery & Suspense 0.153620
## genreOther 0.360260
## genreScience Fiction & Fantasy 0.409371
## runtime 0.000000000006751055 ***
## thtr_rel_year 0.000000000000016209 ***
## imdb_rating < 0.0000000000000002 ***
## top200_boxyes 0.000000000242746229 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 90120 on 635 degrees of freedom
## (1 observation deleted due to missingness)
## Multiple R-squared: 0.3687, Adjusted R-squared: 0.3548
## F-statistic: 26.49 on 14 and 635 DF, p-value: < 0.00000000000000022
The slope in this model, -5522449.1, is the hypothetical number of votes on IMDB of a movie that lasts 0 minutes, released in year 0, with an IMDB rating of 0, that doesn’t appear on the Top 200 Box Office list on BoxOfficeMojo, and whose genre is Action & Adventure.
Of course, the value of this intercept doesn’t offer any useful information and it only serves to adjust the height of the line.
With all else being equal, the coefficient of runtime
means that for each additional minute of the movie, we expect the movie to get 1412.2 more votes on IMDB on average.
Likewise, with all else being equal, the coefficient of thtr_rel_year
means that for each additional year that a movie is released in theaters, we expect the number of votes on IMDB to increase by 2602.7 on average.
Following the same reasoning as before, each additional whole point on imdb_rating
is expected to increase the number of votes on IMDB by 41197.4 on average.
The categorical variables genre
and top200_box
shift the value of the intercept either up, down or none depending on the value it takes, whether the movie or not appears on Top 200 Box Office list on BoxOfficeMojo and the genre of the movie. The respective values they take are defined by their coefficients in the summary table. In case genre
is “Action & Adventure” and top200_box
is “no”, their values are zero.
As a first note, we must be careful about predicting data out of the realm of our model. A movie released on 2016 is representative of an extrapolated-predicted value by the model because it was designed to work on movies up to 2015. However, 2016 is very, very close to 2015 so it’s expected for the model not to yield significant problems and to work closely as expected. We have to take into consideration that it’s still an extrapolation after all, though.
For the prediction section, we are going to use the movie called Hush, whose values were taken directly from its IMDb page (link here). We look for the value of the variables used in the model for this movie:
imdb_num_votes
(response variable): 107005genre
: “Horror”runtime
: 82thtr_rel_year
: 2016imdb_rating
: 6.6top200_box
: “no”So we make an observation in a new data frame with the movie data:
# Hush dataframe
hush <- data.frame(genre="Horror", runtime=82, thtr_rel_year=2016,
imdb_rating=6.6, top200_box="no")
Using this model, we can predict the response variable for this movie:
# Predict number of voters on imdb for hush movie
predict(model_4, hush, interval="prediction", level=0.95)
## fit lwr upr
## 1 101024.5 -80440.08 282489
The predicted value is pretty close to the real number of votes on IMDb, 107005, so the prediction itself was quite accurate. The lower and upper values mean that we are 95% confident that the true number of votes on IMDb is between -80440.08 and 282489 respectively, which encloses our real value. The lower bound, however, is not meaningful in this context because we cannot have a negative value since the lowest possible value for the number of votes is zero.
From the study of the model, the research question and the overall analysis of the variables involved in this model, we conclude that this model is not reliable for making accurate predictions.
By revisiting the previous diagnostic plots and remembering that (1) imdb_rating
is not linear with the model’s residuals, (2) the residuals distributions is very right skewed and (3) residuals don’t possess constant variability, there are 3 important violations to model conditions, therefore we consequently conclude that a linear model is not suitable for making predictions on imdb_num_votes
with the explanatory variables chosen.
Though the response variable predicted on the previous section was indeed accurate, it was most likely due to chance alone, as the model is not expected to make consistent predictions with accuracy.
There seems to be a trend and perhaps a non-linear model that can explain imdb_num_votes
with accuracy exists and can be further investigated. Likewise, there has to exist methods in which we can treat the variables such that they can meet the condition requirements on model diagnostics. However, both such situations are beyond the scope of this course but can be further explored by expert statisticians.
On the plus side, we did partially answered the research question in determining which variables are statistically significant predictors of imdb_num_votes
, and such variables are the ones selected as explanatory variables in our final model.