Movie website scraping part 1

Problem Statement

1. I will fit the yearly data with regression. The dataset is from Box Office Mojo. Specifically, the total gross, number of movies and average ticket price will be fitted against time with linear model and the number of tickets sold will be fitted with quadratic function. Based on fitted model we can predict the data in the future to help filmworkers to get a whole picture of future movie market.

2. I want to compare directors to see who can make more return on investment. To achieve this goal, I will use the regression model to get the relationship between budget and box office revenue for each director. This can provide some information to investors so that they will know to invest whom and what is the estimated profit. This is presented in the part 2

Results

Interesting findings

The total gross, number of movies and average price sold have an overall increasing trend.
The number of tickets sold had a big decline in 2005 and kept decreasing from then.
The average price of tickets increased faster from 2007.
The total gross fell in 2005 and the growth rate recovered in 2006.

Conjectures

The most likely reason for the decline of tickets sold is the rise of online movie providers. In 2005, YouTube was formed, in 2006 hulu is founded and in 2007 Netflix started to provide films online. So film workers should pay more attention on the international market.

Future works:

Study the international movie market trend. So that movie workers can find other faster developing morket to invest.
Analyze online movie service providers. Combine its result with what we got to see the whole movie market trend.

The detailed process is presented in the following.

Take-away points

When I scrape from some websites, some errors may be caused that web site do not have a perfect coding patern.
Unicode will pop up some times.

Presentation slides

Detailed Process

Prepreparation in python 3

1. Import some necessary packages

import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import statsmodels.api as sm
# special matplotlib argument for improved plots
from matplotlib import rcParams
import re

Yearly data analysis

Scraping data from Box Office Mojo

url = 'http://www.boxofficemojo.com/yearly/'
#Apply this header to avoid being denied.
headers = {
    'user-Agent': 'Mozilla/5.0'
}
response = requests.get(url, headers = headers)
#Check whether you request is successful
response.status_code

If status_code shows 200, it means the request is successful.

page = response.text
soup = BeautifulSoup(page)
tables = soup.find_all('table')

Then we need to check each table to see if there is some properties that can specify the table we want. Unfortunately, there is no such properties. So we need to visually check tables. After this step we know table[2] is what we want.

movie_data = []
movie_list = tables[2]
#header can also be scraped. But some times we want to give a better header.
header = ['Year','Total_gross','Change1', 'Tickets_sold', 'Change1', 'num_movies','total_screens',
         'Ave_price', 'Ave_cost', 'movie_1'] 
rows = [row for row in movie_list.find_all("tr")]
for row in rows[1:]:
    row_dict={}
    for i,cell in enumerate(row.findAll("td")):
        row_dict[header[i]] = cell.get_text()
    movie_data.append(row_dict)
movies_df = pd.DataFrame(movie_data)
# Get rid of the $ sign.

Then we need to clean this table: transform some string type to numeric and remove some special signs.

movies_df.Total_gross = movies_df.Total_gross.str[1:]
movies_df.Total_gross = movies_df.Total_gross.str.replace(',', '')
movies_df.Total_gross = pd.to_numeric(movies_df.Total_gross)
movies_df.Ave_price = movies_df.Ave_price.str.lstrip("$")
movies_df.Total_gross = pd.to_numeric(movies_df.Total_gross)
movies_df.Year = pd.to_numeric(movies_df.Year)
movies_df.Tickets_sold = movies_df.Tickets_sold.str.replace(',', '')
movies_df.Tickets_sold = pd.to_numeric(movies_df.Tickets_sold)
movies_df.num_movies = pd.to_numeric(movies_df.num_movies)
movies_df.head()

	Ave_cost	Ave_price	Change1	Tickets_sold	Total_gross	Year	movie_1	num_movies	total_screens
0	-	8.58	-	715.3	6137.0	2016	Finding Dory	362	-
1	-	8.43	+4.1%	1320.1	11128.5	2015	Star Wars: The Force Awakens	701	-
2	-	8.17	-5.6%	1268.2	10360.8	2014	American Sniper	702	-
3	-	8.13	-1.3%	1343.6	10923.6	2013	Catching Fire	688	-
4	-	7.96	+6.1%	1361.5	10837.4	2012	The Avengers	667	-

Remember to write this data into your local file.

df1 = movies_df.drop(0)     #remove the data in 2016
movies_df = movies_df.drop(0)   # The first does not have whole year data
movies_df.to_csv('df_yearly.txt', index = False)

Linear Regression to fit data

Total gross

lm = LinearRegression()
y1 = df1.Total_gross
X1 = df1.Year
X1 = sm.add_constant(X1)
lm.fit(X1, y1)
line_X = np.arange(df1.Year.min(), df1.Year.max() + 1)
line_X = sm.add_constant(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X[:, 1], line_y, '-r', label = 'Linear regressor', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y1, '.b', label = 'test')
plt.title('Total gross')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print(lm.score(X1, y1))

0.96855021574507782

Number of movies

lm = LinearRegression()
y2 = df1.num_movies
lm.fit(X1, y2)
line_X = np.arange(df1.Year.min(), df1.Year.max() + 1)
line_X = sm.add_constant(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X[:, 1], line_y, '-r', label = 'Linear regressor', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y2, '.b', label = 'test')
plt.title('Number of movies')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print(lm.score(X1, y2))

0.58986278863620789

Average price

lm = LinearRegression()
y3 = df1.Ave_price
lm.fit(X1, y3)
line_X = np.arange(df1.Year.min(), df1.Year.max() + 1)
line_X = sm.add_constant(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X[:, 1], line_y, '-r', label = 'Linear regressor', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y3, '.b', label = 'data')
plt.title('Average Price')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print(lm.score(X1, y3))

0.96172045677804519

Number of tickets that were sold out

lm1 = LinearRegression()
lm2 = LinearRegression()
y4 = df1.Tickets_sold
X2 = X1
X2_2004 = X2[X2.Year <= 2004]
y4_2004 = y4[X2.Year <= 2004]
X2_2005 = X2[X2.Year > 2004]
y4_2005 = y4[X2.Year > 2004]
lm1.fit(X2_2004, y4_2004)
lm2.fit(X2_2005, y4_2005)
line_X1 = np.arange(X2_2004.Year.min(), X2_2004.Year.max() + 1)
line_X1 = sm.add_constant(line_X1)
line_X1 = pd.DataFrame(line_X1)
line_y1 = lm1.predict(line_X1)
line_X2 = np.arange(X2_2005.Year.min(), X2_2005.Year.max() + 1)
line_X2 = sm.add_constant(line_X2)
line_X2 = pd.DataFrame(line_X2)
line_y2 = lm2.predict(line_X2)
plt.plot(line_X1.iloc[:, 1], line_y1, '-r', label = 'Before 2005', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y4, '.b', label = 'data')
plt.plot(line_X2.iloc[:, 1], line_y2, '-g', label = 'After 2005', alpha = 0.6)
plt.title('Tickets sold')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print((lm1.score(X2_2004, y4_2004) + lm2.score(X2_2005, y4_2005))/2)

0.655714846952

Jason Tian

Suck out the marrow of data