Jason Tian

Suck out the marrow of data

All posts in one long list


Self-Defined Classification Functions in Python (Part 1)

In this posting, I will write down some regex rules that I used in my finance NLP program.

Find surnames and other names from Hong Kong name list

In the Hong Kong name list, the pattern is that the surnames will be all capital words and other name words will start with capital character and followed by lower-case characters.

Two exampes are D'AQUINO Thomas Paul, DOWSLEY James William D'Altera and E Meng

find_surname = re.compile(r'\b[A-Z]+\'?[A-Z]+\b|^[A-Z]{1}\b')
find_othername = re.compile(r'\b[A-Z]{1}\'[A-Z]{1}[a-z]+\b|\b[A-Z]{1}[a-z]+\b|\b [A-Z]{1}-?\b')
fullname = "D'AQUINO Thomas Paul"
find_surname.findall(fullname)
find_othername.findall(fullname)

Issues

Find surname in LIE-A-CHEONG Tai Chong David

When I add this pattern to previous regex, I found One problem.

find_surname = re.compile(r'\b[A-Z]+\'?[A-Z]+\b|^[A-Z]{1}\b|\b-[A-Z]{1}-\b')
find_surname.findall("LIE-A-CHEONG Tai Chong David")

The output will be ['', '', '', '-A-', '']

The confliction was caused by | and () together

Elasticsearch Notes

This notes will store some unstructured tips when I study Elasticsearch.

Exactly search in array

Generate docs and search “Tony Ma” in the tags:

PUT twitter/blog/1
{ "Company" : "Tencent",
  "tg" : ["Tony Ma", "Tencent"]
}

PUT twitter/blog/2
{ "Company" : "Alibaba",
  "tg": ["Tony", "Ma"]
}

GET twitter/blog/_search?q=tg:"Tony Ma"

Returns:

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.5063205,
    "hits": [
      {
        "_index": "twitter",
        "_type": "blog",
        "_id": "1",
        "_score": 0.5063205,
        "_source": {
          "Company": "Tencent",
          "tg": [
            "Tony Ma",
            "Tencent"
          ]
        }
      }
    ]
  }
}

Remarks:

  1. The field names should be different in different types.

Elasticsearch Installation

I installed elasticsearch and related tools on Mac. Firstly you can refer to this website; however there are still some tricks we need to pay attention.

1. Check if your java is up-to-date.

2. Download and extract Elasticsearch in the directory you want:

curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-5.2.2.tar.gz
tar -xvf elasticsearch-5.2.2.tar.gz

3. Download and extract Kibana (do not download from start page):

wget https://artifacts.elastic.co/downloads/kibana/kibana-5.2.2-darwin-x86_64.tar.gz
tar -xzf kibana-5.2.2-darwin-x86_64.tar.gz

4. Install X-Pack (for security and authentication) in Elasticsearch install dir:

bin/elasticsearch-plugin install x-pack

5. Install X-Pack in Kibana install dir:

bin/kibana-plugin install x-pack

6. Run Elasticsearch in the default port 9200:

# Under the dir elasticsearch-5.2.2
./bin/elasticsearch -Ecluster.name=my_cluster_name -Enode.name=my_node_name

7. Run Kibana in the default port 5601:

# Under the dir kibana-5.2.2-darwin-x86_64
.bin/kibana

Remarks:

  1. The default username of Elasticsearch for xpack is elastic and password is changeme
  2. The default username of Kibana for xpack is elastic and password is changeme

XGBoost Installation under Anaconda

I try to install the newest XGBoost version xgboost 0.6a2 under the Environment:

Python 3.5.1 :: Anaconda custom (x86_64)

As recommended in the document, I firstly try:

  1. brew tap homebrew/versions
  2. install gcc --without-multilib1
  3. pip install xgboost

However, I got one error which I cannot fix:

Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/4v/r8x7ssn916792z84sf7chx740000gn/T/pip-build-m6rki5l0/xgboost/

Then I try the other option in the xgboost documents, which is to clone the XGBoost github repo and then install:

git clone --recursive https://github.com/dmlc/xgboost
cd xgboost
cp make/config.mk ./config.mk
make -j4

Install system-widely, which requires root permission

cd python-package
sudo python setup.py install

You will however need Python distutils module for this to work. It is often part of the core python package or it can be installed using your package manager, e.g. in Debian use

sudo apt-get install python-setuptools

If you recompiled xgboost, then you need to reinstall it again to make the new library take effect

Note: you may need to brew tap homebrew/versions and install gcc --without-multilib1 if you did not install gcc-5 or gcc-6 before.

Some lessons we learnt from our first Kaggle

Presentation Slides

Self-Defined Classification Functions in Python (Part 2)

In this part, I will introduce some self-defined functions to choose parameters for classifiers. These functions form one pipeline to work from feature adjustment (transform categorical variables to dummy variables), stratified sampling, choosing best parameters combination to print out accuarcy for both training set and testing set.

Base functions

import pandas as pd
from sklearn.grid_search import GridSearchCV
import xgboost as xgb
from sklearn.cross_validation import StratifiedKFold

Choose Best Parameters

def cv_optimize(clf, parameters, X, y, stratified = None, n_jobs=1, n_folds=5, score_func=None):
    if stratified != None:
        gs = GridSearchCV(clf, param_grid=parameters, cv=stratified, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    gs.fit(X, y)
    print ("BEST", gs.best_params_, gs.best_score_, gs.grid_scores_)
    best = gs.best_estimator_
    return best

Important Arguments

  • clf - original classifier
  • parameters - grid to search over
  • X - usually your training X matrix
  • y - usually your training y
  • stratified - stratified index from sklearn.cross_validation.StratifiedKFold

Main Function

def do_classify(clf, indf, featurenames, targetname, target1val, parameters = None, mask=None, random_state = 30,
                reuse_split=None, stratified = None, dummies = False, score_func=None, n_folds=4, n_jobs=1):
    y=(indf[targetname].values==target1val)*1
    if dummies:
        X = indf[featurenames]
        X_train = X.iloc[stratified[0],:]
        stratified_train = StratifiedKFold(X_train.contbr_st, n_folds=n_folds, random_state = random_state)
        X = pd.get_dummies(X, prefix = '', prefix_sep = '').values
    else:
        X = indf[featurenames].values
    if stratified != None:
        stratified = list(stratified)
        print('using stratified sampling')
        Xtrain, ytrain = X[stratified[0]], y[stratified[0]]
        Xtest, ytest = X[stratified[1]], y[stratified[1]]
    if mask != None:
        print("using mask")
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if parameters != None:
        if stratified != None:
            clf = cv_optimize(clf, parameters, Xtrain, ytrain, stratified = stratified_train,
                              n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
        else:
            clf = cv_optimize(clf, parameters, Xtrain, ytrain,
                  n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)

    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print ("############# based on standard predict ################")
    print ("Accuracy on training data: %0.2f" % (training_accuracy))
    print ("Accuracy on test data:     %0.2f" % (test_accuracy))
    print (confusion_matrix(ytest, clf.predict(Xtest)))
    print ("########################################################")
    return clf, Xtrain, ytrain, Xtest, ytest

Important arguments

  • indf - Input dataframe
  • featurenames - vector of names of predictors
  • targetname - name of column you want to predict (e.g. 0 or 1, ‘M’ or ‘F’, ‘yes’ or ‘no’)
  • target1val - particular value you want to have as a 1 in the target
  • mask - boolean vector indicating test set (~mask is training set) (we’ll use this to test different classifiers on the same test-train splits)

  • stratified - list that stores stratified index, normally we can get this from list(StratifiedKFold(, n_folds=5))[0]. If it is defined, then inside cv_optimize will also apply stratified sampling

  • dummies - If True, the categorical features will be transformed as dummy variables
  • score_func - we’ve used the accuracy as a way of scoring algorithms but this can be more general later on

  • n_folds - Number of folds for cross validation ()
  • n_jobs - used for parallelization stratified and mask cannot be simultaneously applied

Remarks

  • stratified in this model is for stratifying samples based on one specific feature not on the target. It deals with the problem with highly imbalanced categorical feature.
  • If your target is highly imbalanced, there is one option that you can use is to set class_weight = 'balanced' inside the sklearn classifiers.

One Example

Here is one example from one of my project to predict party preference in 2016 US predencial election based on the donation data set. When I did the feature engineering, there is one tricky issue I need to tackle. After I ruled out all missing data, the numbers of donators for states are really different. For example California has about 60,000 donators, whereas Nevada only has 10 donators. However we still do not want to loss any states when we split data into training set and test set. In this case, the best choice should be stratified sampling strategies.

gbm = xgb.XGBClassifier()
skf = list(StratifiedKFold(df_clean.contbr_st, n_folds=5, random_state=30))[0]   #Stratifed sampling first, then make variable dummies.
featurenames = ['contbr_st', 'employer_categorized', 'salary', 'MedianPrice', 'Gender',
           'is_Retired', 'is_Unemployed_NotRetired', 'is_Self_Employed']
targetname = 'party'
param = {"max_depth": [7, 9, 11], "n_estimators": [300, 400, 500], 'learning_rate': [0.05, 0.08, 0.1]}
gbm_fitted, Xtrain, ytrain, Xtest, ytest = do_classify(gbm, df_clean, featurenames, targetname, 1, stratified = skf,
                       parameters = param,dummies = True, n_jobs = 4, n_folds = 4)

Self-Defined Classification Functions in Python (Part 1)

Visualization

import numpy as np
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_curve, auc
import matplotlib as mpl
import matplotlib.cm as cm
import matplotlib.pyplot as plt
import pandas as pd
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
from matplotlib.colors import ListedColormap

Binary Classification results based on two features

cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
def classify_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.1, psize=10):
    h = .02
    X=np.concatenate((Xtr, Xte))
    x_min, x_max = X[:, 0].min() - .5, X[:, 0].max() + .5
    y_min, y_max = X[:, 1].min() - .5, X[:, 1].max() + .5
    xx, yy = np.meshgrid(np.linspace(x_min, x_max, 100),
                         np.linspace(y_min, y_max, 100))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
    ZZ = Z.reshape(xx.shape)
    if mesh:
        plt.pcolormesh(xx, yy, ZZ, cmap = colorscale, alpha=alpha, axes=ax)
    else:
        showtr = ytr
        showte = yte
    ax.scatter(Xtr.iloc[:, 0], Xtr.iloc[:, 1], c=showtr-1, cmap=cdiscrete, s=psize, alpha=alpha, edgecolor="k")
    # and testing points
    ax.scatter(Xte.iloc[:, 0], Xte.iloc[:, 1], c=showte-1, cmap=cdiscrete, alpha=alpha, marker="*", s=psize+30)
    ax.set_xlim(xx.min(), xx.max())
    ax.set_ylim(yy.min(), yy.max())
    return ax,xx,yy

Parameters

  • ax: figure information
  • Xtr: pandas.DataFrame

    Training set of features where features are in the columns

  • Xte: DataFrame

    Testing set of features where features are in the columns

  • ytr: DataFrame

    Training set of target

  • yte: DataFrame

    Testing set of target

  • clf: fitted classifier
  • mesh: plot mesh or not. If False, then it only plots predicted values.
  • colorscale: specify the Colormap for mesh plot

    Colormap is the base classes to convert numbers to color to the RGBA color.

  • cdiscrete: specify the Colormap for points
  • alpha: transparency
  • psize: size of points

Returns

  1. ax:figure information
  2. xx: x axis coordinates
  3. yy: y axis coordinates

Demo:

dfhw=pd.read_csv("https://dl.dropboxusercontent.com/u/75194/stats/data/01_heights_weights_genders.csv")
df=dfhw.sample(500, replace=False)
df.Gender = (df.Gender=="Male") * 1
X_train, X_test, y_train, y_test = train_test_split(
    df[['Height', 'Weight']], df.Gender, test_size=0.4, random_state=42)

clflog = LogisticRegression()
clf = clflog.fit(X_train, y_train)
plt.figure(figsize = (15,13))
ax=plt.gca()
classify_plot(ax, X_train, X_test, y_train, y_test, clf, alpha = 0.5)
plt.show()

Probability boundaries

cm = plt.cm.RdBu
def classify_plot_prob(ax, Xtr, Xte, ytr, yte, clf, colorscale=cmap_light, cdiscrete=cmap_bold, ccolor=cm, psize=10, alpha=0.1, prob=True):
    ax,xx,yy = points_plot(ax, Xtr, Xte, ytr, yte, clf, mesh=False, colorscale=colorscale, cdiscrete=cdiscrete, psize=psize, alpha=alpha)
    if prob:
        Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])[:, 1]
    else:
        Z = clf.decision_function(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    plt.contourf(xx, yy, Z, cmap=ccolor, alpha=.2, axes=ax)
    cs2 = plt.contour(xx, yy, Z, cmap=ccolor, alpha=.6, axes=ax)
    plt.clabel(cs2, fmt = '%2.1f', colors = 'k', fontsize=14, axes=ax)
    return ax 

Parameters

  1. ccolor: specify the Colormap for contour plot
  2. prob: use probability or decision function

Demo:

plt.figure(figsize = (15,13))
ax=plt.gca()
classify_plot_prob(ax, X_train, X_test, y_train, y_test, clf, alpha = 0.5)
plt.show()

Tree Classifier plot

def plot_tree(ax, Xtr, Xte, ytr, yte, clf, plot_train = True, plot_test = True, lab = ['Feature 1', 'Feature 2'],
               mesh=True, colorscale=cmap_light, cdiscrete=cmap_bold, alpha=0.3, psize=10):
    # Create a meshgrid as our test data
    plt.figure(figsize=(15,10))
    plot_step= 0.05
    xmin, xmax= Xtr[:,0].min(), Xtr[:,0].max()
    ymin, ymax= Xtr[:,1].min(), Xtr[:,1].max()
    xx, yy = np.meshgrid(np.arange(xmin, xmax, plot_step), np.arange(ymin, ymax, plot_step) )

    # Re-cast every coordinate in the meshgrid as a 2D point
    Xplot= np.c_[xx.ravel(), yy.ravel()]
    # Predict the class
    Z = clf.predict( Xplot )

    # Re-shape the results
    Z= Z.reshape( xx.shape )
    cs = ax.contourf(xx, yy, Z, cmap= cmap_light, alpha=0.3)
  
    # Overlay training samples
    if (plot_train == True):
        ax.scatter(Xtr[:, 0], Xtr[:, 1], c=ytr-1, cmap=cmap_bold, alpha=alpha,edgecolor="k") 
    # and testing points
    if (plot_test == True):
        ax.scatter(Xte[:, 0], Xte[:, 1], c=yte-1, cmap=cmap_bold, alpha=alpha, marker="*")

    ax.set_xlabel(lab[0])
    ax.set_ylabel(lab[1])
    ax.set_title("Boundary for decision tree classifier",fontsize=15)

Demo

from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
plt.figure(figsize = (10,10))
ax=plt.gca()
plot_tree(ax, np.array(X_train), np.array(X_test), np.array(y_train), np.array(y_test), 
           clf = rf, lab = ['height', 'weight'], alpha = 1)

Colormaps

Colormaps is very useful when we try to convert numbers into colors.

If the colormaps have a listed values like ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF']), it means to map one number into one of these three colors. For example if \(X = [x_1, … , x_n]\) has values from 0 to 1, then [0, 0.33] indicates color #FFAAAA, (0.33, 0.66] indicates color #AAFFAA and (0.66, 1] indicates color AAAAFF.

If the colormaps is continuous like plt.cm.RdBu, then the numbers should map to that range. The following is one example:

Precision vs. Recall

def pr_curve(truthvec, scorevec, digit_prec=2):
    threshvec = np.unique(np.round(scorevec,digit_prec))
    numthresh = len(threshvec)
    tpvec = np.zeros(numthresh)
    fpvec = np.zeros(numthresh)
    fnvec = np.zeros(numthresh)

    for i in range(numthresh):
        thresh = threshvec[i]
        tpvec[i] = sum(truthvec[scorevec>=thresh])
        fpvec[i] = sum(1-truthvec[scorevec>=thresh])
        fnvec[i] = sum(truthvec[scorevec<thresh])
    recallvec = tpvec/(tpvec + fnvec)
    precisionvec = tpvec/(tpvec + fpvec)
    plt.plot(precisionvec,recallvec)
    plt.axis([0, 1, 0, 1])
    plt.xlabel('precision')
    plt.ylabel('recall')
    return (recallvec, precisionvec, threshvec)

Demo

rf = RandomForestClassifier()
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
plt.figure(figsize = (10,10))
pr_curve(y_test, y_pred)

ROC Curve

This function can present the thresholds in the ROC figure.

def make_roc(name, clf, ytest, xtest, ax=None, labe=5, proba=True, skip=0):
    initial=False
    if not ax:
        ax=plt.gca()
        initial=True
    if proba:
        fpr, tpr, thresholds=roc_curve(ytest, clf.predict_proba(xtest)[:,1])
    else:
        fpr, tpr, thresholds=roc_curve(ytest, clf.decision_function(xtest))
    roc_auc = auc(fpr, tpr)
    if skip:
        l=fpr.shape[0]
        ax.plot(fpr[0:l:skip], tpr[0:l:skip], '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    else:
        ax.plot(fpr, tpr, '.-', alpha=0.3, label='ROC curve for %s (area = %0.2f)' % (name, roc_auc))
    label_kwargs = {}
    label_kwargs['bbox'] = dict(
        boxstyle='round,pad=0.3', alpha=0.2,
    )
    for k in range(0, fpr.shape[0],labe):
        #from https://gist.github.com/podshumok/c1d1c9394335d86255b8
        threshold = str(np.round(thresholds[k], 2))
        ax.annotate(threshold, (fpr[k], tpr[k]), **label_kwargs)
    if initial:
        ax.plot([0, 1], [0, 1], 'k--')
        ax.set_xlim([0.0, 1.0])
        ax.set_ylim([0.0, 1.05])
        ax.set_xlabel('False Positive Rate')
        ax.set_ylabel('True Positive Rate')
        ax.set_title('ROC')
    ax.legend(loc="lower right")
    print('AUC: ', roc_auc)
    return ax

Parameters

  • name: string, indicating the name of model
  • clf: fitted classifier
  • labe: the distance between two adjacent labels.
  • proba: some classifiers, like KNN, Logistic regression, need to set proba = True
  • skip: the step size when plot ROC

Demo

plt.figure(figsize = (10,10))
make_roc('Random Forest', rf, y_test, X_test)
plt.show()

Remarks

  1. The first two functions are only for logistic regression, KNN, SVM and Random Forest.
  2. When you try to feed SVM into classify_plot_prob, you need to set probability = True in SVC().
  3. The second function may present ugly figures for KNN and Random Forest.

K Dimension Tree and K Nearest Neighbors

K Dimension Tree

In some cases we may be not interested in the question “is X in this container,” but rather “what value in the container is X most similar to?” At a high level, a kd-tree is a generalization of a binary search tree that stores points in k-dimensional space.

This is quoted from this nice introduction, and you can find more detailed interpretation in it. Here I just excerpt some contents from it.

When we just consider the balanced K-D tree, we always use median value to be split boundaries.

Nearest-Neighbor

  • The first target should be in the smallest cell and we randomly choose one point in that cell, then we can construct a candidate hy- persphere centered at the test point and running through the guess point. The nearest neighbor to the test point must lie inside this hypersphere.
  • If this hypersphere is fully to one side of a splitting hyperplane, then all points on the other side of the splitting hyperplane cannot be contained in the sphere and thus cannot be the nearest neighbor.
  • To determine whether the candidate hypersphere crosses a splitting hyperplane that com- pares coordinate i, we check whether \(|b_i – a_i|<r\).
  • If this hypersphere crosses some boundaries, we need to go back to the parent node (may go up to multiple upleavls). Then get a point to compute the distance and compare it with the previous distance.

This algorithm can be shown to run in O(log n) time on a balanced kd-tree with n data points provided that those points are randomly distributed. In the worst case, though, the entire tree might have to be searched. However, in low-dimensional spaces, such as the Cartesian plane or three-di- mensional space, this is rarely the case.

k-Nearest Neighbor Searches

We will apply K-D tree with a bounded priority queue (or BPQ for short) to tackle this problem. A bounded priority queue is similar to a regular priority queue, except that there is a fixed upper bound on the number of elements that can be stored in the BPQ. When- ever a new element is added to the queue, if the queue is at capacity, the element with the highest priority value is ejected from the queue.

Generators in Python

Definition

A generator function is defined like a normal function, but whenever it needs to generate a value, it does so with the yield keyword rather than return. If the body of a def contains yield, the function automatically becomes a generator function (even if it also contains a return statement). There’s nothing else we need to do to create one.

Just remember that a generator is a special type of iterator.

In Python, “functions” with these capabilities are called generators, and they’re incredibly useful. generators (and the yield statement) were initially introduced to give programmers a more straightforward way to write code responsible for producing a series of values. Previously, creating something like a random number generator required a class or module that both generated values and kept track of state between calls. With the introduction of generators, this became much simpler.

To better understand the problem generators solve, let’s take a look at an example. Throughout the example, keep in mind the core problem being solved: generating a series of values.

Outside of Python, all but the simplest generators would be referred to as coroutines.

Stop criteria

  1. If a generator function calls return
  2. If the generator is exhausted by calling next(). After that if we call next() again it will result in an error.

Examples

Basic function

fibSeries = [0,1]
def fib1(n):
    for i in range(2,n):
        fibSeries.append(fibSeries[i-1]+fibSeries[i-2])
    return fibSeries[n-1]

The complexity is O(N).

Recursive function

def fib2(n):
    if n==1:
        return 0
    elif n==2:
        return 1
    else:
        return fib(n-1)+fib(n-2)

The issue with this implementation is that each function calls two functions and that become exponetial. Think of fib(20). It will call fib(18) and fib(19). fib(19) will call fib(18) and fib(17). So fib(18) gets called 2 times, fib(17) 3 times, etc. And the count goes up exponetially. The same code as above with a counter. fib(20) causes the function to be called over 13K times. Try fib(40)… the code wouldn’t even complete.

Generators

def temp():
    a,b = 0,1
    yield a
    yield b
    while True:
        a, b = b, a + b
        yield b
def fib3(n):
    res = []
    for i,f in enumerate(temp()):
        if i > 10: break
        res.append(f)
    return res[-1]

Running time

%timeit fib1(10)
%timeit fib2(10)
%timeit fib3(10)

Output:

100000 loops, best of 3: 2.24 µs per loop
10000 loops, best of 3: 26.3 µs per loop
100000 loops, best of 3: 3.31 µs per loop

So the basic fuction is the best, generator function comes next and recursive function is the worst.

This fibonacci is not a good problem for generator function. I will provide another fancier example. This example is shameless stolen from this blog.

import random

def get_data():
    """Return 3 random integers between 0 and 9"""
    return random.sample(range(10), 3)

def consume():
    """Displays a running average across lists of integers sent to it"""
    running_sum = 0
    data_items_seen = 0

    while True:
        data = yield
        data_items_seen += len(data)
        running_sum += sum(data)
        print('The running average is {}'.format(running_sum / float(data_items_seen)))

def produce(consumer):
    """Produces a set of values and forwards them to the pre-defined consumer
    function"""
    while True:
        data = get_data()
        print('Produced {}'.format(data))
        consumer.send(data)
        yield

if __name__ == '__main__':
    consumer = consume()
    consumer.send(None)
    producer = produce(consumer)

    for _ in range(10):
        print('Producing...')
        next(producer)

Tips

  • generators are used to generate a series of values
  • yield is like the return of generator functions
  • The only other thing yield does is save the “state” of a generator function
  • A generator is just a special type of iterator
  • Like iterators, we can get the next value from a generator using next()
  • It is common to use while True: inside generator function.

SQL

Today in the Metis bootcamp we learnt SQL. I read one book Learning SQL some months ago, but I still need more practice on it. So I take some notes here to remind myself its basic syntax. This practice based on this website.

Some resources:

  1. Visual Explanation of SQL Joins
  2. SQL Quick Guide

The typical SQL database structure:

What is the name of the customer who has the most orders?

select c.CustomerID ,COUNT(*) AS count 
from orders o 
join Customers c 
on c.CustomerID=o.CustomerID
group by o.customerID
order by count DESC;

What supplier has the highest average product price?

select s.SupplierName, avg(price) 
from products p
join suppliers s on p.supplierID=s.supplierID
group by 1

How many different countries are there customers from? (Hint: Consider DISTINCT.)

select count(distinct country) as total_counties from customers

What category appears in the most orders?

select categoryName, count(distinct orderID)
from orderDetails o
join products p on p.productID = o.ProductID
join categories c on c.categoryID=p.categoryID
group by 1
order by 2 DESC
limit 1

What was the total cost for each order?

select orderID, sum(product_price)
from (select orderID, quantity*price as product_price
from orderdetails o
join products p on p.productID=o.productID)
group by 1

What employee made the most sales (by total cost)?

SELECT e.FirstName, e.LastName, e.employeeID, sum(product_price)
from 
  (select orderID, sum(quantity*price) as product_price
  from orderdetails as o
  join products as p on p.productID=o.productID
  group by 1) x
join orders o on x.orderID=o.orderID
join employees e on e.employeeID=o.employeeID
group by 1,2 
order by 3 desc

What Employees have BS degrees? (Hint: Look at the LIKE operator.)

SELECT firstName||" "||LastName as name from employees
where Notes like '% BS %'

What supplier of three or more products has the highest average product price? (Hint: Look at the HAVING operator.)

select supplierName, avg_price from
 (select supplierID, avg(price) as avg_price
 from products 
 group by 1 
 having count(*) >2) x
join suppliers s on x.supplierID=s.supplierID
order by avg_price DESC

Remark:

  1. ||" "||: Combine two columns.

Set up Jupyter Notebook on AWS

After one day intense work on setting up Ipython on AWS, I think it worth writing the whole process down. There are too many details that need to care about. Most of the contents here are shameless stolen from Andrew Blevins.

Contents Table

  1. Add the inbound port when setup the instance
  2. Add user and security issue (not necessary)
  3. Download Anaconda and software updates
  4. Setting up Jupyter Notebook
  5. Release Jupyter Notebook from terminal

Add the inbound port when setup the instance

When setup the instance (I used ubuntu instance), we need to add more ports when setup security group:

Adding User and Security

Open AWS server on your local terminal

ssh -i .ssh/aws_key.pem ubuntu@11.11.11.11

Remarks

  1. aws_key.pem is the downloaded password key from aws when we create the instance.
  2. 11.11.11.11 is the AWS instance public IP.
  3. This is a public account so it is not safe.

Add new user

ubuntu@ip-172-31-60-68:/home$ sudo adduser jason

Note: pick a password (save it in an easy-to-find place !! ); enter through all the other questions (name fields, etc.)

Delete user

$ sudo userdel -r olduser
User privileges

Make yourself special by granting yourself root privileges: type sudo visudo. This will open up nano (a text editor) to edit the sudoers file. Find the line that says root ALL=(ALL:ALL) ALL. Give yourself a line beneath that which says [username] ALL=(ALL:ALL) ALL.

User privilege specification
root     ALL=(ALL:ALL) ALL
jason  ALL=(ALL:ALL) ALL

Save file in nano editor: Ctrl-o then Enter when asked for the file name.
Exit file from nano editor: Ctrl-x

Setting up User Account

Now you have a user account, but you can’t just log in with a password. Passwords aren’t secure enough.
Copy your github public key (from your local machine) ~/.ssh/id_rsa.pub to your remote machine to the authorized keys file. Your github public key may have different name. You need to know your password for this SSH key. Please notice that this password may be not the same one you use to login your github account. If you cannot remember this, you need to reset github public SSH:

Jason@news-MacBook-Pro:~/.ssh$ ssh-keygen -t rsa

Remarks:

  1. After this command you need to rename the public key file or use the default id_rsa.
  2. You will set your password again.

Create the authorized_keys file

On your remote machine (AWS):

  1. create the directory
  2. then copy key from local machine (all characters inside id_rsa.pub) to remote machine.
sudo mkdir -p /home/my_cool_username/.ssh/
sudo vi /home/my_cool_username/.ssh/authorized_keys

When you paste the key please be careful that you may miss the first character ‘s’.

My example:

1)  get output from your (local machine) public key file like this:
jason$ pwd
/Users/jason/.ssh
jason$ cat id_rsa.pub

2) Copy everything (Command c)

3) On your AWS machine:  
after you run:
$ sudo nano /home/jason/.ssh/authorized_keys

To paste in the current window:  Command v
then hit  
ctrl o (to save)  
enter
ctrl x (to exit)

Change login name

Open a new terminal on your local machine under the root directory. Then create a file .ssh/config, in which .ssh is the folder where you put your password key id_rsa.

$vim `.ssh/config`
#Then inside config file paste the following lines:
Host machine_name_goes_here
Hostname 123.234.123.234
User my_cool_username

My exsample:

Host aws_ipython
     HostName 54.172.80.98
     User jason

Remarks

  1. Host is the AWS instance name.
  2. HostName is the public IP of the AWS instance
  3. user is the new user name you just created

Now you can log in to your remote machine with ssh machine_name_goes_here.

Send a file from your local machine to your remote machine

$ scp cool_file.png machine_name_goes_here:~

Type exit on your remote machine and open a shell on your local machine:

$ ssh -i .ssh/id_rsa machine_name_goes_here

If you do not change the default user name, you can use the following commands:

scp -i <location of aws key> melville-moby_dick.txt ubuntu@11.11.11.11:~/data/

Remarks:

  1. Please be careful about .ssh/id_rsa, it is not .ssh/id_rsa.pub!!
  2. You need to enter your github password for SSH.
  3. Next time you only need to type ssh machine_name_goes_here to login.

Download Anaconda and software updates

After we use that new user name to login AWS, we need to update python and install Jupyter Notebook. The easest way is to download Anaconda.

Download Anaconda

jason@ip-172-31-60-68:~$ wget http://repo.continuum.io/archive/Anaconda3-4.2.0-Linux-x86_64.sh
jason@ip-172-31-60-68:~$ bash Anaconda3-4.2.0-Linux-x86_64.sh

After a few minutes the install will finish and tell you to put the folder that was just created at the top of your $PATH. Modify your .bashrc

jason@ip-172-31-60-68:~$ vim .bashrc

Inside .bashrc put the following line:

# added by Anaconda3 4.2.0 installer
export PATH="/home/ubuntu/anaconda3/bin:$PATH"

Then execute .bashrc under your new user root directory.

jason@ip-172-31-60-68:~$ source .bashrc

You may need to download newest version.

Confirm the version of python

jason@ip-172-31-60-68:~$ python --version

Update softwares

jason@ip-172-31-60-68:~$ sudo apt-get update

apt-get Package Management Tool
Read more about apt-get at above link.

Setting up Jupyter Notebook

On your remote machine:

$ ipython
In [1]:from IPython.lib import passwd
In [2]:passwd()

Remarks:

  1. You need to set your passwords for Jupyter Notebook.
  2. your output will have a string that starts with 'sha1:' copy this string somewhere for later use

Then you need to do this:

$ cd ~
$ mkdir .certs
$ cd .certs
$ sudo openssl req -x509 -nodes -days 365 -newkey rsa:1024 -keyout mycert.pem -out mycert.pem

Change the jupyter configuration

jason@ip-172-31-60-68:~$ mkdir .jupyter
jason@ip-172-31-60-68:~$ cd ~/.jupyter/
jason@ip-172-31-60-68:~$ vi jupyter_notebook_config.py

Then you need to copy the following codes into the second line in this file (under # Configuration file for jupyter-notebook.).

c = get_config()

# Kernel config
c.IPKernelApp.pylab = 'inline'  # if you want  plotting support always in your notebook

# Notebook config
c.NotebookApp.certfile = '/home/my_cool_username/.certs/mycert.pem' #location of your certificate file
c.NotebookApp.ip = '*'
c.NotebookApp.open_browser = False  #so that the ipython notebook does not opens up a browser by default
c.NotebookApp.password = 'sha1:68c136a5b064...'  #the encrypted password we generated in ipython
# Set the port to match the port we opened in the security group
c.NotebookApp.port = 8888

Remarks:

  1. This is for Python 3+.
  2. Be careful about c.NotebookApp.certfile line. You need to enter your own path.

Then let’s run it!

$ cd ~
$ mkdir Notebooks
$ cd Notebooks
$ jupyter notebook

After this you need to open your browser and type 11.11.11.11:8888. 11.11.11.11 is your AWS instance public IP.

Release Jupyter Notebook from terminal

Sometimes we may shut down the terminal by mistakes, then all the runnning notebooks will be affected. We can use the following way to avoid this situation

$ nohup jupyter notebook

Output:

nohup: ignoring input and appending output to ‘nohup.out’

Now you cannot do anything in this terminal. Now you can close this terminal.

Kill this Jupyter Notebook

Open a new terminal and login your remote AWS machine.

$ cd Notebooks     #this folder is where you run your Jupyter Notebook last time
$ lsof nuohup.out

If your program is still running, you can see something like this:

COMMAND    PID  USER   FD   TYPE DEVICE SIZE/OFF   NODE NAME
jupyter-n 3034 jason    1w   REG  202,1      838 269167 nohup.out
jupyter-n 3034 jason    2w   REG  202,1      838 269167 nohup.out

Then you can kill the Jupyter Notebook by this:

kill -9 3034

Remarks

  1. 3034 is the PID.

Movie website scraping part 2

Comparason of movie directors

Movie investors always want to do best investment that can end up most return. The director is one of most significant factor to achieve this goal. Therefore it worth to predict profit given a certain amount of budget based on their historical records.

At this time, I only analyze four famous science fiction directors: James Cameron, Steven Spielberg, Quentin Tarantino and Francis Ford coppola.

Results

Ratio of box office and budget

  • From this ratio figure, if budget is less than $100 million Steven is the best; whereas if budget is larger than $100 million James is outstanding.

Ratio of box office and budget

  • From this ratio figure, if budget is less than $55 million Steven is the best; whereas if budget is larger than $55 million James is outstanding.

Takeaway points

  1. Be careful with the outliers, rule them out before apply regression.
  2. Find the right function to fit the data.
  3. The ratio cannot be negative, so we need to consider this when we choose function to fit data. When I fitted the ratio for Steven, I applied logorithm function.

Further works

  1. Scrape more directors and add some more features, like genre, producer and editor.
  2. The ideal final product should be a we page that you can input features then it will output the rank list of directors.

Detailed process

Scrape data from wiki

James Cameron

#https://en.wikipedia.org/wiki/James_Cameron_filmography
director = 'James_Cameron'
url = 'https://en.wikipedia.org/wiki/' + director + '_filmography'
#Apply this header to avoid being denied.
headers = {
    'user-Agent': 'Mozilla/5.0'
}
response = requests.get(url, headers = headers)
page = response.text
soup = BeautifulSoup(page, "lxml")
tables = soup.find_all('table')
#Check the tables visually and then get what we want.
tables = tables[0]

James_data = []
rows = tables.find_all('tr')
for row in rows[2:]:
    row_dict={}
    th = row.find('th')
    th_ref = th.find('a')
    if th_ref is not None:
        year = row.find('td')
        if year.findNextSibling().get_text() == 'Yes':   #Make sure he is the director
            row_dict['Title'] = th_ref.get_text()
            row_dict['Year'] = year.get_text()
            row_dict['url'] = th_ref['href']
            James_data.append(row_dict)
James_df = pd.DataFrame(James_data)
James_df.head(2)
  Title Year url
0 Xenogenesis 1978 /wiki/Xenogenesis_(film)
1 Piranha II: The Spawning 1981 /wiki/Piranha_II:_The_Spawning

Now we got the table which contains all movies that James direct and have a hyperlink so that we can find detailed information. Then I define a function to scrape budget and box office values based on this table.

def budget_boxOffice(director_movies):
    director_data = []
    header = ['Budget', 'Box_office']
    for i in range(director_movies.shape[0]):
        url_movie = 'https://en.wikipedia.org' + director_movies.url[i]
        headers = {
        'user-Agent': 'Mozilla/5.0'
        }
        row_dict={}
        response = requests.get(url_movie, headers = headers)
        page = response.text
        soup = BeautifulSoup(page, "lxml")
        tables = soup.find(class_ = 'infobox vevent')
        rows = tables.find_all('tr')
        if rows[-1].th.get_text() == 'Box office':
            row_dict['Box office'] = rows[-1].td.get_text()
            if rows[-2].th.get_text() != 'Budget':
                row_dict['Budget'] = None
            else:
                row_dict['Budget'] = rows[-2].td.get_text()
        else:
            row_dict['Box office'] = None
            row_dict['Budget'] = rows[-1].td.get_text()
        director_data.append(row_dict)
    df2 = pd.DataFrame(director_data)
    return df2

Apply this function and merge the result with original table.

James_df2 = budget_boxOffice(James_df)
James_df = James_df.merge(James_df2, left_index = True, right_index = True)
James_df = James_df.dropna(axis = 0)
James_df.head(2)
  Title Year url Box office Budget
2 The Terminator 1984 /wiki/The_Terminator $78.3 million[4] $6.4 million[4]
3 Aliens 1986 /wiki/Aliens_(film) $131.1–183.3 million[5][7] $17–18 million[5][6]

Then we need to define another function to clean this table.

def clean_num(bucks):
    bucks = bucks.replace(',', '')
    nums = re.findall('[0-9.]{2,}', bucks)   #find the number who has more than 2 digits
    if len(nums) == 0:
        nums = re.findall('[0-9.]+ [a-z]+', bucks)     #find the number + character combination
        if len(nums) == 0:
            return None
        else:
            nums = re.findall('\d', nums[0])
    mil = re.findall('million', bucks)
    bil = re.findall('billion', bucks)
    if len(mil) > 0:
        unit = 'million'
    elif len(bil) > 0:
        unit = 'billion'
    else:
        unit = 'digit'
    if unit == 'million':
        nums = [np.float(x) for x in nums]
    if unit == 'billion':
        nums = [np.float(x) * 1000 for x in nums]
    if unit == 'digit':
        nums = [np.float(x) / 1000000 for x in nums]            
    res = np.mean(nums)
    return res
    
James_df['Box office'] = James_df['Box office'].apply(clean_num)
James_df['Budget'] = James_df['Budget'].apply(clean_num)
James_df.head(2)
  Title Year url Box office Budget
2 The Terminator 1984 /wiki/The_Terminator 78.300000 6.4
3 Aliens 1986 /wiki/Aliens_(film) 157.200000 17.5

Steven Spielberg

#https://en.wikipedia.org/wiki/James_Cameron_filmography
director = 'Steven_Spielberg'
url = 'https://en.wikipedia.org/wiki/' + director + '_filmography'
#Apply this header to avoid being denied.
headers = {
    'user-Agent': 'Mozilla/5.0'
}
response = requests.get(url, headers = headers)
page = response.text
soup = BeautifulSoup(page, "lxml")
tables = soup.find_all('table')
tables = tables[2]

For this table there are some years span over multiple movies, so we need to take care of this situation.

Steven_data = []
header = ['Title', 'Year', 'url']
rows = tables.find_all('tr')
for row in rows[1:]:
    row_dict={}
    td = row.find_all('td')
    td0 = td[0].get_text()
    check_td0 = re.findall('[0-9]{4,}' ,td0)    #Find number that has more than 3 digits
    if check_td0 != []:
        year = td[0].get_text()
        if (td[1].a is not None) and td[2].get_text() == 'Yes':
            row_dict['Title'] = td[1].get_text()
            row_dict['Year'] = year
            row_dict['url'] = td[1].a['href']
            Steven_data.append(row_dict)
    else:                                       #The first cell is not year so it has spanned year.
        if (td[0].a is not None) and td[1].get_text() == 'Yes':
            row_dict['Title'] = td[0].get_text()
            row_dict['Year'] = year
            row_dict['url'] = td[0].a['href']
            Steven_data.append(row_dict)

Steven_df = pd.DataFrame(Steven_data)
Steven_df.Year = pd.to_numeric(Steven_df.Year)
Steven_df = Steven_df[Steven_df.Year < 2016]    #We only want movies that have already been screened.
Steven_df.head(2)
  Title Year url
0 Amblin’ 1968 /wiki/Amblin%27
1 Duel 1971 /wiki/Duel_(1971_film)

Scrape box office and budget values. Merge and clean table.

Steven_df2 = budget_boxOffice(Steven_df)
Steven_df = Steven_df.merge(Steven_df2, left_index = True, right_index = True)
Steven_df = Steven_df.dropna(axis = 0)
Steven_df['Box office'] = Steven_df['Box office'].apply(clean_num)
Steven_df['Budget'] = Steven_df['Budget'].apply(clean_num)
Steven_df.head(2)
  Title Year url Box office Budget
2 The Sugarland Express 1974 /wiki/The_Sugarland_Express 12.8 3
3 Jaws 1975 /wiki/Jaws_(film) 470.7 9

Save data sets.

James_df.to_csv('James.txt', index = False)
Steven_df.to_csv('Steven.txt', index = False)

Regression show time

Now I will apply the regression to find the fited line of box office vs. budget and rate of return vs. budget.

Box office vs. budget

Firstly, let’s plot all the points to visually check their relationship.

From this plot, we can guess James Cameron has a quadratic trend and the rest have linear trend.

James Cameron

lm = LinearRegression()
y_James = James_df['Box office']
X_James = James_df['Budget']
X_James = sm.add_constant(X_James)
X_James['Budget_sqr'] = X_James.Budget ** 2
lm.fit(X_James, y_James)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
line_X['Budget_sqr'] = line_X.iloc[:, 1] ** 2
line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(X_James.iloc[:, 1], y_James, '.b')
plt.title('James Cameron')
plt.xlim(0, 250)
plt.ylim(0, 3000)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Steven Spielberg

lm = LinearRegression()
y_Steven = Steven_df['Box office']
X_Steven = Steven_df['Budget']
X_Steven = sm.add_constant(X_Steven)
lm.fit(X_Steven, y_Steven)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(X_Steven.iloc[:, 1], y_Steven, '.b')
plt.title('Steven Spielberg')
plt.xlim(0, 250)
plt.ylim(0, 3000)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Quentin Tarantino

lm = LinearRegression()
y_Quentin = Quentin_df['Box office']
X_Quentin = Quentin_df['Budget']
X_Quentin = sm.add_constant(X_Quentin)
lm.fit(X_Quentin, y_Quentin)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(X_Quentin.iloc[:, 1], y_Quentin, '.b')
plt.title('Quentin Tarantino')
plt.xlim(0, 250)
plt.ylim(0, 3000)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Francis_Ford_Coppola

lm = LinearRegression()
y_Francis = Francis_df['Box office']
X_Francis = Francis_df['Budget']
X_Francis = sm.add_constant(X_Francis)
lm.fit(X_Francis, y_Francis)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(X_Francis.iloc[:, 1], y_Francis, '.b')
plt.title('Francis Ford Coppola')
plt.xlim(-10, 250)
plt.ylim(-10, 3000)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

The rate of return on investment vs. budget

Firstly, check scatter plot.

James Cameron

lm = LinearRegression()
y_James = James_df['Box office'] / James_df['Budget']
X_James = James_df['Budget']
X_James = sm.add_constant(X_James)
lm.fit(X_James, y_James)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(X_James.iloc[:, 1], y_James, '.b')
plt.title('James Cameron')
plt.xlim(0, 250)
plt.ylim(-1, 20)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Steven Spielberg

lm = LinearRegression()
y_Steven = Steven_df['Box office'] / Steven_df['Budget']
ind = y_Steven < 40
y_Steven = y_Steven[ind]
X_Steven = Steven_df['Budget']
X_Steven = X_Steven[ind]
X_Steven = sm.add_constant(X_Steven)
X_Steven['Budget_log'] = np.log(X_Steven.Budget + 1)
lm.fit(X_Steven[['const', 'Budget_log']], y_Steven)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
line_X['Budget_log'] = np.log(line_X.iloc[:, 1] + 1)
line_y = lm.predict(line_X.iloc[:, [0,2]])
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(Steven_df['Budget'], Steven_df['Box office'] / Steven_df['Budget'], '.b')
plt.title('Steven Spielberg')
plt.xlim(-10, 250)
plt.ylim(-1, 80)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Quentin Tarantino

lm = LinearRegression()
y_Quentin = Quentin_df['Box office'] / Quentin_df['Budget']
ind = y_Quentin < 20
y_Quentin = y_Quentin[ind]
X_Quentin = Quentin_df['Budget']
X_Quentin = X_Quentin[ind]
X_Quentin = sm.add_constant(X_Quentin)
#X_Quentin['Budget_log'] = np.log(X_Quentin.Budget + 1)
#lm.fit(X_Quentin[['const', 'Budget_log']], y_Quentin)
lm.fit(X_Quentin, y_Quentin)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
#line_X['Budget_log'] = np.log(line_X.iloc[:, 1] + 1)

line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(Quentin_df['Budget'], Quentin_df['Box office'] / Quentin_df['Budget'], '.b')
plt.title('Quentin Tarantino')
plt.xlim(-10, 250)
plt.ylim(-1, 40)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Francis Ford Coppola

lm = LinearRegression()
y_Francis = Francis_df['Box office'] / Francis_df['Budget']
ind = y_Francis < 30
y_Francis = y_Francis[ind]
X_Francis = Francis_df['Budget']
X_Francis = X_Francis[ind]
X_Francis = sm.add_constant(X_Francis)
#X_Francis['Budget_log'] = np.log(X_Francis.Budget + 1)
#lm.fit(X_Francis[['const', 'Budget_log']], y_Francis)
lm.fit(X_Francis, y_Francis)
line_X = np.arange(0, 250)
line_X = sm.add_constant(line_X)
line_X = pd.DataFrame(line_X)
#line_X['Budget_log'] = np.log(line_X.iloc[:, 1] + 1)
#line_y = lm.predict(line_X.iloc[:, [0,2]])
line_y = lm.predict(line_X)
plt.plot(line_X.iloc[:, 1], line_y, '-k', alpha = 0.6)
plt.plot(Francis_df['Budget'], Francis_df['Box office'] / Francis_df['Budget'], '.b')
plt.title('Francis Ford Coppola')
plt.xlim(-10, 250)
plt.ylim(-1, 40)
fig = plt.gcf()
fig.set_size_inches(8, 8)
plt.show()

Linear Regression

Standard Process

  1. Feature engineering
  2. Split the data into train and test sets (random or not)
  3. Based on the train data set, apply cross validation to choose optimal tuning parameter for regularized models, like lasso, ridge and elastic net.
  4. Based on entire training data set, fit all regularized models with best tuning parameter and fit the base linear model as well.
  5. Apply the test data set to select the best model.
  6. Check more statistics.
  7. Check specific features and go back to adjust model.

Prepreparation in python 3

1. Import some necessary packages

import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_diabetes
from sklearn.linear_model import Lasso, Ridge, ElasticNet, LinearRegression
from sklearn.cross_validation import cross_val_score, train_test_split, KFold
from sklearn.grid_search import GridSearchCV
from sklearn.metrics import r2_score, mean_squared_error
import statsmodels.api as sm

2. Toy data set

diabetes = load_diabetes()

This data set has 10 features diabetes.data and one target diabetes.target. Besides, there is no description of this data set, so we need to do some feature engineering first.

Feature Engineering

Remember that the first thing for feature engineering is to imagine you are the expert in that specific area that your data comes from. From expert perspective, you can select features based on professional experience (just google it as much as you can).

1. Check whether this data set is normalized

X = diabetes['data']
X = DataFrame(X)
X.describe()

We can tell from this description of each feature that the mean value is around 0 and has same standard deviation. Therefore, the predictors are normalized.

In other datasets, the X may not be centered, then we need to add the intercept. In sklearn, there is one argument to add intercept fit_intercept = True. The alternative way is:

sm.add_constant(X)

2. Check the correlation of features and target

diabetes_df = DataFrame(diabetes.data)
diabetes_df.columns = ["X" + str(col) for col in diabetes_df.columns]
diabetes_df["target"] = diabetes.target

3. Plot the scatter points of each feature against target

From the pair plot figure we can tell the relationship of features and target and the relationship between features. There may be some features do not have linear relationship with target, we can try to guess any function in our knowledge have the similar curves with those nonlinear relationship. If we detect some strong correlated features, we should be really careful. When two features are linearly correlated, we need to remove one of them from analysis. When two features have nonlinear relationship, we still need to guess the function that can describe this relationship best.

sns.pairplot(diabetes_df)

We can tell from this pair plot figure that there are two categorical features and it is not necessary to apply polynomial regression.

Split the data

Since we need to compare the results from multiple models, we need to split the data into train and test sets.

X_train, X_holdout, y_train, y_holdout = train_test_split(diabetes.data, diabetes.target, test_size=0.1, random_state=42)

In this case, we need to shuffle the data before we split data. However in some other cases, we need to apply different strategies. If the data is in time series, then we may not need to shuffle the data. If target is categorical, then we need to use stratified k-fold.

Cross Validation

Now we need to apply cross validation to choose tuning parameter in regularized models.

1. Split train data into multiple folds

kfold = KFold(len(X_train), n_folds=5, shuffle=True, random_state=0)

In this case, we need to shuffle the data before we split data (even if the data is in time series). If target is categorical, then we need to use stratified k-fold.

from sklearn.cross_validation import StratifiedKFold
X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 0, 1, 1])
skf = StratifiedKFold(y, n_folds=2)
for train_index, test_index in skf:
   print("TRAIN:", train_index, "TEST:", test_index)
   X_train, X_test = X[train_index], X[test_index]
   y_train, y_test = y[train_index], y[test_index]
   #Then fit and test your model.

TRAIN: [1 3] TEST: [0 2]

TRAIN: [0 2] TEST: [1 3]

2. Select best tuning parameter

Apply grid search and cross validation to find the optimal tuning parameters in the lasso, ridge and elastic net models. Compared with lasso penalty, ridge (L2) penalizes less on small numbers and more on large numbers. Elastic net tries to balance these two penalties.

def cv_optimize(clf, parameters, X, y, n_folds, n_jobs=1, score_func=None):
    if score_func:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, n_jobs=n_jobs, cv=n_folds)
    gs.fit(X, y)
    print ("BEST", gs.best_params_, gs.best_score_)
    df = pd.DataFrame(gs.grid_scores_)
    #Plot mean score vs. alpha
    for param in parameters.keys():
        df[param] = df.parameters.apply(lambda val: val[param])
    plt.semilogx(df.alpha, df.mean_validation_score)
    return gs

1) Lasso

params = {
    "alpha": np.logspace(-4, -.1, 20)
}
#If we do not know whether we need to normalize the data first, we can do this:
#params = {
#    "alpha": np.logspace(-4, -.1, 20), "normalize": [False, True]
#}
lm_lasso = cv_optimize(Lasso(), params, X_train, y_train, kfold)

2) Ridge

params = {
    "alpha": np.logspace(-4, -.1, 20)
}
lm_ridge = cv_optimize(Ridge(), params, X_train, y_train, kfold)

3)Elastic Net

params = {
    "alpha": np.logspace(-4, -.1, 20)
}
lm_elastic = cv_optimize(ElasticNet(), params, X_train, y_train, kfold)

Fit models with best parameters

Now we will fit the linear regression models with or without penalties.

lm_lasso_best = Lasso(alpha = lm_lasso.best_params_['alpha']).fit(X_train, y_train)
lm_ridge_best = Lasso(alpha = lm_ridge.best_params_['alpha']).fit(X_train, y_train)
lm_elasticnet_best = Lasso(alpha = lm_elasticnet.best_params_['alpha']).fit(X_train, y_train)
lm_base = LinearRegression().fit(X_train, y_train)
y_lasso = lm_lasso_best.predict(x_holdout)
y_ridge = lm_ridge_best.predict(x_holdout)
y_elasticnet = lm_elasticnet_best.predict(x_holdout)
y_base = lm_base.predict(x_holdout)
mse_lasso = np.mean((y_lasso - y_holdout) ** 2)
mse_ridge = np.mean((y_ridge - y_holdout) ** 2)
mse_elasticnet = np.mean((y_elasticnet - y_holdout) ** 2)
mse_base = np.mean((y_base - y_holdout) ** 2)

print("Linear Regression: R squred score is ", r2_score(y_holdout, y_base), " and mean squred error is ", mse_base)
print("Lasso Regression: R squred score is ", r2_score(y_holdout, y_lasso), " and mean squred error is ", mse_lasso)
print("Ridge Regression: R squred score is ", r2_score(y_holdout, y_ridge), " and mean squred error is ", mse_ridge)
print("Elastic Net Regression: R squred score is ", r2_score(y_holdout, y_elasticnet), " and mean squred error is ", mse_elasticnet)

Then we can choose the best models with best R squred score and mean squared errors.

More statistics

In many situations, we want to check more statistics of linear regression. We can apply t statistic to select features with base linear regression. The smaller p-value of t statistic means more significance of that feature. Typically when p-value is larger than 0.05 we say that feature is not important. To do this, we need to use package statsmodels.api.

#X_train = sm.add_constant(X_train)      #if data is not centered.
model = sm.OLS(y_train, X_train)
results = model.fit()
results.summary()

This table can provide a lot of statistical analysis information. From this table, if we set significance level as 0.05 there are only two significant features. However we still cannot simply decide to get rid of those unimportant features.

Check specific features

If some features are ruled out by regularization regression or have very large p-value, we need to plot the points and fitted line to make a double check. It may be caused by noisy features, non-linear relationship or correlated features.

Xt = X_train[:,2]    #Check third feature.
X_line = np.linspace(min(Xt),max(Xt), 10).reshape(10,1)
X_line = np.tile(X_line, [1, 10])
y_train_pred = lin_reg_est.predict(X_line)
plt.scatter(Xt, y_train, alpha=0.2)
plt.plot(X_line, y_train_pred)
plt.show()

Some further discussion about normalization (centering and scaling):

  1. centering is equivalent to adding the intercept
  2. when you’re trying to sum or average variables that are on different scales, perhaps to create a composite score of some kind. Without scaling, it may be the case that one variable has a larger impact on the sum due purely to its scale, which may be undesirable.
  3. To simplify calculations and notation. For example, the sample covariance matrix of a matrix of values centered by their sample means is simply \(X′X\). Similarly, if a univariate random variable \(X\) has been mean centered, then \(var(X)=E(X^2)\) and the variance can be estimated from a sample by looking at the sample mean of the squares of the observed values.
  4. PCA can only be interpreted as the singular value decomposition of a data matrix when the columns have first been centered by their means.

Note that scaling is not necessary in the last two bullet points I mentioned and centering may not be necessary in the second bullet I mentioned, so the two do not need to go hand and hand at all times. For regression without regularization centering and scaling are not necessary, but centering is helpful for interpretation of coefficients. For regularized regression, centering (or adding intercept) is necessary because we want coefficients have same magnitude.

Remarks and Tips

  1. If the data size is very large, then grid search and cross validation will cost too much time. Therefore we may use small number of folds and small candidate sets of parameters.
  2. Before split data and fit model you need to make sure that your features are in columns. Especially when you have only one feature, you may apply np.reshape().
  3. If the data set was not centered, we need to set the intercept = True inside each model.
  4. sklearn and statsmodels.api apply different algorithm to fit the linear model, so they will out put different results.
  5. Another way to use regularization models is to apply them select features, then use base linear regression to analyze data with selected features

Movie website scraping part 1

Problem Statement

1. I will fit the yearly data with regression. The dataset is from Box Office Mojo. Specifically, the total gross, number of movies and average ticket price will be fitted against time with linear model and the number of tickets sold will be fitted with quadratic function. Based on fitted model we can predict the data in the future to help filmworkers to get a whole picture of future movie market.

2. I want to compare directors to see who can make more return on investment. To achieve this goal, I will use the regression model to get the relationship between budget and box office revenue for each director. This can provide some information to investors so that they will know to invest whom and what is the estimated profit. This is presented in the part 2

Results

Interesting findings

  1. The total gross, number of movies and average price sold have an overall increasing trend.
  2. The number of tickets sold had a big decline in 2005 and kept decreasing from then.
  3. The average price of tickets increased faster from 2007.
  4. The total gross fell in 2005 and the growth rate recovered in 2006.

Conjectures

The most likely reason for the decline of tickets sold is the rise of online movie providers. In 2005, YouTube was formed, in 2006 hulu is founded and in 2007 Netflix started to provide films online. So film workers should pay more attention on the international market.

Future works:

  1. Study the international movie market trend. So that movie workers can find other faster developing morket to invest.
  2. Analyze online movie service providers. Combine its result with what we got to see the whole movie market trend.

The detailed process is presented in the following.

Take-away points

  1. When I scrape from some websites, some errors may be caused that web site do not have a perfect coding patern.
  2. Unicode will pop up some times.

Presentation slides

Detailed Process

Prepreparation in python 3

1. Import some necessary packages

import pandas as pd
from pandas import DataFrame
import numpy as np
import matplotlib.pyplot as plt
from bs4 import BeautifulSoup
import requests
from sklearn.linear_model import LinearRegression
import seaborn as sns
sns.set_style("whitegrid")
sns.set_context("poster")
import statsmodels.api as sm
# special matplotlib argument for improved plots
from matplotlib import rcParams
import re

Yearly data analysis

Scraping data from Box Office Mojo

url = 'http://www.boxofficemojo.com/yearly/'
#Apply this header to avoid being denied.
headers = {
    'user-Agent': 'Mozilla/5.0'
}
response = requests.get(url, headers = headers)
#Check whether you request is successful
response.status_code

If status_code shows 200, it means the request is successful.

page = response.text
soup = BeautifulSoup(page)
tables = soup.find_all('table')

Then we need to check each table to see if there is some properties that can specify the table we want. Unfortunately, there is no such properties. So we need to visually check tables. After this step we know table[2] is what we want.

movie_data = []
movie_list = tables[2]
#header can also be scraped. But some times we want to give a better header.
header = ['Year','Total_gross','Change1', 'Tickets_sold', 'Change1', 'num_movies','total_screens',
         'Ave_price', 'Ave_cost', 'movie_1'] 
rows = [row for row in movie_list.find_all("tr")]
for row in rows[1:]:
    row_dict={}
    for i,cell in enumerate(row.findAll("td")):
        row_dict[header[i]] = cell.get_text()
    movie_data.append(row_dict)
movies_df = pd.DataFrame(movie_data)
# Get rid of the $ sign.

Then we need to clean this table: transform some string type to numeric and remove some special signs.

movies_df.Total_gross = movies_df.Total_gross.str[1:]
movies_df.Total_gross = movies_df.Total_gross.str.replace(',', '')
movies_df.Total_gross = pd.to_numeric(movies_df.Total_gross)
movies_df.Ave_price = movies_df.Ave_price.str.lstrip("$")
movies_df.Total_gross = pd.to_numeric(movies_df.Total_gross)
movies_df.Year = pd.to_numeric(movies_df.Year)
movies_df.Tickets_sold = movies_df.Tickets_sold.str.replace(',', '')
movies_df.Tickets_sold = pd.to_numeric(movies_df.Tickets_sold)
movies_df.num_movies = pd.to_numeric(movies_df.num_movies)
movies_df.head()
  Ave_cost Ave_price Change1 Tickets_sold Total_gross Year movie_1 num_movies total_screens
0 - 8.58 - 715.3 6137.0 2016 Finding Dory 362 -
1 - 8.43 +4.1% 1320.1 11128.5 2015 Star Wars: The Force Awakens 701 -
2 - 8.17 -5.6% 1268.2 10360.8 2014 American Sniper 702 -
3 - 8.13 -1.3% 1343.6 10923.6 2013 Catching Fire 688 -
4 - 7.96 +6.1% 1361.5 10837.4 2012 The Avengers 667 -

Remember to write this data into your local file.

df1 = movies_df.drop(0)     #remove the data in 2016
movies_df = movies_df.drop(0)   # The first does not have whole year data
movies_df.to_csv('df_yearly.txt', index = False)

Linear Regression to fit data

Total gross

lm = LinearRegression()
y1 = df1.Total_gross
X1 = df1.Year
X1 = sm.add_constant(X1)
lm.fit(X1, y1)
line_X = np.arange(df1.Year.min(), df1.Year.max() + 1)
line_X = sm.add_constant(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X[:, 1], line_y, '-r', label = 'Linear regressor', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y1, '.b', label = 'test')
plt.title('Total gross')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print(lm.score(X1, y1))

0.96855021574507782

Number of movies

lm = LinearRegression()
y2 = df1.num_movies
lm.fit(X1, y2)
line_X = np.arange(df1.Year.min(), df1.Year.max() + 1)
line_X = sm.add_constant(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X[:, 1], line_y, '-r', label = 'Linear regressor', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y2, '.b', label = 'test')
plt.title('Number of movies')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print(lm.score(X1, y2))

0.58986278863620789

Average price

lm = LinearRegression()
y3 = df1.Ave_price
lm.fit(X1, y3)
line_X = np.arange(df1.Year.min(), df1.Year.max() + 1)
line_X = sm.add_constant(line_X)
line_y = lm.predict(line_X)
plt.plot(line_X[:, 1], line_y, '-r', label = 'Linear regressor', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y3, '.b', label = 'data')
plt.title('Average Price')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print(lm.score(X1, y3))

0.96172045677804519

Number of tickets that were sold out

lm1 = LinearRegression()
lm2 = LinearRegression()
y4 = df1.Tickets_sold
X2 = X1
X2_2004 = X2[X2.Year <= 2004]
y4_2004 = y4[X2.Year <= 2004]
X2_2005 = X2[X2.Year > 2004]
y4_2005 = y4[X2.Year > 2004]
lm1.fit(X2_2004, y4_2004)
lm2.fit(X2_2005, y4_2005)
line_X1 = np.arange(X2_2004.Year.min(), X2_2004.Year.max() + 1)
line_X1 = sm.add_constant(line_X1)
line_X1 = pd.DataFrame(line_X1)
line_y1 = lm1.predict(line_X1)
line_X2 = np.arange(X2_2005.Year.min(), X2_2005.Year.max() + 1)
line_X2 = sm.add_constant(line_X2)
line_X2 = pd.DataFrame(line_X2)
line_y2 = lm2.predict(line_X2)
plt.plot(line_X1.iloc[:, 1], line_y1, '-r', label = 'Before 2005', alpha = 0.6)
plt.plot(X1.iloc[:, 1], y4, '.b', label = 'data')
plt.plot(line_X2.iloc[:, 1], line_y2, '-g', label = 'After 2005', alpha = 0.6)
plt.title('Tickets sold')
plt.xlim(1978, 2017)
plt.legend(loc = 2)
plt.show()
print((lm1.score(X2_2004, y4_2004) + lm2.score(X2_2005, y4_2005))/2)

0.655714846952