Jason Tian

Suck out the marrow of data

Self-Defined Classification Functions in Python (Part 2)

In this part, I will introduce some self-defined functions to choose parameters for classifiers. These functions form one pipeline to work from feature adjustment (transform categorical variables to dummy variables), stratified sampling, choosing best parameters combination to print out accuarcy for both training set and testing set.

Base functions

import pandas as pd
from sklearn.grid_search import GridSearchCV
import xgboost as xgb
from sklearn.cross_validation import StratifiedKFold

Choose Best Parameters

def cv_optimize(clf, parameters, X, y, stratified = None, n_jobs=1, n_folds=5, score_func=None):
    if stratified != None:
        gs = GridSearchCV(clf, param_grid=parameters, cv=stratified, n_jobs=n_jobs, scoring=score_func)
    else:
        gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
    gs.fit(X, y)
    print ("BEST", gs.best_params_, gs.best_score_, gs.grid_scores_)
    best = gs.best_estimator_
    return best

Important Arguments

  • clf - original classifier
  • parameters - grid to search over
  • X - usually your training X matrix
  • y - usually your training y
  • stratified - stratified index from sklearn.cross_validation.StratifiedKFold

Main Function

def do_classify(clf, indf, featurenames, targetname, target1val, parameters = None, mask=None, random_state = 30,
                reuse_split=None, stratified = None, dummies = False, score_func=None, n_folds=4, n_jobs=1):
    y=(indf[targetname].values==target1val)*1
    if dummies:
        X = indf[featurenames]
        X_train = X.iloc[stratified[0],:]
        stratified_train = StratifiedKFold(X_train.contbr_st, n_folds=n_folds, random_state = random_state)
        X = pd.get_dummies(X, prefix = '', prefix_sep = '').values
    else:
        X = indf[featurenames].values
    if stratified != None:
        stratified = list(stratified)
        print('using stratified sampling')
        Xtrain, ytrain = X[stratified[0]], y[stratified[0]]
        Xtest, ytest = X[stratified[1]], y[stratified[1]]
    if mask != None:
        print("using mask")
        Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
    if parameters != None:
        if stratified != None:
            clf = cv_optimize(clf, parameters, Xtrain, ytrain, stratified = stratified_train,
                              n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
        else:
            clf = cv_optimize(clf, parameters, Xtrain, ytrain,
                  n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)

    clf=clf.fit(Xtrain, ytrain)
    training_accuracy = clf.score(Xtrain, ytrain)
    test_accuracy = clf.score(Xtest, ytest)
    print ("############# based on standard predict ################")
    print ("Accuracy on training data: %0.2f" % (training_accuracy))
    print ("Accuracy on test data:     %0.2f" % (test_accuracy))
    print (confusion_matrix(ytest, clf.predict(Xtest)))
    print ("########################################################")
    return clf, Xtrain, ytrain, Xtest, ytest

Important arguments

  • indf - Input dataframe
  • featurenames - vector of names of predictors
  • targetname - name of column you want to predict (e.g. 0 or 1, ‘M’ or ‘F’, ‘yes’ or ‘no’)
  • target1val - particular value you want to have as a 1 in the target
  • mask - boolean vector indicating test set (~mask is training set) (we’ll use this to test different classifiers on the same test-train splits)

  • stratified - list that stores stratified index, normally we can get this from list(StratifiedKFold(, n_folds=5))[0]. If it is defined, then inside cv_optimize will also apply stratified sampling

  • dummies - If True, the categorical features will be transformed as dummy variables
  • score_func - we’ve used the accuracy as a way of scoring algorithms but this can be more general later on

  • n_folds - Number of folds for cross validation ()
  • n_jobs - used for parallelization stratified and mask cannot be simultaneously applied

Remarks

  • stratified in this model is for stratifying samples based on one specific feature not on the target. It deals with the problem with highly imbalanced categorical feature.
  • If your target is highly imbalanced, there is one option that you can use is to set class_weight = 'balanced' inside the sklearn classifiers.

One Example

Here is one example from one of my project to predict party preference in 2016 US predencial election based on the donation data set. When I did the feature engineering, there is one tricky issue I need to tackle. After I ruled out all missing data, the numbers of donators for states are really different. For example California has about 60,000 donators, whereas Nevada only has 10 donators. However we still do not want to loss any states when we split data into training set and test set. In this case, the best choice should be stratified sampling strategies.

gbm = xgb.XGBClassifier()
skf = list(StratifiedKFold(df_clean.contbr_st, n_folds=5, random_state=30))[0]   #Stratifed sampling first, then make variable dummies.
featurenames = ['contbr_st', 'employer_categorized', 'salary', 'MedianPrice', 'Gender',
           'is_Retired', 'is_Unemployed_NotRetired', 'is_Self_Employed']
targetname = 'party'
param = {"max_depth": [7, 9, 11], "n_estimators": [300, 400, 500], 'learning_rate': [0.05, 0.08, 0.1]}
gbm_fitted, Xtrain, ytrain, Xtest, ytest = do_classify(gbm, df_clean, featurenames, targetname, 1, stratified = skf,
                       parameters = param,dummies = True, n_jobs = 4, n_folds = 4)