In this part, I will introduce some self-defined functions to choose parameters for classifiers. These functions form one pipeline to work from feature adjustment (transform categorical variables to dummy variables), stratified sampling, choosing best parameters combination to print out accuarcy for both training set and testing set.
Base functions
import pandas as pd
from sklearn.grid_search import GridSearchCV
import xgboost as xgb
from sklearn.cross_validation import StratifiedKFold
Choose Best Parameters
def cv_optimize(clf, parameters, X, y, stratified = None, n_jobs=1, n_folds=5, score_func=None):
if stratified != None:
gs = GridSearchCV(clf, param_grid=parameters, cv=stratified, n_jobs=n_jobs, scoring=score_func)
else:
gs = GridSearchCV(clf, param_grid=parameters, cv=n_folds, n_jobs=n_jobs, scoring=score_func)
gs.fit(X, y)
print ("BEST", gs.best_params_, gs.best_score_, gs.grid_scores_)
best = gs.best_estimator_
return best
Important Arguments
-
clf
- original classifier -
parameters
- grid to search over -
X
- usually your training X matrix -
y
- usually your training y -
stratified
- stratified index fromsklearn.cross_validation.StratifiedKFold
Main Function
def do_classify(clf, indf, featurenames, targetname, target1val, parameters = None, mask=None, random_state = 30,
reuse_split=None, stratified = None, dummies = False, score_func=None, n_folds=4, n_jobs=1):
y=(indf[targetname].values==target1val)*1
if dummies:
X = indf[featurenames]
X_train = X.iloc[stratified[0],:]
stratified_train = StratifiedKFold(X_train.contbr_st, n_folds=n_folds, random_state = random_state)
X = pd.get_dummies(X, prefix = '', prefix_sep = '').values
else:
X = indf[featurenames].values
if stratified != None:
stratified = list(stratified)
print('using stratified sampling')
Xtrain, ytrain = X[stratified[0]], y[stratified[0]]
Xtest, ytest = X[stratified[1]], y[stratified[1]]
if mask != None:
print("using mask")
Xtrain, Xtest, ytrain, ytest = X[mask], X[~mask], y[mask], y[~mask]
if parameters != None:
if stratified != None:
clf = cv_optimize(clf, parameters, Xtrain, ytrain, stratified = stratified_train,
n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
else:
clf = cv_optimize(clf, parameters, Xtrain, ytrain,
n_jobs=n_jobs, n_folds=n_folds, score_func=score_func)
clf=clf.fit(Xtrain, ytrain)
training_accuracy = clf.score(Xtrain, ytrain)
test_accuracy = clf.score(Xtest, ytest)
print ("############# based on standard predict ################")
print ("Accuracy on training data: %0.2f" % (training_accuracy))
print ("Accuracy on test data: %0.2f" % (test_accuracy))
print (confusion_matrix(ytest, clf.predict(Xtest)))
print ("########################################################")
return clf, Xtrain, ytrain, Xtest, ytest
Important arguments
-
indf
- Input dataframe -
featurenames
- vector of names of predictors -
targetname
- name of column you want to predict (e.g. 0 or 1, ‘M’ or ‘F’, ‘yes’ or ‘no’) -
target1val
- particular value you want to have as a 1 in the target -
mask
- boolean vector indicating test set (~mask is training set) (we’ll use this to test different classifiers on the same test-train splits) -
stratified
- list that stores stratified index, normally we can get this fromlist(StratifiedKFold(, n_folds=5))[0]
. If it is defined, then inside cv_optimize will also apply stratified sampling -
dummies
- If True, the categorical features will be transformed as dummy variables -
score_func
- we’ve used the accuracy as a way of scoring algorithms but this can be more general later on -
n_folds
- Number of folds for cross validation () -
n_jobs
- used for parallelization stratified and mask cannot be simultaneously applied
Remarks
-
stratified
in this model is for stratifying samples based on one specific feature not on the target. It deals with the problem with highly imbalanced categorical feature. - If your target is highly imbalanced, there is one option that you can use is to set
class_weight = 'balanced'
inside the sklearn classifiers.
One Example
Here is one example from one of my project to predict party preference in 2016 US predencial election based on the donation data set. When I did the feature engineering, there is one tricky issue I need to tackle. After I ruled out all missing data, the numbers of donators for states are really different. For example California has about 60,000 donators, whereas Nevada only has 10 donators. However we still do not want to loss any states when we split data into training set and test set. In this case, the best choice should be stratified sampling strategies.
gbm = xgb.XGBClassifier()
skf = list(StratifiedKFold(df_clean.contbr_st, n_folds=5, random_state=30))[0] #Stratifed sampling first, then make variable dummies.
featurenames = ['contbr_st', 'employer_categorized', 'salary', 'MedianPrice', 'Gender',
'is_Retired', 'is_Unemployed_NotRetired', 'is_Self_Employed']
targetname = 'party'
param = {"max_depth": [7, 9, 11], "n_estimators": [300, 400, 500], 'learning_rate': [0.05, 0.08, 0.1]}
gbm_fitted, Xtrain, ytrain, Xtest, ytest = do_classify(gbm, df_clean, featurenames, targetname, 1, stratified = skf,
parameters = param,dummies = True, n_jobs = 4, n_folds = 4)