从决策树到随机森林：树型算法的原理与实现

mapper = DataFrameMapper([('AgeGroup', LabelEncoder()),('Education', LabelEncoder()),('Workclass', LabelEncoder()),('MaritalStatus', LabelEncoder()),('Occupation', LabelEncoder()),('Relationship', LabelEncoder()),('Race', LabelEncoder()),('Sex', LabelEncoder()),('Income', LabelEncoder())], df_out=True, default=None)cols = list(df_train_set.columns)cols.remove("Income")cols = cols[:-3] + ["Income"] + cols[-3:]df_train = mapper.fit_transform(df_train_set.copy())df_train.columns = colsdf_test = mapper.transform(df_test_set.copy())df_test.columns = colscols.remove("Income")x_train, y_train = df_train[cols].values, df_train["Income"].valuesx_test, y_test = df_test[cols].values, df_test["Income"].valuesOut-of-Bag（OOB）误差

如今我们用精确的情势对数据进行了练习和测试，已创建了我们的第一个模型！

treeClassifier = DecisionTreeClassifier()treeClassifier.fit(x_train, y_train)treeClassifier.score(x_test, y_test)

最简单的且没有优化的概率分类器模许可以达到 83.5% 的精度。在分类问题中，混淆矩阵（confusion matrix）是衡量模型精度的好办法。应用下列代码我们可以绘制随便率性基于树的模型的混淆矩阵。

import itertoolsfrom sklearn.metrics import confusion_matrixdef plot_confusion_matrix(cm, classes, normalize=False):"""    This function prints and plots the confusion matrix.    Normalization can be applied by setting `normalize=True`.    """    cmap = plt.cm.Blues    title = "Confusion Matrix"if normalize:        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]        cm = np.around(cm, decimals=3)    plt.imshow(cm, interpolation='nearest', cmap=cmap)    plt.title(title)    plt.colorbar()    tick_marks = np.arange(len(classes))    plt.xticks(tick_marks, classes, rotation=45)    plt.yticks(tick_marks, classes)    thresh = cm.max() / 2.for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):        plt.text(j, i, cm[i, j],                 horizontalalignment="center",                 color="white" if cm[i, j] > thresh else "black")    plt.tight_layout()    plt.ylabel('True label')    plt.xlabel('Predicted label')

如今，我们可以看到第一个模型的混淆矩阵：

importances = rclf.feature_importances_indices = np.argsort(importances)cols = [cols[x] for x in indices]plt.figure(figsize=(10,6))plt.title('Feature Importances')plt.barh(range(len(indices)), importances[indices], color='b', align='center')plt.yticks(range(len(indices)), cols)plt.xlabel('Relative Importance')

y_pred = treeClassifier.predict(x_test)cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])plt.figure(figsize=(10,6))plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)

我们发明多笆攀类别（<=50K）的精度为 90.5%，少数类别（>50K）的精度只有 60.8%。

让我们看一下调校此简单分类器的办法。我们能应用带有 5 折交叉验证的 GridSearchCV() 来调校树分类器的复荡蜇要参数。

from sklearn.model_selection import GridSearchCV
parameters = {'max_features':(None, 9, 6),'max_depth':(None, 24, 16),'min_samples_split': (2, 4, 8),'min_samples_leaf': (16, 4, 12)}

clf = GridSearchCV(treeClassifier, parameters, cv=5, n_jobs=4)
clf.fit(x_train, y_train)
(0.85934092933263717,剪枝

 0.85897672133161351,
0.86606676699118579
 {'max_depth': 16,
  'min_samples_leaf': 16,
  'min_samples_split': 8})
  'max_features': 9,

经由优化，我们发明精度上升到了 85.9%。在上方，我们也可以看见最优模型的参数。如今，让我们看一下已优化模型的混淆矩阵（confusion matrix）：

y_pred = clf.predict(x_test)cfm = confusion_matrix(y_test, y_pred, labels=[0, 1])plt.figure(figsize=(10,6))plot_confusion_matrix(cfm, classes=["<=50K", ">50K"], normalize=True)

经由优化，我们发明在两种类别下，猜测精度都有所晋升。

决定计划树的局限性

决定计划树有很多长处，比如：

易于懂得、易于解释
可视化
无需大年夜量数据预备。不过要留意，sklearn.tree 模块不支撑缺掉值。
应用决定计划树（猜测数据）的成本是练习决定计划时所用数据的对数量级。

但这些模型往往不直接应用，决定计划树一些常见的缺点是：

构建的树过于复杂，无法很好地在数据上实现泛化。
数据的渺小更改可能导致生成的树完全不合，是以决定计划树猜?稳定。
决定计划树进修算法在实践中平日基于启发式算法，如贪婪算法，在每一个结点作出局部最优决定计划。词攀类算法无法确保返回全局最优决定计划树。
4/7 首页上一页 2 3 4 5 6 7 下一页尾页

　　推荐阅读

　　Linux下awk内置变量使用介绍

我们将逐渐揭开 awk 功能的神秘面纱，在本节中，我们将介绍 awk 内置built-in变量的概念。你可以在 awk 中应用两种类型的变量，它们是：用户自定义user-defined变量和内置变量。我们将逐>>>详细阅读

本文标题：从决策树到随机森林：树型算法的原理与实现

地址：http://www.17bianji.com/lsqh/36558.html

1/2 1