引言
想通过随机森林来获取数据的主要特征
1、理论
随机森林是一个高度灵活的机器学习方法,拥有广泛的应用前景,从市场营销到医疗保健保险。 既可以用来做市场营销模拟的建模,统计客户来源,保留和流失。也可用来预测疾病的风险和病患者的易感性。
根据个体学习器的生成方式,目前的集成学习方法大致可分为两大类,即个体学习器之间存在强依赖关系,必须串行生成的序列化方法,以及个体学习器间不存在强依赖关系,可同时生成的并行化方法;
前者的代表是Boosting,后者的代表是Bagging和“随机森林”(Random
Forest)
随机森林在以决策树为基学习器构建Bagging集成的基础上,进一步在决策树的训练过程中引入了随机属性选择(即引入随机特征选择)。
简单来说,随机森林就是对决策树的集成,但有两点不同:
(2)特征选取的差异性:每个决策树的n个分类特征是在所有特征中随机选择的(n是一个需要我们自己调整的参数)
随机森林,简单理解, 比如预测salary,就是构建多个决策树job,age,house,然后根据要预测的量的各个特征(teacher,39,suburb)分别在对应决策树的目标值概率(salary<5000,salary>=5000),从而,确定预测量的发生概率(如,预测出P(salary<5000)=0.3).
随机森林是一个可做能够回归和分类。 它具备处理大数据的特性,而且它有助于估计或变量是非常重要的基础数据建模。
参数说明:
最主要的两个参数是n_estimators和max_features。
n_estimators:表示森林里树的个数。理论上是越大越好。但是伴随着就是计算时间的增长。但是并不是取得越大就会越好,预测效果最好的将会出现在合理的树个数。
max_features:随机选择特征集合的子集合,并用来分割节点。子集合的个数越少,方差就会减少的越快,但同时偏差就会增加的越快。根据较好的实践经验。如果是回归问题则:
max_features=n_features,如果是分类问题则max_features=sqrt(n_features)。
如果想获取较好的结果,必须将max_depth=None,同时min_sample_split=1。
同时还要记得进行cross_validated(交叉验证),除此之外记得在random forest中,bootstrap=True。但在extra-trees中,bootstrap=False。
2、随机森林python实现
2.1Demo1
实现随机森林基本功能
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
#随机森林 from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_iris iris = load_iris() #print iris#iris的4个属性是:萼片宽度 萼片长度 花瓣宽度 花瓣长度 标签是花的种类:setosa versicolour virginica print (iris[ 'target' ].shape) rf = RandomForestRegressor() #这里使用了默认的参数设置 rf.fit(iris.data[: 150 ],iris.target[: 150 ]) #进行模型的训练 #随机挑选两个预测不相同的样本 instance = iris.data[[ 100 , 109 ]] print (instance) rf.predict(instance[[ 0 ]]) print ( 'instance 0 prediction;' ,rf.predict(instance[[ 0 ]])) print ( 'instance 1 prediction;' ,rf.predict(instance[[ 1 ]])) print (iris.target[ 100 ],iris.target[ 109 ]) |
运行结果
(150,)
[[ 6.3 3.3 6. 2.5]
[ 7.2 3.6 6.1 2.5]]
instance 0 prediction; [ 2.]
instance 1 prediction; [ 2.]
2 2
2.2 Demo2
3种方法的比较
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
#random forest test from sklearn.model_selection import cross_val_score from sklearn.datasets import make_blobs from sklearn.ensemble import RandomForestClassifier from sklearn.ensemble import ExtraTreesClassifier from sklearn.tree import DecisionTreeClassifier X, y = make_blobs(n_samples = 10000 , n_features = 10 , centers = 100 ,random_state = 0 ) clf = DecisionTreeClassifier(max_depth = None , min_samples_split = 2 ,random_state = 0 ) scores = cross_val_score(clf, X, y) print (scores.mean()) clf = RandomForestClassifier(n_estimators = 10 , max_depth = None ,min_samples_split = 2 , random_state = 0 ) scores = cross_val_score(clf, X, y) print (scores.mean()) clf = ExtraTreesClassifier(n_estimators = 10 , max_depth = None ,min_samples_split = 2 , random_state = 0 ) scores = cross_val_score(clf, X, y) print (scores.mean()) |
运行结果:
0.979408793821
0.999607843137
0.999898989899
2.3 Demo3-实现特征选择
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
#随机森林2 from sklearn.tree import DecisionTreeRegressor from sklearn.ensemble import RandomForestRegressor import numpy as np from sklearn.datasets import load_iris iris = load_iris() from sklearn.model_selection import cross_val_score, ShuffleSplit X = iris[ "data" ] Y = iris[ "target" ] names = iris[ "feature_names" ] rf = RandomForestRegressor() scores = [] for i in range (X.shape[ 1 ]): score = cross_val_score(rf, X[:, i:i + 1 ], Y, scoring = "r2" , cv = ShuffleSplit( len (X), 3 , . 3 )) scores.append(( round (np.mean(score), 3 ), names[i])) print ( sorted (scores, reverse = True )) |
运行结果:
[(0.89300000000000002, 'petal width (cm)'), (0.82099999999999995, 'petal length
(cm)'), (0.13, 'sepal length (cm)'), (-0.79100000000000004, 'sepal width (cm)')]
2.4 demo4-随机森林
本来想利用以下代码来构建随机随机森林决策树,但是,遇到的问题是,程序一直在运行,无法响应,还需要调试。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
|
#随机森林4 #coding:utf-8 import csv from random import seed from random import randrange from math import sqrt def loadCSV(filename): #加载数据,一行行的存入列表 dataSet = [] with open (filename, 'r' ) as file : csvReader = csv.reader( file ) for line in csvReader: dataSet.append(line) return dataSet # 除了标签列,其他列都转换为float类型 def column_to_float(dataSet): featLen = len (dataSet[ 0 ]) - 1 for data in dataSet: for column in range (featLen): data[column] = float (data[column].strip()) # 将数据集随机分成N块,方便交叉验证,其中一块是测试集,其他四块是训练集 def spiltDataSet(dataSet, n_folds): fold_size = int ( len (dataSet) / n_folds) dataSet_copy = list (dataSet) dataSet_spilt = [] for i in range (n_folds): fold = [] while len (fold) < fold_size: # 这里不能用if,if只是在第一次判断时起作用,while执行循环,直到条件不成立 index = randrange( len (dataSet_copy)) fold.append(dataSet_copy.pop(index)) # pop() 函数用于移除列表中的一个元素(默认最后一个元素),并且返回该元素的值。 dataSet_spilt.append(fold) return dataSet_spilt # 构造数据子集 def get_subsample(dataSet, ratio): subdataSet = [] lenSubdata = round ( len (dataSet) * ratio) #返回浮点数 while len (subdataSet) < lenSubdata: index = randrange( len (dataSet) - 1 ) subdataSet.append(dataSet[index]) # print len(subdataSet) return subdataSet # 分割数据集 def data_spilt(dataSet, index, value): left = [] right = [] for row in dataSet: if row[index] < value: left.append(row) else : right.append(row) return left, right # 计算分割代价 def spilt_loss(left, right, class_values): loss = 0.0 for class_value in class_values: left_size = len (left) if left_size ! = 0 : # 防止除数为零 prop = [row[ - 1 ] for row in left].count(class_value) / float (left_size) loss + = (prop * ( 1.0 - prop)) right_size = len (right) if right_size ! = 0 : prop = [row[ - 1 ] for row in right].count(class_value) / float (right_size) loss + = (prop * ( 1.0 - prop)) return loss # 选取任意的n个特征,在这n个特征中,选取分割时的最优特征 def get_best_spilt(dataSet, n_features): features = [] class_values = list ( set (row[ - 1 ] for row in dataSet)) b_index, b_value, b_loss, b_left, b_right = 999 , 999 , 999 , None , None while len (features) < n_features: index = randrange( len (dataSet[ 0 ]) - 1 ) if index not in features: features.append(index) # print 'features:',features for index in features: #找到列的最适合做节点的索引,(损失最小) for row in dataSet: left, right = data_spilt(dataSet, index, row[index]) #以它为节点的,左右分支 loss = spilt_loss(left, right, class_values) if loss < b_loss: #寻找最小分割代价 b_index, b_value, b_loss, b_left, b_right = index, row[index], loss, left, right # print b_loss # print type(b_index) return { 'index' : b_index, 'value' : b_value, 'left' : b_left, 'right' : b_right} # 决定输出标签 def decide_label(data): output = [row[ - 1 ] for row in data] return max ( set (output), key = output.count) # 子分割,不断地构建叶节点的过程对对对 def sub_spilt(root, n_features, max_depth, min_size, depth): left = root[ 'left' ] # print left right = root[ 'right' ] del (root[ 'left' ]) del (root[ 'right' ]) # print depth if not left or not right: root[ 'left' ] = root[ 'right' ] = decide_label(left + right) # print 'testing' return if depth > max_depth: root[ 'left' ] = decide_label(left) root[ 'right' ] = decide_label(right) return if len (left) < min_size: root[ 'left' ] = decide_label(left) else : root[ 'left' ] = get_best_spilt(left, n_features) # print 'testing_left' sub_spilt(root[ 'left' ], n_features, max_depth, min_size, depth + 1 ) if len (right) < min_size: root[ 'right' ] = decide_label(right) else : root[ 'right' ] = get_best_spilt(right, n_features) # print 'testing_right' sub_spilt(root[ 'right' ], n_features, max_depth, min_size, depth + 1 ) # 构造决策树 def build_tree(dataSet, n_features, max_depth, min_size): root = get_best_spilt(dataSet, n_features) sub_spilt(root, n_features, max_depth, min_size, 1 ) return root # 预测测试集结果 def predict(tree, row): predictions = [] if row[tree[ 'index' ]] < tree[ 'value' ]: if isinstance (tree[ 'left' ], dict ): return predict(tree[ 'left' ], row) else : return tree[ 'left' ] else : if isinstance (tree[ 'right' ], dict ): return predict(tree[ 'right' ], row) else : return tree[ 'right' ] # predictions=set(predictions) def bagging_predict(trees, row): predictions = [predict(tree, row) for tree in trees] return max ( set (predictions), key = predictions.count) # 创建随机森林 def random_forest(train, test, ratio, n_feature, max_depth, min_size, n_trees): trees = [] for i in range (n_trees): train = get_subsample(train, ratio) #从切割的数据集中选取子集 tree = build_tree(train, n_features, max_depth, min_size) # print 'tree %d: '%i,tree trees.append(tree) # predict_values = [predict(trees,row) for row in test] predict_values = [bagging_predict(trees, row) for row in test] return predict_values # 计算准确率 def accuracy(predict_values, actual): correct = 0 for i in range ( len (actual)): if actual[i] = = predict_values[i]: correct + = 1 return correct / float ( len (actual)) if __name__ = = '__main__' : seed( 1 ) dataSet = loadCSV(r 'G:\0研究生\tianchiCompetition\训练小样本2.csv' ) column_to_float(dataSet) n_folds = 5 max_depth = 15 min_size = 1 ratio = 1.0 # n_features=sqrt(len(dataSet)-1) n_features = 15 n_trees = 10 folds = spiltDataSet(dataSet, n_folds) #先是切割数据集 scores = [] for fold in folds: train_set = folds[ :] # 此处不能简单地用train_set=folds,这样用属于引用,那么当train_set的值改变的时候,folds的值也会改变,所以要用复制的形式。(L[:])能够复制序列,D.copy() 能够复制字典,list能够生成拷贝 list(L) train_set.remove(fold) #选好训练集 # print len(folds) train_set = sum (train_set, []) # 将多个fold列表组合成一个train_set列表 # print len(train_set) test_set = [] for row in fold: row_copy = list (row) row_copy[ - 1 ] = None test_set.append(row_copy) # for row in test_set: # print row[-1] actual = [row[ - 1 ] for row in fold] predict_values = random_forest(train_set, test_set, ratio, n_features, max_depth, min_size, n_trees) accur = accuracy(predict_values, actual) scores.append(accur) print ( 'Trees is %d' % n_trees) print ( 'scores:%s' % scores) print ( 'mean score:%s' % ( sum (scores) / float ( len (scores)))) |
2.5 随机森林分类sonic data
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
|
# CART on the Bank Note dataset from random import seed from random import randrange from csv import reader # Load a CSV file def load_csv(filename): file = open (filename, "r" ) lines = reader( file ) dataset = list (lines) return dataset # Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float (row[column].strip()) # Split a dataset into k folds def cross_validation_split(dataset, n_folds): dataset_split = list () dataset_copy = list (dataset) fold_size = int ( len (dataset) / n_folds) for i in range (n_folds): fold = list () while len (fold) < fold_size: index = randrange( len (dataset_copy)) fold.append(dataset_copy.pop(index)) dataset_split.append(fold) return dataset_split # Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range ( len (actual)): if actual[i] = = predicted[i]: correct + = 1 return correct / float ( len (actual)) * 100.0 # Evaluate an algorithm using a cross validation split def evaluate_algorithm(dataset, algorithm, n_folds, * args): folds = cross_validation_split(dataset, n_folds) scores = list () for fold in folds: train_set = list (folds) train_set.remove(fold) train_set = sum (train_set, []) test_set = list () for row in fold: row_copy = list (row) test_set.append(row_copy) row_copy[ - 1 ] = None predicted = algorithm(train_set, test_set, * args) actual = [row[ - 1 ] for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores # Split a data set based on an attribute and an attribute value def test_split(index, value, dataset): left, right = list (), list () for row in dataset: if row[index] < value: left.append(row) else : right.append(row) return left, right # Calculate the Gini index for a split dataset def gini_index(groups, class_values): gini = 0.0 for class_value in class_values: for group in groups: size = len (group) if size = = 0 : continue proportion = [row[ - 1 ] for row in group].count(class_value) / float (size) gini + = (proportion * ( 1.0 - proportion)) return gini # Select the best split point for a dataset def get_split(dataset): class_values = list ( set (row[ - 1 ] for row in dataset)) b_index, b_value, b_score, b_groups = 999 , 999 , 999 , None for index in range ( len (dataset[ 0 ]) - 1 ): for row in dataset: groups = test_split(index, row[index], dataset) gini = gini_index(groups, class_values) if gini < b_score: b_index, b_value, b_score, b_groups = index, row[index], gini, groups print ({ 'index' :b_index, 'value' :b_value}) return { 'index' :b_index, 'value' :b_value, 'groups' :b_groups} # Create a terminal node value def to_terminal(group): outcomes = [row[ - 1 ] for row in group] return max ( set (outcomes), key = outcomes.count) # Create child splits for a node or make terminal def split(node, max_depth, min_size, depth): left, right = node[ 'groups' ] del (node[ 'groups' ]) # check for a no split if not left or not right: node[ 'left' ] = node[ 'right' ] = to_terminal(left + right) return # check for max depth if depth > = max_depth: node[ 'left' ], node[ 'right' ] = to_terminal(left), to_terminal(right) return # process left child if len (left) < = min_size: node[ 'left' ] = to_terminal(left) else : node[ 'left' ] = get_split(left) split(node[ 'left' ], max_depth, min_size, depth + 1 ) # process right child if len (right) < = min_size: node[ 'right' ] = to_terminal(right) else : node[ 'right' ] = get_split(right) split(node[ 'right' ], max_depth, min_size, depth + 1 ) # Build a decision tree def build_tree(train, max_depth, min_size): root = get_split(train) split(root, max_depth, min_size, 1 ) return root # Make a prediction with a decision tree def predict(node, row): if row[node[ 'index' ]] < node[ 'value' ]: if isinstance (node[ 'left' ], dict ): return predict(node[ 'left' ], row) else : return node[ 'left' ] else : if isinstance (node[ 'right' ], dict ): return predict(node[ 'right' ], row) else : return node[ 'right' ] # Classification and Regression Tree Algorithm def decision_tree(train, test, max_depth, min_size): tree = build_tree(train, max_depth, min_size) predictions = list () for row in test: prediction = predict(tree, row) predictions.append(prediction) return (predictions) # Test CART on Bank Note dataset seed( 1 ) # load and prepare data filename = r 'G:\0pythonstudy\决策树\sonar.all-data.csv' dataset = load_csv(filename) # convert string attributes to integers for i in range ( len (dataset[ 0 ]) - 1 ): str_column_to_float(dataset, i) # evaluate algorithm n_folds = 5 max_depth = 5 min_size = 10 scores = evaluate_algorithm(dataset, decision_tree, n_folds, max_depth, min_size) print ( 'Scores: %s' % scores) print ( 'Mean Accuracy: %.3f%%' % ( sum (scores) / float ( len (scores)))) |
运行结果:
{'index': 38, 'value': 0.0894}
{'index': 36, 'value': 0.8459}
{'index': 50, 'value': 0.0024}
{'index': 15, 'value': 0.0906}
{'index': 16, 'value': 0.9819}
{'index': 10, 'value': 0.0785}
{'index': 16, 'value': 0.0886}
{'index': 38, 'value': 0.0621}
{'index': 5, 'value': 0.0226}
{'index': 8, 'value': 0.0368}
{'index': 11, 'value': 0.0754}
{'index': 0, 'value': 0.0239}
{'index': 8, 'value': 0.0368}
{'index': 29, 'value': 0.1671}
{'index': 46, 'value': 0.0237}
{'index': 38, 'value': 0.0621}
{'index': 14, 'value': 0.0668}
{'index': 4, 'value': 0.0167}
{'index': 37, 'value': 0.0836}
{'index': 12, 'value': 0.0616}
{'index': 7, 'value': 0.0333}
{'index': 33, 'value': 0.8741}
{'index': 16, 'value': 0.0886}
{'index': 8, 'value': 0.0368}
{'index': 33, 'value': 0.0798}
{'index': 44, 'value': 0.0298}
Scores: [48.78048780487805, 70.73170731707317, 58.536585365853654, 51.2195121951
2195, 39.02439024390244]
Mean Accuracy: 53.659%
请按任意键继续. . .
知识点:
1.load CSV file
1
2
3
4
5
6
7
8
9
10
|
from csv import reader # Load a CSV file def load_csv(filename): file = open (filename, "r" ) lines = reader( file ) dataset = list (lines) return dataset filename = r 'G:\0pythonstudy\决策树\sonar.all-data.csv' dataset = load_csv(filename) print (dataset) |
2.把数据转化成float格式
1
2
3
4
5
6
7
8
|
# Convert string column to float def str_column_to_float(dataset, column): for row in dataset: row[column] = float (row[column].strip()) # print(row[column]) # convert string attributes to integers for i in range ( len (dataset[ 0 ]) - 1 ): str_column_to_float(dataset, i) |
3.把最后一列的分类字符串转化成0、1整数
1
2
3
4
5
6
7
8
9
10
11
12
13
|
def str_column_to_int(dataset, column): class_values = [row[column] for row in dataset] #生成一个class label的list # print(class_values) unique = set (class_values) #set 获得list的不同元素 print (unique) lookup = dict () #定义一个字典 # print(enumerate(unique)) for i, value in enumerate (unique): lookup[value] = i # print(lookup) for row in dataset: row[column] = lookup[row[column]] print (lookup[ 'M' ]) |
4、把数据集分割成K份
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Split a dataset into k folds def cross_validation_split(dataset, n_folds): dataset_split = list () #生成空列表 dataset_copy = list (dataset) print ( len (dataset_copy)) print ( len (dataset)) #print(dataset_copy) fold_size = int ( len (dataset) / n_folds) for i in range (n_folds): fold = list () while len (fold) < fold_size: index = randrange( len (dataset_copy)) # print(index) fold.append(dataset_copy.pop(index)) #使用.pop()把里边的元素都删除(相当于转移),这k份元素各不相同。 dataset_split.append(fold) return dataset_split n_folds = 5 folds = cross_validation_split(dataset, n_folds) #k份元素各不相同的训练集 |
5.计算正确率
1
2
3
4
5
6
7
|
# Calculate accuracy percentage def accuracy_metric(actual, predicted): correct = 0 for i in range ( len (actual)): if actual[i] = = predicted[i]: correct + = 1 return correct / float ( len (actual)) * 100.0 #这个是二值分类正确性的表达式 |
6.二分类每列
1
2
3
4
5
6
7
8
9
|
# Split a data set based on an attribute and an attribute value def test_split(index, value, dataset): left, right = list (), list () #初始化两个空列表 for row in dataset: if row[index] < value: left.append(row) else : right.append(row) return left, right #返回两个列表,每个列表以value为界限对指定行(index)进行二分类。 |
7.使用gini系数来获得最佳分割点
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
|
# Calculate the Gini index for a split dataset def gini_index(groups, class_values): gini = 0.0 for class_value in class_values: for group in groups: size = len (group) if size = = 0 : continue proportion = [row[ - 1 ] for row in group].count(class_value) / float (size) gini + = (proportion * ( 1.0 - proportion)) return gini # Select the best split point for a dataset def get_split(dataset): class_values = list ( set (row[ - 1 ] for row in dataset)) b_index, b_value, b_score, b_groups = 999 , 999 , 999 , None for index in range ( len (dataset[ 0 ]) - 1 ): for row in dataset: groups = test_split(index, row[index], dataset) gini = gini_index(groups, class_values) if gini < b_score: b_index, b_value, b_score, b_groups = index, row[index], gini, groups # print(groups) print ({ 'index' :b_index, 'value' :b_value, 'score' :gini}) return { 'index' :b_index, 'value' :b_value, 'groups' :b_groups} |
这段代码,在求gini指数,直接应用定义式,不难理解。获得最佳分割点可能比较难看懂,这里用了两层迭代,一层是对不同列的迭代,一层是对不同行的迭代。并且,每次迭代,都对gini系数进行更新。
8、决策树生成
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
|
# Create child splits for a node or make terminal def split(node, max_depth, min_size, depth): left, right = node[ 'groups' ] del (node[ 'groups' ]) # check for a no split if not left or not right: node[ 'left' ] = node[ 'right' ] = to_terminal(left + right) return # check for max depth if depth > = max_depth: node[ 'left' ], node[ 'right' ] = to_terminal(left), to_terminal(right) return # process left child if len (left) < = min_size: node[ 'left' ] = to_terminal(left) else : node[ 'left' ] = get_split(left) split(node[ 'left' ], max_depth, min_size, depth + 1 ) # process right child if len (right) < = min_size: node[ 'right' ] = to_terminal(right) else : node[ 'right' ] = get_split(right) split(node[ 'right' ], max_depth, min_size, depth + 1 ) |
这里使用了递归编程,不断生成左叉树和右叉树。
9.构建决策树
1
2
3
4
5
6
7
|
# Build a decision tree def build_tree(train, max_depth, min_size): root = get_split(train) split(root, max_depth, min_size, 1 ) return root tree = build_tree(train_set, max_depth, min_size) print (tree) |
10、预测test集
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
|
# Build a decision tree def build_tree(train, max_depth, min_size): root = get_split(train) #获得最好的分割点,下标值,groups split(root, max_depth, min_size, 1 ) return root # tree=build_tree(train_set, max_depth, min_size) # print(tree) # Make a prediction with a decision tree def predict(node, row): print (row[node[ 'index' ]]) print (node[ 'value' ]) if row[node[ 'index' ]] < node[ 'value' ]: #用测试集来代入训练的最好分割点,分割点有偏差时,通过搜索左右叉树来进一步比较。 if isinstance (node[ 'left' ], dict ): #如果是字典类型,执行操作 return predict(node[ 'left' ], row) else : return node[ 'left' ] else : if isinstance (node[ 'right' ], dict ): return predict(node[ 'right' ], row) else : return node[ 'right' ] tree = build_tree(train_set, max_depth, min_size) predictions = list () for row in test_set: prediction = predict(tree, row) predictions.append(prediction) |
11.评价决策树
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
|
# Evaluate an algorithm using a cross validation split def evaluate_algorithm(dataset, algorithm, n_folds, * args): folds = cross_validation_split(dataset, n_folds) scores = list () for fold in folds: train_set = list (folds) train_set.remove(fold) train_set = sum (train_set, []) test_set = list () for row in fold: row_copy = list (row) test_set.append(row_copy) row_copy[ - 1 ] = None predicted = algorithm(train_set, test_set, * args) actual = [row[ - 1 ] for row in fold] accuracy = accuracy_metric(actual, predicted) scores.append(accuracy) return scores |
以上就是本文的全部内容,希望对大家的学习有所帮助,也希望大家多多支持服务器之家。
原文链接:http://blog.csdn.net/rosefun96/article/details/78833477