We are working hard to try to rectify that in the next version of AoC VBStats.

summary of RF: Random Forests algorithm is a classifier based on primarily two methods - bagging and random subspace method. The study of error estimates for bagged classifiers in Breiman [1996b], gives empirical evidence to show that the out-of-bag estimate is as accurate as using a test set of the same

There are n such subsets (one for each data record in original dataset T).

If you want to classify some input data D = {x1, x2, ..., xM} you let it pass through each tree and produce S outputs (one for each tree) which can

Random forests technique involves sampling of the input data with replacement (bootstrap sampling) called as Bagging. So for each Ti bootstrap dataset you create a tree Ki.

This is called Bootstrapping. (en.wikipedia.org/wiki/Bootstrapping_(statistics)) Bagging is the process of taking bootstraps & then aggregating the models learned on each bootstrap.

There are n such subsets (one for each data record in original dataset T). Now, RF creates S trees and uses m (=sqrt(M) or =floor(lnM+1)) random subfeatures out of M possible features to create any tree.

Error estimated on these out of bag samples is the out of bag error. OOB is the mean prediction error on each training sample xᵢ, using only the trees that did not have xᵢ in their bootstrap sample.[1] Subsampling allows one to define an out-of-bag xiM} yi is the label (or output or class).

This is called random subspace method. Breiman [1996b] Each of these is called a bootstrap dataset. Final prediction is a majority vote on this set.


Why is it important? Looking to track a hitter’s productivity during a match or over an entire season? Due to "with-replacement" every dataset Ti can have duplicate data records and Ti can be missing several data records from original datasets. Out Of Bag Score Suppose we decide to have S number of trees in our forest then we first create S datasets of "same size as original" created from random resampling of data in T

As the forest is built, each tree can thus be tested (similar to leave one out cross validation) on the samples not used in building that tree.

This subset, pay attention, is a set of boostrap datasets which does not contain a particular record from the original dataset. OOB classifier is the aggregation of votes ONLY over Tk such that it does not contain (xi,yi).