RIPPER
This page contains my reading notes on
Some of the knowledge are also from:
- Incremental Reduced Error Pruning
- MDL and Categorical Theories (Continued)
- A blog post on his RIPPER Python package
Introduction
- In this paper, Cohen first implements his own version of IREP (Incremental Reduced Error Pruning) with some minor differences and has support multi-label problems and missing attributes.
- Then he proposes several major changes to IREP and names the improved version IREP*.
- Finally, based on IREP*, he proposes a new rule mining algorithm called RIPPER (Repeated Incremental Pruning to Produce Error Reduction).
## IREP (Cohen version)
IREP algorithm
The characteristics of IREP have two fold: 1. Separate and conquer: the covered instances in the training set are removed after a rule is found; thus in the next iteration, a new rule will be learned on the training instances that have not been covered by the previously found rules. 1. Integration of pre-pruning and post-pruning: 1. Pre-pruning: some training examples are deliberately ignored to in the training process (early stopping condition). 1. Post-pruning: first the model is trained to fit the training set perfectly and then some parts of the model are deleted after the training (branch cutting).
Function: IREP.
Input: the training set \mathcal{D} with binary labels and all possible features \mathcal{F}.
Output: the learned rule set \mathcal{R}. 1. Initialize an empty rule set \mathcal{R}. 1. While there are still positive instances in \mathcal{D}: 1. Randomly choose 2/3 from \mathcal{D} as the growing set \mathcal{G} and the rest 1/3 becomes the pruning set \mathcal{P}. 1. R = GrowRule(\mathcal{G}) 1. R = PruneRule(\mathcal{P}, R) 1. If the accuracy of R < 0.5 on \mathcal{P}: break 1. Add R to \mathcal{R}. 1. Remove instances that are covered by R from \mathcal{D}. 1. Return \mathcal{R}
The minor differences between the Cohen’s implementation of IREP and the original version are: 1. stopping condition. The original IREP stopped when the accuracy of the learned rule is less than the accuracy of the empty rule instead of 50%. 1. PruneRule algorithm, which is to be detailed later.
Grow a Rule
In each iteration, the feature f with value v that has the maximum FOIL score is selected to the rule and the iterations terminate when the rule doesn’t cover any negative instances in the growing set.
Function: GrowRule.
Input: the growing set \mathcal{G} with binary labels and all possible features \mathcal{F}.
Output: the unpruned rule R. 1. Initialize an empty rule R. 1. Until all instances in \mathcal{G} that satisfy R are positive (accuracy of R is 1 in \mathcal{G}) or there is no feature to add: 1. For every feature f \in \mathcal{F} not in R and every possible value v \in \mathcal{V}(f): 1. Create a temp rule R_{t} by copying current R. 1. Add (f, v) to R_{t}. 1. Calculate FOIL’s information gain of R_{t}: \mathrm{Foil}(R, R_{t}) based on \mathcal{G}. 1. Get the R_{t}^{max} with the max value of \mathrm{Foil}(R_{t}). 1. R=R_{t}^{max}. 1. Return R.
[Support for categorical and continuous features]: the definition of \mathcal{V}(f) for different feature f is different for categorical and numerical features. 1. For a categorical feature f_{c}, \mathcal{V}(f) is the collection of all possible values that f_{c} can take. 1. For a numerical feature f_{n}, \mathcal{V}(f) is the Cartesian product of \{\leq, \geq\} and all values of f that appear in the training set. For example, if all values that appear in the training set for feature age is \{10, 20, 30\}, then \mathcal{V}(\text{age}) is \{\leq 10, \geq 10, \leq 20, \geq 20, \leq 30, \geq 30\}
[FOIL’s information gain]: it gives how much information entropy is reduced from R_{old} to R_{new}.
\operatorname{Foil}(R_{old}, R_{new}) = P(R_{new}) \left( \log_{2} \left( \frac{P(R_{new})}{P(R_{new}) + N(R_{new})} \right) - \log_{2} \left( \frac{P(R_{old})}{P(R_{old}) + N(R_{old})} \right) \right)
where P(R) (N(R)) is the number of positive (negative) instances covered by R.
Prune a Rule
PruneRule considers deleting any final sequence of conditions in the order they are grown from the rule and chooses the deletion that maximizes the Rule-Value metric on the pruning set.
Function: PruneRule.
Input: the pruning set \mathcal{P} and the unpruned rule R.
Output: the pruned rule R_{p}. 1. Initialize R_{p} = R. 1. For all (f, v)_{i} \in R starting from the last added one to the first one: 1. Removing (f, v)_{i} from R. 1. If \operatorname{Value}(R) \geq \operatorname{Value}(R_{p}): then R_{p} = R. 1. Return R_{p}
The original implementation of IREP only considers the “deletions of a single final condition”.
[IREP Rule-Value metric]:
\operatorname{Value}(R) = \frac{P(R) + (N - N(R))}{P + N}
where P (N) is the total number of positive (negative) instances and P(R) (N(R)) is the number of positive (negative) instances covered by R.
IREP* as an improved version of IREP
The support for multi-class and missing value allows IREP to be applied on a wide range of benchmarks and Cohen further improves on his implementation of IREP on the stopping condition and pruning metric.
New Rule-Value metric
The IREP Rule-Value metric sometimes is highly unintuitive. Assuming P and N are fixed to be 3000, the IREP Rule-Value metric prefers R_{1} over R_{2} in the following example, but R_{2} is obviously more predictive. - R_{1}: P(R_{1}) = 2000, N(R_{1}) = 1000, \operatorname{Value}(R) = \frac{4000}{6000} - R_{2}: P(R_{2}) = 1000, N(R_{2}) = 1, \operatorname{Value}(R) = \frac{3999}{6000}
Cohen’s solution doesn’t have the issue mentioned above.
[IREP* Rule-Value metric]:
\operatorname{Value}(R) = \frac{P(R) - N(R)}{P(R) + N(R)}
where P(R) is the number of positive instances covered by R and N(R) is the number of negative instances covered by R
New Stopping condition
The IREP stops adding rules when the current learned rule has a bad (< 50%) accuracy on the pruning set. This estimate often makes the algorithm stop too early especially if the current learned rule has very low coverage (the algorithm will stop if, for example, the rule only covers 2 instances and 1 instance has negative label).
IREP* defines the stopping condition based on the total description length value of the currently learned rule set on the pruning set.
- Calculate the total description length of \mathcal{R}: \operatorname{MDL}(\mathcal{R}).
- If \operatorname{MDL}(\mathcal{R}) > \operatorname{MDL}_{min} + d: break
- If \operatorname{MDL}(\mathcal{R}) < \operatorname{MDL}_{min}: \operatorname{MDL}_{min} = \operatorname{MDL}(\mathcal{R})
where \operatorname{MDL}(\mathcal{R}) is the total Description Length of the rule set \mathcal{R} and d is a hyperparameter with the default value of 64 in the paper’s experiment.
[MDL Principle (Minimum Description Length Principle)]: From the Machine Learning perspective, each model derived from the dataset can be characterized by a description length, which is defined as the number of bits required to encode the model and the data from which it was learned. MDL Principle states that the model with the minimum description length is generally preferred to avoid over-fitting.
Description length consists of model description length (theory cost) and exceptions description length (exceptions cost). - Model description length measures the complexity of the model. Higher model description length means that the model is more complex and thus more prone to over-fitting. - Exceptions description length measures the degree to which the model incorrectly fit to the data. The large the exceptions description length, the more error-prone the model is.
For RIPPER, the description length of a rule set is defined as the sum of the model description length of each rule plus the exceptions description length of the whole rule set:
\operatorname{MDL}(\mathcal{R}) = \sum_{R_{i} \in \mathcal{R}} \operatorname{MDL}_{M}(R_{i}) + \operatorname{MDL}_{E}(\mathcal{R})
Model description length of each rule calculates how many bits are needed to encode a rule:
\operatorname{MDL}_{M}(R) = 0.5(k\log_{2}\frac{1}{p} + (n - k) \log_2\frac{1}{1 - p} + \lVert k \rVert)
where k is the number of features in the rule, n is the number of all features, and p = \frac{k}{n}. \lVert k \rVert = \log_{2}(k) is the number of bits required to encode the number k. The 0.5 factor is to “account for possible redundancies”.
Exceptions description length evaluates the errors of the rule set on a given dataset:
\operatorname{MDL}_{E}(\mathcal{R}) = \log_{2}{P(\mathcal{R}) \choose \mathit{FP}(\mathcal{R})} + \log_{2}{N(\mathcal{R}) \choose \mathit{FN}(\mathcal{R})}
where P(\mathcal{R}) (N(\mathcal{R})) is the number of positive (negative) instances covered by the rule set \mathcal{R} and \mathit{FP}(\mathcal{R}) (\mathit{FN}(\mathcal{R})) is the number of false positives (false negatives) covered by the rule set \mathcal{R}.
## RIPPER
TODO: Many of the implementation details are from this public Github implementation, since the original paper doesn’t elaborate on how exactly they are implemented.
RIPPER = IREP* + a post-processing optimization
RIPPER further improves on IREP* by post-pruning the rules generated by IREP*.
Function: RIPPER.
Input: a training set \mathcal{D}.
Output: a optimized rule set \mathcal{R}_{o}. 1. Run IREP* on \mathcal{D} to get \mathcal{R}. 1. \mathcal{R}_{o} = Optimize(\mathcal{D}, \mathcal{R}). Note that \mathcal{D} here is a copy of the original training set (no removal from IREP). 1. While there are still positive instances in \mathcal{D}: 1. R = GrowRule*(\mathcal{D}). 1. Remove instances that are covered by R_{i}^{best} from \mathcal{D}. 1. Add R to \mathcal{R}_{o}. 1. Return \mathcal{R}_{o}.
The optimize procedure proposes 2 more versions of the rule for each rule learned from IREP* and select the best version using MDL metric to add to the final rule set.
Function: Optimize.
Input: a training set \mathcal{D} and a rule set \mathcal{R}.
Output: a optimized rule set \mathcal{R}_{o}. 1. Initialize an empty rule set \mathcal{R}_{o}. 1. For each R_{i} \in \mathcal{R} in the order R_{i} is learned in \mathcal{R}: 1. Randomly choose 2/3 from \mathcal{D} as the growing set \mathcal{G} and the rest 1/3 becomes the pruning set \mathcal{P}. 1. Grow the replacement rule from scratch using FOIL’s information gain: \hat{R}_{i} = GrowRule(\mathcal{G}). 1. Form a new rule set by replacing R_{i} in \mathcal{R} with \hat{R}_{i}: \mathcal{\hat{R}} = \{R_{1}, \dots, \hat{R}_{i}, \dots, R_{n}}. 1. Prune the replacement rule: \hat{R}_{i} = PruneRule(\mathcal{P}, \hat{R}_{i}), but using accuracy of the rule set \mathcal{\hat{R}} instead of Rule-Value metric as the maximizing objective. 1. Update the \hat{R}_{i} in \hat{\mathcal{R}} with the pruned version. 1. Grow the revision rule from the current rule R_{i} using FOIL’s information gain: \bar{R}_{i} = GrowRule(\mathcal{G}). 1. Form a new rule set by replacing R_{i} in \mathcal{R} with \bar{R}_{i}: \mathcal{\bar{R}} = \{R_{1}, \dots, \bar{R}_{i}, \dots, R_{n}}. 1. Prune the revision rule: \bar{R}_{i} = PruneRule(\mathcal{P}, \bar{R}_{i}), but using accuracy of the rule set \mathcal{\bar{R}} instead of Rule-Value metric as the maximizing objective. 1. Update the \bar{R}_{i} in \bar{\mathcal{R}} with the pruned version. 1. The best rule from the 3 versions is the one whose corresponding rule set has the smallest minimum total description length: R_{i}^{best} = \arg \min_{R \in \{R_{i}, \hat{R}_{i}, \bar{R}_{i}\}} \operatorname{MTDL} (\{R_{1}, \dots, R, \dots, R_{n}\}). 1. Add R_{i}^{best} to \mathcal{R}_{o}. 1. Remove instances that are covered by R_{i}^{best} from \mathcal{D}. 1. If there is no positive instances in \mathcal{D}: break. 1. Return \mathcal{R}_{o}.
[Minimum total description length]: Given a rule set, the minimum total description length is the description length of the rule set after deleting the rules that increase the total description length of the rule set.
RIPPER2 and RIPPERk
RIPPER2 is just running Optimize again on the output of RIPPER, while RIPPERk is to run RIPPER k times.
Function: RIPPER2.
Input: a training set \mathcal{D}.
Output: a optimized rule set \mathcal{R}_{o}. 1. Run RIPPER on \mathcal{D} to get \mathcal{R}. 1. \mathcal{R}_{o} = Optimize(\mathcal{D}, \mathcal{R}). Note that \mathcal{D} here is a copy of the original training set (no removal from IREP). 1. While there are still positive instances in \mathcal{D}: 1. R = GrowRule*(\mathcal{D}). 1. Remove instances that are covered by R_{i}^{best} from \mathcal{D}. 1. Add R to \mathcal{R}_{o}. 1. Return \mathcal{R}_{o}.