CG Decision Rules

Published

September 19, 2021

This page contains my reading notes on

Problem formulation

These paper try to find a decision rule set as a binary classifier by fomulating a integer programming problem.

A decision rule set is a binary classifier that only takes binary inputs.
1. The decision rule set is in the form of DNF (OR-of-ANDs). Each rule is an AND gate of part of the binary inputs and all rules are connected using an OR gate.
2. If a instance satisfies any of the rules, it is predicted to be positive. Otherwise, the instance is predicted to be negative.
Learning a set of decision rule set can be formulated as a large mixed integer programming (MIP) problem.
1. A rule is a subset of input features. If there are m binary input features, a rule can be represented by a binary vector of length m (1 means the feature is included in the rule, 0 otherwise).
2. Let \mathcal{K} denotes all possible DNF rules. \mathcal{K} has in total 2^{m} - 1 of rules (a vector of all 0 is not a rule).
3. Given a training set that has a set of positive instances \mathcal{P} and a set of negative instances \mathcal{N}, we can minimize a loss function that characterize the classification error.

0-1 Loss

The 0-1 loss is the most direct loss the counts the numbers of false positives and false negatives.

\begin{aligned} \min \quad & \sum_{i \in \mathcal{P}} \mathcal{L}_{i} + \sum_{i \in \mathcal{N}} \mathcal{L}_{i} \\ \text{s.t. } \quad & \mathcal{L}_{i} + \sum_{k \in \mathcal{K}_{i}} w_{k} \geq 1, && i \in \mathcal{P} \\ & w_{k} \leq \mathcal{L}_{i}, && i \in \mathcal{N}, k \in \mathcal{K}_{i} \\ & w_{k}, \mathcal{L}_{i} \in \{0, 1\}, && k \in \mathcal{K}, i \in \mathcal{P} \cup \mathcal{N} \\ \end{aligned}

Apart from \mathcal{K}, \mathcal{P}, \mathcal{N} defined above, some of new terms are defined:

Let w_{k} \in \{0, 1\} be a variable of if rule k \in \mathcal{K} is selected in the rule set.
Let \mathcal{L}_{i} \in \{0, 1\} be a variable indicating if an instance i \in \mathcal{P} \cup \mathcal{N} is misclassified.
Let \mathcal{K}_{i} \subset \mathcal{K} be the set of rules met by the instance i.

Explanation:

Objective \sum_{i \in \mathcal{P}} \mathcal{L}_{i} is the number of false positives and \sum_{i \in \mathcal{N}} \mathcal{L}_{i} is the number of false negatives. Thus the loss function above is directly minimizing the classification errors.
Constraint 1 defines the misclassification of a positive instance (false negative) by forcing \mathcal{L}_{i} to take value 1 if no rule that is satisfied by the instance i \in \mathcal{P} is in the rule set.
Constraint 2 defines the misclassification of a negative instance (false positive) by forcing \mathcal{L}_{i} to take value 1 if any rule that is satisfied by the instance i \in \mathcal{N} is in the rule set.
Constraint 3 defines types of each variables.

The first constraint has \lvert \mathcal{P} \rvert number of equations while the second one has at most \lvert \mathcal{N} \mathcal{K} \rvert (assuming each negative instance satisfies all possible rules) equations. The second constraint is very expensive and can be avoided using the hamming loss introduced below.

Hamming Loss

Hamming loss also considers characterizing the numbers of false positives and false negatives. While Hamming loss incurs the same loss for each false negative, Hamming loss considers the loss to be the number of selected rules a negative instance satisfy for each false positive.

\begin{aligned} \min \quad & \sum_{i \in \mathcal{P}} \mathcal{L}_{i} + \sum_{i \in \mathcal{N}} \sum_{k \in \mathcal{K}_{i}} w_{k} \\ \text{s.t. } \quad & \mathcal{L}_{i} + \sum_{k \in \mathcal{K}_{i}} w_{k} \geq 1, && i \in \mathcal{P} \\ & w_{k}, \mathcal{L}_{i} \in \{0, 1\}, && k \in \mathcal{K}, i \in \mathcal{P} \cup \mathcal{N} \\ \end{aligned}

Explanation:

Objective \sum_{i \in \mathcal{P}} \mathcal{L}_{i} counts the number of false positives (same as that of the 0-1 loss), while \sum_{i \in \mathcal{N}} \sum_{k \in \mathcal{K}_{i}} w_{k} counts the number of selected rules of each negative instance.
Constraint 1 defines a false negative (same as the 1st constraint of 0-1 loss).
Constraint 2 defines types of each variables (same as the 3rd constraint of 0-1 loss).

Hamming loss eliminate the 2nd constraint in 0-1 loss by converting the counting of the false positives to be the hamming distances between true positives and false positives, thus reducing the problem size.

Fairness metrics and constraints

Given a set of protected features (e.g. sex, red), the training instances can be divided into different groups \mathcal{G} (e.g. 4 groups: \mathcal{G} = {sex=male, sex=female, red=True, red=False}).
Two types of fairness metrics:
1. Equality of opportunity requires the false negative rate to be equal for all groups.
2. Equalized odds requires both the false positive and false negative rate to be equal for all groups. Equalized odds is stricter version of equality of opportunity.
The maximum violation can be used to measure the unfairness of the classifier: \Delta(d) = \max_{g, g' \in \mathcal{G}} \lvert \mathbb{P}(d(X)=-1 \mid Y=1, G=g) - \mathbb{P}(d(x)=-1 \mid Y=1, G=g') \rvert This equation gives the maximum difference of the equality of opportunity (false negative rate) between the instances in any two groups g and g'.
Assuming \mathcal{G} = \{1,2\} (there are only two groups), we can incorporate the following constraints to the Hamming loss to impose the fairness:
1. Equality of opportunity (false negative rate): \frac{1}{\lvert \mathcal{P}_{1} \rvert} \sum_{i \in \mathcal{P}_{1}} \mathcal{L}_{i} - \frac{1}{\lvert \mathcal{P}_{2} \rvert} \sum_{i \in \mathcal{P}_{2}} \mathcal{L}_{i} \leq \epsilon_{1} \frac{1}{\lvert \mathcal{P}_{2} \rvert} \sum_{i \in \mathcal{P}_{2}} \mathcal{L}_{i} - \frac{1}{\lvert \mathcal{P}_{1} \rvert} \sum_{i \in \mathcal{P}_{1}} \mathcal{L}_{i} \leq \epsilon_{1}
2. Equality odds (false positive rate): \frac{1}{\lvert \mathcal{N}_{1} \rvert} \sum_{i \in \mathcal{N}_{1}} \sum_{k \in \mathcal{K}_{i}} w_{k} - \frac{1}{\lvert \mathcal{N}_{2} \rvert} \sum_{i \in \mathcal{N}_{2}} \sum_{k \in \mathcal{K}_{i}} w_{k} \leq \epsilon_{2} \frac{1}{\lvert \mathcal{N}_{2} \rvert} \sum_{i \in \mathcal{N}_{2}} \sum_{k \in \mathcal{K}_{i}} w_{k} - \frac{1}{\lvert \mathcal{N}_{1} \rvert} \sum_{i \in \mathcal{N}_{1}} \sum_{k \in \mathcal{K}_{i}} w_{k} \leq \epsilon_{2} where in both cases, \mathcal{P}_{1} is the positives instances that are in group 1 (similar for \mathcal{P}_{2}, \mathcal{N}_{1}, \mathcal{N}_{2}) and \epsilon_{1}, \epsilon_{2} are two constants that bound the maximum allowed unfairness for false positives and false negatives.