The Azimuth Project
Receiver Operating Characteristic analysis (Rev #1)

Idea

When creating any system which maps from inputs to a classification output, but particularly machine learning classifiers, there are multiple kinds of error (false positive, etc). It’s desirable to be able to says which are the “better” classifiers even without fully committing to the relative importance of the various kinds of errors. Receiver Operating Characteristic (ROC) analysis is a method for doing this to extent the possible.

Details

Note: for simplicity, we describe the case where the class prior probability of a data item being a particular class is equal; analogous results hold in the uneven prior case.

Kinds of errors

Consider a system where each item has a feature vector fFf \in F and a class cCc \in C. Then given classifier ξ:FC\xi : F \rightarrow C, on a data set {(f i,c i)}\{(f_i,c_i)\} it will in general have some misclassifications where ξ(f j)=d jc j\xi(f_j) = d_j \ne c_j; each is an instance of misclassifying c jc_j as d jd_j. In the two class case, which we specialise to until further notice, these get the special names false positive (fp) (d=trued=true when c=falsec=false) and false negative (fn) (d=falsed=false when c=truec=true), along with the correct classifications true positive (tp) and true negatives (tn). Note that in the limit there are the relations

(1)tp+fp=1andtn+fn=1 tp+fp=1 \quad and tn+fn=1

so that one has some freedom in terms of which variables to use.

Mixing classifiers