Decision trees

Seminar Scientific Computing - Summer Term 2011

Decision tree is a technique which is often used for the classification of data. For simplification we will only consider binary trees. Suppose a sample of objects which should be classified, is given. Each node of the decision tree contains one feature which classifies all incoming objects into two classes: objects for which the value of the feature is TRUE and objects for which the value of the feature is FALSE. The number of leaves of the decision tree describe its complexity. Each leaf

contains classified objects. Some of the objects may be misclassified. The sum of all misclassified objects through all leaves is the error of the decision tree. The smaller the error the better decision tree. Trees with a large number of leaves are expensive and that is the reason why the smaller number of leaves the better. The problem is to construct a decision tree that contains as few misclassified objects as possible and has the lowest complexity. One should find a balance between complexity and accuracy. In general, the higher the complexity the better the accuracy is. The full examination of all decision trees is an NP problem, thus good heuristic algorithms are needed. The area of application of decision trees is huge. Examples are classification of diseases in medicine, classification of documents and images in science and industry etc.

Contact