Conventional Evaluation Unreliable

Today there is rapidly growing interest in 'intelligent' computer-based methods that use various classes of measurement signals, from different patient samples, for instance, to create a model for classifying new observations.

"Especially in applications in which faulty classification decisions can lead to catastrophic consequences, such as choosing the wrong form of therapy for treating cancer, it is extremely important to be able to make a reliable estimate of the performance of the classification model," explains Mats Gustafsson who co-directed the new study.

To evaluate the performance of a classification model, one normally tests it on a number of trial examples that have never been involved in the design of the model. However, as the researchers write, there are seldom tens of thousands of test examples available for this type of evaluation. In biomedicine, they give as an example, it is often expensive and difficult to collect the patient samples needed, especially if one wishes to analyse a rare disease. To solve this problem, many different methods have been proposed. Since the 1980s two methods have dominated research, namely, cross validation and resampling/bootstrapping.

In the new study, the researchers use both theory and computer simulations to show that this methodology is worthless in practice when the total number of examples is small in relation to the natural variation that exists among different observations. What is considered a small number depends in turn on what problem is being studied - in other words, it is impossible to determine whether the number of examples is sufficient.

"Our main conclusion is that this methodology cannot be depended on at all, and that it therefore needs to be immediately replaced by Bayesian methods, for example, which can deliver reliable measures of the uncertainty that exists," says Gustafsson.; Source: Uppsala University