Data classification using genetic programming.
Genetic programming (GP), a field of artificial intelligence, is an evolutionary algorithm which evolves a population of trees which represent programs. These programs are used to solve problems. This dissertation investigates the use of genetic programming for data classification. In machine learning, data classification is the process of allocating a class label to an instance of data. A classifier is created in order to perform these allocations. Several studies have investigated the use of GP to solve data classification problems. These studies have shown that GP is able to create classifiers with high classification accuracies. However, there are certain aspects which have not previously been investigated. Five areas were investigated in this dissertation. The first was an investigation into how discretisation could be incorporated into a GP algorithm. An adaptive discretisation algorithm was proposed, and outperformed certain existing methods. The second was a comparison of GP representations for binary data classification. The findings indicated that from the representations examined (arithmetic trees, decision trees, and logical trees), the decision trees performed the best. The third was to investigate the use of the encapsulation genetic operator and its effect on data classification. The findings revealed that an improvement in both training and test results was achieved when encapsulation was incorporated. The fourth was an investigative analysis of several hybridisations of a GP algorithm with a genetic algorithm in order to evolve a population of ensembles. Four methods were proposed and these methods outperformed certain existing GP and ensemble methods. Finally, the fifth area was to investigate an ensemble construction method for classification. In this approach GP evolved a single ensemble. The proposed method resulted in an improvement in training and test accuracy when compared to the standard GP algorithm. The methods proposed in this dissertation were tested on publicly available data sets, and the results were statistically tested in order to determine the effectiveness of the proposed approaches.