Documento sin título

Comparación de Cuatro Técnicas de Selección de Características Envolventes usando Redes Neuronales, Arboles de Decisión, Máquinas de Vector de Soporte y Clasificador Bayesiano

Samuel Oporto Diaz
Instituto Tecnológico de Educación Superior de Monterrey
Monterrey, Mexico
soporto@wiphala.net

Iván Aquino Morales
Universidad Nacional de Ingenieria, Dept. Ingenieria de Sistemas
Lima, Perú
ivaqmo@yahoo.es

Jacqueline K. Chávez Cuzcano
Universidad Nacional de Ingenieria, Dept. Ingenieria de Sistemas
Lima, Perú
karinajcc@yahoo.com

César O. Pérez Pinche
Universidad Nacional de Ingenieria, Dept. Ingenieria de Sistemas
Lima, Perú
cesaruni24@yahoo.com.mx

Resumen

La selección de características consiste en la búsqueda del subconjunto óptimo de características que disminuya el error de un algoritmo de aprendizaje. Existen tres tipos de algoritmos de selección de características: los de filtro, los envolventes y los híbridos; los de filtro escogen el subconjunto de características independiente del algoritmo de aprendizaje, los envolventes usan los algoritmos de aprendizaje para escoger el mejor subconjunto de características y los hibridos es una combinacion de los dos anteriores. En este trabajo realizamos una comparación de 4 algoritmos de selección de características envolventes para clasificación con búsquedas: Búsqueda Aleatoria Optimizada, Búsqueda Mejor Primero, Búsqueda Genética y Búsqueda Aleatoria. Para medir la calidad del subconjunto usaremos el error del clasificador. Los clasificadores usados son: Red Neuronal de Retropropagación, Árbol de Decisión C4.5, Máquina de Vector de Soporte y el clasificador bayesiano NaiveBayes. En los experimentos, usaremos 3 Bases de Datos extraídos del Repositorio UCI. Para estas pruebas se demuestra que la Búsqueda Aleatoria Optimizada produce, en promedio, el menor error de clasificación y que la Búsqueda Genética reduce mayor cantidad de características.

Palabras claves: Selección de características, búsqueda aleatoria del subconjunto de características, Minería de Datos

Abstract

Feature Selection is the process of searching the best subset of features that decrease the error rate of a learning algorithm. There are three types of feature selection algorithms: filter, wrapper and hybrid; the filter ones choose the subset of features independently from the learning algorithm, the wrapper ones use the learning algorithm to choose the best subset and the hybrid ones combine both filter and wrapper scheme. In this paper we made a comparative study among four wrapper feature selection algorithms for classification with searches: Random Optimized Search, Best-First Search, Genetic Search and the Random Search. In order to measure the quality of a subset we will use the classification error rate. The classifiers used are: Back Propagation neural network, Decision trees C4.5, Support Vector Machines and Bayesian classifier. For the experiments, we will use three Datasets extracted from the UCI Repository. For this experiment the Optimized Random Search produce the lower classification error rate than the other searches and the Genetic Search reduce the major quantity of features.

Keywords: Feature Selection, Random search of the feature subset, Data Mining