Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?

Comparaci\'on entre SVM y regresi\'on log\'istica: \textquestiondown cuál es más recomendable para discriminar?

DIEGO ALEJANDRO SALAZAR1, JORGE IV\'AN V\'ELEZ2, JUAN CARLOS SALAZAR3

1Universidad Nacional de Colombia, Escuela de Estadística, Medellín, Colombia. MSc student. Email: diasalazarbl@unal.edu.co
2Universidad Nacional de Colombia, Grupo de Investigaci\'on en Estad\'istica, Medellín, Colombia. Researcher. Email: jorgeivanvelez@gmail.com
3Universidad Nacional de Colombia, Escuela de Estadística, Medellín, Colombia. Universidad Nacional de Colombia, Grupo de Investigaci\'on en Estad\'istica, Medellín, Colombia. Associate professor. Email: jcsalaza@unal.edu.co

Abstract

The classification of individuals is a common problem in applied statistics. If X is a data set corresponding to a sample from an specific population in which observations belong to g different categories, the goal of classification methods is to determine to which of them a new observation will belong to. When g=2, logistic regression (LR) is one of the most widely used classification methods. More recently, Support Vector Machines (SVM) has become an important alternative. In this paper, the fundamentals of LR and SVM are described, and the question of which one is better to discriminate is addressed using statistical simulation. An application with real data from a microarray experiment is presented as illustration.

Key words: Classification, Genetics, Logistic regression, Simulation, Support vector machines.

Resumen

La clasificaci\on de individuos es un problema muy com\un en el trabajo estad\istico aplicado. Si X es un conjunto de datos de una poblaci\on en la que sus elementos pertenecen a g clases, el objetivo de los m\etodos de clasificaci\on es determinar a cu\al de ellas pertenecer\a una nueva observaci\on. Cuando g=2, uno de los m\etodos m\as utilizados es la regresi\on log\istica. Recientemente, las M\aquinas de Soporte Vectorial se han convertido en una alternativa importante. En este trabajo se exponen los principios b\asicos de ambos m\etodos y se da respuesta a la pregunta de cu\al es m\as recomendable para discriminar, vía simulaci\on. Finalmente, se presenta una aplicaci\on con datos provenientes de un experimento con microarreglos.

Palabras clave: clasificación, genética, máquinas de soporte vectorial, regresión logística, simulación.

Texto completo disponible en PDF

References

1. Anderson, T. (1984), An Introduction to Multivariate Statistical Analysis, John Wiley & Sons, New York.

2. Asparoukhova, K. & Krzanowskib, J. (2001), A comparison of discriminant procedures for binary variables´, Computational Statistics & Data Analysis 38, 139-160.

3. Cornfield, J. (1962), Joint dependence of the risk of coronary heart disease on serum cholesterol and systolic blood pressure: a discriminant function analysis´, Proceedings of the Federal American Society of Experimental Biology 21, 58-61.

4. Cortes, C. & Vapnik, V. (1995), Support-vector networks´, Machine Learning 20(3), 273-297.

5. Cover, T. M. (1965), Geometrical and statistical properties of systems of linear inequalities with applications in pattern recognition´, IEEE Transactions on Electronic Computers 14, 326-334.

6. Cox, D. (1966), Some Procedures Associated with the Logistic Qualitative Response Curve, John Wiley & Sons, New York.

7. David, A. & Lerner, B. (2005), Support vector machine-based image classification for genetic syndrome diagnosis´, Pattern Recognition Letters 26, 1029-1038.

8. Day, N. & Kerridge, D. (1967), A general maximum likelihood discriminant´, Biometrics 23, 313-323.

9. Dimitriadou, E., Hornik, K., Leisch, F., Meyer, D., , & Weingessel, A. (2011), e1071: Misc Functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-27. *http://CRAN.R-project.org/packagee1071

10. Fisher, R. (1936), The use of multiple measurements in taxonomic problems´, Annual Eugenics 7, 179-188.

11. Furey, T. S., Cristianini, N., Duffy, N., Bednarski, D. W., Schummer, M. & Haussler, D. (2000), Support vector machine classification and validation of cancer tissue samples using microarray expression data´, Bioinformatics 16(10), 906-914.

12. Gentleman, R., Carey, V., Huber, W. & Hahne, F. (2011), Genefilter: Methods for filtering genes from microarray experiments. R package version 1.34.0.

13. Golub, T., Slonim, D., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J., Coller, H., Loh, M., Downing, J., Caligiuri, M., Bloomfield, C. & Lander, E. (1999), Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring´, Science 286, 531-537.

14. Hern\'andez, F. & Correa, J. C. (2009), Comparaci\'on entre tres t\'ecnicas de clasificaci\'on´, Revista Colombiana de Estad\'\istica 32(2), 247-265.

15. Hosmer, D. & Lemeshow, S. (1989), Applied Logistic Regression, John Wiley & Sons, New York.

16. Karatzoglou, A., Meyer, D. & Hornik, K. (2006), Support vector machines in R´, Journal of Statistical Software 15(8), 267-73.

17. Lee, J. B., Park, M. & Song, H. S. (2005), An extensive comparison of recent classification tools applied to microarray data´, Computational Statistics & Data Analysis 48, 869-885.

18. Li, L., Jiang, W., Li, X., Moser, K. L., Guo, Z., Du, L., Wang, Q., Topol, E. J., Wang, Q. & Rao, S. (2005), A robust hybrid between genetic algorithm and support vector machine for extracting an optimal feature gene subset´, Genomics 85(1), 16-23.

19. Moguerza, J. & Mu\~noz, A. (2006), Vector machines with applications´, Statistical Science 21(3), 322-336.

20. Mootha, V. K., Lindgren, C. M., Eriksson, K. F., Subramanian, A., Sihag, S., Lehar, J., Puigserver, P., Carlsson, E., Ridderstrele, M., Laurila, E., Houstis, N., Daly, M. J., Patterson, N., Mesirov, J. P., Golub, T. R., Tamayo, P., Spiegelman, B., Lander, E. S., Hirschhorn, J. N., Altshuler, D. & Groop, L. C. (2003), Pgc-1\alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes´, Nature Genetics 34(3), 267-73.

21. Noble, W. (2006), What is a support vector machine?´, Nature Biotechnology 24(12), 1565-1567.

22. Peng, S., Xum, Q., Bruce Ling, X., Peng, X., Du, W. & Chen, L. (2003), Molecular classification of cancer types from microarray data using the combination of genetic algorithms and support vector machines´, FEBS Letters 555, 358 - 362.

23. R Development Core Team, (2011), R: A Language and Environment for Statistical Computing, R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. *http://www.R-project.org/

24. Salazar, D. (2012), Comparaci\'on de M\'aquinas de Soporte vectorial vs. Regresi\'on Log\'istica: cu\'al es m\'as recomendable para discriminar?, Tesis de Maestr\'ia, Escuela de Estad\'istica, Universidad Nacional de Colombia, Sede Medell\'in.

25. Shou, T., Hsiao, Y. & Huang, Y. (2009), Comparative analysis of logistic regression, support vector machine and artificial neural network for the differential diagnosis of benign and malignant solid breast tumors by the use of three-dimensional power doppler´, Korean Journal of Radiology 10, 464-471.

26. Tibshirani, R. & Friedman, J. (2008), The Elements of Statistical Learning, Springer, California.

27. Verplancke, T., Van Looy, S., Benoit, D., Vansteelandt, S., Depuydt, P., De Turck, F. & Decruyenaere, J. (2008), Support vector machine versus logistic regression modeling for prediction of hospital mortality in critically ill patients with haematological malignancies´, BMC Medical Informatics and Decision Making 8, 56-64.

28. Wang, G. & Huan, G. (2011), Application of support vector machine in cancer diagnosis´, Med. Oncol. 28(1), 613-618.

29. Westreich, D., Lessler, J. & Jonsson, M. (2010), Propensity score estimation: Neural networks, support vector machines, decision trees (CART), and meta-classifiers as alternatives to logistic regression´, Journal of Clinical Epidemiology 63, 826-833.

[Recibido en septiembre de 2011. Aceptado en febrero de 2012]

Este artículo se puede citar en LaTeX utilizando la siguiente referencia bibliográfica de BibTeX:

@ARTICLE{RCEv35n2a03,     AUTHOR  = {Salazar, Diego Alejandro and V\'elez, Jorge Iv\'an and Salazar, Juan Carlos},     TITLE   = {{Comparison between SVM and Logistic Regression: Which One is Better to Discriminate?}},     JOURNAL = {Revista Colombiana de Estadística},     YEAR    = {2012},     volume  = {35},     number  = {2},     pages   = {223-237} }`