Predicting qualitative phenotypes from microarray data - the Eadgene pig data set

Robert-Granie, Christele, Le Cao, Kim-Anh and Sancristobal, Magali (2009). Predicting qualitative phenotypes from microarray data - the Eadgene pig data set. In: Dirk-Jan de Koning, BMC Proceedings. EADGENE and SABRE Post-analyses Workshop, Lelystad, The Netherlands, (). 12–14 November, 2008. doi:10.1186/1753-6561-3-S4-S13

Author Robert-Granie, Christele
Le Cao, Kim-Anh
Sancristobal, Magali
Title of paper Predicting qualitative phenotypes from microarray data - the Eadgene pig data set
Conference name EADGENE and SABRE Post-analyses Workshop
Conference location Lelystad, The Netherlands
Conference dates 12–14 November, 2008
Proceedings title BMC Proceedings   Check publisher's open access policy
Place of Publication United Kingdom
Publisher BioMed Central
Publication Year 2009
Sub-type Fully published paper
DOI 10.1186/1753-6561-3-S4-S13
Open Access Status DOI
ISSN 1753-6561
Editor Dirk-Jan de Koning
Volume 3
Issue Supp. 4
Language eng
Formatted Abstract/Summary

The aim of this work was to study the performances of 2 predictive statistical tools on a data set that was given to all participants of the Eadgene-SABRE Post Analyses Working Group, namely the Pig data set of Hazard et al. (2008). The data consisted of 3686 gene expressions measured on 24 animals partitioned in 2 genotypes and 2 treatments. The objective was to find biomarkers that characterized the genotypes and the treatments in the whole set of genes.


We first considered the Random Forest approach that enables the selection of predictive variables. We then compared the classical Partial Least Squares regression (PLS) with a novel approach called sparse PLS, a variant of PLS that adapts lasso penalization and allows for the selection of a subset of variables.


All methods performed well on this data set. The sparse PLS outperformed the PLS in terms of prediction performance and improved the interpretability of the results.


We recommend the use of machine learning methods such as Random Forest and multivariate methods such as sparse PLS for prediction purposes. Both approaches are well adapted to transcriptomic data where the number of features is much greater than the number of individuals.
Subjects 06 Biological Sciences
0104 Statistics
Q-Index Code EX
Q-Index Status Provisional Code
Additional Notes Article number: S13

Document type: Conference Paper
Collection: Institute for Molecular Bioscience - Publications
Version Filter Type
Citation counts: Google Scholar Search Google Scholar
Created: Thu, 16 Sep 2010, 14:31:58 EST by Laura McTaggart on behalf of Institute for Molecular Bioscience