Concept selection for phenotypes and diseases using learn to rank

Collier, Nigel, Oellrich, Anika and Groza, Tudor (2015) Concept selection for phenotypes and diseases using learn to rank. Journal of Biomedical Semantics, 6 24: . doi:10.1186/s13326-015-0019-z

Author Collier, Nigel
Oellrich, Anika
Groza, Tudor
Title Concept selection for phenotypes and diseases using learn to rank
Journal name Journal of Biomedical Semantics   Check publisher's open access policy
ISSN 2041-1480
Publication date 2015-06-01
Year available 2015
Sub-type Article (original research)
DOI 10.1186/s13326-015-0019-z
Open Access Status DOI
Volume 6
Issue 24
Total pages 12
Place of publication London, United Kingdom
Publisher BioMed Central
Collection year 2016
Language eng
Formatted abstract
Background Phenotypes form the basis for determining the existence of a disease against the given evidence. Much of this evidence though remains locked away in text – scientific articles, clinical trial reports and electronic patient records (EPR) – where authors use the full expressivity of human language to report their observations.

Results In this paper we exploit a combination of off-the-shelf tools for extracting a machine understandable representation of phenotypes and other related concepts that concern the diagnosis and treatment of diseases. These are tested against a gold standard EPR collection that has been annotated with Unified Medical Language System (UMLS) concept identifiers: the ShARE/CLEF 2013 corpus for disorder detection. We evaluate four pipelines as stand-alone systems and then attempt to optimise semantic-type based performance using several learn-to-rank (LTR) approaches – three pairwise and one listwise. We observed that whilst overall Apache cTAKES tended to outperform other stand-alone systems on a strong recall (R = 0.57), precision was low (P = 0.09) leading to low-to-moderate F1 measure (F1 = 0.16). Moreover, there is substantial variation in system performance across semantic types for disorders. For example, the concept Findings (T033) seemed to be very challenging for all systems. Combining systems within LTR improved F1 substantially (F1 = 0.24) particularly for Disease or syndrome (T047) and Anatomical abnormality (T190). Whilst recall is improved markedly, precision remains a challenge (P = 0.15, R = 0.59).
Q-Index Code C1
Q-Index Status Confirmed Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2016 Collection
School of Information Technology and Electrical Engineering Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 3 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 1 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 25 Aug 2015, 00:45:56 EST by System User on behalf of Scholarly Communication and Digitisation Service