Assessing the impact of case sensitivity and term information gain on biomedical concept recognition

Groza, Tudor and Verspoor, Karin (2015) Assessing the impact of case sensitivity and term information gain on biomedical concept recognition. PLoS ONE, 10 3: . doi:10.1371/journal.pone.0119091

Author Groza, Tudor
Verspoor, Karin
Title Assessing the impact of case sensitivity and term information gain on biomedical concept recognition
Journal name PLoS ONE   Check publisher's open access policy
ISSN 1932-6203
Publication date 2015-03-19
Sub-type Article (original research)
DOI 10.1371/journal.pone.0119091
Open Access Status DOI
Volume 10
Issue 3
Total pages 22
Place of publication San Francisco, CA, United States
Publisher Public Library of Science
Collection year 2016
Language eng
Formatted abstract
Concept recognition (CR) is a foundational task in the biomedical domain. It supports the important process of transforming unstructured resources into structured knowledge. To date, several CR approaches have been proposed, most of which focus on a particular set of biomedical ontologies. Their underlying mechanisms vary from shallow natural language processing and dictionary lookup to specialized machine learning modules. However, no prior approach considers the case sensitivity characteristics and the term distribution of the underlying ontology on the CR process. This article proposes a framework that models the CR process as an information retrieval task in which both case sensitivity and the information gain associated with tokens in lexical representations (e.g., term labels, synonyms) are central components of a strategy for generating term variants. The case sensitivity of a given ontology is assessed based on the distribution of so-called case sensitive tokens in its terms, while information gain is modelled using a combination of divergence from randomness and mutual information. An extensive evaluation has been carried out using the CRAFT corpus. Experimental results show that case sensitivity awareness leads to an increase of up to 0.07 F1 against a non-case sensitive baseline on the Protein Ontology and GO Cellular Component. Similarly, the use of information gain leads to an increase of up to 0.06 F1 against a standard baseline in the case of GO Biological Process and Molecular Function and GO Cellular Component. Overall, subject to the underlying token distribution, these methods lead to valid complementary strategies for augmenting term label sets to improve concept recognition.
Q-Index Code C1
Q-Index Status Provisional Code
Institutional Status UQ

Document type: Journal Article
Sub-type: Article (original research)
Collections: Official 2016 Collection
School of Information Technology and Electrical Engineering Publications
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 0 times in Thomson Reuters Web of Science Article
Scopus Citation Count Cited 2 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Tue, 07 Apr 2015, 00:23:54 EST by System User on behalf of School of Information Technol and Elec Engineering