Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses

Miotto, Olivo, Tan, Tin Wee and Brusic, Vladimir (2008). Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses. In: Shoba Ranganathan, Michael Gribskov and Tin Wee Tan, Asia Pacific Bioinformatics Network (APBioNet) Sixth International Conference on Bioinformatics (InCoB2007): Proceedings. InCoB2007: Sixth International Conference on Bioinformatics, Clear Water Bay, Kowloon, Hong Kong; Hanoi, Viet Nam; Nansha IT Park, Pearl River Delta, China, (S7.1-S7.14). 27-31 August, 2007. doi:10.1186/1471-2105-9-S1-S7


Author Miotto, Olivo
Tan, Tin Wee
Brusic, Vladimir
Title of paper Rule-based knowledge aggregation for large-scale protein sequence analysis of influenza A viruses
Conference name InCoB2007: Sixth International Conference on Bioinformatics
Conference location Clear Water Bay, Kowloon, Hong Kong; Hanoi, Viet Nam; Nansha IT Park, Pearl River Delta, China
Conference dates 27-31 August, 2007
Convener Asia Pacific Bioinformatics Network (APBioNet)
Proceedings title Asia Pacific Bioinformatics Network (APBioNet) Sixth International Conference on Bioinformatics (InCoB2007): Proceedings   Check publisher's open access policy
Journal name BMC Bioinformatics   Check publisher's open access policy
Place of Publication London, United Kingdom
Publisher BioMed Central
Publication Year 2008
Year available 2007
Sub-type Fully published paper
DOI 10.1186/1471-2105-9-S1-S7
Open Access Status DOI
ISSN 1471-2105
Editor Shoba Ranganathan
Michael Gribskov
Tin Wee Tan
Volume 9
Issue Supp. 1
Start page S7.1
End page S7.14
Total pages 14
Language eng
Formatted Abstract/Summary
Background: The explosive growth of biological data provides opportunities for new statistical and comparative analyses of large information sets, such as alignments comprising tens of thousands of sequences. In such studies, sequence annotations frequently play an essential role, and reliable results depend on metadata quality. However, the semantic heterogeneity and annotation inconsistencies in biological databases greatly increase the complexity of aggregating and cleaning metadata. Manual curation of datasets, traditionally favoured by life scientists, is impractical for studies involving thousands of records. In this study, we investigate quality issues that affect major public databases, and quantify the effectiveness of an automated metadata extraction approach that combines structural and semantic rules. We applied this approach to more than 90,000 influenza A records, to annotate sequences with protein name, virus subtype, isolate, host, geographic origin, and year of isolation.

Results: Over 40,000 annotated Influenza A protein sequences were collected by combining information from more than 90,000 documents from NCBI public databases. Metadata values were automatically extracted, aggregated and reconciled from several document fields by applying user-defined structural rules. For each property, values were recovered from ≥88.8% of records, with accuracy exceeding 96% in most cases. Because of semantic heterogeneity, each property required up to six different structural rules to be combined. Significant quality differences between databases were found: GenBank documents yield values more reliably than documents extracted from GenPept. Using a simple set of semantic rules and a reasoner, we reconstructed relationships between sequences from the same isolate, thus identifying 7640 isolates. Validation of isolate metadata against a simple ontology highlighted more than 400 inconsistencies, leading to over 3,000 property value corrections.

Conclusion: To overcome the quality issues inherent in public databases, automated knowledge aggregation with embedded intelligence is needed for large-scale analyses. Our results show that user-controlled intuitive approaches, based on combination of simple rules, can reliably automate various curation tasks, reducing the need for manual corrections to approximately 5% of the records. Emerging semantic technologies possess desirable features to support today's knowledge aggregation tasks, with a potential to bring immediate benefits to this field.
Q-Index Code C1
Q-Index Status Provisional Code
Institutional Status UQ

Document type: Conference Paper
Collection: School of Agriculture and Food Sciences
 
Versions
Version Filter Type
Citation counts: TR Web of Science Citation Count  Cited 7 times in Thomson Reuters Web of Science Article | Citations
Scopus Citation Count Cited 6 times in Scopus Article | Citations
Google Scholar Search Google Scholar
Created: Wed, 27 Nov 2013, 08:51:34 EST by System User on behalf of School of Land, Crop and Food Sciences