CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes: supplemental material

Parks, Donovan H., Imelfort, Michael, Skennerton, Connor T., Hugenholtz, Philip and Tyson, Gene W. (2015): CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes: supplemental material. Cold Spring Harbor Laboratory Press. Dataset. doi:10.14264/uql.2016.841

Attached Files (Some files may be inaccessible until you login with your UQ eSpace credentials)
Name Description MIMEType Size Downloads
Supplemental_information.docx Full text (open access) Click to show the corresponding preview/stream application/vnd.openxmlf... 2.21MB 0
Supp_Material.zip Full text (open access) Click to show the corresponding preview/stream application/zip 293.14KB 0
Supp_Table_S16.xls Full text (open access) Click to show the corresponding preview/stream application/vnd.ms-excel 752KB 0
Supp_Table_S19.xls Full text (open access) Click to show the corresponding preview/stream application/vnd.ms-excel 736.5KB 0
Supp_Table_S20.xls Full text (open access) Click to show the corresponding preview/stream application/vnd.ms-excel 95KB 0
Supp_Table_S21.xls Full text (open access) Click to show the corresponding preview/stream application/vnd.ms-excel 66KB 0
Supp_Table_S22.xls Full text (open access) Click to show the corresponding preview/stream application/vnd.ms-excel 41KB 0
 
Related Publications and Datasets
 
Project name Toward a complete view of life on earth via single cell genomics
Project description
ARC DP120103498 - Genome sequencing has revolutionised biology, but for most microorganisms this revolution has not arrived because the majority cannot be grown in the laboratory. This project will address this grand challenge by targeted sequencing of single cells from the environment that will fill in many major gaps in the microbial tree of life.

ARC DP1093175 - Microorganisms underpin life on Earth, but our understanding of their diversity and activity is limited by our inability to grow most of them in the laboratory. Recently, new techniques have emerged that allow access to the genetic information of all microorganisms by directly sequencing DNA and RNA from the environment. In this research we will further develop these frontier technologies, promoting this new area of science in Australia. We will apply these techniques to microbial communities involved in wastewater treatment in order to understand the interactions between microorganisms and the viruses that infect them. Understanding this interaction will have important implications for optimising these treatment processes.
Contact name Tyson, Gene W.
Contact email g.tyson@uq.edu.au
Creator name Parks, Donovan H.
Imelfort, Michael
Skennerton, Connor T.
Hugenholtz, Philip
Tyson, Gene W.
Creator(s) role Investigator
Investigator
Investigator
Investigator
Investigator
Dataset name CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes: supplemental material
Dataset description
Supplementary Results
Refinement for Gene Loss and Duplication
Estimates under Opal Stop Codon Recodings

Supplementary Methods
Identification of Trusted Reference Genomes
Refining Marker Sets for Lineage-specific Gene Loss and Duplication
Determination of Coding Table
Systematic Bias of Completeness and Contamination Estimates

Supplemental Figure S1. Distribution of the 104 bacterial and 281 gammaproteobacterial marker genes around the E. coli K12 genome.

Supplemental Figure S2. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the random contig model.

Supplemental Figure S3. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the inverse length model.

Supplemental Figure S4. Maximum-likelihood genome tree inferred from 5656 reference genomes.

Supplemental Figure S5. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the random fragment model using a window size of 20 kbp.

Supplemental Figure S6. Error in completeness and contamination estimates on simulated genomes with varying levels of completeness and contamination generated under the inverse length model.

Supplemental Figure S7. Error in completeness and contamination estimates on simulated genomes from different phyla.

Supplemental Figure S8. Bias in completeness and contamination estimates when modelled as a binomial distribution.

Supplemental Figure S9. GC-distribution plots of the HMP Capnocytophaga sp. oral taxon 329 genome.

Supplemental Figure S10. Phylogenetic placement of the two genomes (Cluster 0 and Cluster 1) identified within the HMP Capnocytophaga sp. oral taxon 329 genome.

Supplemental Figure S11. Completeness estimates for 90 putative population genomes recovered from an acetate-amended aquifer.

Supplemental Figure S12. Contamination estimates for 90 putative population genomes recovered from an acetate-amended aquifer.

Supplemental Figure S13. Identification of the 213 marker genes within the Meyerdierks et al. (2010) ANME-1 genome.

Supplemental Figure S14. Refining a marker set for lineage-specific gene loss and duplication.

Supplemental Tables
Supplemental Table S1.
Mean absolute error of completeness (comp.) and contamination (cont.) estimates determined using different universal- and domain-specific marker gene sets.

Supplemental Table S2. Number of marker genes and marker sets for taxonomic groups with ≥ 20 reference genomes.

Supplemental Table S3. Mean absolute error of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker genes treated individually (IM) or organized into collocated marker sets (MS).


Supplemental Table S4. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker genes treated individually (IM) or organized into collocated marker sets (MS).

Supplemental Table S5. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker genes treated individually (IM) or organized into collocated marker sets (MS).

Supplemental Table S6. Phylogenetically informative marker genes used to infer the reference genome tree
along with matching PhyloSift genes.

Supplemental Table S7. Phylogenetically informative genes used in PhyloSift without a matching CheckM gene.

Supplemental Table S8. Mean absolute error of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms), the lineage-specific marker set selected by CheckM (sms), and the best performing lineage-specific marker set (bms).

Supplemental Table S9. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms), the lineage-specific marker set selected by CheckM (sms), and the best performing lineage-specific marker set (bms).

Supplemental Table S10. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms), the lineage-specific marker set selected by CheckM (sms), and the best performing lineage-specific marker set (bms).


Supplemental Table S11. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms) and the lineage-specific marker set selected by CheckM (sms).

Supplemental Table S12. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms) and the lineage-specific marker sets selected by CheckM (sms).


Supplemental Table S13. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates determined using domain-specific marker sets (dms) and the lineage-specific marker sets selected by CheckM (sms).

Supplemental Table S14. Taxonomic rank of the selected lineage-specific marker set used for evaluating the quality of genomes at different degrees of taxonomic novelty.

Supplemental Table S15. Mean absolute error and standard deviation of completeness (comp.) and contamination (cont.) estimates for simulated genomes at different degrees of taxonomic novelty.

Supplemental Table S16. Lineage-specific completeness and contamination estimates for isolate genomes from large-scale sequencing initiatives.
(see Excel file)

Supplemental Table S17. Completeness and contamination estimates of the Lactobacillus gasseri MV-22 genome for increasingly basal lineage-specific marker sets.

Supplemental Table S18. Bacterial marker genes identified within the HMP Lactobacillus gasseri genomes. Markers missing from a genome or present in multiple copies are highlighted with a grey background.

Supplemental Table S19. Lineage-specific completeness and contamination estimates for genomes annotated as finished at IMG, along with predicted translation tables and calculated coding density. (see Excel file)

Supplemental Table S20: Lineage-specific completeness and contamination estimates for single-cell genomes from the GEBA-MDM initiative along with traditional assembly statistics. (see Excel file)

Supplemental Table S21: Lineage-specific completeness and contamination estimates for population genomes, plasmids, and phage recovered from metagenomic datasets along with traditional assembly statistics. (see Excel file)

Supplemental Table S22: Completeness and contamination estimates for population genomes recovered from an acetate-amended aquifer determined using domain-level and lineage-specific marker sets. (see Excel file)


Access conditions Open Access
Licencing and terms of access UQ Terms & Conditions Permitted Re-use with Acknowledgement Licence

View License Details
ANZSRC Field of Research (FoR) Code 060309 Phylogeny and Comparative Analysis
060504 Microbial Ecology
DOI 10.14264/uql.2016.841
Grant ID DP120103498
DP1093175
Type of data Text
Figures
Spreadsheets
Python files
Software required CheckM v0.9.4
Language eng
Keyword Isolates
Single cells
Metagenomic data
Genome quality
CheckM
Geographic co-ordinates

153.050537,-27.352253

Collection type Dataset
Publisher Cold Spring Harbor Laboratory Press
Publication Year 2015
Copyright notice 2015, The University of Queensland
Additional Notes CheckM is open source software available at http://ecogenomics.github.io/CheckM

Document type: Data Collection
Collections: Research Data Collections
School of Chemistry and Molecular Biosciences
 
Versions
Version Filter Type
Citation counts: Google Scholar Search Google Scholar
Created: Wed, 05 Oct 2016, 19:23:32 EST by Anthony Yeates