Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

How to use decontam to remove microbiome contamination

Decontam is a bioinformatics decontamination tool applicable to both amplicon and metagenomic sequencing data, that leverages the differing relative abundances of contaminants in control samples when compared to experimental samples as well as those samples with low biomass when compared to samples with high biomass. Contaminants within microbial measurements are a consistent and pervasive issue within the field. These contaminants have the capacity to impact taxonomic assignments, relative abundance calculations, and even promote the archiving of spurious taxa which can lead to incorrect assumptions and interpretations. This is of particular in low-biomass environments where the impact of these contaminants is heightened due to the low amount of starting biological material from the target community. Decontam seeks to remedy this by removing contaminants at the feature level before relative abundance calculation and has been shown to be effective in identifying contaminants in datasets of various structures. Here we show how to use Decontam through the q2-quality-control plugin.

QIIME 2 Basics

If you’re completely new to QIIME 2, we recommend reading Getting Started with QIIME 2 to familiarize yourself with concepts that may be helpful. To install QIIME 2 or MOSHPIT, both of which will include the q2-quality-control decontam functionality by default, follow the instructions here.

Installation and base actions

The q2-quality-control decontam functionality consists of two main actions: decontam-identify and decontam-score-viz.

The decontam-identify action produces an artifact containing the decontam scores for each feature in the dataset. This artifact is then passed into the decontam-score-viz visualizer to display the distribution of decontam scores and help determine where the threshold should be set to eliminate the majority of contaminants while retaining non-contaminant features. To then remove contaminants based on this threshold, actions in the q2-feature-table plugin can be used.

Tutorial and Walkthrough

Step 0: Access the tutorial data

This tutorial’s example commands reference the following example data. To download the example table, representative_sequences, and metadata follow the below steps.

[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'table.qza' \
  'https://amplicon-docs.qiime2.org/en/stable/data/decontam-howto/table.qza'
[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'rep-seqs.qza' \
  'https://amplicon-docs.qiime2.org/en/stable/data/decontam-howto/rep-seqs.qza'
[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'sample-metadata.tsv' \
  'https://amplicon-docs.qiime2.org/en/stable/data/decontam-howto/sample-metadata.tsv'

Step 1: Identify suspected contaminants

The first step is to run decontam-identify. This action runs the functionality of the base decontam application. You will need to decide which method of decontamination to use: frequency, prevalence, or combined. An in-depth explanation of the functionality of each method can be found here. Each decontam method requires unique metadata to run appropriately. The frequency method requires that each sample processed has corresponding concentration information. This information allows identification of contaminants through the simple idea that contaminants will be in greater relative abundance in low concentration samples than in high DNA concentration samples. The prevalence method requires control samples to be included in the dataset being analyzed. These control samples need to be identified via a metadata column that differentiates them from experimental samples. This method works on the premise that contaminants will be in higher relative abundances in control samples than in experimental samples. The combined method as the name suggests utilizes facets of both aforementioned methods and combines them to form a composite decontam score. Just as each method has it’s own unique metadata, each method also has its own unique parameters. Parameters that are unique to the frequency method have the prefix freq and parameters that are unique to the prevalence method have the prefix prev. The combined method uses all prevalence and frequency parameters.

Arguments/Parameters:

Option 1: frequency Method

To run decontam-identify with the frequency method perform the following. The frequency method is the only decontamination method that can be used when there are no control samples.

[Command Line]
[Python API]
[R API]
[View Source]
qiime quality-control decontam-identify \
  --i-table table.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-method frequency \
  --p-freq-concentration-column Concentration \
  --o-decontam-scores freq-decontam-scores.qza

Option 2: prevalence Method

To run decontam-identify with the prevalence method perform the following.

[Command Line]
[Python API]
[R API]
[View Source]
qiime quality-control decontam-identify \
  --i-table table.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-method prevalence \
  --p-prev-control-column Sample_or_Control \
  --p-prev-control-indicator control \
  --o-decontam-scores prev-decontam-scores.qza

Option 3: combined Method

To run decontam-identify with the combined method perform the following.

[Command Line]
[Python API]
[R API]
[View Source]
qiime quality-control decontam-identify \
  --i-table table.qza \
  --m-metadata-file sample-metadata.tsv \
  --p-method combined \
  --p-freq-concentration-column Concentration \
  --p-prev-control-column Sample_or_Control \
  --p-prev-control-indicator control \
  --o-decontam-scores comb-decontam-scores.qza

Step 2: Visualize suspected contaminants

The second step in the q2-quality-control decontam workflow is decontam-score-viz. This step allows for visualization of the decontam scores in the form of a histogram and a table that includes the following information for each feature:

This action was designed to assist in investigating contaminants and identifying which threshold to use when removing contaminants. To select an appropriate threshold for the data, the contaminant feature distribution within the histogram of decontam scores will need to be identified. Decontam score distributions are typically bimodal, meaning that there are two peaks within the distribution. The peaks correspond to a sub-distribution of contaminant features and a sub-distribution of true features. A feature with a lower decontam score indicates more evidence that the feature is a contaminant. A feature with a higher decontam indicates less evidence that the feature is a contaminant. Below is the histogram from the decontam-score-viz action using the example data provided in this tutorial.

hist

The left side of the histogram (0-0.1 or 0.15) has a partial normal distribution ending at 0.1 or 0.15 with a small amplitude, which is indicative of a typical contaminant feature distribution within the overall decontam score histogram if the associated dataset has a low amount of contaminates. If a dataset has a larger amount of contaminants the partial normal distribution encompassing the lower decontam scores will increase in amplitude. There are buttons in the visualization that allow those features identified as non-contaminant or contaminant features to be downloaded as individual fasta files for asynchronous investigation.

buttons

To investigate specific features within the experiment the following table is provided in the visualization. It is located below the histogram and the fasta download buttons:

table

Arguments/Parameters:

To run decontam-score-viz perform the following.

[Command Line]
[Python API]
[R API]
[View Source]
qiime quality-control decontam-score-viz \
  --i-decontam-scores first:comb-decontam-scores.qza \
  --i-table first:table.qza \
  --i-rep-seqs rep-seqs.qza \
  --p-threshold 0.1 \
  --p-no-weighted \
  --p-bin-size 0.05 \
  --o-visualization decontam-score-viz.qzv

Step 3: Remove suspected contaminants

To remove the contaminant features actions in the q2-feature-table plugin can be used. Below, SQLite syntax is used in the where parameter to filter features. The steps remove features and corresponding sequences whose decontam scores fall below the 0.1 threshold. This allows accurate and contaminant-free downstream analysis. More information about the q2-feature-table plugin is available here here.

[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table filter-features \
  --i-table table.qza \
  --m-metadata-file comb-decontam-scores.qza \
  --p-where '[p]>0.1 OR [p] IS NULL' \
  --o-filtered-table filtered-table.qza
[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table filter-seqs \
  --i-data rep-seqs.qza \
  --i-table filtered-table.qza \
  --o-filtered-data filtered-rep-seqs.qza
References
  1. Davis, N. M., Proctor, D. M., Holmes, S. P., Relman, D. A., & Callahan, B. J. (2018). Simple statistical identification and removal of contaminant sequences in marker-gene and metagenomics data. Microbiome, 6(1). 10.1186/s40168-018-0605-2