Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

How to cluster sequences into OTUs

De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2 using vsearch. Definitions of the clustering approaches can be found in Rideout et al. (2014).

These operations begin with FeatureTable[Frequency] and FeatureData[Sequence] artifacts, for example as could be generated by dada2 denoise-paired, deblur denoise-16S, or vsearch dereplicate-sequences.

Here we’ll illustrate applying open-reference clustering, and the other two approaches should be straight-forward to model from that command.

Obtain the data

First, download sample metadata and a few data artifacts. The data artifacts that we’ll download are a “demux artifact” (i.e., SampleData[SequencesWithQuality]), and a collection of reference sequences for use in open-reference clustering.

[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'sample-metadata.tsv' \
  'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/sample-metadata.tsv'
[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'demux.qza' \
  'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/demux.qza'
[Command Line]
[Python API]
[R API]
[View Source]
wget -O 'reference-seqs.qza' \
  'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/reference-seqs.qza'

The data used in this guide were sequenced on Illumina MiSeq, and originally published in Meilander et al. (2024). The data used here are subsampled to 10% of the original input sequences so the commands can be run quickly. You can find the full dataset in the study’s Artifact Repository.

Generate the input artifacts

Open-reference clustering in QIIME 2 begins with:

To generate the FeatureTable[Frequency] and a corresponding FeatureData[Sequence], we’ll using DADA2’s denoise-paired action.

[Command Line]
[Python API]
[R API]
[View Source]
qiime dada2 denoise-paired \
  --i-demultiplexed-seqs demux.qza \
  --p-trim-left-f 0 \
  --p-trunc-len-f 250 \
  --p-trim-left-r 0 \
  --p-trunc-len-r 250 \
  --o-representative-sequences asv-seqs.qza \
  --o-table asv-table.qza \
  --o-denoising-stats denoising-stats.qza \
  --o-base-transition-stats base-transition-stats.qza

Cluster features

We now have all of the artifacts that we need to cluster the sequences. In open-reference clustering, each input (i.e., query) sequence is searched against a reference collection of sequences (i.e., the subject sequences). If the current query sequence matches a subject sequence at greater than or equal to the user-specified percent identity threshold (we’ll use 85% here), the query sequence is mapped to that subject sequence. If the query sequence doesn’t match a subject sequence at the specified threshold, it becomes the centroid of a new OTU and that sequence is added to the reference collection of sequences.[^open-reference-definition] We run this as follows:

[Command Line]
[Python API]
[R API]
[View Source]
qiime vsearch cluster-features-open-reference \
  --i-table asv-table.qza \
  --i-sequences asv-seqs.qza \
  --i-reference-sequences reference-seqs.qza \
  --p-perc-identity 0.85 \
  --o-clustered-sequences otu-seqs.qza \
  --o-clustered-table otu-table.qza \
  --o-new-reference-sequences new-reference-seqs.qza

The outputs from cluster-features-open-reference are a FeatureTable[Frequency] artifact and two FeatureData[Sequence] artifacts. One of the FeatureData[Sequence] artifacts represents the clustered sequences, while the other artifact represents the new reference sequences, composed of the reference sequences used for input, as well as the input sequences that were added to the reference. The new reference sequences could be used for iterative open-reference clustering, as described in Rideout et al. (2014).

These outputs can be used for all downstream analyses. For example, let’s summarize the OTU table.

[Command Line]
[Python API]
[R API]
[View Source]
qiime feature-table summarize \
  --i-table otu-table.qza \
  --m-metadata-file sample-metadata.tsv \
  --o-summary clustered-table.qzv \
  --o-sample-frequencies sample-frequencies.qza \
  --o-feature-frequencies otu-frequencies.qza
References
  1. Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. 10.7717/peerj.2584
  2. Rideout, J. R., He, Y., Navas-Molina, J. A., Walters, W. A., Ursell, L. K., Gibbons, S. M., Chase, J., McDonald, D., Gonzalez, A., Robbins-Pianka, A., Clemente, J. C., Gilbert, J. A., Huse, S. M., Zhou, H.-W., Knight, R., & Caporaso, J. G. (2014). Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ, 2, e545. 10.7717/peerj.545
  3. Meilander, J., Herman, C., Manley, A., Augustine, G., Birdsell, D., Bolyen, E., Celona, K. R., Coffey, H., Cocking, J., Donoghue, T., Draves, A., Erickson, D., Foley, M., Gehret, L., Hagen, J., Hepp, C., Ingram, P., John, D., Kadar, K., … Caporaso, J. G. (2024). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. arXiv. 10.48550/ARXIV.2411.04148
  4. Caporaso, J. G., & Meilander, J. (2025). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis (supporting data). Zenodo. 10.5281/ZENODO.13887456
  5. Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: high-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581. 10.1038/nmeth.3869