De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2 using vsearch. Definitions of the clustering approaches can be found in Rideout et al. (2014).
These operations begin with FeatureTable[Frequency] and FeatureData[Sequence] artifacts, for example as could be generated by dada2 denoise-paired, deblur denoise-16S, or vsearch dereplicate-sequences.
Here we’ll illustrate applying open-reference clustering, and the other two approaches should be straight-forward to model from that command.
Obtain the data¶
First, download sample metadata and a few data artifacts.
The data artifacts that we’ll download are a “demux artifact” (i.e., SampleData[SequencesWithQuality]), and a collection of reference sequences for use in open-reference clustering.
wget -O 'sample-metadata.tsv' \
'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/sample-metadata.tsv'
from qiime2 import Metadata
from urllib import request
url = 'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/sample-metadata.tsv'
fn = 'sample-metadata.tsv'
request.urlretrieve(url, fn)
sample_metadata_md = Metadata.load(fn)
library(reticulate)
Metadata <- import("qiime2")$Metadata
request <- import("urllib")$request
url <- 'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/sample-metadata.tsv'
fn <- 'sample-metadata.tsv'
request$urlretrieve(url, fn)
sample_metadata_md <- Metadata$load(fn)
sample_metadata = use.init_metadata_from_url(
'sample-metadata',
'https://www.dropbox.com/scl/fi/irosimbb1aud1aa7frzxf/sample-metadata.tsv?rlkey=f45jpxzajjz9xx9vpvfnf1zjx&st=nahafuvy&dl=1')sample-metadata.tsv| download
wget -O 'demux.qza' \
'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/demux.qza'
from qiime2 import Artifact
url = 'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/demux.qza'
fn = 'demux.qza'
request.urlretrieve(url, fn)
demux = Artifact.load(fn)
Artifact <- import("qiime2")$Artifact
url <- 'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/demux.qza'
fn <- 'demux.qza'
request$urlretrieve(url, fn)
demux <- Artifact$load(fn)
demux = use.init_artifact_from_url(
'demux',
'https://www.dropbox.com/scl/fi/hpsl1hxa0kj3njhes7p64/demux-10p.qza?rlkey=e5brlu9xn4qcrqaan11z2oi7d&st=r9or2kur&dl=1')wget -O 'reference-seqs.qza' \
'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/reference-seqs.qza'
url = 'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/reference-seqs.qza'
fn = 'reference-seqs.qza'
request.urlretrieve(url, fn)
reference_seqs = Artifact.load(fn)
url <- 'https://amplicon-docs.qiime2.org/en/stable/data/cluster-reads-into-otus/reference-seqs.qza'
fn <- 'reference-seqs.qza'
request$urlretrieve(url, fn)
reference_seqs <- Artifact$load(fn)
reference_seqs = use.init_artifact_from_url(
'reference_seqs',
'https://data.qiime2.org/2025.4/tutorials/otu-clustering/85_otus.qza')The data used in this guide were sequenced on Illumina MiSeq, and originally published in Meilander et al. (2024). The data used here are subsampled to 10% of the original input sequences so the commands can be run quickly. You can find the full dataset in the study’s Artifact Repository.
Generate the input artifacts¶
Open-reference clustering in QIIME 2 begins with:
a
FeatureTable[Frequency]and a correspondingFeatureData[Sequence], andanother
FeatureData[Sequence]containing the reference sequences to cluster against.
To generate the FeatureTable[Frequency] and a corresponding FeatureData[Sequence], we’ll using DADA2’s denoise-paired action.
qiime dada2 denoise-paired \
--i-demultiplexed-seqs demux.qza \
--p-trim-left-f 0 \
--p-trunc-len-f 250 \
--p-trim-left-r 0 \
--p-trunc-len-r 250 \
--o-representative-sequences asv-seqs.qza \
--o-table asv-table.qza \
--o-denoising-stats denoising-stats.qza \
--o-base-transition-stats base-transition-stats.qzaimport rachis.plugins.dada2.actions as dada2_actions
asv_table, asv_seqs, denoising_stats, base_transition_stats = dada2_actions.denoise_paired(
demultiplexed_seqs=demux,
trim_left_f=0,
trunc_len_f=250,
trim_left_r=0,
trunc_len_r=250,
)dada2_actions <- import("rachis.plugins.dada2.actions")
action_results <- dada2_actions$denoise_paired(
demultiplexed_seqs=demux,
trim_left_f=0L,
trunc_len_f=250L,
trim_left_r=0L,
trunc_len_r=250L,
)
asv_seqs <- action_results$representative_sequences
asv_table <- action_results$table
denoising_stats <- action_results$denoising_stats
base_transition_stats <- action_results$base_transition_statsasv_seqs, asv_table, denoising_stats, base_transition_stats = use.action(
use.UsageAction(plugin_id='dada2',
action_id='denoise_paired'),
use.UsageInputs(demultiplexed_seqs=demux,
trim_left_f=0,
trunc_len_f=250,
trim_left_r=0,
trunc_len_r=250),
use.UsageOutputNames(representative_sequences='asv_seqs',
table='asv_table',
denoising_stats='denoising_stats',
base_transition_stats='base_transition_stats'))asv-seqs.qza| download | viewasv-table.qza| download | viewdenoising-stats.qza| download | viewbase-transition-stats.qza| download | view
Cluster features¶
We now have all of the artifacts that we need to cluster the sequences. In open-reference clustering, each input (i.e., query) sequence is searched against a reference collection of sequences (i.e., the subject sequences). If the current query sequence matches a subject sequence at greater than or equal to the user-specified percent identity threshold (we’ll use 85% here), the query sequence is mapped to that subject sequence. If the query sequence doesn’t match a subject sequence at the specified threshold, it becomes the centroid of a new OTU and that sequence is added to the reference collection of sequences.[^open-reference-definition] We run this as follows:
qiime vsearch cluster-features-open-reference \
--i-table asv-table.qza \
--i-sequences asv-seqs.qza \
--i-reference-sequences reference-seqs.qza \
--p-perc-identity 0.85 \
--o-clustered-sequences otu-seqs.qza \
--o-clustered-table otu-table.qza \
--o-new-reference-sequences new-reference-seqs.qzaimport rachis.plugins.vsearch.actions as vsearch_actions
otu_table, otu_seqs, new_reference_seqs = vsearch_actions.cluster_features_open_reference(
table=asv_table,
sequences=asv_seqs,
reference_sequences=reference_seqs,
perc_identity=0.85,
)vsearch_actions <- import("rachis.plugins.vsearch.actions")
action_results <- vsearch_actions$cluster_features_open_reference(
table=asv_table,
sequences=asv_seqs,
reference_sequences=reference_seqs,
perc_identity=0.85,
)
otu_seqs <- action_results$clustered_sequences
otu_table <- action_results$clustered_table
new_reference_seqs <- action_results$new_reference_sequencesclustered_sequences, clustered_table, new_reference_sequences = use.action(
use.UsageAction(plugin_id='vsearch',
action_id='cluster_features_open_reference'),
use.UsageInputs(table=asv_table,
sequences=asv_seqs,
reference_sequences=reference_seqs,
perc_identity=0.85),
use.UsageOutputNames(clustered_sequences='otu_seqs',
clustered_table='otu_table',
new_reference_sequences='new_reference_seqs'))otu-seqs.qza| download | viewotu-table.qza| download | viewnew-reference-seqs.qza| download | view
The outputs from cluster-features-open-reference are a FeatureTable[Frequency] artifact and two FeatureData[Sequence] artifacts.
One of the FeatureData[Sequence] artifacts represents the clustered sequences, while the other artifact represents the new reference sequences, composed of the reference sequences used for input, as well as the input sequences that were added to the reference.
The new reference sequences could be used for iterative open-reference clustering, as described in Rideout et al. (2014).
These outputs can be used for all downstream analyses. For example, let’s summarize the OTU table.
qiime feature-table summarize \
--i-table otu-table.qza \
--m-metadata-file sample-metadata.tsv \
--o-summary clustered-table.qzv \
--o-sample-frequencies sample-frequencies.qza \
--o-feature-frequencies otu-frequencies.qzaimport rachis.plugins.feature_table.actions as feature_table_actions
otu_frequencies, sample_frequencies, clustered_table_viz = feature_table_actions.summarize(
table=otu_table,
metadata=sample_metadata_md,
)feature_table_actions <- import("rachis.plugins.feature_table.actions")
action_results <- feature_table_actions$summarize(
table=otu_table,
metadata=sample_metadata_md,
)
clustered_table_viz <- action_results$summary
sample_frequencies <- action_results$sample_frequencies
otu_frequencies <- action_results$feature_frequenciesuse.action(
use.UsageAction(plugin_id='feature_table',
action_id='summarize'),
use.UsageInputs(table=clustered_table,
metadata=sample_metadata),
use.UsageOutputNames(summary='clustered_table',
sample_frequencies='sample_frequencies',
feature_frequencies='otu_frequencies'))clustered-table.qzv| download | viewsample-frequencies.qza| download | viewotu-frequencies.qza| download | view
- Rognes, T., Flouri, T., Nichols, B., Quince, C., & Mahé, F. (2016). VSEARCH: a versatile open source tool for metagenomics. PeerJ, 4, e2584. 10.7717/peerj.2584
- Rideout, J. R., He, Y., Navas-Molina, J. A., Walters, W. A., Ursell, L. K., Gibbons, S. M., Chase, J., McDonald, D., Gonzalez, A., Robbins-Pianka, A., Clemente, J. C., Gilbert, J. A., Huse, S. M., Zhou, H.-W., Knight, R., & Caporaso, J. G. (2014). Subsampled open-reference clustering creates consistent, comprehensive OTU definitions and scales to billions of sequences. PeerJ, 2, e545. 10.7717/peerj.545
- Meilander, J., Herman, C., Manley, A., Augustine, G., Birdsell, D., Bolyen, E., Celona, K. R., Coffey, H., Cocking, J., Donoghue, T., Draves, A., Erickson, D., Foley, M., Gehret, L., Hagen, J., Hepp, C., Ingram, P., John, D., Kadar, K., … Caporaso, J. G. (2024). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis. arXiv. 10.48550/ARXIV.2411.04148
- Caporaso, J. G., & Meilander, J. (2025). Upcycling Human Excrement: The Gut Microbiome to Soil Microbiome Axis (supporting data). Zenodo. 10.5281/ZENODO.13887456
- Callahan, B. J., McMurdie, P. J., Rosen, M. J., Han, A. W., Johnson, A. J. A., & Holmes, S. P. (2016). DADA2: high-resolution sample inference from Illumina amplicon data. Nature Methods, 13(7), 581. 10.1038/nmeth.3869