Plugin Overview¶
A plugin to generate kmers from biological sequences.
- version:
2026.4.0 - website: https://
github .com /bokulich -lab /q2 -kmerizer - user support:
- Please post to the QIIME 2 forum for help with this plugin: https://
forum .qiime2 .org
Actions¶
| Name | Type | Short Description |
|---|---|---|
| seqs-to-kmers | method | Generate kmers from sequences. |
| core-metrics | pipeline | Kmer counting and core diversity metrics (non-phylogenetic) |
kmerizer seqs-to-kmers¶
Generate kmers from biological sequences.
Citations¶
Pedregosa et al., 2011
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence | ProteinSequence] Biological sequences to kmerize.[required]
- table:
FeatureTable[Frequency] Frequencies of sequences per sample.[required]
Parameters¶
- kmer_size:
Int Length of kmers to generate.[default:
16]- tfidf:
Bool If True, kmers will be scored using TF-IDF and output frequencies will be weighted by scores. If False, kmers are counted without TF-IDF scores.[default:
False]- max_df:
Float%Range(0, 1, inclusive_end=True)|Int Ignore kmers that have a frequency strictly higher than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default:
1.0]- min_df:
Float%Range(0, 1)|Int Ignore kmers that have a frequency strictly lower than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default:
1]- max_features:
Int If not None, build a vocabulary that only considers the top max_features ordered by frequency (or TF-IDF score).[optional]
- norm:
Str%Choices('None', 'l1', 'l2') Normalization procedure applied to TF-IDF scores. Ignored if tfidf=False. l2: Sum of squares of vector elements is 1. l1: Sum of absolute values of vector elements is 1.[default:
'None']
Outputs¶
- kmer_table:
FeatureTable[Frequency] Frequencies of kmers per sample.[required]
kmerizer core-metrics¶
Generate kmer counts from sequences and apply a collection of diversity metrics (non-phylogenetic) to compare samples.
Inputs¶
- sequences:
FeatureData[Sequence | RNASequence | ProteinSequence] Biological sequences to kmerize.[required]
- table:
FeatureTable[Frequency] Frequencies of sequences per sample.[required]
Parameters¶
- sampling_depth:
Int%Range(1, None) The total frequency that each sample should be rarefied to prior to computing diversity metrics.[required]
- metadata:
Metadata The sample metadata to use in the emperor plots.[required]
- kmer_size:
Int Length of kmers to generate.[default:
16]- tfidf:
Bool If True, kmers will be scored using TF-IDF and output frequencies will be weighted by scores. If False, kmers are counted without TF-IDF scores.[default:
False]- max_df:
Float%Range(0, 1, inclusive_end=True)|Int Ignore kmers that have a frequency strictly higher than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default:
1.0]- min_df:
Float%Range(0, 1)|Int Ignore kmers that have a frequency strictly lower than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default:
1]- max_features:
Int If not None, build a vocabulary that only considers the top max_features ordered by frequency (or TF-IDF score).[optional]
- with_replacement:
Bool Rarefy with replacement by sampling from the multinomial distribution instead of rarefying without replacement.[default:
False]- n_jobs:
Int%Range(1, None)|Str%Choices('auto') [beta methods only] - The number of concurrent jobs to use in performing this calculation. May not exceed the number of available physical cores. If n_jobs = 'auto', one job will be launched for each identified CPU core on the host.[default:
1]- pc_dimensions:
Int Number of principal coordinate dimensions to keep for plotting.[default:
3]- color_by:
Str Categorical measure from the input Metadata that should be used for color-coding the scatterplot.[optional]
- norm:
Str%Choices('None', 'l1', 'l2') Normalization procedure applied to TF-IDF scores. Ignored if tfidf=False. l2: Sum of squares of vector elements is 1. l1: Sum of absolute values of vector elements is 1.[default:
'None']
Outputs¶
- rarefied_table:
FeatureTable[Frequency] The resulting rarefied feature table.[required]
- kmer_table:
FeatureTable[Frequency] Frequencies of kmers per sample.[required]
- observed_features_vector:
SampleData[AlphaDiversity] Vector of Observed Kmers values by sample.[required]
- shannon_vector:
SampleData[AlphaDiversity] Vector of Shannon diversity values by sample.[required]
- jaccard_distance_matrix:
DistanceMatrix Matrix of Jaccard distances between pairs of samples.[required]
- bray_curtis_distance_matrix:
DistanceMatrix Matrix of Bray-Curtis dissimilarities between pairs of samples.[required]
- jaccard_pcoa_results:
PCoAResults PCoA matrix computed from Jaccard distances between samples.[required]
- bray_curtis_pcoa_results:
PCoAResults PCoA matrix computed from Bray-Curtis dissimilarities between samples.[required]
- scatterplot:
Visualization Scatterplot of results. Axes can be selected to display alpha diversity results or PCoA coordinates computed from Jaccard or Bray-Curtis.[required]
- Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.