Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

kmerizer

Plugin Overview

A plugin to generate kmers from biological sequences.

version: 2026.4.0
website: https://github.com/bokulich-lab/q2-kmerizer
user support:
Please post to the QIIME 2 forum for help with this plugin: https://forum.qiime2.org

Actions

NameTypeShort Description
seqs-to-kmersmethodGenerate kmers from sequences.
core-metricspipelineKmer counting and core diversity metrics (non-phylogenetic)


kmerizer seqs-to-kmers

Generate kmers from biological sequences.

Citations

Pedregosa et al., 2011

Inputs

sequences: FeatureData[Sequence | RNASequence | ProteinSequence]

Biological sequences to kmerize.[required]

table: FeatureTable[Frequency]

Frequencies of sequences per sample.[required]

Parameters

kmer_size: Int

Length of kmers to generate.[default: 16]

tfidf: Bool

If True, kmers will be scored using TF-IDF and output frequencies will be weighted by scores. If False, kmers are counted without TF-IDF scores.[default: False]

max_df: Float % Range(0, 1, inclusive_end=True) | Int

Ignore kmers that have a frequency strictly higher than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default: 1.0]

min_df: Float % Range(0, 1) | Int

Ignore kmers that have a frequency strictly lower than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default: 1]

max_features: Int

If not None, build a vocabulary that only considers the top max_features ordered by frequency (or TF-IDF score).[optional]

norm: Str % Choices('None', 'l1', 'l2')

Normalization procedure applied to TF-IDF scores. Ignored if tfidf=False. l2: Sum of squares of vector elements is 1. l1: Sum of absolute values of vector elements is 1.[default: 'None']

Outputs

kmer_table: FeatureTable[Frequency]

Frequencies of kmers per sample.[required]


kmerizer core-metrics

Generate kmer counts from sequences and apply a collection of diversity metrics (non-phylogenetic) to compare samples.

Inputs

sequences: FeatureData[Sequence | RNASequence | ProteinSequence]

Biological sequences to kmerize.[required]

table: FeatureTable[Frequency]

Frequencies of sequences per sample.[required]

Parameters

sampling_depth: Int % Range(1, None)

The total frequency that each sample should be rarefied to prior to computing diversity metrics.[required]

metadata: Metadata

The sample metadata to use in the emperor plots.[required]

kmer_size: Int

Length of kmers to generate.[default: 16]

tfidf: Bool

If True, kmers will be scored using TF-IDF and output frequencies will be weighted by scores. If False, kmers are counted without TF-IDF scores.[default: False]

max_df: Float % Range(0, 1, inclusive_end=True) | Int

Ignore kmers that have a frequency strictly higher than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default: 1.0]

min_df: Float % Range(0, 1) | Int

Ignore kmers that have a frequency strictly lower than the given threshold. If float, the parameter represents a proportion of sequences, if an integer it represents an absolute count.[default: 1]

max_features: Int

If not None, build a vocabulary that only considers the top max_features ordered by frequency (or TF-IDF score).[optional]

with_replacement: Bool

Rarefy with replacement by sampling from the multinomial distribution instead of rarefying without replacement.[default: False]

n_jobs: Int % Range(1, None) | Str % Choices('auto')

[beta methods only] - The number of concurrent jobs to use in performing this calculation. May not exceed the number of available physical cores. If n_jobs = 'auto', one job will be launched for each identified CPU core on the host.[default: 1]

pc_dimensions: Int

Number of principal coordinate dimensions to keep for plotting.[default: 3]

color_by: Str

Categorical measure from the input Metadata that should be used for color-coding the scatterplot.[optional]

norm: Str % Choices('None', 'l1', 'l2')

Normalization procedure applied to TF-IDF scores. Ignored if tfidf=False. l2: Sum of squares of vector elements is 1. l1: Sum of absolute values of vector elements is 1.[default: 'None']

Outputs

rarefied_table: FeatureTable[Frequency]

The resulting rarefied feature table.[required]

kmer_table: FeatureTable[Frequency]

Frequencies of kmers per sample.[required]

observed_features_vector: SampleData[AlphaDiversity]

Vector of Observed Kmers values by sample.[required]

shannon_vector: SampleData[AlphaDiversity]

Vector of Shannon diversity values by sample.[required]

jaccard_distance_matrix: DistanceMatrix

Matrix of Jaccard distances between pairs of samples.[required]

bray_curtis_distance_matrix: DistanceMatrix

Matrix of Bray-Curtis dissimilarities between pairs of samples.[required]

jaccard_pcoa_results: PCoAResults

PCoA matrix computed from Jaccard distances between samples.[required]

bray_curtis_pcoa_results: PCoAResults

PCoA matrix computed from Bray-Curtis dissimilarities between samples.[required]

scatterplot: Visualization

Scatterplot of results. Axes can be selected to display alpha diversity results or PCoA coordinates computed from Jaccard or Bray-Curtis.[required]

References
  1. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, É. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12(Oct), 2825–2830.