Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

How to import data for use with QIIME 2

A QIIME 2 analysis almost always starts with importing data for use in QIIME 2, and this can also unfortunately be one of the most difficult steps of using QIIME 2 for users. Importing creates a QIIME 2 artifact from data in another format, such as a .biom file or one or more .fastq files. You can learn about why this is needed and why it’s hard in Why importing is necessary.

The first step in determining how to import your data is figuring out what artifact class you’re trying to import. The following sections present importing of different artifact classes, ordered by how common we think they are to import in practice. You don’t need to read all of the sections below but rather jump to the section that describes the data that you’re trying to import. We also don’t present all of the importable artifact classes here, as many of them are rarely used in importing (rather they represent intermediary data used by QIIME 2). Using the command line interface, you can run qiime tools list-types to see a list of all importable artifact classes.

As always, we’re here to help on the forum if you get stuck. There are lots of existing discussions about importing and if after searching you haven’t found an answer to your question, post with a description of the data that you’re trying to import and we’ll help you figure out how to proceed.

Importing “fastq sequencing data”

Most users begin their QIIME 2 analysis with “raw sequencing data” in the form of .fastq.gz files.

Raw microbiome data typically exists in one of two forms: multiplexed or demultiplexed. In multiplexed data, sequences from all samples are grouped together in one or more files. In demultiplexed data, sequences are separated into different files based on the sample they are derived from. Data can be demultiplexed before it’s delivered to you, or it can be delivered still multiplexed in which case you can use QIIME 2 to demultiplex the data.

Demultiplexed sequence data

“Fastq manifest” formats

We recommend importing demultiplexed data using a fastq manifest file. This format is not specific to any sequencing instrument, but rather is generally used for importing demultiplexed fastq data. This file should be easy for your sequencing center to generate for you, so you can ask them to provide it or you can generate one yourself.

A fastq manifest file is a type of sample metadata file that maps sample identifiers to one or two absolute filepaths pointing at .fastq.gz (or .fastq) files, depending on whether you’re importing data from a single- or paried-end run.

The following examples present fastq manifest files for single-end and paired-end read data.

Single-end reads
Paired-end reads
sample-id absolute-filepath
sample-1  /scratch/microbiome/sample1_R1.fastq.gz
sample-2  /scratch/microbiome/sample2_R1.fastq.gz

You’re almost certainly interested in one of two variants of this format: one for single-end read data and one for pair-end read data.

The following import commands should allow you to import your demultiplexed sequences, assuming you have a fastq manifest file named fq-manifest.tsv.

Single-end reads, PHRED 33
Paired-end reads, PHRED 33
qiime tools import \
 --type 'SampleData[SequencesWithQuality]' \
 --input-path fq-manifest.tsv \
 --output-path demux.qza \
 --input-format SingleEndFastqManifestPhred33V2

Casava 1.8 paired-end demultiplexed fastq

The Casava 1.8 paired-end demultiplexed fastq format is a format for demultiplexed sequence data that is very specific to the software used to create it. There are two fastq.gz files for each sample in the study, each containing the forward or reverse reads for that sample. The file name includes the sample identifier. The forward and reverse read file names for a single sample might look like sample-1_15_L001_R1_001.fastq.gz and sample-1_15_L001_R2_001.fastq.gz, respectively. The underscore-separated fields in this file name are:

  1. the sample identifier,

  2. the barcode sequence or a barcode identifier,

  3. the lane number,

  4. the direction of the read (i.e. R1 or R2), and

  5. the set number.

If you’re lucky enough to have data that is in this format exactly, you can import it as follows (assuming it’s in a directory called my-sequence-data):

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path my-sequence-data/ \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux.qza

More often, if you have demultiplexed sequence data, you’ll need to import using a fastq manifest file.

Importing “-omics feature tables”

FeatureTable is almost certainly the most frequent artifact class used with QIIME 2. There are many actions that work on these, and most support arbitrary individual[2] -omics data types. If you generate a feature table outside of QIIME 2, you can import for use with the many QIIME 2 actions that work on these.

The main thing you need to know to import your feature table into QIIME 2 is the type of data it contains. Counts of features (e.g., ASVs, genes, pathways, proteins, metabolites, ...) on a per sample basis are described with the subclass Frequency. Relative frequencies (i.e., fractions such that the sum across all features in a sample is 1.0) are described with the subclass RelativeFrequency. There are others, but those are the most common.

Importing from .biom (v2.1.0, default)

Input:

This is currently the default, so a feature table containing frequencies (i.e., FeatureTable[Frequency]) could be imported as follows:

qiime tools import \
  --input-path feature-table-v210.biom \
  --type 'FeatureTable[Frequency]' \
  --output-path feature-table.qza

Alternatives:

Importing from .biom (v1.0.0)

Input:

qiime tools import \
  --input-path feature-table-v100.biom \
  --type 'FeatureTable[Frequency]' \
  --output-path feature-table.qza \
  --input-format BIOMV100Format

Alternatives:

Importing from other feature table formats

If you have a feature table in a format that is not one of the two listed above, we’re working on additional options for importing. In the meantime, see forum posts on this topic.

Importing metadata (tl;dr: metadata doesn’t get imported)

Sample metadata and feature metadata don’t need to be imported, but rather can be loaded and used directly from .tsv files. To learn more about metadata in QIIME 2, refer to refer to Using QIIME 2’s Metadata file format.

Footnotes
  1. It would be possible for us to compile this information because imports are recorded in data provenance, and we could assess that across the user community when provenance is parsed by QIIME 2 View. To date though, we’ve never collected usage information (or any other data) through QIIME 2 View.

  2. If your feature table integrates different -omics data types, specialized methods may be required. Best to reach out on the forum for input here.