[UGENE-6254] Add "Classify Sequences with MetaPhlAn2" workflow element - Jira

Details

Type: New Feature
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.31
Fix Version/s: 1.32
Component/s: NGS, Workflow
Labels:
- classification
Environment:

Linux 64-bit, macOS

Story Points:
2
Epic Link:
MetaPhlAn2
Sprint:
DEV-32-2, DEV-32-3
Affect Type:
Userdefined

Description

Input data

There is one input port.

Port name in GUI: Input sequences

Port description:

URL(s) to FASTQ or FASTA file(s) should be provided. In case of SE reads or contigs use the "Input URL 1" slot only. In case of PE reads input "left" reads to "Input URL 1", "right" reads to "Input URL 2". See also the "Input data" parameter of the element.

Port ID in UWL: in
Slot #1:
- Name in GUI: Input URL1
- Slot ID in UWL: url1
- Slot type: String
Slot #2
- Name in GUI: Input URL2
- Slot ID in UWL: url2
- Slot type: String

Output data

There is one output port.

Port name in GUI: MetaPhlAn2 Classification

Port description:

A map of sequence names with the associated taxonomy IDs, produced from MetaPhlAn2 "bowtie2out" output.

Port ID in UWL: out
Slot:
- Name in GUI: Taxonomy classification data
- ID in UWL: tax_data
- Slot type: Taxonomy classification

Element description

Description on the Scene:
```
Classify sequences from unset with MetaPhlAn2, use unset database
```
Here the first "unset" correspond to the input port value, the second "unset" correspond to the database Bowtie2 index base name (similarly to the "Classify Sequences with CLARK" element).

Description in the Property Editor:

MetaPhlAn2 (METAgenomic PHyLogenetic ANalysis) is a tool for profiling the composition of microbial communities (bacteria, archaea, eukaryotes, and viruses) from whole-metagenome shotgun sequencing data.

The tool relies on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic)

Parameters

Input data

Values: "SE reads or contigs" (default), "PE reads"

Description:

To classify single-end (SE) reads or contigs, received by reads de novo assembly, set this parameter to "SE reads or contigs".

To classify paired-end (PE) reads, set the value to "PE reads".

One or two slots of the input port are used depending on the value of the parameter. Pass URL(s) to data to these slots.

The input files should be in FASTA or FASTQ formats. See "Input file format" parameter.

Function: for "SE reads" one input slot is added, a command is run for each input file (see below). For "PE reads" two slots are added, a command is run for each input file pair (see below).

Input file format
- Values: "FASTA" (default), "FASTQ"
- Description:
```
Set type of an input file (--input-type). Each input file will usually contain a lot of sequences that should be classified.
```
- Function:
  - in case of "FASTA" add "--input-type fasta" to the command
  - in case of FASTQ add "--input-type fastq"

Database

Values: a folder URL

Description:

A path to a folder with MetaPhlAn2 database: BowTie2 index files, built from reference genomes, and *.pkl file (--mpa-pkl, --bowtie2db).

By default, "mpa_v20_m200" database is provided (if it has been downloaded). The database was built on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic).

Function:
- add "--mpa_pkl" argument that equals to "the folder path/file name.pkl"
- add "--bowtie2db" argument that equals to "the folder path"
Validation:
- Six index files exist. There are only six of them.
- *.pkl file exists. There is only one such file.

Number of threads
- Values: from 1 to value of the "Optimize for CPU count" value, specified on the "Resources" tab of the UGENE Application Settings.
- Description:
```
The number of CPUs to use for parallelizing the mapping (--nproc).
```
- Function: add "--nproc N" argument to the command, where N is the specified value.
Bowtie2 output file
- Values: "Auto" or a file name
- Description:
```
The file for saving the output of BowTie2 (--bowtie2out). In case of PE reads one file is created per each pair of files.
```
- Function: add "--bowtie2out" argument. By default, the name of the output file is determined based on the input file name or pair of files (the standard way in UGENE, see other classification elements, for example). The generated name should be "samplename_bowtie2out.txt".
Output file
- Values: "Auto" or a file name
- Description:
```
The tab-separated output file of the predicted taxon relative abundances.
```
- Function: specify the output file ("> filename") at the end of the command. By default, the name is generated automatically like "samplename_profile.txt".

Temporary folder

Add argument "--tmp-dir" to a command with a value that correspond to the folder with temporary files specified in the UGENE Application Settings.

Command

In case of SE reads full command will look as follows:

./metaphlan2.py /path/small_stool.fa --nproc 4 --tmp_dir /path --input_type fasta --bowtie2out /path/small_stool_bowtie2out.txt --mpa_pkl /path/mpa_v20_m200.pkl --bowtie2db /path/mpa_v20_m200 > /path/small_stool_profile.txt

In case of PE reads:

./metaphlan2.py /path/small_P00134-R1.fastq,/path/small_P00134-R2.fastq --nproc 4 --tmp_dir /path/tmp --input_type fastq --bowtie2out /path/small_P00134_bowtie2out.txt --mpa_pkl /path/mpa_v20_m200/mpa_v20_m200.pkl --bowtie2db /path/mpa_v20_m200 > /path/small_P00134_profile.txt

Sample data, output data and database can be found on the file server in folder "/data/test_data/~~UGENE-6254~~".

Test plan

See tests "Tests: Metagenomics > MetaPhlAn > Workflow element > Core".

Attachments

Issue Links

relates to

UGENE-6264 Online installer: MetaPhlAn2 component and license, mpa_v20_m200 database

Closed

UGENE-6265 Use "mpa_v20_m200" (if available) in the MetaPhlAn2 workflow element

Closed

UGENE-6294 Add MetaPhlAn2 to sample workflows

Closed

Add "Classify Sequences with MetaPhlAn2" workflow element