Uploaded image for project: 'UGENE'
  1. UGENE
  2. UGENE-6254

Add "Classify Sequences with MetaPhlAn2" workflow element

    XMLWordPrintable

    Details

    • Type: New Feature
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.31
    • Fix Version/s: 1.32
    • Component/s: NGS, Workflow
    • Labels:
    • Environment:

      Linux 64-bit, macOS

    • Story Points:
      2
    • Epic Link:
    • Sprint:
      DEV-32-2, DEV-32-3
    • Affect Type:
      Userdefined

      Description

      Input data

      There is one input port.

      • Port name in GUI: Input sequences
      • Port description:
        URL(s) to FASTQ or FASTA file(s) should be provided. In case of SE reads or contigs use the "Input URL 1" slot only. In case of PE reads input "left" reads to "Input URL 1", "right" reads to "Input URL 2". See also the "Input data" parameter of the element.
        
      • Port ID in UWL: in
      • Slot #1:
        • Name in GUI: Input URL1
        • Slot ID in UWL: url1
        • Slot type: String
      • Slot #2
        • Name in GUI: Input URL2
        • Slot ID in UWL: url2
        • Slot type: String

      Output data

      There is one output port.

      • Port name in GUI: MetaPhlAn2 Classification
      • Port description:
        A map of sequence names with the associated taxonomy IDs, produced from MetaPhlAn2 "bowtie2out" output.
        
      • Port ID in UWL: out
      • Slot:
        • Name in GUI: Taxonomy classification data
        • ID in UWL: tax_data
        • Slot type: Taxonomy classification

      Element description

      • Description on the Scene:
        Classify sequences from unset with MetaPhlAn2, use unset database
        

        Here the first "unset" correspond to the input port value, the second "unset" correspond to the database Bowtie2 index base name (similarly to the "Classify Sequences with CLARK" element).

      • Description in the Property Editor:
        MetaPhlAn2 (METAgenomic PHyLogenetic ANalysis) is a tool for profiling the composition of microbial communities (bacteria, archaea, eukaryotes, and viruses) from whole-metagenome shotgun sequencing data.
        
        The tool relies on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic)
        

      Parameters

      1. Input data
        • Values: "SE reads or contigs" (default), "PE reads"
        • Description:
          To classify single-end (SE) reads or contigs, received by reads de novo assembly, set this parameter to "SE reads or contigs".
          
          To classify paired-end (PE) reads, set the value to "PE reads".
          
          One or two slots of the input port are used depending on the value of the parameter. Pass URL(s) to data to these slots.
          
          The input files should be in FASTA or FASTQ formats. See "Input file format" parameter.
          
        • Function: for "SE reads" one input slot is added, a command is run for each input file (see below). For "PE reads" two slots are added, a command is run for each input file pair (see below).
      2. Input file format
        • Values: "FASTA" (default), "FASTQ"
        • Description:
          Set type of an input file (--input-type). Each input file will usually contain a lot of sequences that should be classified.
          
        • Function:
          • in case of "FASTA" add "--input-type fasta" to the command
          • in case of FASTQ add "--input-type fastq"
      3. Database
        • Values: a folder URL
        • Description:
          A path to a folder with MetaPhlAn2 database: BowTie2 index files, built from reference genomes, and *.pkl file (--mpa-pkl, --bowtie2db).
          
          By default, "mpa_v20_m200" database is provided (if it has been downloaded). The database was built on ~1M unique clade-specific marker genes identified from ~17,000 reference genomes (~13,500 bacterial and archaeal, ~3,500 viral, and ~110 eukaryotic).
          
        • Function:
          • add "--mpa_pkl" argument that equals to "the folder path/file name.pkl"
          • add "--bowtie2db" argument that equals to "the folder path"
        • Validation:
          • Six index files exist. There are only six of them.
          • *.pkl file exists. There is only one such file.
      4. Number of threads
        • Values: from 1 to value of the "Optimize for CPU count" value, specified on the "Resources" tab of the UGENE Application Settings.
        • Description:
          The number of CPUs to use for parallelizing the mapping (--nproc).
          
        • Function: add "--nproc N" argument to the command, where N is the specified value.
      5. Bowtie2 output file
        • Values: "Auto" or a file name
        • Description:
          The file for saving the output of BowTie2 (--bowtie2out). In case of PE reads one file is created per each pair of files.
          
        • Function: add "--bowtie2out" argument. By default, the name of the output file is determined based on the input file name or pair of files (the standard way in UGENE, see other classification elements, for example). The generated name should be "samplename_bowtie2out.txt".
      6. Output file
        • Values: "Auto" or a file name
        • Description:
          The tab-separated output file of the predicted taxon relative abundances.
          
        • Function: specify the output file ("> filename") at the end of the command. By default, the name is generated automatically like "samplename_profile.txt".

      Temporary folder

      Add argument "--tmp-dir" to a command with a value that correspond to the folder with temporary files specified in the UGENE Application Settings.

      Command

      In case of SE reads full command will look as follows:

      ./metaphlan2.py /path/small_stool.fa --nproc 4 --tmp_dir /path --input_type fasta --bowtie2out /path/small_stool_bowtie2out.txt --mpa_pkl /path/mpa_v20_m200.pkl --bowtie2db /path/mpa_v20_m200 > /path/small_stool_profile.txt
      

      In case of PE reads:

      ./metaphlan2.py /path/small_P00134-R1.fastq,/path/small_P00134-R2.fastq --nproc 4 --tmp_dir /path/tmp --input_type fastq --bowtie2out /path/small_P00134_bowtie2out.txt --mpa_pkl /path/mpa_v20_m200/mpa_v20_m200.pkl --bowtie2db /path/mpa_v20_m200 > /path/small_P00134_profile.txt
      

      Sample data, output data and database can be found on the file server in folder "/data/test_data/UGENE-6254".

      Test plan

      See tests "Tests: Metagenomics > MetaPhlAn > Workflow element > Core".

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              dsukhomlinov Dmitrii Sukhomlinov
              Reporter:
              oigl Olga Golosova
              Assigned Tester:
              Svetlana Samoilenko
              Watchers:
              2 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: