[UGENE-6035] Add "Ensemble Classification Data" workflow element - Jira

XML

Word

Printable

Details

Type: New Feature
Status: Closed
Priority: Blocker
Resolution: Fixed
Affects Version/s: virogenesis
Fix Version/s: 1.31
Component/s: NGS, Workflow
Labels:
- formats
- new_tool

Story Points:
3
Epic Link:
VIROGENESIS-II
Sprint:
DEV-30-4, DEV-30-5, DEV-30-6
Affect Type:
Userdefined

Description

Element name and description

Name of the element: "Ensemble Classification Data"
Description of the element on the Scene:"Ensemble classification data from other elements into unset."
The "unset" value corresponds to the name of the output file.
Description of the element in the Property Editor:
"The element ensembles data, produced by classification tools (Kraken, CLARK, DIAMOND), into a single file in CSV format. This file can be used as input for the WEVOTE classifier."

Input data

There is one input port:

Item	Value
Port name in GUI	Input taxonomy data
Port description	Three input slots are available for taxonomy classification data. At least first and second slots should be connected to classification data slots.
Port ID in UWL	in
Number of slots	3
Slot #1 name in GUI	Input tax data 1
Slot #1 ID in UWL	tax_data1
Slot #1 data type	Taxonomy classification
Slot #2 name in GUI	Input tax data 2
Slot #2 ID in UWL	tax_data2
Slot #2 data type	Taxonomy classification
Slot #3 name in GUI	Input tax data 3
Slot #3 ID in UWL	tax_data3
Slot #3 data type	Taxonomy classification

Output data

There is one output port:

Item	Value
Port name in GUI	Ensembled classification
Port description	URL to the CSV file with ensembled classification data.
Port ID in UWL	out
Number of slots	1
Slot #1 name in GUI	Output URL
Slot #1 ID in UWL	url
Slot #1 data type	string

Parameters

There is one parameter "Output file". In GUI it is a line edit with the browse button. The value is mandatory ("Required"). The default value is "ensemble.csv". The parameter description is the following:

Specify the output file. The classification data are stored in CSV format with the following columns:
    1) a sequence name
    2) taxID from the first tool
    3) taxID from the second tool
    4) optionally, taxID from the third tool

Data processing by the element

The element takes input taxonomy data (i.e. maps of sequence names with taxIDs) from two or three slots. Datasets are not taken into account. The data are processed per file.
It sorts all input sequence names by alphabet.
Create a CSV file (using the name, specified in the parameters) with the following columns structure:
- seq_name
- taxID_of_seq_from_slot1
- taxID_of_seq_from_slot2
- taxID_of_seq_from_slot3 (if specified)
Show the CSV file as the output on the WD dashboard. Pass the file URL to the output port.

Error messages

In case the first or the second slot is not set:

Show an error in the WD Error list:

It is required to input taxonomy data for at least the first and the second slot.

In case there are sequences present in one of the map, but not present in another one:

Generate a "TRACE" message like:

Taxonomy data for "seq_name" is found in "file1", but not found in "file2" and "file3".

Generate an "INFO" message in the log and a warning message on the WD dashboard:
```
Different taxonomy data do not match. Some sequence names were skipped. 
```

Sample data

See, for example, files "HC1.fasta" and "HC1_ensemble.csv" on the file server (in the ".../virogenesis/tools_testing/wevote_without_classifiers" folder). The second file was received from the first one by running "run_WEVOTE_PIPELINE.sh" with:

CLARK-l with the "bacteria" database that goes with the tool.
Kraken with the "MiniKraken" database.

Attachments

Issue Links

relates to

UGENE-6036 Add "Improve Classification with WEVOTE" workflow element

Closed

Activity

People

Assignee:: Aleksey Tiunov [X] (Inactive)

Reporter:: Olga Golosova

Assigned Tester:: Eugenia Pushkova [X] (Inactive)

Watchers:: 2 Start watching this issue

Dates

Created:: 19/Feb/18 6:45 AM

Updated:: 18/Apr/18 8:32 PM

Resolved:: 29/Mar/18 7:48 AM