Element name and description
- Name of the element: "Ensemble Classification Data"
- Description of the element on the Scene:"Ensemble classification data from other elements into unset."
The "unset" value corresponds to the name of the output file. - Description of the element in the Property Editor:
"The element ensembles data, produced by classification tools (Kraken, CLARK, DIAMOND), into a single file in CSV format. This file can be used as input for the WEVOTE classifier."
Input data
There is one input port:
Item | Value |
---|---|
Port name in GUI | Input taxonomy data |
Port description | Three input slots are available for taxonomy classification data. At least first and second slots should be connected to classification data slots. |
Port ID in UWL | in |
Number of slots | 3 |
Slot #1 name in GUI | Input tax data 1 |
Slot #1 ID in UWL | tax_data1 |
Slot #1 data type | Taxonomy classification |
Slot #2 name in GUI | Input tax data 2 |
Slot #2 ID in UWL | tax_data2 |
Slot #2 data type | Taxonomy classification |
Slot #3 name in GUI | Input tax data 3 |
Slot #3 ID in UWL | tax_data3 |
Slot #3 data type | Taxonomy classification |
Output data
There is one output port:
Item | Value |
---|---|
Port name in GUI | Ensembled classification |
Port description | URL to the CSV file with ensembled classification data. |
Port ID in UWL | out |
Number of slots | 1 |
Slot #1 name in GUI | Output URL |
Slot #1 ID in UWL | url |
Slot #1 data type | string |
Parameters
There is one parameter "Output file". In GUI it is a line edit with the browse button. The value is mandatory ("Required"). The default value is "ensemble.csv". The parameter description is the following:
Specify the output file. The classification data are stored in CSV format with the following columns: 1) a sequence name 2) taxID from the first tool 3) taxID from the second tool 4) optionally, taxID from the third tool
Data processing by the element
- The element takes input taxonomy data (i.e. maps of sequence names with taxIDs) from two or three slots. Datasets are not taken into account. The data are processed per file.
- It sorts all input sequence names by alphabet.
- Create a CSV file (using the name, specified in the parameters) with the following columns structure:
- seq_name
- taxID_of_seq_from_slot1
- taxID_of_seq_from_slot2
- taxID_of_seq_from_slot3 (if specified)
- Show the CSV file as the output on the WD dashboard. Pass the file URL to the output port.
Error messages
In case the first or the second slot is not set:
- Show an error in the WD Error list:
It is required to input taxonomy data for at least the first and the second slot.
In case there are sequences present in one of the map, but not present in another one:
- Generate a "TRACE" message like:
Taxonomy data for "seq_name" is found in "file1", but not found in "file2" and "file3".
- Generate an "INFO" message in the log and a warning message on the WD dashboard:
Different taxonomy data do not match. Some sequence names were skipped.
Sample data
See, for example, files "HC1.fasta" and "HC1_ensemble.csv" on the file server (in the ".../virogenesis/tools_testing/wevote_without_classifiers" folder). The second file was received from the first one by running "run_WEVOTE_PIPELINE.sh" with:
- CLARK-l with the "bacteria" database that goes with the tool.
- Kraken with the "MiniKraken" database.
- relates to
-
UGENE-6036 Add "Improve Classification with WEVOTE" workflow element
- Closed