The main objective of this epic is to make work with different file formats of reference and input data in the VIROGENESIS framework smoothly and handy.
Archive file formats
After Stage I (UGENE-5879) we have to unpack all data. For efficiency and tangible economy of disk space make it possible to work with archive formats 7z and GZ.
Note that 7z is used by the UGENE installer. It is x5 times more efficient on some data than GZ. The GZ format is a common format for storage of reference data and NGS FASTQ files.
Other format issues
There a few issues with the formats, for example:
- The original CLARK tool supports only "one FASTA per file"-format, that makes it harder to use with data downloaded from the NCBI FTP.
- Kraken requires a certain header of input sequences to build a database, therefore RefSeq (or other sequence data) currently require additional processing. It is better to modify the headers on the fly.
Data sources
The data are:
- User input data
- Taxonomy
- RefSeq
- In terms of the default CLARK database
- In terms of data for building databases for Kraken and CLARK
- Uniprot (for DIAMOND)
Note that MiniKraken (for Kraken) will be skipped for now, as it is not so big the other data packages.
Version control
The data version control should be supported in the framework. The process of the data updating should be automated.
- relates to
-
UGENE-5879 VIROGENESIS, Stage 1
- Closed
-
UGENE-6091 VEME 2018 issues
- Closed
-
UGENE-5925 VIROGENESIS, Stage 2
- Closed
-
UGENE-5957 Re-pack RefSeq bacterial genomes for CLARK and change default database
- Closed
-
UGENE-6052 Hidden parameters in WD wizards
- Closed
-
UGENE-5959 Integrate 7zip, support it for building CLARK database
- Closed
-
UGENE-5894 Online installer: "Filter by classification" component
- Closed
-
UGENE-6013 Rename folder to "minikraken"
- Closed
-
UGENE-5930 Input paired-end reads for "Classify Sequences with DIAMOND"
- Closed
-
UGENE-6018 7zip: RefSeq for Kraken
- Closed
-
UGENE-6019 7zip: Kraken database (e.g. MiniKraken)
- Closed
-
UGENE-6022 7zip: input data for tools
- Closed