Uploaded image for project: 'UGENE'
  1. UGENE
  2. UGENE-6782

Extracting data from a GTF file by sequences into another GTF file



    • Story Points:
    • Epic Link:
    • Sprint:
      DEV-35-2, DEV-35-3, DEV-35-4
    • Affect Type:


      Annotations for chromosomes are often stored in a common GTF file - different lines in the file have different sequence names. See, for example:

      It is very hard to work with this data in the current UGENE version 34, as the GTF files are loaded too long.

      This issue proposes a workaround for this issue - possibility to extract annotations for specific sequences only into a new GTF file.

      More exactly, the following should be done:

      1. If a user opens a GTF file and the file is big enough, the procedure described below is applied. As for the "big enough" criteria, 30 Mb *.gtf file or 2Mb *.gtf.gz limit is proposed. For example, when "chr1" lines from "hg19.knownGene.gtf.gz" are saved into a separate file, it takes ~26 Mb and loads into UGENE fast enough on my computer.
      2. Thus, when a smaller file is loaded, it should be opened commonly.
      3. When the file is big enough, a dialog appears:
        • Title: "Open File or Extract Data by Sequences"
        • Buttons: "Help", "Open file", "Extract by sequences", "Cancel"
        • Content:
          The size of the "file_name.gtf" is XX Mb. It may take some time to parse the data.
          Probably, the file contains annotations that belong to different sequences (chromosomes). In case only some sequences are investigated, it is recommended to extract annotations that belong to these sequences into a separate file.
          Usage of the extracted annotations file of a smaller size will help to decrease the time needed for loading the data.
      4. If "Open file" is clicked, the GTF file is opened commonly.
      5. If "Cancel" is clicked, the GTF file opening is cancelled.
      6. If "Extract by sequences" is clicked, another dialog appears:
        • Title: "Extract Annotations by Sequences".
        • Buttons: "Help", "Extract", "Cancel".
        • Content:
          • a table with columns "Sequence name", "Number of annotations" and checkboxes.
          • "Save extracted annotations to file" parameter with a text field for the GTF file path and a browse ("...") button. The file browse dialog should have filter for "*.gtf" files.
      7. When the dialog is opened the input GTF file is parsed - all lines are read, but not yet parsed. Only the beginning of each line is read with the sequence name. The task is collecting the sequence names and the number of lines per each sequence name.
      8. An icon with progress is visualized in the table while the calculations are in progress.
      9. The "Cancel" is always available. If it is clicked, the task is cancelled, if required, the dialog closes and the GTF file opening is skipped.
      10. After sequence names are loaded, the user can check the required items.
      11. The "Extract" button is locked until the list of available sequences is collected. If it is clicked, the new GTF file is created with the required lines.
      12. In case the task is interrupted externally, e.g. by deleting or modified the input GTF file, the task is stopped and an error notification is pop up:
        Extracting of annotations from "file_name" failed.




            kir Kirill Rasputin
            oigl Olga Golosova
            0 Start watching this issue