
This package provides a command line interface.

Retrieve a reference

To retrieve a reference mention its id with the --id option.

$ mutalyzer_retriever --id "NG_012337.1"
##sequence-region NG_012337.1 1 15948

Retrieve a reference model

To retrieve the reference model add --parse (-p). Optionally, choose the preferred indentation with --indent.

$ mutalyzer_retriever --id "NG_012337.1" -p --indent 2
  "annotations": {
    "id": "NG_012337.1",
    "type": "record",

Output directory and split the model

Specify an output directory with --output.

$ mutalyzer_retriever --id "NG_012337.1" -p --indent 2 --output .
$ less NG_012337.1
  "annotations": {
    "id": "NG_012337.1",
    "type": "record",

Split the model between annotations and sequence with --split.

$ mutalyzer_retriever --id "NG_012337.1" -p --indent 2 --output . --split
$ less NG_012337.1.annotations
  "id": "NG_012337.1",
  "type": "record",
$ less NG_012337.1.sequence

Choose the retrieval source

By default all the sources are accessed (in the following order: LRG, NCBI, Esembl) and the reference is retrieved from the first one found. However, a specific source can be specified with -source (-s).

$ mutalyzer_retriever --id "NG_012337.1" -s ncbi

Choose the retrieval file type

For NCBI and Ensembl the default retrieved reference is gff3. However, a fasta file can be also retrieved with --type (-t).

$ mutalyzer_retriever --id "NG_012337.1" -t fasta
>NG_012337.1 Homo sapiens succinate dehydrogenase complex, ...

If --parse (p) is added to the previous command, the sequence model is obtained (no annotations are included).

$ mutalyzer_retriever --id "NG_012337.1" -t fasta -p

For the moment, this is not the case when --parse (p) is used in combination with -t gff3.

Raw genbank files can be retrieved from NCBI with -t genbank, but they cannot be parsed to obtain a model.

Parse local files

To obtain a model from local files (gff3 with fasta and lrg) use the from_file command.

$ mutalyzer_retriever from_file -h
usage: mutalyzer_retriever from_file [-h]
                                     [--paths PATHS [PATHS ...]]

optional arguments:
  -h, --help            show this help message and exit
  --paths PATHS [PATHS ...]
                        both gff3 and fasta paths or just an lrg
  --is_lrg              there is one file which is lrg

An example with gff3 and fasta is as follows.

$ mutalyzer_retriever from_file --paths NG_012337.1.gff3 NG_012337.1.fasta
{"annotations": {"id": "NG_012337.1", "type": "record", "location": ...

For an lrg file the --is_lrg flag needs to be added.

$ mutalyzer_retriever from_file --paths LRG_417 --is_lrg
{"annotations": {"type": "record", "id": "LRG_417", "location": ...

Retrieve the NCBI reference models from FTP

Starting from scratch, i.e., connect to the FTP location to retrieve the assembly versions and to download the annotations files.

$ mutalyzer_retriever ncbi_assemblies
- local output directory set up to ./models

Restrict only to specific reference ids and assuming that the input files are already present in the downloads/ directory.

$ mutalyzer_retriever ncbi_assemblies  --input downloads/ --ref_id_start NC_000023 --downloaded
- local output directory set up to ./models
- processing 109 from 20180213, (GRCh38.p12, GCF_000001405.38)
  - NC_000023.11
- processing 105.20220307 from 20220307, (GRCh37.p13, GCF_000001405.25)
  - NC_000023.10
- writing ./models/NC_000023.10