Usage

This package provides a command line interface.

Configuration

If you plan to retrieve NCBI references, then create a configuration file where you include the email address used to communicate with the NCBI:

$ echo EMAIL = your.email@address.com > config.txt

Optionally, include the NCBI API key:

$ echo NCBI_API_KEY = your_NCBI_key >> config.txt

Finally:

export MUTALYZER_SETTINGS="$(pwd)/config.txt"

Retrieve a reference

To retrieve a reference mention its id with the --id option.

$ mutalyzer_retriever --id "NG_012337.1"
##sequence-region NG_012337.1 1 15948
...

Retrieve a reference model

To retrieve the reference model add --parse (-p). Optionally, choose the preferred indentation with --indent.

$ mutalyzer_retriever --id "NG_012337.1" -p --indent 2
{
  "annotations": {
    "id": "NG_012337.1",
    "type": "record",
...

Output directory and split the model

Specify an output directory with --output.

$ mutalyzer_retriever --id "NG_012337.1" -p --indent 2 --output .
$ less NG_012337.1
{
  "annotations": {
    "id": "NG_012337.1",
    "type": "record",
...

Split the model between annotations and sequence with --split.

$ mutalyzer_retriever --id "NG_012337.1" -p --indent 2 --output . --split
$ less NG_012337.1.annotations
{
  "id": "NG_012337.1",
  "type": "record",
...
$ less NG_012337.1.sequence
GGGCTTGGTTCTACCATATCTCTACTTTGTGTTTATGTTTGTGTATGCATGTACTCCAAAGTCTT
...

Choose the retrieval source

By default all the sources are accessed (in the following order: LRG, NCBI, Esembl) and the reference is retrieved from the first one found. However, a specific source can be specified with -source (-s).

$ mutalyzer_retriever --id "NG_012337.1" -s ncbi
...

Choose the retrieval file type

For NCBI and Ensembl the default retrieved reference is gff3. However, a fasta file can be also retrieved with --type (-t).

$ mutalyzer_retriever --id "NG_012337.1" -t fasta
>NG_012337.1 Homo sapiens succinate dehydrogenase complex, ...
GGGCTTGGTTCTACCATATCTCTACTTTGTGTTTATGTTTGTGTATGCATGTACTCCAA...
...

If --parse (p) is added to the previous command, the sequence model is obtained (no annotations are included).

$ mutalyzer_retriever --id "NG_012337.1" -t fasta -p
{"sequence": {"seq": "GGGCTTGGTTCTACCATATCTCTACTTT

For the moment, this is not the case when --parse (p) is used in combination with -t gff3.

Raw genbank files can be retrieved from NCBI with -t genbank, but they cannot be parsed to obtain a model.

Parse local files

To obtain a model from local files (gff3 with fasta and lrg) use the from_file command.

$ mutalyzer_retriever from_file -h
usage: mutalyzer_retriever from_file [-h]
                                     [--paths PATHS [PATHS ...]]
                                     [--is_lrg]

optional arguments:
  -h, --help            show this help message and exit
  --paths PATHS [PATHS ...]
                        both gff3 and fasta paths or just an lrg
  --is_lrg              there is one file which is lrg

An example with gff3 and fasta is as follows.

$ mutalyzer_retriever from_file --paths NG_012337.1.gff3 NG_012337.1.fasta
{"annotations": {"id": "NG_012337.1", "type": "record", "location": ...
...

For an lrg file the --is_lrg flag needs to be added.

$ mutalyzer_retriever from_file --paths LRG_417 --is_lrg
{"annotations": {"type": "record", "id": "LRG_417", "location": ...

Retrieve the NCBI reference models from FTP

Starting from scratch, i.e., connect to the FTP location to retrieve the assembly versions and to download the annotations files. Please note that the following command will retrieve, besides the chromosomes (NC_), also the contigs (NT_) and the scaffolds (NW_).

$ mutalyzer_retriever ncbi_assemblies
Downloading assembly releases:
 - assembly: GRCh37
   - dir: ncbi_annotation_releases/GRCh37/20190906
   - dir: ncbi_annotation_releases/GRCh37/20220307
   - dir: ncbi_annotation_releases/GRCh37/20240902
 - assembly: GRCh38
   - dir: ncbi_annotation_releases/GRCh38/20180213
   ...
   - dir: ncbi_annotation_releases/GRCh38/20240823
 - assembly: T2T-CHM13v2
   - dir: ncbi_annotation_releases/T2T-CHM13v2/20230315
   - dir: ncbi_annotation_releases/T2T-CHM13v2/20231002
   - dir: ncbi_annotation_releases/T2T-CHM13v2/20240823
Get annotation models:
- get from: GRCh38, date: 20180213
  - NC_000001.11
  - NT_187361.1
  ...

To restrict only to specific reference ids and assuming that the input files are already present in the ./ncbi_annotation_releases (default) directory:

$ mutalyzer_retriever ncbi_assemblies --input ncbi_annotation_releases --ref_id_start NC_000023 --downloaded
Using downloaded releases from:
 ./ncbi_annotation_releases
Get annotation models:
- get from: GRCh38, date: 20180213
  - NC_000023.11
...
- get from: GRCh38, date: 20240823
  - NC_000023.11
- get from: GRCh37, date: 20190906
  - NC_000023.10
- get from: GRCh37, date: 20220307
  - NC_000023.10
- get from: GRCh37, date: 20240902
  - NC_000023.10
- get from: T2T-CHM13v2, date: 20230315
- get from: T2T-CHM13v2, date: 20231002
- get from: T2T-CHM13v2, date: 20240823
- writing ./ncbi_annotation_models/NC_000023.11.annotations
- writing ./ncbi_annotation_models/NC_000023.10.annotations

To restrict only to a specific reference id and an assembly id, with the input files already present in the ./ncbi_annotation_releases directory, and to download also the sequences (--include_sequence) in the same directory:

$ mutalyzer_retriever ncbi_assemblies --ref_id_start NC_0 --assembly_id_start GRCh37 --downloaded --include_sequence
Using downloaded releases from:
 ./ncbi_annotation_releases
Get annotation models:
- get from: GRCh37, date: 20190906
  - NC_000001.10
  ...
  - NC_000024.9
- get from: GRCh37, date: 20220307
  - NC_000001.10
  ...
  - NC_000024.9
- get from: GRCh37, date: 20240902
  - NC_000001.10
  ...
  - NC_000024.9
- writing ./ncbi_annotation_models/NC_000001.10.annotations
  ...
- writing ./ncbi_annotation_models/NC_000024.9.annotations
Downloading the sequences:
- get the sequence for NC_000001.10
- writing ./ncbi_annotation_models/NC_000023.10.sequence
...
- get the sequence for NC_000023.10
- writing ./ncbi_annotation_models/NC_000023.10.sequence