Seq/NCBI/Taxonomy Database

The taxonomy database maintained by the National Center for Biotechnology Information (NCBI) contains taxonomy information closely related to their sequence database [1],[2]. There are multiple ways to access NCBI taxonomy database, including web browser [3],[4], ftp, and command line tools, etc..

NCBI Datasets Command line Interface (CLI) Tool

For example, if we are interested in studying the virus causing feline acquired immunodeficiency syndrome, we can find the information in the NCBI Taxonomy Browser [6]. The NCBI:txid11673 is uniquely linked to the species.

To install NCBI CLI tools, follow instructions in reference [7].

Follow instructions in reference to download [8], unzip and verify [9] for the given txid 11673 in a terminal:

datasets download taxonomy taxon 11673 --filename dltxid11673.zip
unzip dltxid11673.zip -d dltxid11673
cd dltxid11673
md5sum -c md5sum.txt

Output:

Downloading: dltxid11673.zip    2.96kB valid data package
Validating package files [================================================] 100% 5/5
Archive:  dltxid11673.zip
  inflating: dltxid11673/README.md   
  inflating: dltxid11673/ncbi_dataset/data/taxonomy_report.jsonl  
  inflating: dltxid11673/ncbi_dataset/data/taxonomy_summary.tsv  
  inflating: dltxid11673/ncbi_dataset/data/dataset_catalog.json  
  inflating: dltxid11673/md5sum.txt  
ncbi_dataset/data/taxonomy_report.jsonl: OK
ncbi_dataset/data/taxonomy_summary.tsv: OK
ncbi_dataset/data/dataset_catalog.json: OK

The downloaded tsv and jsonl are included here as an example.

To view the example tsv file:

import pandas as pd

dt = pd.read_csv('../mtbp3/data/supp_seq/taxonomy_summary.tsv', delimiter='\t')
print(dt.transpose())
                                                       0
Query                                              11673
Taxid                                              11673
Tax name                   Feline immunodeficiency virus
Authority                                            NaN
Rank                                             SPECIES
Basionym                                             NaN
Basionym authority                                   NaN
Curator common name                                  NaN
Has type material                                     no
Group name                                       viruses
Superkingdom name                                Viruses
Superkingdom taxid                                 10239
Kingdom name                                Pararnavirae
Kingdom taxid                                    2732397
Phylum name                               Artverviricota
Phylum taxid                                     2732409
Class name                               Revtraviricetes
Class taxid                                      2732514
Order name                                  Ortervirales
Order taxid                                      2169561
Family name                                 Retroviridae
Family taxid                                       11632
Genus name                                    Lentivirus
Genus taxid                                        11646
Species name               Feline immunodeficiency virus
Species taxid                                      11673
Scientific name is formal                           True

To view the jsonl file:

import json

with open('../mtbp3/data/supp_seq/taxonomy_report.jsonl', 'r') as file:
    line = file.readline()

print(json.dumps(json.loads(line), indent=2))
{
  "taxonomy": {
    "taxId": 11673,
    "rank": "SPECIES",
    "currentScientificName": {
      "name": "Feline immunodeficiency virus",
      "notes": [
        {
          "name": "ICTV Status",
          "note": "Name is currently accepted by the International Committee on Taxonomy of Viruses.",
          "noteClassifier": "ictv_accepted"
        }
      ]
    },
    "groupName": "viruses",
    "classification": {
      "superkingdom": {
        "name": "Viruses",
        "id": 10239
      },
      "kingdom": {
        "name": "Pararnavirae",
        "id": 2732397
      },
      "phylum": {
        "name": "Artverviricota",
        "id": 2732409
      },
      "class": {
        "name": "Revtraviricetes",
        "id": 2732514
      },
      "order": {
        "name": "Ortervirales",
        "id": 2169561
      },
      "family": {
        "name": "Retroviridae",
        "id": 11632
      },
      "genus": {
        "name": "Lentivirus",
        "id": 11646
      },
      "species": {
        "name": "Feline immunodeficiency virus",
        "id": 11673
      }
    },
    "parents": [
      1,
      10239,
      2559587,
      2732397,
      2732409,
      2732514,
      2169561,
      11632,
      327045,
      11646
    ],
    "children": [
      11648,
      289357,
      36373,
      36372,
      36371,
      45409,
      31676,
      11675,
      11674
    ],
    "counts": [
      {
        "type": "COUNT_TYPE_ASSEMBLY",
        "count": 1
      },
      {
        "type": "COUNT_TYPE_GENE",
        "count": 5
      },
      {
        "type": "COUNT_TYPE_PROTEIN_CODING",
        "count": 5
      }
    ],
    "genomicMoltype": "ssRNA-RT",
    "currentScientificNameIsFormal": true
  },
  "query": [
    "11673"
  ]
}

Reference