Seq/NCBI/Taxonomy Database
The taxonomy database maintained by the National Center for Biotechnology Information (NCBI) contains taxonomy information closely related to their sequence database [1],[2]. There are multiple ways to access NCBI taxonomy database, including web browser [3],[4], ftp, and command line tools, etc..
NCBI Datasets Command line Interface (CLI) Tool
For example, if we are interested in studying the virus causing feline acquired immunodeficiency syndrome, we can find the information in the NCBI Taxonomy Browser [6].
The NCBI:txid11673
is uniquely linked to the species.
To install NCBI CLI tools, follow instructions in reference [7].
Follow instructions in reference to download [8], unzip and verify [9] for the given txid 11673
in a terminal:
datasets download taxonomy taxon 11673 --filename dltxid11673.zip
unzip dltxid11673.zip -d dltxid11673
cd dltxid11673
md5sum -c md5sum.txt
Output:
Downloading: dltxid11673.zip 2.96kB valid data package
Validating package files [================================================] 100% 5/5
Archive: dltxid11673.zip
inflating: dltxid11673/README.md
inflating: dltxid11673/ncbi_dataset/data/taxonomy_report.jsonl
inflating: dltxid11673/ncbi_dataset/data/taxonomy_summary.tsv
inflating: dltxid11673/ncbi_dataset/data/dataset_catalog.json
inflating: dltxid11673/md5sum.txt
ncbi_dataset/data/taxonomy_report.jsonl: OK
ncbi_dataset/data/taxonomy_summary.tsv: OK
ncbi_dataset/data/dataset_catalog.json: OK
The downloaded tsv and jsonl are included here as an example.
To view the example tsv file:
import pandas as pd
dt = pd.read_csv('../mtbp3/data/supp_seq/taxonomy_summary.tsv', delimiter='\t')
print(dt.transpose())
0
Query 11673
Taxid 11673
Tax name Feline immunodeficiency virus
Authority NaN
Rank SPECIES
Basionym NaN
Basionym authority NaN
Curator common name NaN
Has type material no
Group name viruses
Superkingdom name Viruses
Superkingdom taxid 10239
Kingdom name Pararnavirae
Kingdom taxid 2732397
Phylum name Artverviricota
Phylum taxid 2732409
Class name Revtraviricetes
Class taxid 2732514
Order name Ortervirales
Order taxid 2169561
Family name Retroviridae
Family taxid 11632
Genus name Lentivirus
Genus taxid 11646
Species name Feline immunodeficiency virus
Species taxid 11673
Scientific name is formal True
To view the jsonl file:
import json
with open('../mtbp3/data/supp_seq/taxonomy_report.jsonl', 'r') as file:
line = file.readline()
print(json.dumps(json.loads(line), indent=2))
{
"taxonomy": {
"taxId": 11673,
"rank": "SPECIES",
"currentScientificName": {
"name": "Feline immunodeficiency virus",
"notes": [
{
"name": "ICTV Status",
"note": "Name is currently accepted by the International Committee on Taxonomy of Viruses.",
"noteClassifier": "ictv_accepted"
}
]
},
"groupName": "viruses",
"classification": {
"superkingdom": {
"name": "Viruses",
"id": 10239
},
"kingdom": {
"name": "Pararnavirae",
"id": 2732397
},
"phylum": {
"name": "Artverviricota",
"id": 2732409
},
"class": {
"name": "Revtraviricetes",
"id": 2732514
},
"order": {
"name": "Ortervirales",
"id": 2169561
},
"family": {
"name": "Retroviridae",
"id": 11632
},
"genus": {
"name": "Lentivirus",
"id": 11646
},
"species": {
"name": "Feline immunodeficiency virus",
"id": 11673
}
},
"parents": [
1,
10239,
2559587,
2732397,
2732409,
2732514,
2169561,
11632,
327045,
11646
],
"children": [
11648,
289357,
36373,
36372,
36371,
45409,
31676,
11675,
11674
],
"counts": [
{
"type": "COUNT_TYPE_ASSEMBLY",
"count": 1
},
{
"type": "COUNT_TYPE_GENE",
"count": 5
},
{
"type": "COUNT_TYPE_PROTEIN_CODING",
"count": 5
}
],
"genomicMoltype": "ssRNA-RT",
"currentScientificNameIsFormal": true
},
"query": [
"11673"
]
}