Get lineage of an organism from NCBI taxonomy data

A Python script to obtain the complete taxonomic lineage of an organism from NCBI taxonomy data.

This is a method I followed to get an organism’s complete lineage from NCBI taxonomy data. The procedure involves:

  1. Creating a MySQL database
  2. Initialising the database with the BioSQL schema
  3. Downloading and importing NCBI taxonomy data into the database
  4. Using a python script to query the database and return organism’s complete lineage

Much of the information in this post comes from the documentation of the BioSQL and BioPython projects. In addition, I wrote a Python script to query the database and output results.

Here is how the script works:

python3 lineager.py -n Bos taurus

Sample output:

2018-07-14 04:47 INFO     Processing organism name provided
                          at the command line: Bos taurus
Organism,Lineage
Bos taurus,cellular organisms;Eukaryota;Opisthokonta;Metazoa;
Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;
Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;
Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;
Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;
Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus     

Instead of name, an input file containing names of organisms can be provided using the -f option.

Requirements

These steps are done in BioVM but can be replicated on any system where the following programs can be installed.

  • MySQL (5.7.22) for the database.
  • Perl (5.22.1) and the DBI and DBD modules for initialising the database with the biosql schema and for the load_taxonomy.pl script.
  • Python (3.5.2) and the MySQLdb module (1.3.7) for the lineager.py script.

All these steps are done as a user (here it is biovm). sudo privileges are required for installing packages, creating the database and the database user.

Install MySQL:

sudo apt install -y mysql-server

During installation, you will be prompted to set a password for the mysql admin user (-u root in mysql commands). You will need this password to connect and create the database later.

After installation, you can check if the MySQL server is up and running using:

sudo netstat -tlpn | grep mysql
tcp        0      0 127.0.0.1:3306          0.0.0.0:*               LISTEN      3113/mysqld

Install the Perl DBI, DBD modules and the Python MySQL client libraries required for interacting with the MySQL database:

sudo apt install -y libdbi-perl libdbd-mysql-perl \
  libmysqlclient-dev python3-mysqldb

Create the biosql database

Login as the mysql admin user:

mysql -u root -p

Create database. Note the use of backticks (`) instead of single quotes for the database name:

CREATE DATABASE `biosql` COLLATE 'utf8_general_ci';

Create database user and grant permissions to the biosql database.

Note: set a strong password under IDENTIFIED BY

CREATE USER `biovm`@`localhost` IDENTIFIED BY 'your-password-here';
GRANT ALL PRIVILEGES ON `biosql`.* TO `biovm`@`localhost`;

Quit mysql shell by typing exit.

For the database connection to work in the lineager.py script, create a file .my.cnf in the user’s HOME directory with the following content:

[client]
user = biovm
password = your-password-here
database = biosql

Download the latest release of the BioSQL schema:

wget -c https://github.com/biosql/biosql/archive/biosql-release-1-0-1.tar.gz

Decompress the archive, change directory:

tar zxvf biosql-release-1-0-1.tar.gz
cd biosql-biosql-release-1-0-1

Initialise database with BioSQL schema

Edit file biosqldb-mysql.sql in the sql directory and remove TYPE=INNODB from all CREATE TABLE statements. See footnote1 for details.

Initialise database by executing the SQL script:

mysql -u biovm -D biosql < sql/biosqldb-mysql.sql

Import NCBI taxonomy data

While still in the biosql-biosql-release-1-0-1 directory, create a directory to store taxonmy data:

mkdir taxdata

Download taxonomy database, verify integrity of the downloaded archive and then uncompress the archive:

cd taxdata
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5

md5sum -c taxdump.tar.gz.md5
# this should return
taxdump.tar.gz: OK

# uncompress archive
tar zxvf taxdump.tar.gz

# Move to parent directory
cd ..

Import taxonomy data into the biosql database. For the script to connect to the database, the DBI_PASSWORD environment variable will need to be set. The password is the one set for the database user earlier:

export DBI_PASSWORD='your-password-here'

Now load taxonomy data:

perl scripts/load_ncbi_taxonomy.pl --dbname biosql --dbuser biovm

This will take some time to complete:

Loading NCBI taxon database in taxdata:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
... updating new parent IDs
... (committing nodes)
... rebuilding nested set left/right values
... reading in taxon names from names.dmp
... deleting old taxon names
... inserting new taxon names
... cleaning up
Done.

To view options the script supports, install the perl-doc program and then use --help:

perl scripts/load_ncbi_taxonomy.pl --help

Run lineager.py to get lineage

Clone git repository:

cd
git clone --depth 1 https://gitlab.com/vimalkvn/lineager

Run script:

cd lineager
python3 lineager.py -n Escherichia coli

If you have a file input.txt containing names like this:

Canis lupus familiaris
Bos taurus
Escherichia
AMBIGUOUS
Arabidopsis thaliana

then run the script like this:

python3 lineager.py -f input.txt

At the end of the run, an output file lineage.csv will be generated in the same directory containing lineage information for all organisms in the input file.

Footnotes


  1. Without this, the following error will occur when attempting to initialise the database:

    ERROR 1064 (42000) at line 49: You have an error
    in your SQL syntax; check the manual that corresponds
    to your MySQL server version for the right syntax
    to use near 'TYPE=INNODB' at line 8.
    
    This happens because the TYPE=INNODB clause was
    changed to ENGINE=INNODB since MySQL version 5.5. As
    INNODB is the default, it is not necessary to specify
    it in the CREATE TABLE statements.
    

Comments