A Python script to obtain the complete taxonomic lineage of an organism from NCBI taxonomy data.
This is a method I followed to get an organism’s complete lineage from NCBI taxonomy data. The procedure involves:
- Creating a MySQL database
- Initialising the database with the BioSQL schema
- Downloading and importing NCBI taxonomy data into the database
- Using a python script to query the database and return organism’s complete lineage
Much of the information in this post comes from the documentation of the BioSQL and BioPython projects. In addition, I wrote a Python script to query the database and output results.
Here is how the script works:
python3 lineager.py -n Bos taurus
Sample output:
2018-07-14 04:47 INFO Processing organism name provided
at the command line: Bos taurus
Organism,Lineage
Bos taurus,cellular organisms;Eukaryota;Opisthokonta;Metazoa;
Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;
Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;
Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;
Eutheria;Boreoeutheria;Laurasiatheria;Cetartiodactyla;
Ruminantia;Pecora;Bovidae;Bovinae;Bos;Bos taurus
Instead of name, an input file containing names of organisms
can be provided using the -f
option.
Requirements
These steps are done in BioVM but can be replicated on any system where the following programs can be installed.
- MySQL (5.7.22) for the database.
- Perl (5.22.1) and the DBI and DBD modules
for initialising the database with the biosql schema
and for the
load_taxonomy.pl
script. - Python (3.5.2) and the MySQLdb module (1.3.7) for the
lineager.py
script.
All these steps are done as a user (here it is biovm). sudo privileges are required for installing packages, creating the database and the database user.
Install MySQL:
sudo apt install -y mysql-server
During installation, you will be prompted to set a
password for the mysql admin user (-u root
in mysql commands). You will need this password to connect and create the database later.
After installation, you can check if the MySQL server is up and running using:
sudo netstat -tlpn | grep mysql
tcp 0 0 127.0.0.1:3306 0.0.0.0:* LISTEN 3113/mysqld
Install the Perl DBI, DBD modules and the Python MySQL client libraries required for interacting with the MySQL database:
sudo apt install -y libdbi-perl libdbd-mysql-perl \
libmysqlclient-dev python3-mysqldb
Create the biosql database
Login as the mysql admin user:
mysql -u root -p
Create database. Note the use of backticks (`) instead of single quotes for the database name:
CREATE DATABASE `biosql` COLLATE 'utf8_general_ci';
Create database user and grant permissions to the biosql database.
Note: set a strong password under IDENTIFIED BY
CREATE USER `biovm`@`localhost` IDENTIFIED BY 'your-password-here';
GRANT ALL PRIVILEGES ON `biosql`.* TO `biovm`@`localhost`;
Quit mysql shell by typing exit
.
For the database connection to work in the lineager.py
script,
create a file .my.cnf
in the user’s HOME directory with
the following content:
[client]
user = biovm
password = your-password-here
database = biosql
Download the latest release of the BioSQL schema:
wget -c https://github.com/biosql/biosql/archive/biosql-release-1-0-1.tar.gz
Decompress the archive, change directory:
tar zxvf biosql-release-1-0-1.tar.gz
cd biosql-biosql-release-1-0-1
Initialise database with BioSQL schema
Edit file biosqldb-mysql.sql
in the sql
directory
and remove TYPE=INNODB
from all CREATE TABLE
statements.
See footnote1 for details.
Initialise database by executing the SQL script:
mysql -u biovm -D biosql < sql/biosqldb-mysql.sql
Import NCBI taxonomy data
While still in the biosql-biosql-release-1-0-1
directory,
create a directory to store taxonmy data:
mkdir taxdata
Download taxonomy database, verify integrity of the downloaded archive and then uncompress the archive:
cd taxdata
wget -c https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz
wget https://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.md5
md5sum -c taxdump.tar.gz.md5
# this should return
taxdump.tar.gz: OK
# uncompress archive
tar zxvf taxdump.tar.gz
# Move to parent directory
cd ..
Import taxonomy data into the biosql database. For
the script to connect to the database, the DBI_PASSWORD
environment variable will need to be set. The password
is the one set for the database user earlier:
export DBI_PASSWORD='your-password-here'
Now load taxonomy data:
perl scripts/load_ncbi_taxonomy.pl --dbname biosql --dbuser biovm
This will take some time to complete:
Loading NCBI taxon database in taxdata:
... retrieving all taxon nodes in the database
... reading in taxon nodes from nodes.dmp
... insert / update / delete taxon nodes
... updating new parent IDs
... (committing nodes)
... rebuilding nested set left/right values
... reading in taxon names from names.dmp
... deleting old taxon names
... inserting new taxon names
... cleaning up
Done.
To view options the script supports, install the
perl-doc program and then use --help
:
perl scripts/load_ncbi_taxonomy.pl --help
Run lineager.py to get lineage
Clone git repository:
cd
git clone --depth 1 https://gitlab.com/vimalkvn/lineager
Run script:
cd lineager
python3 lineager.py -n Escherichia coli
If you have a file input.txt
containing names like this:
Canis lupus familiaris
Bos taurus
Escherichia
AMBIGUOUS
Arabidopsis thaliana
then run the script like this:
python3 lineager.py -f input.txt
At the end of the run, an output file lineage.csv
will be
generated in the same directory containing lineage information for all organisms in the input file.
Footnotes
-
Without this, the following error will occur when attempting to initialise the database:
ERROR 1064 (42000) at line 49: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'TYPE=INNODB' at line 8. This happens because the TYPE=INNODB clause was changed to ENGINE=INNODB since MySQL version 5.5. As INNODB is the default, it is not necessary to specify it in the CREATE TABLE statements.
Comments