Blast Documentation¶
This module uses NCBI’s standalone blast to generate blastn results. The results are parsed for the best hit, which are used to get accession numbers.
What is BLAST?¶
Per NCBI, the Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
We use NCBI’s blastn task to generate a best hit in order to infer
orthology which is under the umbrella of comparative genetics/genomics
Comparative genetics/genomics is a field of biological research in which
the genome sequences of different species human, mouse, and a wide
variety of other organisms from bacteria to chimpanzees are compared.
Using this package, we compared these genes of interest across a group of species.
How do we configure and run blast?¶
Running blast is the most complex aspect of this package, but we’ve found a way to simplify the automation of blasting while also limiting blast searches by taxonomy id.
Before you use this function, you need for NCBI Blast+ to be
installed and in your path. Download the latest standalone blast
executables from
here.
We are currently using version 2.8.1.
Our Blast Methods¶
NCBI’s blastn can be configured (using its parameters) in a number
of different ways (i.e. local or remote use and with seqidlists or
taxids). For typical orthology analyses, it’s important to take
advantage of the speed and efficiency of NCBI’s newest preformatted
blast databases
(blastdbv5). In order to
do that, we’ve implemented a method (1) that uses taxids (taxonomic
groups — species level and higher level taxa). View more about our
methods below.
Method |
Description |
|---|---|
1 |
Local blast using taxids. Utilizes local databases
( |
2 |
Remote blast using an entrez query. Uses entrez species name and query |
None |
A single query method not useful for orthology inference |
Our Custom Accession File Format¶
We use a specifically formatted accession file with our headers as
Tier, Gene, Organism to store blast output and input. This
allows for distinguishing genes by families or features. The Tier
header can be omitted, but the other headers are requirements. The
Accession numbers are stored in a .csv file. The following table is
an example of how we format our blast input file.
Tier |
Gene |
Homo_sapiens |
Macaca_mulatta |
Mus_musculus |
Rattus_norvegicus |
|---|---|---|---|---|---|
1 |
ADRA1A |
NM_000680.3 |
|||
2 |
ADRA1B |
NM_000679.3 |
|||
3 |
ADRA1D |
NM_000678.3 |
|||
4 |
ADRA2A |
NM_000681.3 |
|||
Immune |
ADRA2B |
NM_000682.6 |
|||
Addiction |
CHRM1 |
NM_000738.2 |
|||
Ugly |
CHRM2 |
NM_000739.2 |
|||
Other |
CHRM3 |
NM_000740.2 |
|||
GPCR |
CHRM5 |
NM_012125.3 |
|||
Isoforms |
CNR1 |
NM_016083.4 |
The .csv file requires some manual configuration, and, while tedious, it is also currently fundamental for the API.
Below we have defined the headers:
Tier: The target genes need a ranking or categorization based on the experiment. These can be user defined or a preset tier system can be used. In the future the different tiers will allow the user to control the order that each gene is processed.
Gene: The genes are HGNC aliases for the target genes of interest. In the future we will be able to process the HGNC .csv file to further automate the creation of this template file.
Query: The query organism is placed into the 3rd column of the .csv file. In the example Homo sapiens is used. Each taxa is a string in the format of “Genus_species”. The query organism also has to have accession numbers for each gene. It is therefore highly important to pick a well annotated species for accurate analysis.
Examples¶
The main class to use is OrthoBlastN in order to run blast. In order
to run OrthoBlastN without using our database management features,
the BLASTDB paths must be set in your environment.
Performing Blast & Post-Blast Analysis¶
from OrthoEvol.Orthologs.Blast import OrthoBlastN
# Use an existing list of gpcr genes
gpcr_blastn = OrthoBlastN(project="orthology-gpcr", method=1,
save_data=True, acc_file="gpcr.csv",
copy_from_package=True)
# View the list of genes
gpcr_blastn.gene_list
# View the blast dataframe
gpcr_blastn.df
# Start the blast
gpcr_blastn.run()
# Use your own accessions file.
# You don't need to copy from package to use your own genes
my_blastn = OrthoBlastN(project="orthology-project", method=1,
save_data=True, acc_file="mygenes.csv",
copy_from_package=False)
my_blastn.run()
Customing with BaseBlastN¶
from OrthoEvol.Orthologs.Blast import BaseBlastN
# This is more pythonic with YAML loading
blastconfig = {
"project": "test",
"method": 1,
"taxon_file": None,
"go_list": None,
"post_blast": True,
"template": None,
"save_data": True,
"copy_from_package": False,
"acc_file": "test_blast.csv",
"project_path": None,
"proj_mana": None,
"ref_species": "Homo_sapiens"
}
test_blast = BaseBlastN(**blastconfig)
test_blast.configure(test_blast.blast_human, auto_start=True)