Blast Documentation¶
This module uses NCBI’s standalone blast to generate blastn results. The results are parsed for the best hit, which are used to get accession numbers.
What is BLAST?¶
Per NCBI, the Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.
We use NCBI’s blastn task to generate a best hit in order to infer orthology which is under the umbrella of comparative genetics/genomics Comparative genetics/genomics is a field of biological research in which the genome sequences of different species — human, mouse, and a wide variety of other organisms from bacteria to chimpanzees — are compared.
Using this package, we compared these genes of interest across a group of species.
How do we configure and run blast?¶
Running blast is the most complex aspect of this package, but we’ve found a way to simplify the automation of blasting while also limiting blast searches by organism.
Before you use this function, you need NCBI Blast+
must be installed
and in your path. Download the latest standalone blast executables from
here.
The story of 2 blast methods - Seqids vs Windowmasking¶
Seqids¶
Windowmasking¶
We have perfected the method of using a windowmasker file for each taxonomy id of the organisms that we are analyzing. The blastn executable can filter a query sequence using the windowmasker data files. This option can be used to mask interspersed repeats that may lead to spurious matches. The windowmasker data files should be downloaded from the NCBI FTP site.
For information on how to set up a window masker database, read our setup tutorial.
On a command line, the windowmasker function would look as such:
blastn -query input -db database -window_masker_taxid 9606 -out results.txt
That requires you to have a WINDOW_MASKER_PATH
variable in your
environment variables.
In python:
In addition to using windowmasker data files, we also use a specifically
formatted accession file
with our headers as Tier
, Gene
,
Organism
to store blast output and input. This allows for
distinguishing genes by families or features. The Tier
header can be
omitted, but the other headers are requirements.
The Accession numbers are stored in a .csv file. The following table is an example of how we format our blast input file.
Tier | Gene | Homo_sapiens | Macaca_mulatta | Mus_musculus | Rattus_norvegicus |
---|---|---|---|---|---|
1 | ADRA1A | NM_000680.3 | |||
2 | ADRA1B | NM_000679.3 | |||
3 | ADRA1D | NM_000678.3 | |||
4 | ADRA2A | NM_000681.3 | |||
Good | ADRA2B | NM_000682.6 | |||
Bad | CHRM1 | NM_000738.2 | |||
Ugly | CHRM2 | NM_000739.2 | |||
Other | CHRM3 | NM_000740.2 | |||
GPCR | CHRM5 | NM_012125.3 | |||
Isoforms | CNR1 | NM_016083.4 |
The .csv file requires some manual configuration, and, while tedious, it is also currently fundamental for the API.
Below we have defined the headers:
- Tier: The target genes need a ranking or categorization based on the experiment. These can be user defined or a preset tier system can be used. In the future the different tiers will allow the user to control the order that each gene is processed.
- Gene: The genes are HGNC aliases for the target genes of interest. In the future we will be able to process the HGNC .csv file to further automate the creation of this template file.
- Query: The query organism is placed into the 3rd column of the .csv file. In the example Homo sapiens is used. Each taxa is a string in the format of “Genus_species”. The query organism also has to have accession numbers for each gene. It is therefore highly important to pick a well annotated species for accurate analysis.
Examples¶
The main class to use is OrthoBlastN
in order to run blast. In order
to run OrthoBlastN
without using our database management features,
BLASTDB
and WINDOW_MASKER_PATH
paths must be set.
Performing Blast & Post-Blast Analysis¶
from OrthoEvol.Orthologs.Blast import OrthoBlastN
import os
# Create a blast configuration dictionary
blast_cfg = {
"taxon_file": None,
"go_list": None,
"post_blast": True,
"template": None,
"save_data": True,
"copy_from_package": True,
"MAF": 'MAFV3.2.csv'
}
path = os.getcwd()
myblast = OrthoBlastN(proj_mana=None, project="blast-test", project_path=path, **blast_config)
# If you want to immediately start blasting, set auto_start to True
myblast.blast_config(myblast.blast_human, 'Homo_sapiens', auto_start=False)
Making the API available with Accession data¶
TODO: This is unfinished.
from OrthoEvol.Orthologs.CompGenetics import CompGenAnalysis