Blast Documentation¶

This module uses NCBI’s standalone blast to generate blastn results. The results are parsed for the best hit, which are used to get accession numbers.

What is BLAST?¶

Per NCBI, the Basic Local Alignment Search Tool (BLAST) finds regions of local similarity between sequences. The program compares nucleotide or protein sequences to sequence databases and calculates the statistical significance of matches. BLAST can be used to infer functional and evolutionary relationships between sequences as well as help identify members of gene families.

We use NCBI’s blastn task to generate a best hit in order to infer orthology which is under the umbrella of comparative genetics/genomics Comparative genetics/genomics is a field of biological research in which the genome sequences of different species human, mouse, and a wide variety of other organisms from bacteria to chimpanzees are compared.

Using this package, we compared these genes of interest across a group of species.

How do we configure and run blast?¶

Running blast is the most complex aspect of this package, but we’ve found a way to simplify the automation of blasting while also limiting blast searches by taxonomy id.

Before you use this function, you need for NCBI Blast+ to be installed and in your path. Download the latest standalone blast executables from here. We are currently using version 2.8.1.

Our Blast Methods¶

NCBI’s blastn can be configured (using its parameters) in a number of different ways (i.e. local or remote use and with seqidlists or taxids). For typical orthology analyses, it’s important to take advantage of the speed and efficiency of NCBI’s newest preformatted blast databases (blastdbv5). In order to do that, we’ve implemented a method (1) that uses taxids (taxonomic groups — species level and higher level taxa). View more about our methods below.

Method	Description
1	Local blast using taxids. Utilizes local databases (`refseq_rna_v5`).
2	Remote blast using an entrez query. Uses entrez species name and query
None	A single query method not useful for orthology inference

Our Custom Accession File Format¶

We use a specifically formatted accession file with our headers as Tier, Gene, Organism to store blast output and input. This allows for distinguishing genes by families or features. The Tier header can be omitted, but the other headers are requirements. The Accession numbers are stored in a .csv file. The following table is an example of how we format our blast input file.

Tier	Gene	Homo_sapiens
1	ADRA1A	NM_000680.3
2	ADRA1B	NM_000679.3
3	ADRA1D	NM_000678.3
4	ADRA2A	NM_000681.3
Immune	ADRA2B	NM_000682.6
Addiction	CHRM1	NM_000738.2
Ugly	CHRM2	NM_000739.2
Other	CHRM3	NM_000740.2
GPCR	CHRM5	NM_012125.3
Isoforms	CNR1	NM_016083.4

The .csv file requires some manual configuration, and, while tedious, it is also currently fundamental for the API.

Below we have defined the headers:

Tier: The target genes need a ranking or categorization based on the experiment. These can be user defined or a preset tier system can be used. In the future the different tiers will allow the user to control the order that each gene is processed.
Gene: The genes are HGNC aliases for the target genes of interest. In the future we will be able to process the HGNC .csv file to further automate the creation of this template file.
Query: The query organism is placed into the 3rd column of the .csv file. In the example Homo sapiens is used. Each taxa is a string in the format of “Genus_species”. The query organism also has to have accession numbers for each gene. It is therefore highly important to pick a well annotated species for accurate analysis.

Examples¶

The main class to use is OrthoBlastN in order to run blast. In order to run OrthoBlastN without using our database management features, the BLASTDB paths must be set in your environment.

Performing Blast & Post-Blast Analysis¶

from OrthoEvol.Orthologs.Blast import OrthoBlastN


# Use an existing list of gpcr genes
gpcr_blastn = OrthoBlastN(project="orthology-gpcr", method=1,
                             save_data=True, acc_file="gpcr.csv",
                             copy_from_package=True)

# View the list of genes
gpcr_blastn.gene_list

# View the blast dataframe
gpcr_blastn.df

# Start the blast
gpcr_blastn.run()

# Use your own accessions file.
# You don't need to copy from package to use your own genes
my_blastn = OrthoBlastN(project="orthology-project", method=1,
                             save_data=True, acc_file="mygenes.csv",
                             copy_from_package=False)

my_blastn.run()

Customing with BaseBlastN¶

from OrthoEvol.Orthologs.Blast import BaseBlastN

# This is more pythonic with YAML loading
blastconfig = {
    "project": "test",
    "method": 1,
    "taxon_file": None,
    "go_list": None,
    "post_blast": True,
    "template": None,
    "save_data": True,
    "copy_from_package": False,
    "acc_file": "test_blast.csv",
    "project_path": None,
    "proj_mana": None,
    "ref_species": "Homo_sapiens"
}


test_blast = BaseBlastN(**blastconfig)
test_blast.configure(test_blast.blast_human, auto_start=True)