12. References (enrichment)
12.1. Enrichment tools
- class GSEA(gene_sets={})[source]
Constructor
- Parameters:
species --
- compute_enrichment(gene_list, background=None, verbose=False, outdir=None)[source]
- Parameters:
gene_list -- list of genes (e.g. genes with significant fold change)
background -- expected background of the species. Should be number of genes for the species of interest.
verbose --
outdir (str) -- a temporary directory to store reports and intermediate results
- gene_sets
attribute to store the gene sets to be checked for enrichment
- class KEGGPathwayEnrichment(gene_lists, organism, progress=True, mapper=None, background=None, padj_cutoff=0.05, preload_directory=None, convert_input_gene_to_upper_case=False, color_node_with_annotation='Name', used_genes=None)[source]
KEGG Pathways enrichment from DGE results
DGE = Differentially Gene Expression
Current input is the output of the RNADiff analysis. This is a file than can be read by RNADiffResults
This class takes as input a dictionary with 3 lists of genes: 'up', 'down' and 'all'.
For example, using the output of the RNA-seq DGE, use:
from sequana import RNADiffResults r = RNADiffResults("rnadiff.csv") r._log2_fc = 1 gl = r.get_gene_lists(annot_col="gene_name") gl['KO_vs_cont"] ke = KEGGPathwayEnrichment(gl, "mmu") # mmu for mouse
this calls ke.compute_enrichment() that stores the up, down and all results in the attribute
enrichment
as a dictionary.You can now plot the results:
ke.barplot('down')
and save enriched pathways as follows:
up = ke.save_significant_pathways("up") down = ke.save_significant_pathways("down") up.to_csv("kegg_pathway_up_regulated.csv") down.to_csv("kegg_pathway_down_regulated.csv")
This class works like a charm for ecoli with GFF that uses gene names.
For mus musculus, organism is mmu (not **mus*). you will need to have a mapping of the Ensembl ID into KEGG IDs (actually gene name).
You can perform the conversion using BioServices/BioMart. We have implemented a simple function inside Sequana:
from sequana import Mart conv = Mart("mmusculus_gene_ensembl") df = conf.query() conf.save(df)
You can then import the dataframe, as follows using the mapper argument:
import pandas as pd df = pd.read_csv("biomart.csv") df = df.rename({"external_gene_name":"name", "ensembl_gene_id": "ensembl"}, axis=1) df = df.set_index("ensembl", inplace=True) ke = KEGGPathwayEnrichment("path_to_rnadiff", "mmu", mapper=df) ke.scatterplot('down') savefig("B4052_T1vsT0_KE_scatterplot_down.png") ke.scatterplot('up') savefig("B4052_T1vsT0_KE_scatterplot_up.png")
Pathways are loaded from KEGG which may take some time. For development or production, you may want to save the KEGG pathways in a given place. Note however, that the pathways that are enriched will still be downloaded for the final annotation and image creation
To save the pathways locally and load them later do as follows (here for human):
ke = KEGGPathwayEnrichment({}, organism="hsa") ke.save_pathways("all_pathways/") # load them back next time ke = KEGGPathwayEnrichment({}, organism="hsa", preload_directory="all_pathways")
In some cases, the input identifiers are converted into names thanks to the input mapper (csv file). Yet, if the external name are from one species and you use another species in kegg, the kegg names may be upper case while your species' name are in lower case. In such situations, you may set input identifiers are upper case setting the convert_input_gene_to_upper_case parameter to True
- find_pathways_by_gene(gene_name)[source]
Returns pathways that contain the gene name
ke.find_pathways_by_gene("ysgA")
- save_project(tag, outdir='.')[source]
Save tables and visualisations of the complete enrichmment analysis.
- class Mart(dataset="mmusculus_gene_ensembl") # you could choose hsapiens_gene_ensembl for instance df = conv.query() df.set_index("ensembl_gene_id") conv.save(df)[source]
The file can now be loaded in e.g.
KeggPathwayEnrichment
as a mapper of the ensemble identifier to external names understood by Kegg.For fungi (cryptococcus), you can use:
m = Mart(host="fungi.ensembl.org", dataset="cneoformans_eg_gene", mart="fungi_mart") m.query(['cneoformans_eg_gene'])
See more information on https://bioservices.readthedocs.io
- property dataset
- class PantherEnrichment(gene_lists, taxon, requests_per_sec=10, padj_threshold=0.05, log2_fc_threshold=0, fc_threshold=None, enrichment_fdr=0.05, annot_col='Name')[source]
By default, we keep only the genes with a adjusted pvalue <= 0.05. The fold_change threshold is on a log2 scale and set to 0 (meaning no filtering). Only one value is requested and used to filter out positive and negative fold change (on the log2 scale). In other word, a log2 fold change threshold of 2 means that we filter out values between -2 and 2.
If you prefer to work in natural scale, just set the parameter fc_threshold, which overwrite the log2_fc_threshold parameter.
pe = PantherEnrichment("input_file.tsv", taxon=10090, log2_fc_threshold=1) # compute enrichment for genes down and up pe.compute_enrichment() # Results for up case is stored in pe.enrichment # then, we plot the most mportat go terms df_up = pe.plot_go_terms("up") df = pe.plot_go_terms("up", pe.MF) pe.save_chart(df, "chart_MF_up.png") # all 3 main ontology df = pe.plot_go_terms("up") pe.save_chart(df, "chart_up.png")
e.stats contains some statistics. One important is the list of unmapped genes. The results from the GO enrichment are stored in the attributes enrichment. There, we have again adjusted p-value and a fold enrichment, which can in turn be filtered or not.
You can retrieve the cleaned data using the get_data method.
You can also plot the GO terms that are significantly enriched using:
e.plot_go_terms(['MF', 'CC', 'BP])
This function returns the dataframe used during the plotting.
If you want to look at the up regulated genes only:
e.compute_enrichment(pe.mygenes["up"], 83333) df = e.plot_go_terms(['MF', 'CC', 'BP'], log=False, include_negative_enrichment=False, fontsize=8, sort_by='fold_enrichment', show_pvalues=True, fdr_threshold=0.05)
The number of genes is limited to about 3100 depending (don't ask me why, this seem to be a hard-coded limitation on PantherDB website). In such case, you should add a filter e.g on padj or fold change
rnadiff if provided, superseeds the input filename. This is useful for debugging
- compute_enrichment(taxid=None, ontologies=None, enrichment_test='FISHER', correction='FDR')[source]
- Parameters:
enrichment_test -- Fisher or Binomial
correction -- FDR or Bonferonni
The field number_in_reference indicates from the reference, the number of genes that have a given ontolgy term. For instance, 998 genes have the term. This is stored in number_in_reference. If the reference contains 4391 genes, and you provided 49 genes , the expected number of genes that have this ontology term is 49*998/4391 that is 11.1369, which is stored in "expected.
Now, if you actually find that 14 out of 49 genes have the term, you need to compare the numbers 11.1369 and 14. Are they really different ? The ratio 14 / 11.1369 is stored in fold_enrichment. The pvalue and FDR are stored as well.
Some genes may be missing If you provide 50 genes, you may end up with only 45 being mapped onto panther db database. This may explain some differences with the expected value.
Fold enrichment is the number_in_list / expected ratio. Another close metric is the fractional difference: (observed - expected) / expected. This metric is slighlty less than the fold enrichment
To get the number f genes from the database, simply take one output, and compute number_in_reference * number_in_list / expected
The fold enrichment is also called odd-ratio.
- get_data(category, ontologies, include_negative_enrichment=True, fdr=0.05)[source]
From all input GO term that have been found and stored in enrichment[ONTOLOGY]['result'], we keep those with fdr<0.05. We also exclude UNCLASSIFIED entries. The final dataframe is returned
pe.get_data(""up", MF")
- get_functional_classification(mygenes, taxon)[source]
Mapping information from pantherDB for the lisf of genes
We also store uniprot persistent id
- save_chart("B4052-V1.T1vsT0.complete.xls", fc_threshold=5, padj_threshold=0.05) df = pe.plot_go_terms("down", log=True, compute_levels=False) pe.save_chart(df, "chart.png")[source]
- property taxon