transposonmapper.statistics
transposonmapper.statistics¶
- transposonmapper.statistics.apply_stats(variable_a_array, variable_b_array, significance_threshold, volcano_df)[source]¶
This function computes the statistics measure for the volcano plot
- Parameters
variable_a_array (array) – The values (# of insertions or reads) of the replicates of one library
variable_b_array (array) – The values (# of insertions or reads) of the replicates of the other library
significance_threshold (float) – It will use the default value in the volcano function which is 0.01
- Returns
A dataframe containing all the info for the volcano plot.
- Return type
dataframe
- transposonmapper.statistics.dataframe_from_pergenefile(pergenefile, verbose=True)[source]¶
This function creates a dataframe with the information from a pergene.txt file.
The gene_essentiality is created based on the genes present in the Cerevisiae_EssentialGenes_List_1.txt and Cerevisiae_EssentialGenes_List_2.txt files The number of reads per insertion (Nreadsperinsrt) is determined by dividing the read_per_gene column by the tn_per_gene column.
Author: Gregory van Beek
- Parameters
pergenefile (str) – absolute path to the pergene.txt file , one of the outputs of the transposonmapper module
verbose (bool, optional) – [description], by default True
- Returns
- Output is a dataframe where each row is a single gene and with the following columns:
gene_names
gene_essentiality
tn_per_gene
read_per_gene
Nreadsperinsrt
- Return type
dataframe
- transposonmapper.statistics.essential_genes(genenames_list, lines)[source]¶
It provides a list of essential genes
- Parameters
genenames_list (list) – A list will al genes names that were mapped to the reference genome
lines (int) – Number of genes in total
- Returns
List of essential genes
- Return type
list
- transposonmapper.statistics.info_from_datasets(datafiles_list_a, datafiles_list_b, variable, normalize)[source]¶
Read the information contain in the datafiles for the volcano plot
- Parameters
datafiles_list_a (list of str) – List of the absolute paths of all the replicates from the reference library.
datafiles_list_b (list of str) – List of the absolute paths of all the replicates from the experimental library.
variable (str) – Magnitude indicating based on what to make the volcano plot. For example: tn_per_gene, read_per_gene or Nreadsperinsrt
normalize (bool) – If True , If set to True, each gene is normalized based on the total count in each dataset (i.e. each file in filelist_)
- Returns
variable_a_array (numpy.array)
variable_b_array (numpy.array)
volcano_df (pandas.core.frame.DataFrame)
tnread_gene_a (pandas.core.frame.DataFrame)
tnread_gene_b (pandas.core.frame.DataFrame)
- transposonmapper.statistics.make_datafile(path_a, filelist_a, path_b, filelist_b)[source]¶
Assembly the datafile name to analyze
- Parameters
path_a (str) – Path of the files corresponding to the reference library
filelist_a (list of str) – List of the filenames of the different replicates from the reference library. It has to have minimum two replicates per library, so the list has to contain a minimum of two files.
path_b (str) – Path of the files corresponding to the experimental library
filelist_b (list of str) – List of the filenames of the different replicates from the experimental library. It has to have minimum two replicates per library, so the list has to contain a minimum of two files.
- Returns
Complete paths of the reference and the experimental libraries
- Return type
str
- transposonmapper.statistics.read_pergene_file(pergenefile)[source]¶
It reads the content of the pergene file , one of the outputs of Transposonmapper
- Parameters
pergenefile (str) – absolute path to the pergene.txt file , one of the outputs of the transposonmapper module
- Returns
list – Gene names list
list – Insertion list
list – Reads list
- transposonmapper.statistics.reads_per_insertion(tnpergene_list, readpergene_list, lines)[source]¶
It computes the reads per insertion following the formula: reads/(insertions-1) if the number of insertions is higher than 5, if not then the reads per insertion will be 0.
- Parameters
tnpergene_list (list) – A list with all insertions
readpergene_list (list) – A list of the reads
lines (int) – Number of genes mapped to in the reference genome
- Returns
A list containing all the reads per insertions per gene.
- Return type
list
- transposonmapper.statistics.volcano(path_a, filelist_a, path_b, filelist_b, variable='read_per_gene', significance_threshold=0.01, normalize=True, trackgene_list=[], figure_title='')[source]¶
This script creates a volcanoplot to show the significance of fold change between two datasets. It is based on this website:
- Code for showing gene name when hovering over datapoint is based on:
T-test is measuring the number of standard deviations our measured mean is from the baseline mean, while taking into account that the standard deviation of the mean can change as we get more data This creates a volcano plot that shows the fold change between two libraries and the corresponding p-values.
The fold change is determined by the mean of dataset b (experimental set) divided by the mean of dataset a (reference set). The datasets can be of different length. P-value is determined based on the student t-test (scipy.stats.ttest_ind).
Note
The fold change is determined by the ratio between the reference and the experimental dataset. When one of the datasets is 0, this is false results for the fold change. To prevent this, the genes with 0 insertions are set to have 5 insertions, and the genes with 0 reads are set to have 25 reads. These values are determined in dicussion with the Kornmann lab.
Created on Tue Feb 16 14:06:48 2021
@author: gregoryvanbeek
- Parameters
path_a (str) – paths to location of the datafiles for library a
filelist_a (str) – list of the names of the datafiles for library a located in path_a. The type of file here is the pergene.txt file , which is one of the outputs from the transposonmapper function. The format of the pergene file should be TAB separated and NOT COMMA separated. if you have it as comma separated you can convert to tab separated using the command line with this command: cat oldfile.txt | tr ‘[,]’ ‘[ ]’ > newfile.txt
path_b (str) – paths to location of the datafiles for library b
filelist_b (str) – list of the names of the datafiles for library b located in path_b The type of file here is the pergene.txt file , which is one of the outputs from the transposonmapper function. The format of the pergene file should be TAB separated and NOT COMMA separated. if you have it as comma separated you can convert to tab separated using the command line with this command: cat oldfile.txt | tr ‘[,]’ ‘[ ]’ > newfile.txt
variable (str, optional) – tn_per_gene, read_per_gene or Nreadsperinsrt , by default ‘read_per_gene’
significance_threshold (float, optional) – Threshold value above which the fold change is regarded significant, only for plotting, by default 0.01
normalize (bool, optional) – Whether to normalize variable. If set to True, each gene is normalized based on the total count in each dataset (i.e. each file in filelist_) , by default True
trackgene_list (list, optional) – Enter a list of gene name(s) which will be highlighted in the plot (e.g. [‘cdc42’, ‘nrp1’]), by default []
figure_title (str, optional) – The title of the figure if not empty, by default “”
- Returns
dataframe –
A dataframe containing:
gene_names
fold change
t statistic
p value
whether p value is above threshold
figure –
volcanoplot with the log2 fold change between the two libraries and the -log10 p-value.