transposonmapper.statistics¶

transposonmapper.statistics.apply_stats(variable_a_array, variable_b_array, significance_threshold, volcano_df)[source]¶

This function computes the statistics measure for the volcano plot

Parameters

variable_a_array (array) – The values (# of insertions or reads) of the replicates of one library
variable_b_array (array) – The values (# of insertions or reads) of the replicates of the other library
significance_threshold (float) – It will use the default value in the volcano function which is 0.01

Returns

A dataframe containing all the info for the volcano plot.

Return type

dataframe

transposonmapper.statistics.dataframe_from_pergenefile(pergenefile, verbose=True)[source]¶

This function creates a dataframe with the information from a pergene.txt file.

The gene_essentiality is created based on the genes present in the Cerevisiae_EssentialGenes_List_1.txt and Cerevisiae_EssentialGenes_List_2.txt files The number of reads per insertion (Nreadsperinsrt) is determined by dividing the read_per_gene column by the tn_per_gene column.

Author: Gregory van Beek

Parameters

pergenefile (str) – absolute path to the pergene.txt file , one of the outputs of the transposonmapper module
verbose (bool, optional) – [description], by default True

Returns

Output is a dataframe where each row is a single gene and with the following columns:

gene_names
gene_essentiality
tn_per_gene
read_per_gene
Nreadsperinsrt

Return type

dataframe

transposonmapper.statistics.essential_genes(genenames_list, lines)[source]¶

It provides a list of essential genes

Parameters

genenames_list (list) – A list will al genes names that were mapped to the reference genome
lines (int) – Number of genes in total

Returns

List of essential genes

Return type

list

transposonmapper.statistics.info_from_datasets(datafiles_list_a, datafiles_list_b, variable, normalize)[source]¶

Read the information contain in the datafiles for the volcano plot

Parameters

datafiles_list_a (list of str) – List of the absolute paths of all the replicates from the reference library.
datafiles_list_b (list of str) – List of the absolute paths of all the replicates from the experimental library.
variable (str) – Magnitude indicating based on what to make the volcano plot. For example: tn_per_gene, read_per_gene or Nreadsperinsrt
normalize (bool) – If True , If set to True, each gene is normalized based on the total count in each dataset (i.e. each file in filelist_)

Returns

variable_a_array (numpy.array)
variable_b_array (numpy.array)
volcano_df (pandas.core.frame.DataFrame)
tnread_gene_a (pandas.core.frame.DataFrame)
tnread_gene_b (pandas.core.frame.DataFrame)

transposonmapper.statistics.make_datafile(path_a, filelist_a, path_b, filelist_b)[source]¶

Assembly the datafile name to analyze

Parameters

path_a (str) – Path of the files corresponding to the reference library
filelist_a (list of str) – List of the filenames of the different replicates from the reference library. It has to have minimum two replicates per library, so the list has to contain a minimum of two files.
path_b (str) – Path of the files corresponding to the experimental library
filelist_b (list of str) – List of the filenames of the different replicates from the experimental library. It has to have minimum two replicates per library, so the list has to contain a minimum of two files.

Returns

Complete paths of the reference and the experimental libraries

Return type

str

transposonmapper.statistics.read_pergene_file(pergenefile)[source]¶

It reads the content of the pergene file , one of the outputs of Transposonmapper

Parameters

pergenefile (str) – absolute path to the pergene.txt file , one of the outputs of the transposonmapper module

Returns

list – Gene names list
list – Insertion list
list – Reads list

transposonmapper.statistics.reads_per_insertion(tnpergene_list, readpergene_list, lines)[source]¶

It computes the reads per insertion following the formula: reads/(insertions-1) if the number of insertions is higher than 5, if not then the reads per insertion will be 0.

Parameters

tnpergene_list (list) – A list with all insertions
readpergene_list (list) – A list of the reads
lines (int) – Number of genes mapped to in the reference genome

Returns

A list containing all the reads per insertions per gene.

Return type

list

transposonmapper.statistics.volcano(path_a, filelist_a, path_b, filelist_b, variable='read_per_gene', significance_threshold=0.01, normalize=True, trackgene_list=[], figure_title='')[source]¶

This script creates a volcanoplot to show the significance of fold change between two datasets. It is based on this website:

https://towardsdatascience.com/inferential-statistics-series-t-test-using-numpy-2718f8f9bf2f

https://www.statisticshowto.com/independent-samples-t-test/

Code for showing gene name when hovering over datapoint is based on:

https://stackoverflow.com/questions/7908636/possible-to-make-labels-appear-when-hovering-over-a-point-in-matplotlib

T-test is measuring the number of standard deviations our measured mean is from the baseline mean, while taking into account that the standard deviation of the mean can change as we get more data This creates a volcano plot that shows the fold change between two libraries and the corresponding p-values.

The fold change is determined by the mean of dataset b (experimental set) divided by the mean of dataset a (reference set). The datasets can be of different length. P-value is determined based on the student t-test (scipy.stats.ttest_ind).

Note

The fold change is determined by the ratio between the reference and the experimental dataset. When one of the datasets is 0, this is false results for the fold change. To prevent this, the genes with 0 insertions are set to have 5 insertions, and the genes with 0 reads are set to have 25 reads. These values are determined in dicussion with the Kornmann lab.

Created on Tue Feb 16 14:06:48 2021
@author: gregoryvanbeek

Parameters

path_a (str) – paths to location of the datafiles for library a
filelist_a (str) – list of the names of the datafiles for library a located in path_a. The type of file here is the pergene.txt file , which is one of the outputs from the transposonmapper function. The format of the pergene file should be TAB separated and NOT COMMA separated. if you have it as comma separated you can convert to tab separated using the command line with this command: cat oldfile.txt | tr ‘[,]’ ‘[ ]’ > newfile.txt
path_b (str) – paths to location of the datafiles for library b
filelist_b (str) – list of the names of the datafiles for library b located in path_b The type of file here is the pergene.txt file , which is one of the outputs from the transposonmapper function. The format of the pergene file should be TAB separated and NOT COMMA separated. if you have it as comma separated you can convert to tab separated using the command line with this command: cat oldfile.txt | tr ‘[,]’ ‘[ ]’ > newfile.txt
variable (str, optional) – tn_per_gene, read_per_gene or Nreadsperinsrt , by default ‘read_per_gene’
significance_threshold (float, optional) – Threshold value above which the fold change is regarded significant, only for plotting, by default 0.01
normalize (bool, optional) – Whether to normalize variable. If set to True, each gene is normalized based on the total count in each dataset (i.e. each file in filelist_) , by default True
trackgene_list (list, optional) – Enter a list of gene name(s) which will be highlighted in the plot (e.g. [‘cdc42’, ‘nrp1’]), by default []
figure_title (str, optional) – The title of the figure if not empty, by default “”

Returns

dataframe –

A dataframe containing:
- gene_names
- fold change
- t statistic
- p value
- whether p value is above threshold
figure –
- volcanoplot with the log2 fold change between the two libraries and the -log10 p-value.

SATAY pipeline at Delft :)

transposonmapper.statistics

transposonmapper.statistics¶