transposonmapper.statistics

transposonmapper.statistics

transposonmapper.statistics.apply_stats(variable_a_array, variable_b_array, significance_threshold, volcano_df)[source]

This function computes the statistics measure for the volcano plot

Parameters
  • variable_a_array (array) – The values (# of insertions or reads) of the replicates of one library

  • variable_b_array (array) – The values (# of insertions or reads) of the replicates of the other library

  • significance_threshold (float) – It will use the default value in the volcano function which is 0.01

Returns

A dataframe containing all the info for the volcano plot.

Return type

dataframe

transposonmapper.statistics.dataframe_from_pergenefile(pergenefile, verbose=True)[source]

This function creates a dataframe with the information from a pergene.txt file.

The gene_essentiality is created based on the genes present in the Cerevisiae_EssentialGenes_List_1.txt and Cerevisiae_EssentialGenes_List_2.txt files The number of reads per insertion (Nreadsperinsrt) is determined by dividing the read_per_gene column by the tn_per_gene column.

Author: Gregory van Beek

Parameters
  • pergenefile (str) – absolute path to the pergene.txt file , one of the outputs of the transposonmapper module

  • verbose (bool, optional) – [description], by default True

Returns

Output is a dataframe where each row is a single gene and with the following columns:
  • gene_names

  • gene_essentiality

  • tn_per_gene

  • read_per_gene

  • Nreadsperinsrt

Return type

dataframe

transposonmapper.statistics.essential_genes(genenames_list, lines)[source]

It provides a list of essential genes

Parameters
  • genenames_list (list) – A list will al genes names that were mapped to the reference genome

  • lines (int) – Number of genes in total

Returns

List of essential genes

Return type

list

transposonmapper.statistics.info_from_datasets(datafiles_list_a, datafiles_list_b, variable, normalize)[source]

Read the information contain in the datafiles for the volcano plot

Parameters
  • datafiles_list_a (list of str) – List of the absolute paths of all the replicates from the reference library.

  • datafiles_list_b (list of str) – List of the absolute paths of all the replicates from the experimental library.

  • variable (str) – Magnitude indicating based on what to make the volcano plot. For example: tn_per_gene, read_per_gene or Nreadsperinsrt

  • normalize (bool) – If True , If set to True, each gene is normalized based on the total count in each dataset (i.e. each file in filelist_)

Returns

  • variable_a_array (numpy.array)

  • variable_b_array (numpy.array)

  • volcano_df (pandas.core.frame.DataFrame)

  • tnread_gene_a (pandas.core.frame.DataFrame)

  • tnread_gene_b (pandas.core.frame.DataFrame)

transposonmapper.statistics.make_datafile(path_a, filelist_a, path_b, filelist_b)[source]

Assembly the datafile name to analyze

Parameters
  • path_a (str) – Path of the files corresponding to the reference library

  • filelist_a (list of str) – List of the filenames of the different replicates from the reference library. It has to have minimum two replicates per library, so the list has to contain a minimum of two files.

  • path_b (str) – Path of the files corresponding to the experimental library

  • filelist_b (list of str) – List of the filenames of the different replicates from the experimental library. It has to have minimum two replicates per library, so the list has to contain a minimum of two files.

Returns

Complete paths of the reference and the experimental libraries

Return type

str

transposonmapper.statistics.read_pergene_file(pergenefile)[source]

It reads the content of the pergene file , one of the outputs of Transposonmapper

Parameters

pergenefile (str) – absolute path to the pergene.txt file , one of the outputs of the transposonmapper module

Returns

  • list – Gene names list

  • list – Insertion list

  • list – Reads list

transposonmapper.statistics.reads_per_insertion(tnpergene_list, readpergene_list, lines)[source]

It computes the reads per insertion following the formula: reads/(insertions-1) if the number of insertions is higher than 5, if not then the reads per insertion will be 0.

Parameters
  • tnpergene_list (list) – A list with all insertions

  • readpergene_list (list) – A list of the reads

  • lines (int) – Number of genes mapped to in the reference genome

Returns

A list containing all the reads per insertions per gene.

Return type

list

transposonmapper.statistics.volcano(path_a, filelist_a, path_b, filelist_b, variable='read_per_gene', significance_threshold=0.01, normalize=True, trackgene_list=[], figure_title='')[source]

This script creates a volcanoplot to show the significance of fold change between two datasets. It is based on this website:

Code for showing gene name when hovering over datapoint is based on:

T-test is measuring the number of standard deviations our measured mean is from the baseline mean, while taking into account that the standard deviation of the mean can change as we get more data This creates a volcano plot that shows the fold change between two libraries and the corresponding p-values.

The fold change is determined by the mean of dataset b (experimental set) divided by the mean of dataset a (reference set). The datasets can be of different length. P-value is determined based on the student t-test (scipy.stats.ttest_ind).

Note

The fold change is determined by the ratio between the reference and the experimental dataset. When one of the datasets is 0, this is false results for the fold change. To prevent this, the genes with 0 insertions are set to have 5 insertions, and the genes with 0 reads are set to have 25 reads. These values are determined in dicussion with the Kornmann lab.

  • Created on Tue Feb 16 14:06:48 2021

  • @author: gregoryvanbeek

Parameters
  • path_a (str) – paths to location of the datafiles for library a

  • filelist_a (str) – list of the names of the datafiles for library a located in path_a. The type of file here is the pergene.txt file , which is one of the outputs from the transposonmapper function. The format of the pergene file should be TAB separated and NOT COMMA separated. if you have it as comma separated you can convert to tab separated using the command line with this command: cat oldfile.txt | tr ‘[,]’ ‘[ ]’ > newfile.txt

  • path_b (str) – paths to location of the datafiles for library b

  • filelist_b (str) – list of the names of the datafiles for library b located in path_b The type of file here is the pergene.txt file , which is one of the outputs from the transposonmapper function. The format of the pergene file should be TAB separated and NOT COMMA separated. if you have it as comma separated you can convert to tab separated using the command line with this command: cat oldfile.txt | tr ‘[,]’ ‘[ ]’ > newfile.txt

  • variable (str, optional) – tn_per_gene, read_per_gene or Nreadsperinsrt , by default ‘read_per_gene’

  • significance_threshold (float, optional) – Threshold value above which the fold change is regarded significant, only for plotting, by default 0.01

  • normalize (bool, optional) – Whether to normalize variable. If set to True, each gene is normalized based on the total count in each dataset (i.e. each file in filelist_) , by default True

  • trackgene_list (list, optional) – Enter a list of gene name(s) which will be highlighted in the plot (e.g. [‘cdc42’, ‘nrp1’]), by default []

  • figure_title (str, optional) – The title of the figure if not empty, by default “”

Returns

  • dataframe

    A dataframe containing:

    • gene_names

    • fold change

    • t statistic

    • p value

    • whether p value is above threshold

  • figure

    • volcanoplot with the log2 fold change between the two libraries and the -log10 p-value.