transposonmapper.processing

transposonmapper.processing

transposonmapper.processing.binned_list(allcounts_list, bar_width)[source]

A binned list for a histogram of the counts

Parameters
  • allcounts_list (numpy.ndarray) – Output of the counts_genome function

  • bar_width (float) – It could be a function of the length of the genome e.g. bar_width=l_genome/1000

Returns

Binned list

Return type

list

transposonmapper.processing.build_dataframe(dna_dict, start_chr, end_chr, insrt_in_chrom_list, reads_in_chrom_list, genomicregions_list, chrom)[source]

Main function that build the big dataframe with all genes characteristics

Parameters
  • dna_dict (dict) – 1st output of the function intergenic_regions

  • start_chr (int) – 2nd output of the function gene_location

  • end_chr (int) – 3rd output of the function gene_location

  • insrt_in_chrom_list (list) – 1st output of the function read_wig_file

  • reads_in_chrom_list (list) – 2nd output of the function read_wig_file

  • genomicregions_list (list) – All the annotated genomic regions, 2nd output of the intergenic_regions function

  • chrom (str) – Name of the chromosome in roman where to extract the information.

Returns

  • dataframe

  • - dna_df2 (Dataframe containing information about the selected chromosome. This includes the following columns:) –

    • Feature name

    • Standard name of the feature

    • Aliases of feature name (if any)

    • Feature type (e.g. gene, telomere, centromere, etc. If None, this region is not defined)

    • Chromosome

    • Position of feature type in terms of bp relative to chromosome.

    • Length of region in terms of basepairs

    • Number of insertions in region

    • Number of insertions in truncated region where truncated region is the region without the first and last 100bp.

    • Number of reads in region

    • Number of reads in truncated region.

    • Number of reads per insertion (defined by Nreads/Ninsertions)

    • Number of reads per insertion in truncated region (defined by Nreads_truncatedgene/Ninsertions_truncatedgene)

    NOTE: truncated regions are only determined for genes. For the other regions the truncated region values are the same as the non-truncated region values.

transposonmapper.processing.checking_features(feature_orf_dict, chrom, gene_position_dict, verbose)[source]

Checking input values

Parameters
  • feature_orf_dict (dict) – last output of the gene_location function

  • chrom (str) – Name of the chromosome in roman where to extract the information.

  • gene_position_dict (dict) – output of the read_pergene_file function

  • verbose (bool) – If True it allows for warning messages

transposonmapper.processing.chromosome_name_bedfile(bed_file)[source]

This function returns some properties of the chromosomes in a bed file. Input can be either of two options:

The full path to a bed file after which the program opens the bed file, or a list of the lines in the bed file. The latter requires to read the bed file before calling this function and input all lines in the bed file as a list. The function than does not open the bed file again.

Returns three dictionaries (in this order):

The first indicates the names of the chromosomes as used in the bed file (keys are roman numerals 1 to 16 and the values are the names used in the bed file). The second is the start line in the bed file of each chromosome (keys are the roman numerals of the chromosome names and the values are the start lines in the bed file of the chromosome). The third is the end line in the bed file of each chromosome (keys are the roman numerals of the chromosome names and the values are the start lines in the bed file of the chromosome)

CHANGE LINE 60 AND 71 TO AUTOMATICALLY RECOGNIZE THE MITOCHONDRIAL DNA NAME

transposonmapper.processing.cleanfiles(filepath=None, custom_header=None, split_chromosomes=False)[source]

This script removes transposon insertions in .bed and .wig files that were mapped outside the chromosomes, creates consistent naming for chromosomes and change the header of files with custom headers. This code reads a .bed or .wig file and remove any insertions that were mapped outside a chromosome. Mapping of a read outside a chromosome can happen during the alignment and transposon mapping steps and means that the position of an insertions site of a read is larger than the length of the chromosome it is mapped to. This function creates a new file with the same name as the inputfile with the extension _clean.bed or _clean.wig. This is saved at the same location as the input file. In this _clean file the redundant insertions that were mapped outside the chromosome are removed. The lengths of the chromosomes are determined the python function ‘chromosome_position’ which is part of the python module ‘chromosome_and_gene_positions.py’. This module gets the lengths of the chromosomes from a .gff file downloaded from SGD (https://www.yeastgenome.org/). Besides removing the reads outside the chromosomes, it also changes the names of the chromosomes to roman numerals and a custom header can be inputted (optional). Finally, the bed and wig files can be split up in separate files for each chromosome. These are placed in _chromosomesplit folder located at the location of the bed or wig file. @author: gregoryvanbeek Created on Fri Mar 5 15:39:53 2021

Parameters
  • filepath (str) – File path of the wig or bed file to analyze

  • custom_header (str) – String header to be included in the output file

  • split_chromosomes (Bool (True/False)) – If true then there will be a folder created for each chromosome , otherwise there will be a file containing all the info for all chromosomes.

Returns

A file with the same basename as the filepath, and in the same location, with the extension

Return type

_clean.wig/_clean.bed

transposonmapper.processing.counts_genome(variable, bed_file, gff_file)[source]

Counts of reads or the transposons per chromosomes

Parameters
  • variable (str) – “transposons” or “reads”

  • bed_file (str) – absolute path of the location of the bedfile

  • gff_file (str) – absolute path of the location of the gff file

Returns

An array of the length of the genome with the counts of each variable per location in the genome.

Return type

numpy.ndarray

transposonmapper.processing.dna_features(region, wig_file, pergene_insertions_file, variable='reads', plotting=True, savefigure=False, verbose=True)[source]

This scripts takes a user defined genomic region (i.e. chromosome number, region or gene) and creates a dataframe including information about all genomic features in the chromosome (i.e. genes, nc-DNA etc.). This can be used to determine the number of reads outside the genes to use this for normalization of the number of reads in the genes. Output is a dataframe including major information about all genomic features and optionally a barplot indicating the number of transposons per genomic region. A genomic region is here defined as a gene (separated as annotated essential and not essential), telomere, centromere, ars etc. This can be used for identifying neutral regions (i.e. genomic regions that, if inhibited, do not influence the fitness of the cells). This function can be used for normalizing the transposon insertions per gene using the neutral regions.

Parameters
  • region (str) –

    • Region: e.g. chromosome number (either a normal number between 1 and 16 or in roman numerals between I and XVI), a list like [‘V’, 0, 14790] which creates a barplot between basepair 0 and 14790) or a genename.

  • wig_file (str) – absolute path for the wig file location

  • pergene_insertions_file (str) – asbsoulte path for the _pergene_insertions.txt file location

  • variable (str, optional) – By default “reads”. It could be “transposons”or “reads”. This would be used for the plotting if True

  • plotting (bool, optional) – Whether or not producing a bar plot with the reads/insertions per genomic location in the region, by default True

  • savefigure (bool, optional) – Whether or not saving the plot in the same folder as the datafiles, by default False

  • verbose (bool, optional) – Determines how much textual feedback is given. When set to False, only warnings will be shown. By default True

Returns

Dataframe containing information about the selected chromosome.

Return type

dataframe

transposonmapper.processing.feature_position(feature_dict, chrom, start_chr, dna_dict, feature_type=None)[source]

Get features for every gene in the chromosome of interest

Parameters
  • feature_dict (dict) – output of sgd_features(sgd_features_file)[i]

  • chrom (str) – Name of the chromosome in roman where to extract the information.

  • start_chr (int) – [description]

  • dna_dict (dict) – first output of the gene_location function

  • feature_type ([type], optional) – [description], by default None

Return type

dict

transposonmapper.processing.gene_location(chrom, gene_position_dict, verbose)[source]

It gives structured information from the genes inside the chromosome of interest

Parameters
  • chrom (str) – Name of the chromosome in roman where to extract the information.

  • gene_position_dict (dict) – Dictionary with info of the genes inside the chromosome . It is the output of the function read_pergene_file

  • verbose (bool) – Same as main function dna_features. If True allows for warning messages.

Returns

  • dna_dict (dict) – Dictionary with info about genes encoded in the sgd features file

  • start_ch (int) – Integer indicating the genomic location of where the chromosome of interest starts

  • end_chr (int) – Integer indicating the genomic location of where the chromosome of interest ends

  • len_chr (int) – Length of the chromosome of interest

  • feature_orf_dict (dict) – Dictionary with info about genes in the chromosome of interest

transposonmapper.processing.input_region(region, verbose)[source]

Defines the region of interest for further processing

Parameters
  • region (str, int or list) – Enter chromosome as a number (or roman numeral) between 1 and 16 (I and XVI), a list in the form [‘chromosome number, start_position, end_position’] or a valid gene name.

  • verbose (bool) – To allow warning messages.

Returns

  • roi_start (NoneType, int) – Describe the start of the genomic location if region=gene name , otherwise is a NoneType

  • roi_end (NoneType, int) – Describe the end of the genomic location if region=gene name , otherwise is a NoneType

  • region_type (str) – It is either “Gene” or “Chromosome” depending on the region provided

  • chrom (str) – It is the name of the chromosome of the gene of interest if a gene name is provided as the region, otherwise is the roman description of the chromosome of interest.

transposonmapper.processing.intergenic_regions(chrom, start_chr, dna_dict)[source]

Getting intergenic regions from chromosome of interest

Parameters
  • chrom (str) – Name of the chromosome in roman where to extract the information.

  • start_chr (int) – 2nd output of the gene_location function

  • dna_dict (dict) – 1st output of the gene_location function

Returns

  • dna_dict_new (dict)

  • genomicregions_list (list)

transposonmapper.processing.length_genome(chr_length_dict)[source]

Output the length of the genome in bp

Parameters

chr_length_dict (dict) – A dictionary describing the length of each chromosome.

Returns

The length of the genome

Return type

int

transposonmapper.processing.list_known_essentials(input_files=None, headerlines=3, verbose=True)[source]
Get all known essential genes from two different files and combine them in one list.

Input is a list of of paths where files can be found with the known essential genes. A default list is implemented using two files present in the same folder as this file. It is expected that the files contain genes in a single column and nothing else. An option can be set for the number of headerlines, which by default is set to 3. The output is a list containing all the genes present in all files given in the input.

Note when using the default files: The length of the output list exceed the number of known essential genes as the list sometimes contains both the standard name and the systematic name of a gene.

Parameters
  • input_files (str, optional) – File path for the essential file in your file system, by default None

  • headerlines (int, optional) – by default 3

  • verbose (bool, optional) – To show explanations if True, by default True

transposonmapper.processing.middle_chrom_pos(chr_length_dict)[source]

Defines the middle poit of each chromosome

Parameters

chr_length_dict (dict) – A dictionary describing the length of each chromosome.

Returns

A list describing for each chromosome the middle point.

Return type

list

transposonmapper.processing.profile_genome(bed_file=None, variable='transposons', bar_width=None, savefig=False, showfig=False)[source]

Created on Thu Mar 18 13:05:39 2021

@author: gregoryvanbeek This function creates a bar plot along the entire genome. The height of each bar represents the number of transposons or reads at the genomic position indicated on the x-axis.

The bar_width determines how many basepairs are put in one bin. Little basepairs per bin may be slow. Too many basepairs in one bin and possible low transposon areas might be obscured.

Parameters
  • bed_file (str, optional) – The file path to the location of the bed file in your filesystem, by default None

  • variable (str, optional) – The variable for plotting throughput the genome, by default “transposons”

  • bar_width (int, optional) – The width for the histogram of the plot, by default None , which means internally the length of the genome over 1000

  • savefig (bool, optional) – Save the figure if True, by default False

  • showfig (bool, optional) – Show the figure if True, by default False

Returns

  • list – All insertion sites

  • list – Binned insertion sites according the width

transposonmapper.processing.read_pergene_file(pergene_insertions_file, chrom)[source]

Reading the pergene file , the information per gene , related to where it starts and ends in the genome.

Parameters
  • pergene_insertions_file (str) – absolute path of the per gene file location

  • chrom (str) – Name of the chromosome in roman where to extract the informatiion from the wigfile

Returns

gene_position_dict – A dictionary describing the chromosome, start, and end location of every gene in the chromosome of interest.

Return type

dict

transposonmapper.processing.read_wig_file(wig_file, chrom)[source]

Extract the information in the wigfile related to the chromosome of interested

Parameters
  • wig_file (str) – absolute path of the wigfile location

  • chrom (str) – Name of the chromosome in roman where to extract the informatiion from the wigfile

Returns

  • insrt_in_chrom_list (list) – Genomic locations of transposon insertions in the given chromosome.

  • reads_in_chrom_list (list) – How many reads are in each of the genomic locations of the insertions.

transposonmapper.processing.sgd_features(filepath=None)[source]

This function read the file SGD_features.tab and create a dictionary with useful info for processing

Parameters

filepath (str, optional) – filepath of the sgd.tab file , by default None

Returns

A dictionary with the following info: key:

  1. Feature name

value:
  1. feature type (l[1])

  2. feature qualifier (Verified or Dubious) (l[2])

  3. Standard name (l[4])

  4. Aliases (separated by ‘|’) (l[5])

  5. Parent feature name (typically ‘chromosome …’) (l[6])

  6. Chromosome (l[8])

  7. start coordinate (starting at 0 for each chromosome) (l[9])

  8. end coordinate (starting at 0 for each chromosome) (l[10])

This file reads the SGD_features.txt file found at http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/

Return type

dict

transposonmapper.processing.summed_chr(chr_length_dict)[source]

Create a dictionary where each value is the cumulative sum of all bp in each chromosomes

Parameters

chr_length_dict (dict) – A dictionary describing the length of each chromosome.

Returns

A dictionary where each value corresponds to the cumulative sum of the previous chromosomes lengths.

Return type

dict