transposonmapper.processing¶

transposonmapper.processing.binned_list(allcounts_list, bar_width)[source]¶

A binned list for a histogram of the counts

Parameters

allcounts_list (numpy.ndarray) – Output of the counts_genome function
bar_width (float) – It could be a function of the length of the genome e.g. bar_width=l_genome/1000

Returns

Binned list

Return type

list

transposonmapper.processing.build_dataframe(dna_dict, start_chr, end_chr, insrt_in_chrom_list, reads_in_chrom_list, genomicregions_list, chrom)[source]¶

Main function that build the big dataframe with all genes characteristics

Parameters

dna_dict (dict) – 1st output of the function intergenic_regions
start_chr (int) – 2nd output of the function gene_location
end_chr (int) – 3rd output of the function gene_location
insrt_in_chrom_list (list) – 1st output of the function read_wig_file
reads_in_chrom_list (list) – 2nd output of the function read_wig_file
genomicregions_list (list) – All the annotated genomic regions, 2nd output of the intergenic_regions function
chrom (str) – Name of the chromosome in roman where to extract the information.

Returns

dataframe
- dna_df2 (Dataframe containing information about the selected chromosome. This includes the following columns:) –
- Feature name
- Standard name of the feature
- Aliases of feature name (if any)
- Feature type (e.g. gene, telomere, centromere, etc. If None, this region is not defined)
- Chromosome
- Position of feature type in terms of bp relative to chromosome.
- Length of region in terms of basepairs
- Number of insertions in region
- Number of insertions in truncated region where truncated region is the region without the first and last 100bp.
- Number of reads in region
- Number of reads in truncated region.
- Number of reads per insertion (defined by Nreads/Ninsertions)
- Number of reads per insertion in truncated region (defined by Nreads_truncatedgene/Ninsertions_truncatedgene)
NOTE: truncated regions are only determined for genes. For the other regions the truncated region values are the same as the non-truncated region values.

transposonmapper.processing.checking_features(feature_orf_dict, chrom, gene_position_dict, verbose)[source]¶

Checking input values

Parameters

feature_orf_dict (dict) – last output of the gene_location function
chrom (str) – Name of the chromosome in roman where to extract the information.
gene_position_dict (dict) – output of the read_pergene_file function
verbose (bool) – If True it allows for warning messages

transposonmapper.processing.chromosome_name_bedfile(bed_file)[source]¶

This function returns some properties of the chromosomes in a bed file. Input can be either of two options:

The full path to a bed file after which the program opens the bed file, or a list of the lines in the bed file. The latter requires to read the bed file before calling this function and input all lines in the bed file as a list. The function than does not open the bed file again.

Returns three dictionaries (in this order):: The first indicates the names of the chromosomes as used in the bed file (keys are roman numerals 1 to 16 and the values are the names used in the bed file). The second is the start line in the bed file of each chromosome (keys are the roman numerals of the chromosome names and the values are the start lines in the bed file of the chromosome). The third is the end line in the bed file of each chromosome (keys are the roman numerals of the chromosome names and the values are the start lines in the bed file of the chromosome)

CHANGE LINE 60 AND 71 TO AUTOMATICALLY RECOGNIZE THE MITOCHONDRIAL DNA NAME

transposonmapper.processing.cleanfiles(filepath=None, custom_header=None, split_chromosomes=False)[source]¶

This script removes transposon insertions in .bed and .wig files that were mapped outside the chromosomes, creates consistent naming for chromosomes and change the header of files with custom headers. This code reads a .bed or .wig file and remove any insertions that were mapped outside a chromosome. Mapping of a read outside a chromosome can happen during the alignment and transposon mapping steps and means that the position of an insertions site of a read is larger than the length of the chromosome it is mapped to. This function creates a new file with the same name as the inputfile with the extension _clean.bed or _clean.wig. This is saved at the same location as the input file. In this _clean file the redundant insertions that were mapped outside the chromosome are removed. The lengths of the chromosomes are determined the python function ‘chromosome_position’ which is part of the python module ‘chromosome_and_gene_positions.py’. This module gets the lengths of the chromosomes from a .gff file downloaded from SGD (https://www.yeastgenome.org/). Besides removing the reads outside the chromosomes, it also changes the names of the chromosomes to roman numerals and a custom header can be inputted (optional). Finally, the bed and wig files can be split up in separate files for each chromosome. These are placed in _chromosomesplit folder located at the location of the bed or wig file. @author: gregoryvanbeek Created on Fri Mar 5 15:39:53 2021

Parameters

filepath (str) – File path of the wig or bed file to analyze
custom_header (str) – String header to be included in the output file
split_chromosomes (Bool (True/False)) – If true then there will be a folder created for each chromosome , otherwise there will be a file containing all the info for all chromosomes.

Returns

A file with the same basename as the filepath, and in the same location, with the extension

Return type

_clean.wig/_clean.bed

transposonmapper.processing.counts_genome(variable, bed_file, gff_file)[source]¶

Counts of reads or the transposons per chromosomes

Parameters

variable (str) – “transposons” or “reads”
bed_file (str) – absolute path of the location of the bedfile
gff_file (str) – absolute path of the location of the gff file

Returns

An array of the length of the genome with the counts of each variable per location in the genome.

Return type

numpy.ndarray

transposonmapper.processing.dna_features(region, wig_file, pergene_insertions_file, variable='reads', plotting=True, savefigure=False, verbose=True)[source]¶

This scripts takes a user defined genomic region (i.e. chromosome number, region or gene) and creates a dataframe including information about all genomic features in the chromosome (i.e. genes, nc-DNA etc.). This can be used to determine the number of reads outside the genes to use this for normalization of the number of reads in the genes. Output is a dataframe including major information about all genomic features and optionally a barplot indicating the number of transposons per genomic region. A genomic region is here defined as a gene (separated as annotated essential and not essential), telomere, centromere, ars etc. This can be used for identifying neutral regions (i.e. genomic regions that, if inhibited, do not influence the fitness of the cells). This function can be used for normalizing the transposon insertions per gene using the neutral regions.

Parameters

region (str) –
- Region: e.g. chromosome number (either a normal number between 1 and 16 or in roman numerals between I and XVI), a list like [‘V’, 0, 14790] which creates a barplot between basepair 0 and 14790) or a genename.
wig_file (str) – absolute path for the wig file location
pergene_insertions_file (str) – asbsoulte path for the _pergene_insertions.txt file location
variable (str, optional) – By default “reads”. It could be “transposons”or “reads”. This would be used for the plotting if True
plotting (bool, optional) – Whether or not producing a bar plot with the reads/insertions per genomic location in the region, by default True
savefigure (bool, optional) – Whether or not saving the plot in the same folder as the datafiles, by default False
verbose (bool, optional) – Determines how much textual feedback is given. When set to False, only warnings will be shown. By default True

Returns

Dataframe containing information about the selected chromosome.

Return type

dataframe

transposonmapper.processing.feature_position(feature_dict, chrom, start_chr, dna_dict, feature_type=None)[source]¶

Get features for every gene in the chromosome of interest

Parameters

feature_dict (dict) – output of sgd_features(sgd_features_file)[i]
chrom (str) – Name of the chromosome in roman where to extract the information.
start_chr (int) – [description]
dna_dict (dict) – first output of the gene_location function
feature_type ([type], optional) – [description], by default None

Return type

dict

transposonmapper.processing.gene_location(chrom, gene_position_dict, verbose)[source]¶

It gives structured information from the genes inside the chromosome of interest

Parameters

chrom (str) – Name of the chromosome in roman where to extract the information.
gene_position_dict (dict) – Dictionary with info of the genes inside the chromosome . It is the output of the function read_pergene_file
verbose (bool) – Same as main function dna_features. If True allows for warning messages.

Returns

dna_dict (dict) – Dictionary with info about genes encoded in the sgd features file
start_ch (int) – Integer indicating the genomic location of where the chromosome of interest starts
end_chr (int) – Integer indicating the genomic location of where the chromosome of interest ends
len_chr (int) – Length of the chromosome of interest
feature_orf_dict (dict) – Dictionary with info about genes in the chromosome of interest

transposonmapper.processing.input_region(region, verbose)[source]¶

Defines the region of interest for further processing

Parameters

region (str, int or list) – Enter chromosome as a number (or roman numeral) between 1 and 16 (I and XVI), a list in the form [‘chromosome number, start_position, end_position’] or a valid gene name.
verbose (bool) – To allow warning messages.

Returns

roi_start (NoneType, int) – Describe the start of the genomic location if region=gene name , otherwise is a NoneType
roi_end (NoneType, int) – Describe the end of the genomic location if region=gene name , otherwise is a NoneType
region_type (str) – It is either “Gene” or “Chromosome” depending on the region provided
chrom (str) – It is the name of the chromosome of the gene of interest if a gene name is provided as the region, otherwise is the roman description of the chromosome of interest.

transposonmapper.processing.intergenic_regions(chrom, start_chr, dna_dict)[source]¶

Getting intergenic regions from chromosome of interest

Parameters

chrom (str) – Name of the chromosome in roman where to extract the information.
start_chr (int) – 2nd output of the gene_location function
dna_dict (dict) – 1st output of the gene_location function

Returns

dna_dict_new (dict)
genomicregions_list (list)

transposonmapper.processing.length_genome(chr_length_dict)[source]¶

Output the length of the genome in bp

Parameters: chr_length_dict (dict) – A dictionary describing the length of each chromosome.
Returns: The length of the genome
Return type: int

transposonmapper.processing.list_known_essentials(input_files=None, headerlines=3, verbose=True)[source]¶

Get all known essential genes from two different files and combine them in one list.

Input is a list of of paths where files can be found with the known essential genes. A default list is implemented using two files present in the same folder as this file. It is expected that the files contain genes in a single column and nothing else. An option can be set for the number of headerlines, which by default is set to 3. The output is a list containing all the genes present in all files given in the input.

Note when using the default files: The length of the output list exceed the number of known essential genes as the list sometimes contains both the standard name and the systematic name of a gene.

Parameters

input_files (str, optional) – File path for the essential file in your file system, by default None
headerlines (int, optional) – by default 3
verbose (bool, optional) – To show explanations if True, by default True

transposonmapper.processing.middle_chrom_pos(chr_length_dict)[source]¶

Defines the middle poit of each chromosome

Parameters: chr_length_dict (dict) – A dictionary describing the length of each chromosome.
Returns: A list describing for each chromosome the middle point.
Return type: list

transposonmapper.processing.profile_genome(bed_file=None, variable='transposons', bar_width=None, savefig=False, showfig=False)[source]¶

Created on Thu Mar 18 13:05:39 2021

@author: gregoryvanbeek This function creates a bar plot along the entire genome. The height of each bar represents the number of transposons or reads at the genomic position indicated on the x-axis.

The bar_width determines how many basepairs are put in one bin. Little basepairs per bin may be slow. Too many basepairs in one bin and possible low transposon areas might be obscured.

Parameters

bed_file (str, optional) – The file path to the location of the bed file in your filesystem, by default None
variable (str, optional) – The variable for plotting throughput the genome, by default “transposons”
bar_width (int, optional) – The width for the histogram of the plot, by default None , which means internally the length of the genome over 1000
savefig (bool, optional) – Save the figure if True, by default False
showfig (bool, optional) – Show the figure if True, by default False

Returns

list – All insertion sites
list – Binned insertion sites according the width

transposonmapper.processing.read_pergene_file(pergene_insertions_file, chrom)[source]¶

Reading the pergene file , the information per gene , related to where it starts and ends in the genome.

Parameters

pergene_insertions_file (str) – absolute path of the per gene file location
chrom (str) – Name of the chromosome in roman where to extract the informatiion from the wigfile

Returns

gene_position_dict – A dictionary describing the chromosome, start, and end location of every gene in the chromosome of interest.

Return type

dict

transposonmapper.processing.read_wig_file(wig_file, chrom)[source]¶

Extract the information in the wigfile related to the chromosome of interested

Parameters

wig_file (str) – absolute path of the wigfile location
chrom (str) – Name of the chromosome in roman where to extract the informatiion from the wigfile

Returns

insrt_in_chrom_list (list) – Genomic locations of transposon insertions in the given chromosome.
reads_in_chrom_list (list) – How many reads are in each of the genomic locations of the insertions.

transposonmapper.processing.sgd_features(filepath=None)[source]¶

This function read the file SGD_features.tab and create a dictionary with useful info for processing

Parameters

filepath (str, optional) – filepath of the sgd.tab file , by default None

Returns

A dictionary with the following info: key:

Feature name

value:

feature type (l[1])
feature qualifier (Verified or Dubious) (l[2])
Standard name (l[4])
Aliases (separated by ‘|’) (l[5])
Parent feature name (typically ‘chromosome …’) (l[6])
Chromosome (l[8])
start coordinate (starting at 0 for each chromosome) (l[9])
end coordinate (starting at 0 for each chromosome) (l[10])

This file reads the SGD_features.txt file found at http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/

Return type

dict

transposonmapper.processing.summed_chr(chr_length_dict)[source]¶

Create a dictionary where each value is the cumulative sum of all bp in each chromosomes

Parameters: chr_length_dict (dict) – A dictionary describing the length of each chromosome.
Returns: A dictionary where each value corresponds to the cumulative sum of the previous chromosomes lengths.
Return type: dict

SATAY pipeline at Delft :)

transposonmapper.processing

transposonmapper.processing¶