transposonmapper.processing
transposonmapper.processing¶
- transposonmapper.processing.binned_list(allcounts_list, bar_width)[source]¶
A binned list for a histogram of the counts
- Parameters
allcounts_list (numpy.ndarray) – Output of the counts_genome function
bar_width (float) – It could be a function of the length of the genome e.g. bar_width=l_genome/1000
- Returns
Binned list
- Return type
list
- transposonmapper.processing.build_dataframe(dna_dict, start_chr, end_chr, insrt_in_chrom_list, reads_in_chrom_list, genomicregions_list, chrom)[source]¶
Main function that build the big dataframe with all genes characteristics
- Parameters
dna_dict (dict) – 1st output of the function intergenic_regions
start_chr (int) – 2nd output of the function gene_location
end_chr (int) – 3rd output of the function gene_location
insrt_in_chrom_list (list) – 1st output of the function read_wig_file
reads_in_chrom_list (list) – 2nd output of the function read_wig_file
genomicregions_list (list) – All the annotated genomic regions, 2nd output of the intergenic_regions function
chrom (str) – Name of the chromosome in roman where to extract the information.
- Returns
dataframe
- dna_df2 (Dataframe containing information about the selected chromosome. This includes the following columns:) –
Feature name
Standard name of the feature
Aliases of feature name (if any)
Feature type (e.g. gene, telomere, centromere, etc. If None, this region is not defined)
Chromosome
Position of feature type in terms of bp relative to chromosome.
Length of region in terms of basepairs
Number of insertions in region
Number of insertions in truncated region where truncated region is the region without the first and last 100bp.
Number of reads in region
Number of reads in truncated region.
Number of reads per insertion (defined by Nreads/Ninsertions)
Number of reads per insertion in truncated region (defined by Nreads_truncatedgene/Ninsertions_truncatedgene)
NOTE: truncated regions are only determined for genes. For the other regions the truncated region values are the same as the non-truncated region values.
- transposonmapper.processing.checking_features(feature_orf_dict, chrom, gene_position_dict, verbose)[source]¶
Checking input values
- Parameters
feature_orf_dict (dict) – last output of the gene_location function
chrom (str) – Name of the chromosome in roman where to extract the information.
gene_position_dict (dict) – output of the read_pergene_file function
verbose (bool) – If True it allows for warning messages
- transposonmapper.processing.chromosome_name_bedfile(bed_file)[source]¶
This function returns some properties of the chromosomes in a bed file. Input can be either of two options:
The full path to a bed file after which the program opens the bed file, or a list of the lines in the bed file. The latter requires to read the bed file before calling this function and input all lines in the bed file as a list. The function than does not open the bed file again.
- Returns three dictionaries (in this order):
The first indicates the names of the chromosomes as used in the bed file (keys are roman numerals 1 to 16 and the values are the names used in the bed file). The second is the start line in the bed file of each chromosome (keys are the roman numerals of the chromosome names and the values are the start lines in the bed file of the chromosome). The third is the end line in the bed file of each chromosome (keys are the roman numerals of the chromosome names and the values are the start lines in the bed file of the chromosome)
CHANGE LINE 60 AND 71 TO AUTOMATICALLY RECOGNIZE THE MITOCHONDRIAL DNA NAME
- transposonmapper.processing.cleanfiles(filepath=None, custom_header=None, split_chromosomes=False)[source]¶
This script removes transposon insertions in .bed and .wig files that were mapped outside the chromosomes, creates consistent naming for chromosomes and change the header of files with custom headers. This code reads a .bed or .wig file and remove any insertions that were mapped outside a chromosome. Mapping of a read outside a chromosome can happen during the alignment and transposon mapping steps and means that the position of an insertions site of a read is larger than the length of the chromosome it is mapped to. This function creates a new file with the same name as the inputfile with the extension _clean.bed or _clean.wig. This is saved at the same location as the input file. In this _clean file the redundant insertions that were mapped outside the chromosome are removed. The lengths of the chromosomes are determined the python function ‘chromosome_position’ which is part of the python module ‘chromosome_and_gene_positions.py’. This module gets the lengths of the chromosomes from a .gff file downloaded from SGD (https://www.yeastgenome.org/). Besides removing the reads outside the chromosomes, it also changes the names of the chromosomes to roman numerals and a custom header can be inputted (optional). Finally, the bed and wig files can be split up in separate files for each chromosome. These are placed in _chromosomesplit folder located at the location of the bed or wig file. @author: gregoryvanbeek Created on Fri Mar 5 15:39:53 2021
- Parameters
filepath (str) – File path of the wig or bed file to analyze
custom_header (str) – String header to be included in the output file
split_chromosomes (Bool (True/False)) – If true then there will be a folder created for each chromosome , otherwise there will be a file containing all the info for all chromosomes.
- Returns
A file with the same basename as the filepath, and in the same location, with the extension
- Return type
_clean.wig/_clean.bed
- transposonmapper.processing.counts_genome(variable, bed_file, gff_file)[source]¶
Counts of reads or the transposons per chromosomes
- Parameters
variable (str) – “transposons” or “reads”
bed_file (str) – absolute path of the location of the bedfile
gff_file (str) – absolute path of the location of the gff file
- Returns
An array of the length of the genome with the counts of each variable per location in the genome.
- Return type
numpy.ndarray
- transposonmapper.processing.dna_features(region, wig_file, pergene_insertions_file, variable='reads', plotting=True, savefigure=False, verbose=True)[source]¶
This scripts takes a user defined genomic region (i.e. chromosome number, region or gene) and creates a dataframe including information about all genomic features in the chromosome (i.e. genes, nc-DNA etc.). This can be used to determine the number of reads outside the genes to use this for normalization of the number of reads in the genes. Output is a dataframe including major information about all genomic features and optionally a barplot indicating the number of transposons per genomic region. A genomic region is here defined as a gene (separated as annotated essential and not essential), telomere, centromere, ars etc. This can be used for identifying neutral regions (i.e. genomic regions that, if inhibited, do not influence the fitness of the cells). This function can be used for normalizing the transposon insertions per gene using the neutral regions.
- Parameters
region (str) –
Region: e.g. chromosome number (either a normal number between 1 and 16 or in roman numerals between I and XVI), a list like [‘V’, 0, 14790] which creates a barplot between basepair 0 and 14790) or a genename.
wig_file (str) – absolute path for the wig file location
pergene_insertions_file (str) – asbsoulte path for the _pergene_insertions.txt file location
variable (str, optional) – By default “reads”. It could be “transposons”or “reads”. This would be used for the plotting if True
plotting (bool, optional) – Whether or not producing a bar plot with the reads/insertions per genomic location in the region, by default True
savefigure (bool, optional) – Whether or not saving the plot in the same folder as the datafiles, by default False
verbose (bool, optional) – Determines how much textual feedback is given. When set to False, only warnings will be shown. By default True
- Returns
Dataframe containing information about the selected chromosome.
- Return type
dataframe
- transposonmapper.processing.feature_position(feature_dict, chrom, start_chr, dna_dict, feature_type=None)[source]¶
Get features for every gene in the chromosome of interest
- Parameters
feature_dict (dict) – output of sgd_features(sgd_features_file)[i]
chrom (str) – Name of the chromosome in roman where to extract the information.
start_chr (int) – [description]
dna_dict (dict) – first output of the gene_location function
feature_type ([type], optional) – [description], by default None
- Return type
dict
- transposonmapper.processing.gene_location(chrom, gene_position_dict, verbose)[source]¶
It gives structured information from the genes inside the chromosome of interest
- Parameters
chrom (str) – Name of the chromosome in roman where to extract the information.
gene_position_dict (dict) – Dictionary with info of the genes inside the chromosome . It is the output of the function read_pergene_file
verbose (bool) – Same as main function dna_features. If True allows for warning messages.
- Returns
dna_dict (dict) – Dictionary with info about genes encoded in the sgd features file
start_ch (int) – Integer indicating the genomic location of where the chromosome of interest starts
end_chr (int) – Integer indicating the genomic location of where the chromosome of interest ends
len_chr (int) – Length of the chromosome of interest
feature_orf_dict (dict) – Dictionary with info about genes in the chromosome of interest
- transposonmapper.processing.input_region(region, verbose)[source]¶
Defines the region of interest for further processing
- Parameters
region (str, int or list) – Enter chromosome as a number (or roman numeral) between 1 and 16 (I and XVI), a list in the form [‘chromosome number, start_position, end_position’] or a valid gene name.
verbose (bool) – To allow warning messages.
- Returns
roi_start (NoneType, int) – Describe the start of the genomic location if region=gene name , otherwise is a NoneType
roi_end (NoneType, int) – Describe the end of the genomic location if region=gene name , otherwise is a NoneType
region_type (str) – It is either “Gene” or “Chromosome” depending on the region provided
chrom (str) – It is the name of the chromosome of the gene of interest if a gene name is provided as the region, otherwise is the roman description of the chromosome of interest.
- transposonmapper.processing.intergenic_regions(chrom, start_chr, dna_dict)[source]¶
Getting intergenic regions from chromosome of interest
- Parameters
chrom (str) – Name of the chromosome in roman where to extract the information.
start_chr (int) – 2nd output of the gene_location function
dna_dict (dict) – 1st output of the gene_location function
- Returns
dna_dict_new (dict)
genomicregions_list (list)
- transposonmapper.processing.length_genome(chr_length_dict)[source]¶
Output the length of the genome in bp
- Parameters
chr_length_dict (dict) – A dictionary describing the length of each chromosome.
- Returns
The length of the genome
- Return type
int
- transposonmapper.processing.list_known_essentials(input_files=None, headerlines=3, verbose=True)[source]¶
- Get all known essential genes from two different files and combine them in one list.
Input is a list of of paths where files can be found with the known essential genes. A default list is implemented using two files present in the same folder as this file. It is expected that the files contain genes in a single column and nothing else. An option can be set for the number of headerlines, which by default is set to 3. The output is a list containing all the genes present in all files given in the input.
Note when using the default files: The length of the output list exceed the number of known essential genes as the list sometimes contains both the standard name and the systematic name of a gene.
- Parameters
input_files (str, optional) – File path for the essential file in your file system, by default None
headerlines (int, optional) – by default 3
verbose (bool, optional) – To show explanations if True, by default True
- transposonmapper.processing.middle_chrom_pos(chr_length_dict)[source]¶
Defines the middle poit of each chromosome
- Parameters
chr_length_dict (dict) – A dictionary describing the length of each chromosome.
- Returns
A list describing for each chromosome the middle point.
- Return type
list
- transposonmapper.processing.profile_genome(bed_file=None, variable='transposons', bar_width=None, savefig=False, showfig=False)[source]¶
Created on Thu Mar 18 13:05:39 2021
@author: gregoryvanbeek This function creates a bar plot along the entire genome. The height of each bar represents the number of transposons or reads at the genomic position indicated on the x-axis.
The bar_width determines how many basepairs are put in one bin. Little basepairs per bin may be slow. Too many basepairs in one bin and possible low transposon areas might be obscured.
- Parameters
bed_file (str, optional) – The file path to the location of the bed file in your filesystem, by default None
variable (str, optional) – The variable for plotting throughput the genome, by default “transposons”
bar_width (int, optional) – The width for the histogram of the plot, by default None , which means internally the length of the genome over 1000
savefig (bool, optional) – Save the figure if True, by default False
showfig (bool, optional) – Show the figure if True, by default False
- Returns
list – All insertion sites
list – Binned insertion sites according the width
- transposonmapper.processing.read_pergene_file(pergene_insertions_file, chrom)[source]¶
Reading the pergene file , the information per gene , related to where it starts and ends in the genome.
- Parameters
pergene_insertions_file (str) – absolute path of the per gene file location
chrom (str) – Name of the chromosome in roman where to extract the informatiion from the wigfile
- Returns
gene_position_dict – A dictionary describing the chromosome, start, and end location of every gene in the chromosome of interest.
- Return type
dict
- transposonmapper.processing.read_wig_file(wig_file, chrom)[source]¶
Extract the information in the wigfile related to the chromosome of interested
- Parameters
wig_file (str) – absolute path of the wigfile location
chrom (str) – Name of the chromosome in roman where to extract the informatiion from the wigfile
- Returns
insrt_in_chrom_list (list) – Genomic locations of transposon insertions in the given chromosome.
reads_in_chrom_list (list) – How many reads are in each of the genomic locations of the insertions.
- transposonmapper.processing.sgd_features(filepath=None)[source]¶
This function read the file SGD_features.tab and create a dictionary with useful info for processing
- Parameters
filepath (str, optional) – filepath of the sgd.tab file , by default None
- Returns
A dictionary with the following info: key:
Feature name
- value:
feature type (l[1])
feature qualifier (Verified or Dubious) (l[2])
Standard name (l[4])
Aliases (separated by ‘|’) (l[5])
Parent feature name (typically ‘chromosome …’) (l[6])
Chromosome (l[8])
start coordinate (starting at 0 for each chromosome) (l[9])
end coordinate (starting at 0 for each chromosome) (l[10])
This file reads the SGD_features.txt file found at http://sgd-archive.yeastgenome.org/curation/chromosomal_feature/
- Return type
dict
- transposonmapper.processing.summed_chr(chr_length_dict)[source]¶
Create a dictionary where each value is the cumulative sum of all bp in each chromosomes
- Parameters
chr_length_dict (dict) – A dictionary describing the length of each chromosome.
- Returns
A dictionary where each value corresponds to the cumulative sum of the previous chromosomes lengths.
- Return type
dict