Searching for genomic features of transposition events

This scripts takes a user defined genomic region (i.e. chromosome number, region or gene) and creates a dataframe including information about all genomic features in the chromosome (i.e. genes, nc-DNA etc.). This can be used to determine the number of reads outside the genes to use this for normalization of the number of reads in the genes. To run this script, the following files are required and should be placed in the same directory as this file: - chromosome_and_gene_positions.py - chromosome_names_in_files.py - gene_names.py - read_sgdfeatures.py - normalize_reads.py

For the ipython Notebook version of this script with more extensive explanation, see here

This script consists of two functions, dna_features and feature_position. The function dna_features is the main function that takes user inputs and this function calls the function feature_position which is not intended to be used directly. The function dna_features takes the following arguments:

region=[int]||[string]||[list] (required)
wig_file=[path] (required)
pergene_insertions_file=[path] (required)
variable="reads"||"insertions"
plotting=True||False
savefigure=True||False
verbose=True||False

The script takes a wig file and a pergene_insertions_file and a genomic region. This genomic region can be:

  • chromosome number (either roman numeral or integer between 1 and 16)

  • a list with three arguments: first a number defining the chromosome (again either roman numeral or integer between 1 and 16) second an integer defining the start basepair and third an integer defining an end basepair (e.g. [“I”, 10000, 20000] to get the region between basepair 10000 and 20000 on chromosome 1)

  • a gene name (e.g. “CDC42”) which will automatically get the corresponding chromosome and basepair position

The plotting argument (True or False) defines whether to create a barplot. The variable argument determines what to plot, either reads or insertions and the savefigure whether to automatically save the figure at the same location as where this script is stored. Finally the verbose determines if any printing output should be given (this is mostly useful for when calling this script from other python scripts).

This scripts does not only look at genes, but also at other genomic regions like telomere, centromeres, rna genes etc. All these features are stored in one dataframe called dna_df2 that includes naming and positional information about the features and the insertion and read counts (see output). The dataframe will always be created for one entire chromosome (regardless if a basepair region or gene name was entered in the region argument). When the plotting is set to the True, it will also create a barplot for the same chromosome within the region that is defined in the region variable. The plot distinguishes between nonessential genes, essential genes, other genomic features (e.g. telomeres, centromeres etc.) and noncoding dna. The width of the bars is determined by the length of the genomic feature and the height represents either the number of reads or insertions (depending what is defined in variable).

This function can be useful for other python functions as well when information is required about the positions, insertions and read counts of various genomic features. The full list of genomic features that is regarded in the dataframe is mentioned in read_sgdfeatures.py.

  • Output

The main output is the dna_df2 dataframe in which each row represents a genomic feature and has the following columns:

  • Feature name

  • Feature standard name

  • Feature aliases (different names for the same feature, mainly for genes)

  • Essentialilty (only for genes)

  • Chromosome where feature is located

  • Position in terms in basepairs

  • Length of feature in terms of basepairs

  • Number of insertions in feature

  • Number of insertions in truncated feature (only for genes, the insertions in the first and last 100bp are ignored, see below)

  • Sum of reads in feature

  • Sum of reads in truncated feature (only for genes, where the reads in the first and last 100bp are ignored)

  • Number of reads per insertion

  • Number of reads per insertion (only for genes, where the insertions and reads in the first and last 100bp are ignored)

The truncated feature columns ignores basepairs at the beginning and end of a gene. This can be useful as it is mentioned that insertions located at the beginning or end of a gene results in a protein that is still functional (although truncated) (e.g. see Michel et.al. 2017) (see Notes for a further discussion about how this is defined).

When plotting is set to True, a barplot is created where the width of the bars correspond to the width of the feature the bar is representing. This can be automatically saved at the location where the python script is stored. The plot is created for an entire chromosome, or it can be created for a specific region, for example when a list is provided in the region argument or when a gene name is given.

Note

  • The definition for a truncated gene is currently that the first and last 100bp are ignored. This is not completely fair as this is much more stringent for short genes compared to long genes. Alternatively this can be changed to a percentage, for example ignoring the first and last 5 percent of a gene, but this can create large regions in long genes. There is no option to set this directly in the script, but if this needs to be changed, search the script for the following line: #TRUNCATED GENE DEFINITION. This should give the N10percent variable that controls the definition of a truncated gene.

  • The barplot currently only takes reads or insertions, but it might be useful to include reads per insertions as well.

  • Sometimes it can happen that two genomic regions can overlap. When this happens, the dataframe shows a feature, in the next row another feature and then the next row from that it continues with the first feature (e.g. row1; geneA, row2; overlapping feature, row3; geneA). This issue is not automatically solved yet, but best is to, whenever you find a feature, to search if that feature also exists a couple of rows further on. If yes, sum all rows between the first and last occurance of your gene (e.g. in the example above for geneA, sum the values from row1, row2 and row3).

How to use the code in the same script

region = 1 #e.g. 1, "I", ["I", 0, 10000"], gene name (e.g. "CDC42")
wig_file = r""
pergene_insertions_file = r""
plotting=True
variable="reads" #"reads" or "insertions"
savefigure=False
verbose=True
if __name__ == '__main__':
    dna_df2 = dna_features(region=region,
                 wig_file=wig_file,
                 pergene_insertions_file=pergene_insertions_file,
                 variable=variable,
                 plotting=plotting,
                 savefigure=savefigure,
                 verbose=verbose)