Welcome to RGT’s documentation!

Contents:

GenomicRegion

GenomicRegion describes a genomic region.

class GenomicRegion.GenomicRegion(chrom, initial, final, name=None, orientation=None, data=None, proximity=None)

Keyword arguments:

  • chrom – Chromosome.
  • initial – Start position
  • final – End position
  • name – Name of the region
  • orientation – Orientation of the region, “+” or “-“
  • data – Extra information
  • proximity – Close genes
distance(y)

Return the distance between two GenomicRegions. If overlapping, return 0; if on different chromosomes, return None.

Keyword arguments:

  • y – Given GenomicRegion to compare.
extend(left, right)

Extend GenomicRegion on both sides.

Keyword arguments:

  • left – Define the length to extend on left.
  • right – Define the length to extend on right.
extract_blocks()

Extract the block information in self.data into a GenomicRegionSet.

get_data(as_list=False)

Return data as string (with special separating character (_$_)) or as list.

Keyword arguments:

  • as_list – Return a list instead of a string.
overlap(region)

Return True, if GenomicRegion overlaps with region, else False.

Keyword arguments:

  • region – Given GenomicRegion to compare.
toString(space=False)

Return a string of GenomicRegion by its position.

Keyword arguments:

  • space – insert spaces between the values.

GenomicRegionSet

GenomicRegionSet represent a list of GenomicRegions.

class GenomicRegionSet.GenomicRegionSet(name)

Keyword arguments:

  • name – Name of the GenomicRegionSet
add(region)

Add GenomicRegion.

Keyword arguments:
  • region – The GenomicRegion to be added.
any_chrom(chrom, len_min=False, len_max=False)

Return a list of regions which belongs to given chromosome.

Keyword arguments:

  • chrom – Define chromosome
  • len_min – minimum length
  • len_max – maximum length

Return:

  • A list of regions which belongs to given chromosome.
by_names(names)

Subset the GenomicRegionSet by the given list of names.

Keyword arguments:

  • names – A list of names as targets.

Return:

  • A GenomicRegionSet containing the regions with the target names.
change_name_by_dict(convert_dict)

Change the names of each region by the given dictionary.

Keyword arguments:

  • convert_dict – A dictionary having original names as its keys and new names as its values.
closest(y, distance=False)

Return a new GenomicRegionSet including the region(s) of y which is closest to any self region. If there are intersection, return False.

Keyword arguments:

  • y – the GenomicRegionSet which to compare with
  • distance – show the distance in parentheses

Return:

  • A GenomicRegionSet which contains the nearest regions to the self
cluster(max_distance)

Cluster the regions with a certain distance and return the result as a new GenomicRegionSet.

Keyword arguments:

  • max_distance – the maximum distance between regions within the same cluster

Return:

  • A GenomicRegionSet including clusters
self           ----           ----            ----
                  ----             ----                 ----
Result(d=1)    -------        ---------       ----      ----
Result(d=10)   ---------------------------------------------        
combine(region_set, change_name=True, output=False)

Adding another GenomicRegionSet without merging the overlapping regions.

Keyword arguments:

  • region_set – the GenomicRegion which to combine with
  • change_name – Combine the names as a new name for the combined regions
  • output – If TRUE, it returns a GenomicRegionSet; if FASLSE, it merge the regions in place.
complement(organism, chrom_X=True, chrom_Y=False, chrom_M=False)

Return the complement GenomicRegionSet for the given organism.

Keyword arguments:

  • organism – Define organism’s genome to use. (hg19, mm9)
  • chrom_X – The result covers chromosome X or not. (True/False)
  • chrom_Y – The result covers chromosome Y or not. (True/False)
  • chrom_M – The result covers mitochondrial chromosome or not. (True/False)

Return:

  • z – A GenomicRegionSet which contains the complement regions
count_by_region(region)

Return the number of intersection regions with the given GenomicRegion.

Keyword arguments:

  • region – A GenomicRegion defining the interval for counting.
count_by_regionset(regionset)

Return the number of intersection regions with the given GenomicRegionSet.

Keyword arguments:

  • regionset – A GenomicRegionSet defining the interval for counting.
counts_per_region(regionset)

Return a list of counting numbers of the given GenomicRegionSet based on the self.

Keyword arguments:

  • regionset – A GenomicRegionSet defining the interval for counting.

Note

The length of the result list is the same as self GenomicRegionSet

coverage_per_region(regionset)

Return a list of coverage of the given GenomicRegionSet based on the self GenomicRegionSet.

Keyword arguments:

  • regionset – A GenomicRegionSet as the signal for calculate the coverage.

Note

The length of the result list is the same as self GenomicRegionSet.

covered_by_aregion(region)

Return a GenomicRegionSet which includes all the regions covered by a given region.

Keyword arguments:

  • region – A GenomicRegion defining the interval.

Return:

  • A GenomicRegionSet containing the regions within the defined interval.
extend(left, right, percentage=False)

Perform extend step for every element.

Keyword arguments:

  • percentage – input value of left and right can be any positive value or negative value larger than -50 %
extract_blocks()

Extract the exon information from self.data and add them into the self GenomicRegionSet.

filter_by_gene_association(gene_set=None, organism='hg19', promoterLength=1000, threshDist=50000)

Updates self in order to keep only the coordinates associated to genes which are in gene_set.

It also returns information regarding the mapped genes and proximity information (if mapped in proximal (PROX) or distal (DIST) regions). If a region was associated with two genes, one in the gene_set and the other not in gene_set, then the updated self still keeps information about both genes. However, the latter won’t be reported as mapped gene.

Keyword arguments:

  • gene_set – List of gene names as a GeneSet object. If None, then consider all genes to be enriched. (default None)
  • organism – Organism in order to fetch genomic data. (default hg19)
  • promoterLength – Length of the promoter region. (default 1000)
  • threshDist – Threshold maximum distance for a coordinate to be considered associated with a gene. (default 50000)

Return:

  • None – Updates self in order to keep only the coordinates associated to genes which are in gene_set
  • all_genes = GeneSet that contains all genes associated with the coordinates
  • mapped_genes = GeneSet that contains the genes associated with the coordinates which are in gene_set
  • all_proxs = List that contains all the proximity information of genes associated with the coordinates
  • mapped_proxs = List that contains all the proximity information of genes associated with the coordinates which are in gene_set
flank(size)

Return two flanking intervals with given size from both ends of each region.

Keyword arguments:

  • size – the length of flanking intervals (default = SAME length as the region)

Return:

  • z – A GenomicRegionSet including all flanking intervals
self        -----           --            ---
Result -----     -----    --  --       ---   ---
gene_association(gene_set=None, organism='hg19', promoterLength=1000, threshDist=50000, show_dis=False)

Associates coordinates to genes given the following rules:

  1. If the peak is inside gene (promoter+coding) then this peak is associated with that gene.
  2. If a peak is inside overlapping genes, then the peak is annotated with both genes.
  3. If peak is between two genes (not overlapping neither), then both genes are annotated.
  4. If the distance between peak and gene is greater than a threshold distance, then it is not annotated.

Keyword arguments:

  • gene_set – List of gene names as a GeneSet object. If None, then consider all genes to be enriched. (default None)
  • organism – Organism in order to fetch genomic data. (default hg19)
  • promoterLength – Length of the promoter region. (default 1000)
  • threshDist – Threshold maximum distance for a coordinate to be considered associated with a gene. (default 50000)
  • show_dis – Show distance to the closest genes in parentheses.

Return:

  • result_grs – GenomicRegionSet exactely as self, but with the following additional information:

    1. name: String of genes associated with that coordinate separated by ‘:’
    2. data: String of proximity information (if the coordinate matched to the corresponding gene in the previous list in a proximal position (PROX) or distal position (DIST)) separated by ‘:’

    The gene will contain a ‘.’ in the beginning of its name if it is not in the gene_set given.

get_chrom()

Return all chromosomes.

get_genome_data(organism, chrom_X=True, chrom_Y=False, chrom_M=False)

Add genome data from database into the GenomicRegionSet.

Keyword arguments:

  • organism – Define the organism
  • chrom_X – Include chromosome X
  • chrom_Y – Include chromosome Y
  • chrom_M – Include mitochondrial chromosome
include(region)

Check whether the given region has intersect with the original regionset.

Keyword arguments:

  • region – A GenomicRegion to be checked.
intersect(y, mode=0, rm_duplicates=False)

Return the overlapping regions with three different modes.

Keyword arguments:

  • y – the GenomicRegionSet which to compare with.
  • mode – OverlapType.OVERLAP, OverlapType.ORIGINAL or OverlapType.COMP_INCL.
  • rm_duplicates – remove duplicates within the output GenomicRegionSet

Return:

  • A GenomicRegionSet according to the given overlapping mode.

mode = OverlapType.OVERLAP

Return new GenomicRegionSet including only the overlapping regions with y.

Note

it will merge the regions.

self       ----------              ------
y                 ----------                    ----
Result            ---

mode = OverlapType.ORIGINAL

Return the regions of original GenomicRegionSet which have any intersections with y.

self       ----------              ------
y              ----------                    ----
Result     ----------

mode = OverlapType.COMP_INCL

Return region(s) of the GenomicRegionSet which are ‘completely’ included by y.

self        -------------             ------
y              ----------      ---------------              ----
Result                                ------
intersect_count(regionset, mode_count='count', threshold=False)

Return the number of regions in regionset A&B in following order: (A-B, B-A, intersection)

Keyword arguments:

  • regionset – the GenomicRegionSet which to compare with.
  • mode_count – count the number of regions or to measure the length of intersection.
  • threshold – Define the cutoff of the proportion of the intersecting region (0~50%)

Return:

  • A tupple of numbers: (A-B, B-A, intersection)
jaccard(query)

Return jaccard index, a value of similarity of these two GenomicRegionSet.

Keyword arguments:

  • query – the GenomicRegionSet which to compare with.

Return:

  • similarity – (Total length of overlapping regions)/(Total length of original regions)
self              --8--      ---10---    -4-
query        ---10---             ---10---
intersect         -5-             -4-    2
similarity: (5+4+2)/[(8+10+4)+(10+10)-(5+4+2)] = 11/31
maximum_length()

Return the length of the maximum region from the GenomicRegionSet.

merge(w_return=False, namedistinct=False)

Merge the regions within the GenomicRegionSet

Keyword arguments:

  • w_return – If TRUE, it returns a GenomicRegionSet; if FASLSE, it merge the regions in place.
  • namedistinct – Merge the regions which have the same names only.
projection_test(query, organism, extra=None, background=None)

“Return the p value of binomial test.

Keyword arguments:

  • query – A GenomicRegionSet as query
  • organism – Define the organism
  • extra – Return the extra statistics
  • background – Use a GenomicRegionSet as the background

Return:

  • if extra=True, returns (possibility, ration, p-value)
  • if extra=False, returns p-value
random_regions(organism, total_size=None, multiply_factor=1, overlap_result=True, overlap_input=True, chrom_X=False, chrom_M=False, filter_path=None)

Return a GenomicRegionSet which contains the random regions generated by given entries and given number on the given organism.

Keyword arguments:

  • organism – Define organism’s genome to use. (hg19, mm9)
  • total_size – Given the number of result random regions.
  • multiply_factor – This factor multiplies to the number of entries is the number of exporting random regions. ** total_size has higher priority than multiply_factor. **
  • overlap_result – The results whether overlap with each other or not. (True/False)
  • overlap_input – The results whether overlap with input entries or not. (True/False)
  • chrom_X – The result covers chromosome X or not. (True/False)
  • chrom_M – The result covers mitochondria chromosome or not. (True/False)
  • filter_path – Given the path of filter BED file

Return:

  • z – A GenomicRegionSet which contains the random regions
random_split(size)

Return two exclusive GenomicRegionSets from self randomly.

Keyword arguments:

  • size – define number of the spliting regions.
random_subregions(size)

Return a subsampling of the genomic region set with a specific number of regions.

Keyword arguments:

  • size – define number of the subsampling regions.
read_bed(filename)

Read BED file and add every row as a GenomicRegion.

Keyword arguments:

  • filename – define the path to the BED file.

Note

Chrom (1), start (2), end (2), name (4) and orientation (6) is used for GenomicRegion. All other columns (5, 7, 8, ...) are put to the data attribute of the GenomicRegion. The numbers in parentheses are the columns of the BED format.

read_bedgraph(filename)

Read BEDGRAPH file and add every row as a GenomicRegion.

Keyword arguments:

  • filename – define the path to the BEDGRAPH file.
relocate_regions(center='midpoint', left_length=2000, right_length=2000)

Return a new GenomicRegionSet which relocates the regions by given center and extend length.

Keyword arguments:

  • center – Define the referring point of each region

    1. midpoint – locate the new region’s center as original region’s midpoint
    2. leftend – locate the new region’s center as original region’s 5’ end (if no orientation information, default is left end)
    3. rightend – locate the new region’s center as original region’s 3’ end (if no orientation information, default is right end)
    4. bothends – locate the new region’s center as original region’s both ends
    5. downstream – rightend in positive strand and leftend in negative strand
    6. upstream – leftend in positive strand and rightend in negative strand
  • left_length – Define the length to extend on the left side

  • right_length – Define the length to extend on the right side

remove_duplicates()

Remove the duplicate regions and remain the unique regions. (No return)

replace_region_name(regions, combine=False)

Replace the region names by the given GenomicRegionSet.

Keyword arguments:

  • regions – A GenomicRegionSet as the source for the names.
  • combine – Combine the names from the old and new regions.
sort(key=None, reverse=False)

Sort Elements by criteria defined by a GenomicRegion.

Keyword arguments:

  • key – given the key for comparison.
  • reverse – reverse the sorting result.
sort_score()

Sort the regions by their scores.

subtract(y, whole_region=False)

Return a GenomicRegionSet excluded the overlapping regions with y.

Keyword arguments:

  • y – the GenomicRegionSet which to subtract by
  • whole_region – subtract the whole region, not partially

Return:

  • A GenomicRegionSet which contains the remaining regions of self after subtraction
self     ----------              ------
y               ----------                    ----
Result   -------                 ------
subtract_aregion(y)

Return a GenomicRegionSet excluded the overlapping regions with y.

Keyword arguments:

  • y – the GenomicRegion which to subtract by

Return:

  • the remaining regions of self after subtraction
self     ----------              ------
y               ----------
Result   -------                 ------
total_coverage()

Return the sum of all lengths of regions.

window(y, adding_length=1000)

Return the overlapping regions of self and y with adding a specified number (1000, by default) of base pairs upstream and downstream of each region in self. In effect, this allows regions in y that are near regions in self to be detected.

Keyword arguments:

  • y – the GenomicRegionSet which to compare with
  • adding_length – the length of base pairs added to upstream and downstream of self (default 1000)

Return:

  • A GenomicRegionSet including the regions of overlapping between extended self and original y.
within_overlap()

Check whether there is overlapping within or not.

write_bed(filename)

Write GenomicRegions to BED file.

Keyword arguments:

  • filename – define the path to the BED file.
write_bed_blocks(filename)

Write BED file with information of blocks e.g. exons.

Keyword arguments:

  • filename – Define the filename of the new BED file.

CoverageSet

CoverageSet represents the coverage data of a GenomicRegionSet.

class CoverageSet.CoverageSet(name, GenomicRegionSet)

Keyword arguments:

  • name – name.
  • genomicRegions – instance of GenomicRegionSet
add(cs)

Add CoverageSet <cs>.

Keyword arguments:

  • cs – instance of CoverageSet, which is used to add up
count_unique_reads(bamFile)

Count the number of unique reads on for class variable <genomicRegions>.

Keyword arguments:

  • bamFile – path to bigwig file

Output:

number of unique reads

coverage_from_bam(bam_file, read_size=200, binsize=100, stepsize=50, rmdup=True, mask_file=None, get_strand_info=False)

Compute coverage based on GenomicRegionSet.

Iterate over each GenomicRegion in class variable genomicRegions (GenomicRegionSet). The GenomicRegion is divided into consecutive bins with lenth <binsize>. A sliding-window approach with a stepsize of <stepsize> generates the coverage signal.

Keyword arguments:

  • bam_file – path to bam file
  • read_size – used read size
  • binsize – size of bins
  • stepsize – stepsize for the window-based approach to generat the signal
  • rmdup – remove dupliacted reads (reads with same starting coordinate)
  • mask_file – ignore region described in <mask_file> (tab-separated: chrom, start, end)
  • get_strand_info – compute strand information for each bin

Output:

  • Class variable <coverage>: a list of lists: the elements correspond a GenomicRegion. This list gives the coverage of each bin.
  • Class variable <overall_cov>: a list: concatenation of class variable <coverage>.
  • If option <get_strand_info> is set, a numpy array class variable <cov_strand_all> of tuples. The tuples give the number of forward and backward reads for each bin.

Example:

First, we compute a GenomicRegionSet that covers the entire mouse genome mm9. We use the annotation of RGT to compute the variable <regionset>:

>>>from rgt.Util import GenomeData
>>>from helper import get_chrom_sizes_as_genomicregionset

>>>g = GenomeData('mm9')
>>>regionset = get_chrom_sizes_as_genomicregionset(g.get_chromosome_sizes())

Next, we load the CoverageSet class from RGT and initialize it with the variable <regionset>. Finally, we compute the coverage based on <bamfile>:

>>>from rgt.CoverageSet import CoverageSet
>>>cov = CoverageSet('IP coverage', regionset)
>>>cov.coverage_from_bam(bam_file=bamfile, read_size=200)

We can now access <cov>:

>>>from __future__ import print_function
>>>from numpy import sum
>>>print(cov.overall_cov[cov.overall_cov>0][:10])
[1 1 1 1 1 2 2 2 2 1]

>>>print(len(cov.overall_cov))
54515813

Note

the length of the <overall_cov> equals 54515813, as we take the entire genome into account, but use a the default stepsize of 50 for segmentation.

coverage_from_bigwig(bigwig_file, stepsize=100)

Return list of arrays describing the coverage of each genomicRegions from <bigwig_file>.

Keyword arguments:

  • bigwig_file – path to bigwig file
  • stepsize – used stepsize

Output:

Class variable <coverage>: a list where the elements correspond to the GenomicRegion. The list elements give the number of reads falling into the GenomicRegion.

Warning

Function not tested. Please do not use it!

coverage_from_genomicset(bamFile, readSize=200)

Compute coverage based on the class variable <genomicRegions>.

Iterate over each GenomicRegion in class variable genomicRegions (GenomicRegionSet) and set coverage to the number of reads falling into the GenomicRegion.

Keyword arguments:

  • bamFile – path to bam file
  • readSize – used read size

Output:

Class variable <coverage>: a list where the elements correspond to the GenomicRegion. The list elements give the number of reads falling into the GenomicRegion.

Warning

Function not tested. Please do not use it!

index2coordinates(index, regions)

Convert index of class variable <overall_cov> to genomic coordinates.

Keyword arguments:

  • index – index of <overall_cov> that is to be converted
  • regions – instance of GenomicRegionSet the conversion is based on

Note

In most of the cases, the parameter <regions> equals the GenomicRegionSet used for the initialization of the CoverageSet.

Output:

Triple which gives the chromosome, the start- and the end-coordinate of the bin associated to <index>.

Example:

Here, we give out the genomic regions of bins that exhibit a value higher than 10:

>>>from rgt.CoverageSet import CoverageSet
>>>cov = CoverageSet('IP coverage', regionset)
>>>cov.coverage_from_bam(bam_file=bamfile, read_size=200)
>>>for i, el in enumerate(cov.overall_cov):
>>>    if el > 10:
>>>        chrom, s, e = cov.index2coordinates(i, regionset)
>>>        print(chrom, s, e)
normRPM()

Normalize to read per million (RPM).

phastCons46way_score(stepsize=100)

Load the phastCons46way bigwig files to fetch the scores as coverage.

Keyword arguments:

  • stepsize – used stepsize
scale(factor)

Scale coverage with <factor>.

Keyword arguments:

  • factor – float
subtract(cs)

Substract CoverageSet <cs>.

Keyword arguments:

  • cs – instance of CoverageSet

Note

negative values are set to 0.

write_bed(filename, zero=False)

Output coverage in BED format.

Keyword arguments:

  • filename – filepath
  • zero – boolean

Note

If zero=True, coverage of zero is output as well. This may cause large output files!

write_bigwig(filename, chrom_file, end=True, save_wig=False)

Output coverage in bigwig format.

The path to the chromosome size file <chrom_file> is required. This file is tab-separated and assigns a chromosome to its size.

Keyword arguments:

  • filename – filepath
  • chrom_file – chromosome size file
  • end – boolean
  • save_wig – boolean, if set, wig file is also saved.

Warning

Parameter <end> is deprecated! Please do not use it.

Note

The <save_wig> option may cause large output files

write_wig(filename, end)

Output coverage in wig format.

Keyword arguments:

  • filename – filepath
  • end – boolean

Warning

Parameter end is deprecated! Please do not use it.

GenomicVariant

GenomicVariant is a specialized GenomicRegion class and describes a SNP or InDel.

class GenomicVariant.GenomicVariant(chrom, pos, ref, alt, qual, filter=None, id=None, info=None, format=None, genotype=None, samples=None)

Keyword arguments:

  • chrom – chromosome
  • pos – position
  • ref – reference nucleotide
  • alt – alternative nucleotide
  • qual – quality
  • filter – filter
  • id – id
  • info – further informaton
  • format – SNP format
  • genotype – genotype
  • samples – sample

Note

all necessary information are contained in a VCF file.

GenomicVariantSet

GenomicVariantSet represents list of GenomicVariant.

class GenomicVariantSet.GenomicVariantSet(vcf_path=None, name='GenomicVariantSet')

Keyword arguments:

  • vcf_path – VCF file
  • name – name
filter(at, op, t)

Filter for attributes.

Keyword arguments:

  • at – VCF file
  • op – operation to perform
  • t – threshold
Example:

We load a VCF file:

>>>from rgt.GenomicVariantSet import GenomicVariantSet
>>>snps_sample1 = GenomicVariantSet('snps.vcf', name='sample1')

And we filter by the mapping quality:

>>>snps_sample1.filter(at='MQ', op'>', t=30)

The mapping quality is tagged as MQ in the VCF file. We only want to keep SNPs that have a mapping quality higher than 30.

Note

operation <op> and threhold <t> depend on the filtering tag <at>

filter_dbSNP()

Filter for dbSNP.

Note

the vcf file must already contain the dbSNP annotation.

intersect(x)

Intersect GenomicVariantSet.

Keyword arguments:

  • x – instance of GenomicVariantSet
read_vcf(vcf_path)

Read SNPs and InDels from a VCF file.

Keyword arguments:

  • vcf_path – VCF file

Note

vcf_path can also be defined in the initialization.

Example:

We load a VCF file:

>>>from rgt.GenomicVariantSet import GenomicVariantSet
>>>snps_sample1 = GenomicVariantSet('snps.vcf', name='sample1')
sort()

Sort elements by criteria defined by GenomicVariant.

Note

By default, the genomic position is used as sorting criteria.

subtract(x)

Subtract GenomicVariantSet.

Keyword arguments:

  • x – instance of GenomicVariantSet which is subtracted
write_vcf(vcf_path)

Write VCF file.

Keyword arguments:

  • vcf_path – VCF file

AnnotationSet

AnnotationSet represent genomic annotation from genes.

class AnnotationSet.AnnotationSet(gene_source, tf_source=None, alias_source=None, filter_havana=True, protein_coding=False, known_only=True)

This class represents genomic annotation from genes.

Keyword arguments:

  • gene_source – Gene source annotation. It will be used to create the gene_list element. It can be:
    • A matrix (list of lists): An AnnotationSet will be created based on such matrix.
    • A string representing a gtf file: An AnnotationSet will be created based on such gtf file.
    • A string representing an organism: An AnnotationSet will be created based on the gtf file for that organism in data.config file.
  • tf_source – TF source annotation. After initialization, this object is mapped with gene_list. It can be:
    • A matrix (list of lists): Represents a final tf_list element.
    • A list of mtf files: The tf_list will be created based on all mtf files.
    • A list of repositories: The tf_list will be created based on the mtf files associated with such repositories in data.config.
  • alias_source – Alias dictionary source annotation. It can be:
    • A dictionary: An alias dictionary will be created based on such dictionary.
    • A string representing a alias (txt) file: An alias dictionary will be created based on such txt file.
    • A string representing an organism: An alias dictionary will be created based on the txt file for that organism in data.config file.
class DataType

Data type constants.

Constants:

  • GENE_LIST.
  • TF_LIST.
class AnnotationSet.GeneField

Gtf fields constants.

Constants:

  • GENOMIC_REGION.
  • ANNOTATION_SOURCE.
  • FEATURE_TYPE.
  • GENOMIC_PHASE.
  • GENE_ID.
  • TRANSCRIPT_ID.
  • GENE_TYPE.
  • GENE_STATUS.
  • GENE_NAMES.
  • TRANSCRIPT_TYPE.
  • TRANSCRIPT_STATUS.
  • TRANSCRIPT_NAME.
  • LEVEL.
  • EXACT_GENE_MATCHES.
  • INEXACT_GENE_MATCHES.
class AnnotationSet.ReturnType

Return type constants.

Constants:

  • ANNOTATION_SET.
  • LIST.
class AnnotationSet.TfField

Mtf fields constants.

Constants:

  • MATRIX_ID.
  • SOURCE.
  • VERSION.
  • GENE_NAMES.
  • GROUP.
  • EXACT_GENE_MATCHES.
  • INEXACT_GENE_MATCHES.
AnnotationSet.exact_mapping()

Maps (O(n log n)) exact entries of self.gene_list’s gene names with self.tf_list’s gene names.

AnnotationSet.fix_gene_names(gene_set, output_dict=False, mute_warn=False)

Checks if all gene names in gene_set are ensembl IDs. If a gene is not in ensembl format, it will be converted using alias_dict. If the gene name cannot be found then it is reported in a separate gene_set

Keyword arguments:

  • gene_set – A GeneSet object.
  • output_dict – Also output the mapping dictionary (default = False).
  • mute_warn – Do not print warnings regarding genes that mapped to multiple entries (default = False).

Return:

  • mapped_gene_list – A list of ensembl IDs
  • unmapped_gene_list – A list of unmapped gene symbols/IDs
AnnotationSet.get(query=None, list_type=0, return_type=0)

Gets subsets of either self objects and returns different types.

Keyword arguments:

  • query – A parameter that allows for subsets of self to be fetched. It can be:
    • None: All fields/values are going to be returned.

    • A dictionary: Subsets the desired list according to this structure. Each

      key must be a field (please refer to AnnotationSet.GeneField or AnnotationSet.TfField) that must point to a single value or a list of values.

  • list_type – Indicates which list should be subsetted/returned. Please refer to AnnotationSet.DataType.

  • return_type – Indicates what should be returned. Please refer to AnnotationSet.ReturnType.

Return:

  • result_list – A <return_type> containing the requested <list_type> subsetted according to <query>.
AnnotationSet.get_exons(start_site=False, end_site=False, gene_set=None)

Gets exons of genes. It returns a GenomicRegionSet with such exons. The id of each gene will be put in the NAME field of each GenomicRegion.

Keyword arguments:

  • start_site – Whether to relocate the start sites.
  • end_site – Whether to relocate the end sites.
  • gene_set – A set of genes to narrow the search.

Return:

  • result_grs – A GenomicRegionSet containing the exons.
  • unmapped_gene_list – A list of genes that could not be mapped to an ENSEMBL ID.
AnnotationSet.get_genes(gene_set=None)

Gets regions of genes. It returns a GenomicRegionSet with such genes. The id of each gene will be put in the NAME field of each GenomicRegion.

Keyword arguments:

  • gene_set – A set of genes to narrow the search.

Return:

  • result_grs – A GenomicRegionSet containing the genes.
  • unmapped_gene_list – A list of genes that could not be mapped to an ENSEMBL ID.
AnnotationSet.get_introns(start_site=False, end_site=False, gene_set=None)

Gets introns of genes. It returns a GenomicRegionSet with such introns. The id of each gene will be put in the NAME field of each GenomicRegion.

Keyword arguments:

  • start_site – Whether to relocate the start sites.
  • end_site – Whether to relocate the end sites.
  • gene_set – A set of genes to narrow the search.

Return:

  • result_grs – A GenomicRegionSet containing the introns.
  • unmapped_gene_list – A list of genes that could not be mapped to an ENSEMBL ID.
AnnotationSet.get_official_symbol(gene_name_source)

Returns the official symbol(s) from gene_name_source.

Keyword arguments:

  • gene_source – It can be a string (single gene name) or a GeneSet (multiple genes).

Return:

  • if gene_source is string then returns the converted string gene name or None if gene name could not be converted.
  • if gene_source is list then returns two lists containing, respectively, converted and not-converted gene names.
AnnotationSet.get_promoters(promoterLength=1000, gene_set=None, unmaplist=False)

Gets promoters of genes given a specific promoter length. It returns a GenomicRegionSet with such promoters. The ID of each gene will be put in the NAME field of each GenomicRegion. Each promoter includes also the coordinate of the 5’ base pair, therefore each promoter actual length is promoterLength+1.

Keyword arguments:

  • promoterLength – The length of the promoter region.
  • gene_set – A set of genes to narrow the search.
  • unmaplist – If True than also return the unmappable genes list (default = False).

Return:

  • result_grs – A GenomicRegionSet containing the promoters.
  • unmapped_gene_list – A list of genes that could not be mapped to an ENSEMBL ID.
AnnotationSet.get_tss(gene_set=None)

Gets TSS(Transcription start site) of genes. It returns a GenomicRegionSet with such TSS. The ID of each gene will be put in the NAME field of each GenomicRegion.

Keyword arguments:

  • gene_set – A set of genes to narrow the search.

Return:

  • result_grs – A GenomicRegionSet containing TSS.
  • unmapped_gene_list – A list of genes that could not be mapped to an ENSEMBL ID.
AnnotationSet.get_tts(gene_set=None)

Gets TTS(Transcription termination site) of genes. It returns a GenomicRegionSet with such TTS. The ID of each gene will be put in the NAME field of each GenomicRegion.

Keyword arguments:

  • gene_set – A set of genes to narrow the search.

Return:

  • result_grs – A GenomicRegionSet containing TTS.
  • unmapped_gene_list – A list of genes that could not be mapped to an ENSEMBL ID.
AnnotationSet.inexact_mapping()

Comming soon!

AnnotationSet.load_alias_dict(file_name)

Reads an alias.txt file and creates a dictionary to translate gene symbols/alternative IDs to ensembl gene ID

Keyword arguments:

  • file_name – Alias file name.
AnnotationSet.load_gene_list(file_name, filter_havana=True, protein_coding=False, known_only=False)

Reads gene annotation in gtf (gencode) format. It populates self.gene_list with such entries.

Keyword arguments:

  • file_name – The gencode .gtf file name.
AnnotationSet.load_tf_list(file_name_list)

Reads TF annotation in mtf (internal – check manual) format. It populates self.tf_list with such entries. Everytime a TF annotation is loaded, a mapping with gene list is performed.

Keyword arguments:

  • file_name_list – A list with .mtf files.
AnnotationSet.map_lists()

Maps self.gene_list with self.tf_list in various ways.

ExperimentalMatrix

ExperimentalMatrix describes an experiment.

class ExperimentalMatrix.ExperimentalMatrix

Describes an experimental matrix.

Variables:

  • names – The unique name of experiment (filename).
  • types – The type of data.
  • files – The path of the related file with its filename as keys.
  • fields – List types of informations including names, types, files and others.
  • fieldsDict – Its keys are just self.fields, and the values are extra informations.
  • objectsDict – Key is the names; value is GenomicRegionSet or GeneSet.
get_genesets()

Returns the GeneSets.

get_readsfiles()

Returns the ‘read’ type files.

get_readsnames()

Returns the ‘read’ type names.

get_regionsets()

Returns the RegionSets.

get_regionsnames()

Returns the region names.

get_type(name, field)

Return the type according to the given name and field.

Keyword arguments:

  • name – Name to return.
  • field – Field to return.
get_types(name)

Fetch all extra informations as a list according to the given name.

Keyword arguments:

  • name – Name to return.
load_objects(is_bedgraph, verbose=False)

Load files and initialize object.

Keyword arguments:

  • is_bedgraph – Whether regions are in bedgraph format (default = False).
  • verbose – Verbose output (default = False).
match_ms_tags(field)

Add more entries to match the missing tags of the given field. For example, there are tags for cell like ‘cell_A’ and ‘cell_B’ for reads, but no these tag for regions. Then the regions are repeated for each tags from reads to match all reads.

Keyword arguments:

  • field – Field to add extra entries.
read(file_path, is_bedgraph=False, verbose=False)

Read Experimental matrix file.

Keyword arguments:

  • file_path – Experimental matrix file path + name.
  • is_bedgraph – Whether regions are in bedgraph format (default = False).
  • verbose – Verbose output (default = False).

Example of experimental matrix file:

name type file further1
MPP_PU1 regions file1.bed addidional_info1
CDP_PU1 regions file2.bed addidional_info2
[ ... ]      
remove_name(name)

Removes experiments by name.

Keyword arguments:

  • name – Name to remove.

GeneSet

GeneSet describes genes and their expression.

class GeneSet.GeneSet(name)

Keyword arguments:

  • name – Name of the GeneSet
get_all_genes(organism)

Get all gene names for a given organism.

Keyword arguments:

  • organism – Define the organism.
read(geneListFile)

Read genes from the file.

Keyword arguments:

  • geneListFile – Path to the file which contains a list of genes.
read_expression(geneListFile, header=False, valuestr=False)

Read gene expression data.

Keyword arguments:

  • geneListFile – Path to the file which contains genes and expression value.
  • header – Read first line as header.
  • valuestr – Keep the value as a string, otherwise convert to float number.
subtract(gene_set)

Subtract another GeneSet.

Keyword arguments:

  • gene_set – Another GeneSet for subtracting with.

Util

The Util classes contains many utilities needed by other classes such as the paths to input files.

class Util.AuxiliaryFunctions

Class of auxiliary static functions.

static correct_standard_bed_score(score)

Standardize scores between 0 and 1000.

Keyword arguments:

  • score – Score.
static overlap(t1, t2)

Checks if one interval contains any overlap with another interval.

Keyword arguments:

  • t1 – First tuple.
  • t2 – Second tuple.
Return:
  • -1 – if i1 is before i2.
  • 1 – if i1 is after i2.
  • 0 – if there is any overlap.
static revcomp(s)

Revert complement string.

Keyword arguments:

  • s – String.
static string_is_float(s)

Verifies if a string is a numeric float.

Keyword arguments:

  • s – String to verify.
static string_is_int(s)

Verifies if a string is a numeric integer.

Keyword arguments:

  • s – String to verify.
class Util.ConfigurationFile

Represent the data path configuration file (data.config). It serves as a superclass to classes that will contain default variables (such as paths, parameters to tools, etc.) for a certain purpose (genomic data, motif data, etc.).

Variables:

  • self.config – Represents the configuration file.
  • self.data_dir – Represents the root path to data files.
class Util.ErrorHandler

Handles errors in a standardized way.

Error Dictionary Standard:

Each entry consists of a key+list in the form X:[Y,Z,W] where:

  • X – The key representing the internal error name.
  • Y – Error number.
  • Z – Exit status.
  • W – Error message to be print.

Warning Dictionary Standard:

Each entry consists of a key+list in the form X:[Y,Z] where:

  • X – The key representing the internal warning name.
  • Y – Warning number.
  • Z – Warning message to be print.
throw_error(error_type, add_msg='')

Throws the specified error type. If the error type does not exist, throws a default error message and exits.

Keyword arguments:

  • error_type – Error type.
  • add_msg – Message to add to the error.
throw_warning(warning_type, add_msg='')

Throws the specified warning type. If the warning type does not exist, throws a default warning message and exits.

Keyword arguments:

  • warning_type – Warning type.
  • add_msg – Message to add to the error.
class Util.GenomeData(organism)

Represent genomic data. Inherits ConfigurationFile.

get_annotation_dump_dir()

Returns the current path to the gencode annotation gtf file.

get_association_file()

Returns the current path to the gene association text file.

get_chromosome_sizes()

Returns the current path to the chromosome sizes text file.

get_gencode_annotation()

Returns the current path to the gencode annotation gtf file.

get_gene_alias()

Returns the current path to the gene alias txt file.

get_genome()

Returns the current path to the genome fasta file.

get_organism()

Returns the current organism.

class Util.HelpfulOptionParser(usage=None, option_list=None, option_class=<class optparse.Option at 0x7feef3d6bce8>, version=None, conflict_handler='error', description=None, formatter=None, add_help_option=True, prog=None, epilog=None)

An OptionParser that prints full help on errors. Inherits OptionParser.

class Util.HmmData

Represent HMM data. Inherits Co7nfigurationFile.

get_default_bias_table_F()

Returns the current default bias table for the forward strand.

get_default_bias_table_R()

Returns the current default bias table for the reverse strand.

get_default_hmm_dnase()

Returns the current default DNase only hmm.

get_default_hmm_dnase_bc()

Returns the current default DNase only hmm.

get_default_hmm_dnase_histone()

Returns the current default DNase+histone hmm.

get_default_hmm_dnase_histone_bc()

Returns the current default DNase+histone hmm.

get_default_hmm_histone()

Returns the current default Histone only hmm.

class Util.Html(name, links_dict, fig_dir=None, fig_rpath='../fig', cluster_path_fix='', RGT_header=True, other_logo=None, homepage=None)

Represent an HTML file.

Keyword arguments:

  • name – Name of the HTML document.
  • links_dict – Dictionary with the upper links.
  • fig_dir – Figure directory (default = None).
  • fig_rpath – Relative figure path (default = ‘../fig’).
  • cluster_path_fix – deprecated.
  • RGT_header – Whether to print RGT header (default = True).
  • other_logo – Other tool logos (default = None).
  • homepage – Homepage link (default = None).

Warning

cluster_path_fix is going to be deprecated soon. Do not use it.

add_figure(figure_path, notes=None, align=50, color='black', face='Arial', size=3, bold=False, width='800', more_images=None)

Add a figure with notes underneath.

Keyword arguments:

  • figure_path – The path to the figure.
  • notes – A list of strings for further explanation
  • align – Alignment of the heading. Can be either an integer (interpreted as left margin) or string (interpreted as HTML positional argument) (default = 50).
  • color – Color (default = ‘black’).
  • face – Font (default = ‘Arial’).
  • size – Size (default = 3).
  • bold – Whether it is bold (default = False).
  • width – Width (default = 800).
  • more_images – Add more images (default = None).
add_fixed_rank_sortable()

Add jquery for fixing the first column of the sortable table

add_free_content(content_list)

Adds free HTML to the document.

Keyword arguments:

  • content_list – List of strings. Each string is interpreted as a line in the HTML document.
add_heading(heading, align=50, color='black', face='Arial', size=5, bold=True, idtag=None)

Creates a heading.

Keyword arguments:

  • heading – The heading title.
  • align – Alignment of the heading. Can be either an integer (interpreted as left margin) or string (interpreted as HTML positional argument (default = 50).
  • color – Color of the heading (default = “black”).
  • face – Font of the heading (default = “Arial”).
  • size – Size of the heading (HTML units [1,7]) (default = 5).
  • bold – Whether the heading is bold (default = True).
  • idtag – Add ID tag in the heading element (default = None).

Adds all the links.

add_list(list_of_items, ordered=False)

Add a list to the document

Keyword arguments:

  • list_of_items – List of items to add.
  • ordered – Whether the list is odered (default = False).
add_zebra_table(header_list, col_size_list, type_list, data_table, align=50, cell_align='center', auto_width=False, colorcode=None, header_titles=None, border_list=None, sortable=False)

Creates a zebra table.

Keyword arguments:

  • header_list – A list with the table headers in correct order.

  • col_size_list – A list with the column sizes (integers).

  • type_list – A string in which each character represents the type of each row.
    • s = string (regular word or number)
    • i = image
    • l = link
  • data_table – A table containing the data to be input according to each data type defined.
    • s = string
    • i = tuple containing: (“file name”, width) width = an integer
    • l = tuple containing: (“Name”,”Link”)
  • align – Alignment of the heading. Can be either an integer (interpreted as left margin) or string (interpreted as HTML positional argument) (default = 50).

  • cell_align – Alignment of each cell in the table (default = center).

  • auto_width – Adjust the column width by the content automatically regardless of defined col size (default = False).

  • colorcode – Color code (default = None)

  • header_titles – Given a list corresponding to the header_list, which defines all the explanation in hint windows (default = None).

  • border_list – Table borders (default = None).

  • sortable – Whether it is a sortable table (default = False).

copy_relevent_files(target_dir)

Copies relevant files to relative paths.

Keyword arguments:

  • target_dir – Target directory to copy files.

Adds footer.

create_header(relative_dir=None, RGT_name=True, other_logo=None)

Creates default document header.

Keyword arguments:

  • relative_dir – Define the directory to store CSS file and RGT logo so that the html code can read from it (default = None).
  • RGT_name – Whether to print RGT name (default = True).
  • other_logo – Other tool logos (default = None)
write(file_name)

Write HTML document to file name.

Keyword arguments:

  • file_name – Complete file name to write this HTML document.
class Util.ImageData

Represent image data. Inherits ConfigurationFile.

get_css_file()

Returns the css file location.

Returns the default motif logo file location.

get_jquery()

Returns the jquery code location.

get_jquery_metadata()

Returns the jquery metadata location.

Returns the rgt logo image file location.

get_sorttable_file()

Returns the default sorttable code location.

get_tablesorter()

Returns the table sorter code location.

Returns the default TDF logo.

Returns the default RGT viz logo.

class Util.MotifData

Represent motif (PWM) data. Inherits ConfigurationFile.

get_fpr_list()

Returns the list of current paths to the fpr files.

get_logo_file(current_repository)

Returns the path to a specific logo repository.

Keyword arguments:

  • current_repository – Motif repository.
get_logo_list()

Returns the list of current paths to the logo images of PWMs in the given repositories.

get_mtf_list()

Returns the list of current paths to the mtf files.

get_mtf_path(current_repository)

Returns the path to a specific mtf file.

Keyword arguments:

  • current_repository – Motif repository.
get_pwm_list()

Returns the list of current paths to the PWM repositories.

get_pwm_path(current_repository)

Returns the path to a specific motif repository.

get_repositories_list()

Returns the current repository list.

class Util.OverlapType

Class of overlap type constants.

Constants:

  • OVERLAP – Return new GenomicRegionSet including only the overlapping regions.
  • ORIGINAL – Return the regions of original GenomicRegionSet which have any intersections.
  • COMP_INCL – Return region(s) of the GenomicRegionSet which are ‘completely’ included.
class Util.PassThroughOptionParser(usage=None, option_list=None, option_class=<class optparse.Option at 0x7feef3d6bce8>, version=None, conflict_handler='error', description=None, formatter=None, add_help_option=True, prog=None, epilog=None)

When unknown arguments are encountered, bundle with largs and try again, until rargs is depleted. sys.exit(status) will still be called if a known argument is passed incorrectly (e.g. missing arguments or bad argument types, etc.). Inherits HelpfulOptionParser.

class Util.SequenceType

Class of sequence type

Constants:

  • DNA
  • RNA

MotifSet

Represents a transcription factor motif and the standardization of motif annotation.

class MotifSet.Motif(tf_id, name, database, tf_class, genes, genes_suffix)

Represents transcription factor motifs.

Keyword arguments:

  • tf_id – Transcription factor ID.
  • name – Transcription factor name (symbol).
  • database – Database/repository in which this motif was obtained from.
  • tf_class – Class of transcription factor motif.
  • genes – Genes in which transcription factor binds to.
  • genes_suffix – Gene suffixes.
class MotifSet.MotifSet

Represents a set of motifs.

add(new_motif)

Adds a new motif to this set.

Keyword arguments:

  • new_motif – New motif to be added.
filter_by_genes(genes, search='exact')
This method returns motifs associated to genes. The search has three modes:
  1. ‘exact’ - exact match only
  2. ‘inexact’ - genes with no exact match are searched for inexact matcth
  3. ‘all’ - all genes are applied to an inexact match

Keyword arguments:

  • genes – Gene set to perform the filtering.
  • search – Search mode (default = ‘exact’).

Return:

  • motif_sets – Filtered motif sets.
  • genes_motifs – Dictionary of genes to motifs.
  • motifs_genes – Dictionary of motifs to genes.
filter_by_motifs(motifs)

Filter this motif set by defined motifs.

Keyword arguments:

  • motifs – Motifs in which to filter this set.

Return:

  • motif_sets – Filtered motif sets.
match_suffix(gene_name)

Match with gene suffix

Keyword arguments:

  • gene_name – Gene name to perform the match.

Return:

  • res – ID of mapped genes.
read_file(file_name_list)

Reads TF annotation in mtf (internal format; check manual) format.

Keyword arguments:

  • file_name_list – A list with .mtf files.
read_motif_targets_enrichment(enrichment_files, pvalue_threshold)

Reads current output of motif enrichment analysis to get gene targets.

Keyword arguments:

  • enrichment_files – Enrichment files to read.
  • pvalue_threshold – P-value threshold for motif acceptance.
write_cytoscape_network(genes, gene_mapping_search, out_path, targets, threshold)

Write files to be used as input for cytoscape. It recieves a list of genes to map to, a mapping search strategy and path for outputting files.

Keyword arguments:

  • genes – Gene set.
  • gene_mapping_search – Gene mapping.
  • out_path – Output path.
  • targets – Gene targets.
  • threshold – Threshold for motif acceptance.
write_enrichment_table(threshold, out_file, motifs_map)

Writes enrichment table for network generation.

Keyword arguments:

  • threshold – P-value threshold for motif acceptance.
  • out_file – Output file name.
  • motifs_map – Mapping of motifs.

Indices and tables