Motif discovery

We searched for motifs in 50nt-long exon-free intronic regions proximal to the splice sites that form the junctions differentially regulated under the Wingless and Insulin pathways, and assessed their significance against a large collection of motifs discovered from control regions. In this web supplementary information we describe in detail how the sequences proximal to the differentially regulated junctions, and within the control regions, were retrieved, how motif discovery was performed and how the significance of the motifs was assessed.

Sequences proximal to differentially regulated junctions

Starting from the two sets of 64 and 192 differentially regulated junctions in the Wingless and Insulin pathways, respectively, we consider the corresponding two sets of introns. For each of these two sets we project their genomic coordinates, and the coordinates of the exons ocurring within them, in order to discard duplicated endpoints of the introns (due to overlapping introns associated to, .e.g., junctions involved in the same exon skipping event) and mask out alternatively spliced exonic sequences. This splits the introns into non-redundant exon-free regions leaving intact the exon-free non-overlapping ones. From these regions we discarded those shorter than 56nt and those without any of the two endpoints of the initial intron. Finally, we obtained 48 and 50 regions proximal to 5'ss and 3'ss, respectively, from junctions differentially regulated in the Wingless pathway, while from junctions differentially regulated in the Insulin pathway we obtained 156 and 143 regions proximal to 5'ss and 3'ss, respectively. From each of the 5'ss-proximal regions, we selected 50nt downstream starting on the 7th. nucleotide of the intron (i.e., skipping over the U1 snRNA binding site), and from each of the 3'ss-proximal regions, we selected 50nt upstream starting on the 2nd to last nucleotide of the intron. For each stretch of 50nt we retrieved the corresponding orthologous sequences in the other 11 drosophila species by using genome sequence alignments generated by MAVID (Bray and Pachter, 2004). In some cases, the genome sequence alignments may contain extremely large insertions or have very few or no nucleotides aligned. While these may arise from evolutionary events of different kinds, they may also be the result of a misalignment or a lack of coverage in the corresponding genome sequence. In either case this can easily lead to a misalignment of the sequences of the rest of the drosophila species even though they may contain the corresponding orthologous nucleotides. In order to avoid as much as possible this kind of artifact, for each 50nt-long exon-free region, we have discarded those orthologous sequences with less than 30nt and taken only the first 50nt on those longer than that length. Then, we re-aligned the filtered set of orthologous sequences using ClustalW and do the rest of the analysis on those sequences. More concretely, we obtained in total 473 and 491 sequences proximal to 5'ss and 3'ss, respectively, of junctions differentially regulated in the Wingless pathway, and 1395 and 1363 sequences proximal to 5'ss and 3'ss, respectively, of junctions differentially regulated in the Insulin pathway. We shall refer to these four sequence sets as Wingless 5'ss-proximal, Wingless 3'ss-proximal, Insulin 5'ss-proximal and Insulin 3'ss-proximal.

Motif search in the sequences proximal to differentially regulated junctions

We searched for motifs in the four sequence sets resulting from the previous data-preparation procedure: Wingless 5'ss-proximal, Wingless 3'ss-proximal, Insulin 5'ss-proximal and Insulin 3'ss-proximal. We performed the motif search using two different publicly available pieces of software for that purpose, MEME (Bailey and Elkan, 1994) and PHYLOGIBBS (Siddharthan et al., 2005). Both computer programs find motifs by searching for similar sites across the input sequences, however, while MEME assumes that all sequences are independent, PHYLOGIBBS exploits the evolutionary relationships among the sequences to try to distinguish conservation due to function from conservation due to evolutionary proximity. More concretely, we have provided as input to PHYLOGIBBS the phylogeny of the twelve drosophila species inferred by Heger and Ponting (2007) based on the rates of substitutions at synonymous sites (dS) estimated among ortholog genes. PHYLOGIBBS expects a phylogenetic tree where branch lengths correspond to probabilities q of no-mutation taking place along the branch and these are calculated from synonymous substitutions as q = exp(-dS) (E. van Nimwegen, personal communication). Thus the input tree given to PHYLOGIBBS in Newick format was the following:
((((DroMoj:0.613,DroVir:0.705):0.905,DroGri:0.631):0.657,DroWil:0.313):0.861,
(((((DroSim:0.980,DroSec:0.970):0.970,DroMel:0.932):0.932,
(DroEre:0.905,DroYak:0.896):0.961):0.560,DroAna:0.440):0.705,
(DroPer:0.980,DroPse:0.990):0.507):0.869)
  
The dependence among the ortholog sequences leads MEME to assign statistically significant scores to motifs formed by conserved sequences of different species proximal to the same splice site (i.e., each of such motifs could be associated to a single differentially regulated junction and thus formed by no more than 12 sites albeit being called statistically significant by MEME). While some of those motifs can include real splice factor binding sites we expect that some splice factors acting upon the downstream regulation of the pathway would bind across regions proximal to a substantial number of all of the differentially regulated junctions. For this reason, and in order to have a criterion of motif statistical significance common to MEME and PHYLOGIBBS, we have asked the programs to provide a fixed number of motifs for each dataset expecting that this number is sufficiently large to find interesting motifs and assess independently their statistical significance as described in the next sections. More concretely, we have asked each program to retrieve 5 motifs for each of the four datasets thus obtaining in total a panel of 40 putative motifs which are displayed in Figure 1 of this web supplement. We asked MEME to find motifs of any length between 4 and 9 nucleotides and as we may see in Figure 1 all motifs found by MEME had 9 positions except for two of them that had 8. PHYLOGIBBS requires a fixed motif length and after making the previous observation we decided for PHYLOGIBBS to provide motifs of length 9.

Figure 1. Ranking of the motifs found in regions proximal to 5'ss and 3'ss of junctions being differentially regulated under the Wingless and Insulin pathways. Next to the graphical representation of each motif the following information is specified: the associated pathway and proximal splice site; the computer program used to find the motif; the motif-index among the motifs found by that program on the corresponding sequence set; the number of sites and the empirical p-values on the statistical significance of that number for control regions proximal to non-regulated constitutive and non-constitutive junctions, respetively; the number of different junctions and the empirical p-values on the statistical significance of that number for control regions proximal to non-regulated constitutive and non-constitutive junctions, respectively.

Assessment of the statistical significance of the discovered motifs

We assessed the statistical significance of each of the 40 motifs discovered in regions proximal to the differentially regulated junctions through a series of simulations sampling control sequence sets. A control sequence set is formed by sequences proximal to a number of non-regulated junctions in Drosophila melanogaster and their orthologous counterparts in the other 11 drosophila species such that the number of different non-regulated junctions in the control sequence set coincides with the number of different regulated junctions in the controlled set (e.g., in the set of sequences proximal to 5'ss of junctions differentially regulated in the Wingless pathway) and the sequences therein are conserved in the same number of species than those in the controlled set. Such a control sequence set is sampled from a larger set of 6,445 junctions that did not show differential regulation under both, the Wingless and Insulin pathways. The sampling is made separately from constitutive (3,745) and non-constitutive (2,700) junctions and independently for each of the four controlled sets (Wingless 5'ss-proximal, Wingless 3'ss-proximal, Insulin 5'ss-proximal and Insulin 3'ss-proximal) since the sequences in each of these four sets display a different phylogenetic coverage. Thus, in total, we consider eight different classes of control regions where for each class we sample with replacement 100 control sequence sets. The initial set of sequences from where we are sampling undergoes the same data pre-processing steps (discarding duplicated intron endpoints, masking out exonic sequences, etc.) as those associated to differentially regulated junctions. These steps are described in the first section of this supplement. For each control sequence set we perform the motif discovery procedure described above using MEME and PHYLOGIBBS although this time retaining only the most significant motif found by each of the two programs. Thus, in total, we obtain a set of 100 motifs discovered by each of the two programs using eight different classes of control regions sampling from each of them 100 control sequence sets. In Figure 2 we can see the distribution of the number of sites forming each motif and in Figure 3 we can see the distribution of the number of different junctions included in each motif. Figure 2 also displays for each bin the mean of the per cent of information content of the motifs occurring on that bin. The information content (Schneider and Stephens, 1990) is a measure of nucleotide conservation in bits calculated at each motif position ranging from 0 (no conservation, each nucleotide occurs with the same frequency) to 2 (full conservation, only one single nucleotide occurs on that position) so that for a given motif with w positions, its maximum information content is 2 x w. The per cent of information content is the fraction of the maximum information content that the motif has, and the mean displayed in the figure corresponds to this fraction averaged through all the motifs occurring on the bin of the histogram. We have included bars for two standard errors on those bins with more than one motif. The per cent of information content (as surrogate for the conservation of the motif) decreases as the number of sites forming the motif increases. Note that because PHYLOGIBBS motifs tend to include many more sites, their information content is lower than in the MEME motifs.

Figure 2. Distribution of the number of sites in control regions proximal to non-regulated junctions.

Figure 3. Distribution of the number of different junctions in control regions proximal to non-regulated junctions.

Using these distributions, for each of the 40 motifs discovered in regions proximal to differentially regulated junctions, we estimate empirical p-values of observing a motif in the corresponding control region displaying an equal or larger number of sites and different junctions. In Figure 1, next to the graphical representation of each motif, we find the number of sites (st) and the two p-values of observing that number of sites at the corresponding control regions from constitutive and non-constitutive junctions, respectively. We also find the number of different junctions involved in the motif (jn) and the two p-values of observing that number at the control regions again from constitutive and non-constitutive junctions, respectively. Below the per cent of information content of that motif is indicated and the rest of the information indicates the pathway and splice site to which the motif is associated, the computer program (MEME or PHYLOGIBBS) that found the motif and the motif-index among the five motifs retrieved by that program from that sequence set. Thus, each motif has 4 p-values associated that allowed us to rank the motifs in the way they are displayed in Figure 1 by increasing p-value. We considered as significant a p-value equal or smaller than 0.05 indicating that circumstance in the figure with an asterisk next to the p-value. We finally deemed a motif as statistically significant if all its 4 p-values were significant at that 0.05 level. As we may see from Figure 1, only the two first motifs from the ranking can be considered significant under that criterion and, therefore, they are the ones we believe that contain splice factor binding sites likely to be involved in the events underlying the junction differential splicing regulation observed downstream of the corresponding pathway. In Figures 4 and 5 we show two particular examples of putative binding sites belonging to these two motifs. In the case of the Wingless motif (Figure 4) the binding site is downstream of a 5'ss of an alternatively skipped exon and in the case of the Insulin motif (Figure 5) the binding site is downstream of an alternative 5'ss.

Figure 4. UCSC genome browser snapshot of the putative binding site (AAGAACAAC) of the Wingless motif. In the top panel we find the event view where we see the entire junction being differentially regulated in the first track and the match to the binding site in the second track. In the bottom panel we find a zoomed view of the 5'ss and the intronic region downstream where we find the binding site.

(event view)
(zoomed view)

Figure 5. UCSC genome browser snapshot of the putative binding site (ATATTTGTT) of the Insulin motif. In the top panel we find the event view where we see two junctions differentially regulated in the first track and the match to the binding site in the second track. In the bottom panel we find a zoomed view of the 5'ss and the intronic region downstream where we find the binding site.

(event view)
(zoomed view)

References