We present dependency logos as a new way of visualizing dependency structures in aligned sequences. In contrast to traditional sequence logos (Schneider and Stephens 1990), dependency logos make dependencies between sequence positions visually perceptible. In contrast to previous approaches, dependency logos are model-free and only require a set of aligned sequences, e.g. predicted binding sites, and, optionally, associated weights as input.
#> Warning in par(bak): argument 1 does not name a graphical parameter
The DepLogo R-package extends the original dependency logos (Keilwagen and Grau 2015) in several aspects including
The source code of the DepLogo R-package is available from https://github.com/Jstacs/DepLogo.
The general, conceptual idea of dependency logos is to show dependencies in (aligned) sequences by position-wise partitioning of input sequences. Each of the resulting partitions is then visualized separately. In case of inter-dependent positions, the symbols occurring at one position should be related to those symbols occurring at mutually dependent positions. Hence, a partitioning by symbols at the first position should also lead to a (partial) separation of symbols at those positions depending on the first position. Partitioning and subsequent visualization will make such dependencies perceptible. Each resulting partition may be sub-divided into further sub-groups recursively.
Above, we show a dependency logo with annotations explaining the different parts. At the top of the plot, we find an axis with arcs connecting dependent positions. In this case, positions 12, 13, and 14 show a similar level of pairwise dependencies, whereas the other positions show no dependencies. In general, the levels of dependencies might be more gradual, from no dependencies to strong dependencies.
In the central part of the plot, we see colored boxes representing different partitions of the data. The central consensus sequences of the individual partitions are annotated on the right. Symbols are represented by colors in a similar manner as in traditional sequence logos, and the sequence logo at the lower part of the plot helps to recognize the mapping between colors and symbols (in this case DNA nucleotides).
Although the dependency logo spans 20 positions, only the central positions 7 to 14 are clearly colored. The reason is that colors assigned to the partitions at different positions are first mixed, depending on the symbols occurring in that partition at that position (e.g., position 7 is a mix of red (T) and green (A)), and then opacity is assigned based on “information content”. This means that, in the same manner as in traditional sequence logos where positions which are close to a uniform distribution are scaled down, color intensity is scaled down at such positions (and partitions) in dependency logos. Based on the dependency logo, we may state the following dependencies for these data:
In addition, we may state that although positions 7, 12 and 13 are all either “A” or “T”, only positions 12 and 13 show dependencies to other positions, whereas the occurrence of “A” or “T” at positions 7 is independent of all other positions. Otherwise, we would i) find an arc between position 7 and other positions at the top of the plot and ii) position 7 would be de-mixed due to the partitioning at one of the other positions. We may make a similar statement for position 14, which shows dependencies to positions 12 and 13, although its mon-nucleotide distribution over all sequences is uniform. By contrast, the remaining positions (1 to 6 and 15 to 20) are also uniformly distributed but mutually independent of all other positions. Notably, we cannot distinguish position 14 and the remaining positions (1 to 6 and 15 to 20) by means of the traditional sequence logo.
At the left of the central part, we further find the number of sequences that the dependency logo is based on.
Below, we show a series of (rather simple) dependency logos showing different levels and types of dependencies. Notably, the sequence logos of the first two data sets, and the sequence logos of the third and fourth data set look identical, although their dependency structures differ fundamentally. In the first data set, there are no dependencies between any positions, whereas in the second data set, positions 12 and 13 are mutually dependent. In the third data set, we find string dependencies between positions 12, 13, and 14, whereas in the fourth plot, positions 12 and 13 are independent but positions 13, 14, and 7 are mutually dependent.
#> Warning in par(bak): argument 1 does not name a graphical parameter
#> Warning in par(bak): argument 1 does not name a graphical parameter
#> Warning in par(bak): argument 1 does not name a graphical parameter
#> Warning in par(bak): argument 1 does not name a graphical parameter