  • Similarly, the distribution of D values for the comparison of the proteomic thrA and thrB sequences is also represented in Figure 4, alongside with the null model, Eq. 11, for its dimensionality ( n = log 2 ( uu = 20 possible aminoacids) = 4.32 ), which is graphically nearly undistiguishable from that of the comparison between the stanzas, with n = log 2 ( uu = 19 possible letters) = 4.25 (dotted gray line for the rounded value, n = 4.3).

  • For over a decade the idea of representing biological sequences in a continuous coordinate space has maintained its appeal but not been fully realized [ 1 2 3 ] . The basic idea is that sequences of symbols, such as nucleotides in genomes, aminoacids in proteomes, repeated sequences in MLST [Multi Locus Sequence Typing, 4], words in languages or letters in words, would define trajectories in this continuous space conserving the statistical properties of the original sequences [ 3 5 6 7 8 9 ] . Accordingly, the coordinate position of each unit would uniquely encode for both its identity and its context, i.e. the identity of its neighbors [ 10 ] . Ideally, the position should be scale-independent, such that the extraction of the encompassing sequence can be performed with any resolution, leading to an oligomer of arbitrary length.

