Overview of Help-Pages
GEMS Launcher Logo

Multiple Alignment with DiAlign

[Introduction] [Input] [Parameters] [Output] [References]


DiAlign is a (DNA or protein) alignment program that relies on comparison of whole segments of sequences instead of comparison of single nucleic/amino acids.

The program DiAlign constructs alignments from gapfree pairs of similar segments of the sequences. Such segment pairs are referred to as diagonals.

Every possible diagonal is given a so-called weight reflecting the degree of similarity among the two segments involved. The overall score of an alignment is then defined as the sum of weights of the diagonals it consists of and the program finds an alignment with a maximum score -- in other words: the program tries to find a consistent collection of diagonals with a maximum sum of weights.

DiAlign does not use any gap penalty, thus avoiding this critical parameter. Consequently the program is especially suited to detect local similarities in otherwise completely unrelated sequences.

In the example below, the sequence segments corresponding to diagonals are underlined in each sequence. The color corresponds to the segment of the other sequence involved in the same diagonal. Lower case letters indicate amino acids that are not included in any diagonal and remained unaligned. The first diagonal shown in the alignment consists of the TPLPSH segment of HTLV_II and the APLPIH segment of HBV. The rows of * signs below the alignment symbolize the degree of overlapping diagonals at each point.

diagonal alignment

Mathematical details of the algorithm are described in Morgenstern et al., 1996 (Proc. Natl. Acad. Sci. USA) and a more general description including application examples is given in Morgenstern et al., 1998 (Bioinformatics).


General: Sequence Formats
Accepted DNA sequence formats The following formats for DNA sequences are accepted: There should be only IUPAC characters in the sequence, any other characters will be skipped!
Sequence Input
Choose from your previously uploaded sequences Select a sequence file from the list of your personal sequence files which were saved in the result management in prior analyses (via "add sequences", see below).
Quick Upload new Paste your sequence(s) in the form field in one of the accepted formats (see above). Note that sequences pasted in the "quick upload" field are not saved for future use.
Add sequences

Sequences or sequence files uploaded here are automatically saved in the result management for later use:

Enter the formatted DNA sequence(s) Enter your correctly formatted sequence(s) directly into the form, e.g. with copy and paste (see above for accepted formats).
or upload a file containing sequence(s) (max. 100 MB) If your browser supports this option, a sequence file can be uploaded.
If you use this option, the file should contain the sequence(s) in either one of the formats listed above.
Please note, that the size for uploaded files is limited to 100 MB. If you want to analyze larger sequences please contact For whole chromosomes you can use the accession number option below (e.g. 'NC_000001' for human chromosome 1).
Accession number(s) If you are interested in one or several special sequences from a database section, you can supply a list of accession numbers. If you want to select more than one accession number, please separate the accession numbers by commas or spaces.

On the Genomatix server accession numbers from the following databases can be entered:

  • GenBank (sections Bacteria, Invertebrates, Other Mammalian, Other Vertebrates, Plants, Primates, Rodents, Viruses, ESTs) (e.g. 'M65229')
  • Eukaryotic Promoter Database (EPD) (e.g. 'EP30014')
  • NCBI Reference Sequences (mRNA sequences) (e.g. 'NM_000402')
  • Genomatix Promoter Database (e.g. 'GXP_107276')
  • dbSNP (e.g. 'rs1234')
Sequence Types

Please also check if your sequence is supposed to be read as

  • DNA sequence
  • PROTEIN sequence

DiAlign Parameters

Alignment Parameters
Type of sequence DiAlign uses information on the loaded sequences for the alignment.

In case of protein sequences there is no further choice, but in case of DNA the sequences can be

  • nucleic acid sequences or
  • nucleic acid sequences containing reading frames.

    With this second option DiAlign translates the compared nucleic acid segments to peptide segments according to the genetic code -- without (necessarily) knowing the reading frames, so all three of them are checked for significant similarity.
    In this case, the similarity among segments will be assessed on the peptide level rather than on the nucleic acid level.
    We strongly recommend this option if nucleic acid sequences are expected to contain protein coding regions, as it will significantly increase the sensitivity of the alignment procedure in these cases.

Threshold T As described above, DiAlign uses diagonals to construct an alignment. The threshold T influences the set of used diagonals: with T > 0, a diagonal is considered for alignment only if its weight exceeds this threshold. Regions of lower similarity are not aligned.

DiAlign usually produces reasonable alignments without a threshold, i.e. with T = 0.
Increasing the threshold reduces the computing time of DiAlign, but also influences the alignment quality. If T is too large, even significant similarities are ignored. We recommend to use a threshold value between 0 and 1 (maximum allowed value for T is 5).

Output Parameters
Display of alignment '*' signs below alignment

'*' characters are used in the DiAlign output to create a pseudo-graphical representation indicating

  • the relative degree of local similarity among the input sequences (diagonal similarity),
  • nucleic/amino acid similarity at each position of the alignment
  • positions where all nucleic/amino acids are identical, or
  • variable positions.

In the first two cases, the user can specify the maximum number of '*' characters per column in the program output thus changing the resolution of the graphics. In the other two cases, one '*' signs denotes identical or variable positions, respectively.

The latter two options are especially suited for very similar sequences where one is interested only in the mismatches within an alignment.

These parameters are hidden by default. Clicking on will reveal them.
Color coding within alignment

By default, the nucleic/amino acids in the DiAlign output that were actually aligned (diagonals) are color coded.

  • In case of a DNA alignment, the four nucleic acids each appear in a different color, other IUPAC representations of nucleic acids are black. There are three options for the color code:

    • the default colors of DiAlign,
    • the color code that is used by the ABI sequencer, or
    • you can define your own colors for each nucleotide.
  • In case of a protein alignment, five groups of amino acids are colored as follows (DiAlign default colors):

    • basic amino acids (H,K,R)
    • nonpolar amino acids (A,C,G,I,L,M,P,V)
    • uncharged polar amino acids (N,Q,S,T)
    • acidic amino acids (D,E)
    • aromatic amino acids (Y,W,F)

    Alternatively, you can define your own colors for each of the amino acid groups.

The color-code-option can be switched off to get a black-and-white result.

These parameters are hidden by default. Clicking on will reveal them.
Do not show non-aligned blocks

This option is set by default. Non-aligned blocks are removed from the DiAlign alignment. One or more omitted non-aligned blocks are indicated by three dots.

This option is especially suited to reduce the size of the alignment when a long sequence is aligned with a very short sequence (e.g. genomic sequence with corresponding mRNA).

Switch off this option in case you want to see the complete alignment.

Number of nucleic/amino acids per line

The default number of nucleic/amino acids per line in the alignment output is 50. It can be set to 0 (= unlimited) so that the complete alignment is shown in one line.

These parameters are hidden by default. Clicking on will reveal them.
Additional output Additional output of pairwise sequence similarities

With this option the similarity (relative to the maximum similarity) and the number of aligned nucleic/amino acids is shown for each pairwise alignment.

  • The similarity value 1.000 marks only the two most similar sequences, it does not necessarily mean that these sequences are identical.
  • The number of aligned nucleic/amino acids is no absolute value but is given in percent of the length of the shorter sequence.

This option is suited to identify pairs of sequences that are very similar.

These parameters are hidden by default. Clicking on will reveal them.
Additional output of alignment in FASTA format

With this option the alignment is additionally displayed in FASTA format (e.g. if the alignment is used as input for other programs).

By default, the program displays the output only in DiAlign format for easy interpretation.

These parameters are hidden by default. Clicking on will reveal them.
Additional output of sequence tree

With this option a sequence tree in PHYLIP format can be displayed in the output.

This tree is constructed by applying the UPGMA clustering method to the DiAlign similarity scores. It roughly reflects the different degrees of similarity among the sequences. For detailed phylogenetic analysis, we recommend the usual methods for phylogenetic reconstruction.

These parameters are hidden by default. Clicking on will reveal them.
Email address Here you can choose between two methods for receiving the results:
  • Show result directly in browser window
    In this option the URL of the result is directly shown in your browser window.

    Warning: Please use this option only for analyses which can be performed in a short time.
    If the analysis takes longer than the timeout of the webserver, the connection will be terminated and you will receive an error message (e.g. "The document contained no data."). In this case, the results will not be available, please restart the analysis using the option below "Send the URL of the result to".

  • Send the URL of the result via email
    In this option an email with the URL of the results will be sent to the user provided email address, when the analysis is finished.

The results will be available for a limited time on our server. For details of how long your results will be kept please see the result-email. After that period they will be deleted unless protected in the project management!

Output Examples

Example of DiAlign alignment format:

HTLV2           1   ldtapcLFSD GS------PQ KAAYVLWDQT IL---QQDIT PLPSHethSA
MMLV            1   pdadhtwYTD GSSLLQEGQR KAGAAVTTET eviwaKALDA G---T---SA
HEPB            1   rpglcQVFAD AT------PT GWGLVMGHQR MR---GTFSA PLPIHt---- 
                         ***** ********** ********** **   ***** *****   ** 
                          **** **      ** ********** **   ***** *****   ** 
                           *** **      ** ********** **   *****            
                                       ** ******                           
HEPB           38   --AELLAACF Arsrsgan-- -IIGTDN--- ---------- ---------- 
                    ********** ********** ********** ********** ********** 
                    ********** ********** ********** ********** ********** 
                       ******* ******     ********** *****                 
                       ******* ******     ********** *****                 



Example of pairwise sequence similarities:

For each pairwise alignment, the similarity (relative to the maximum similarity) and the number of aligned amino acids (in % of shorter sequence is given. Maximum values are underlined.

(157 bp)
(141 bp)
(155 bp)
(135 bp)
65 %
25 %
35 %
(157 bp)

10 %
63 %
(141 bp)

55 %

Please note that the similarity value 1.000 marks only the two most similar sequences, it does not necessarily mean that these sequences are identical.

Example of FASTA alignment format:


Example of PHYLIP tree format:

Trees can be visualized e.g. by the drawtree program contained in the PHYLIP software package.



If you are interested in more details, the method is described in

The main changes of DiAlign2 compared to the first version of the program are described in