 Overview of Help-Pages # Background on MatInspector Algorithm

## MatInd: Creation of a Matrix

The program MatInd constructs a description for a consensus (e.g. of a transcription factor binding site) which consists of

• a nucleotide distribution matrix,
• the conservation of each position within the matrix represented by an array of values termed consensus index vector (Ci-vector).

MatInd employs an alignment algorithm based on the method described by Frech et al. and creates the nucleotide distribution matrix by counting the bases at each position of the alignment.

The Ci-vector is constructed by calculating the Ci-value for each position i of the matrix:

 Ci(i) = (100 / ln5) * ( sum(P(i,b) * ln P(i,b)) + ln5)

where

• b is A,C,G,T,gap
• P(i,b) is the relative frequency of nucleotide b at position i
• 0 <= Ci <= 100

This Ci-vector represents the conservation of the individual nucleotide positions in the matrix in numerical values and is used by MatInspector:

Ci=100 a position with total conservation of one nucleotide a position with equal distribution of all four nucleotides and gaps

MatInd also defines a core region within the matrix which is represented by the four consecutive nucleotide positions with the highest Ci-sum. This core region of the matrix is used by MatInspector to preselect potential matches.

## MatInspector Library

MatInspector's large library (>600) of transcription factor binding site matrices was created with MatInd and has been compiled on the basis of published matrices with emphasis on sequences with experimentally verified binding capacity.

The MatInspector library also includes information on

• an optimized matrix threshold
• the family of each matrix

## MatInspector: Search for Matrix matches

MatInspector uses

• the core,
• the nucleotide distribution matrix,
• the Ci-vector,
• the optimized threshold,
• and the family information

to scan sequences of unlimited length for matches to the consensus matrix description.

1. The search starts with an optional preselection in which only matches to the core region are considered. This reduces the total number of matches and simultaneously accelerates the performance of the program.

The core similarity is calculated for each position of the sequence:

 core_sim = (sum( score(b,j))) / (sum(max_score(j)))

where

• l is the length of the core region
• j=1..l
• score(b,j): matrix value for base b at position j
• max_score(j): max {score(b,j)} with b in A,C,G,T
• 0 <= core_sim <= 1

2. The matrix similarity is calculated only if the core similarity reaches an user defined threshold (core similarity):

 mat_sim = (sum(Ci(j)*score(b,j)))/(sum(Ci(j)*max_score(j)))

where

• n: length of consensus-matrix
• j=1..n
• Ci(j): consensus index value of position j
• score(b,j): matrix-value for base b at position j
• max_score(j): max {score(b,j)} with b in A,C,G,T
• 0 <= mat_sim <= 1

matrix similarity = 1 only if the candidate sequence corresponds to the most conserved nucleotide at each position of the matrix.

Multiplying each score with the Ci-value emphasizes the fact that mismatches at less conserved positions are easier tolerated than mismatches at highly conserved positions.

The output of MatInspector consists of those matches that reach the user-defined minimum core and matrix similarity. Optionally the optimized matrix threshold for each matrix can be used as cut-off criterion.

3. MatInspector applies a further step and compares the matches of matrices that belong to the same family. The program only lists the best match of a number of overlapping matches of a family in the output.