MatDefine is a tool for fully automatic definition and evaluation of weight matrices from a set of short DNA sequences. The resulting weight matrix can be used by MatInspector to scan nucleic acid sequences for matches to the described binding site.
The quality of a matrix is estimated by a value for random expectation (REvalue), which is defined as the number of matches with high matrix similarity (>= 0.85) expected in a random sequence of 1000 bp. This REvalue is assigned to each matrix.
Per default, the weight matrix is generated without any user interaction. A protocol describing the matrix definition process is delivered. The following steps are performed:
All default parameters can be changed. The following additional options are available:
In case of unsuitable input sequences, no matrix will be generated.
Sequence or Matrix Input  

Sequence Input  There are several ways to supply the sequence input
data for MatDefine:
In all cases the following sequence formats are accepted: The sequence file may contain at most 5000 sequences. 
Matrix Input 
Nucleotide distribution matrices
can either be entered directly into the input form or uploaded from
your computer.
The nucleotide distribution matrix consists of four lines. The
first line has to contain the numbers of A at each position of the
binding site, the 2nd, 3rd, and 4th line the numbers of C, G, T,
respectively. The following is an example of a nucleotide distribution matrix: 4 0 0 0 1 7 0 22 0 0 0 0 7 5 4 3 4 5 4 3 9 8 2 22 0 0 0 22 15 5 4 7 4 4 2 4 4 2 1 7 0 0 0 0 0 0 2 7 3 6 3 2 7 9 10 12 6 0 0 22 22 0 7 8 6 8 5 1 In this example matrix elements are separated by blanks. 
Parameters  
Note: The following parameters are only relevant for sequence input files. If your input is a nucleotide distribution matrix these parameters will be ignored.  
Strand optimization  The strand optimization option
is useful if the orientation of the binding site is unknown.
In this case both strands of the input sequences are checked and the "+" or "" strand is used for matrix definition.
If this option is selected, coreanchored alignment is used automatically. 
Alignment and Tuple Search 
Coreanchored alignment
With coreanchored alignment, the best conserved coretuple is selected and the alignment is anchored at the first position of the coretuple in each sequence. The tuple selection algorithm is described in the CoreSearch paper. The following parameters can be modified:

Unanchored alignment
Unanchored alignment means that the search for a common core sequence will be omitted. You should use this option if your sequences do not contain a highly conserved core sequence. 

QuickAlign
QuickAlign is an alignment without introducing gaps. It is about three
times faster as the default alignment. QuickAlign can be used for coreancored as well as for unanchored alignment. 

Matrix Creation  Cut off matrix ends
MatDefine automatically determines the correct length of the matrix by
cutting off low conserved positions at both matrix ends. For example,
if the input sequences are very different in length or contain sequences
around the binding site it is necessary to reduce the matrix length. 
Remove identical sequences
Identical sequences (i.e. one sequence equals another sequence or is part of another sequence) can be removed to avoid a biased nucleotide distribution matrix. In case of sequences with different length the shorter sequence will be removed. Regardless of this option, MatDefine always identifies identical sequences in the output file. 

Calculate optimized threshold
The optimized threshold of a weight matrix is the matrix similarity threshold that minimizes false positive matches when the matrix is used to scan sequences with MatInspector. It is defined in a way that at most 3 matches are found in 10,000 bp of nonregulatory test sequences (i.e. with the optimized threshold less than 3 false positives per 10,000 bp are found). Since the calculation of the optimized threshold requires some computing time it can be omitted for test runs. 

Consistency Check  Minimum number of sequences
This is the minimum number of sequences which is required to define a matrix. If the input file contains less sequences or less sequences remain after the rejection process no matrix will be created. 
Minimum matrix similarity
MatDefine generates a weight matrix which is consistent with its training set, i.e. all training sequences have to be identified by the resulting matrix. Therefore, only sequences that reach the minimum matrix similarity are included in the matrix, all other sequences are rejected. Decreasing the minimum matrix similarity may lead to inclusion of more sequences but also can influence the quality of the matrix. Increasing the minimum matrix similarity may lead to rejection of more sequences. If too few sequences are retained, no matrix will be created (see minimum number of sequences). 

Library Comparison  Selection of matrix groups
Here, you can select the matrix groups from the current MatInspector library with which the newly generated matrix should be compared. Per default, all matrix groups including your userdefined matrices (if available) are selected. Please note that the library comparison cannot be completely disabled. If you do not select at least one matrix group, the new matrix will be compared with all available matrix groups. These parameters are hidden by default. Clicking on will reveal them.

Check all sequences
If this option is set, all input sequences will be checked against the matrix groups selected above (using optimized matrix similarity threshold). These parameters are hidden by default. Clicking on will reveal them.


Your email address  Here you can choose between two methods for receiving
the results:
The results will be available for a limited time on our server. For details of how long your results will be kept please see the resultemail. After that period they will be deleted unless protected in the project management! 
MatDefine creates a protocol detailing each step of the matrix generation,
and the weight matrix which is used by MatInspector. The sequence logo of the matrix can be downloaded in various graphics formats, e.g. for inclusion in a scientific publication.
The resulting matrix can be saved to your personal
matrix library (userdefined library).
The protocol file contains
Sequence  Identical to 

HSFOS  MMCFOS 
XLACTIN5A  XLACTIN8A 
Core sequence:  CCAT 
Number of aligned sequences:  20 
Number of rejected sequences:  0 
Sequence Name  Position  Str.  Alignment  Matrix Similarity 

MMTFEZIF2 MMTFEZIF1 HSACTCA2 XLACTCAG3 EBV GGACAREG1 GGACAREG2 HSACTBPR HSVLC1 MMCYR61G XLACTIN8A XLACTIN5A XLACTCAG1 HSACTCA3 HSACTCA4 XLACTCAG2 MMCFOS HSFOS MMKROX1 MMTFEZIF 
4  23 4  23 4  23 4  23 4  23 4  23 3  22 3  22 4  23 4  23 4  23 4  23 3  22 3  22 3  22 3  22 3  22 3  22 3  22 3  22 
(+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) (+) 
CG CCAT ATAAGGAGCAGGAA CG CCTT ATATGGAGTGGCCC GA CCAA ATAAGGCAAGGTGG TA CCAA ATAAGGGCAGGCTG AG CCAT ATGTGGACAGATGG CG CCTT CTTTGGGCAGCGCG AC CCAA ATATGGCGACGGCC GT CCTT ATATGGACTCATCT AT CCTT TTATGGCCCTGTCC AC CCAA ATATGGAAATATTG GC CCAT ATTTGGCGATCTTC GC CCAT ATTTGGCGATCTTC AT CCCT ATTTGGCCATCCCT CT CCCT ATTTGGCCATCCCC TT CCTT ACATGGTCTGGGGG TT CCAT ACATGGGCTAAGGG GT CCAT ATTAGGACATCTGC GT CCAT ATTAGGACATCTGC GT CCAT ATATGGGCAGCGAC TC CCAT ATATGGCCATGTAC 
0.855 0.901 0.865 0.872 0.941 0.857 0.904 0.914 0.866 0.897 0.955 0.955 0.926 0.940 0.843 0.854 0.945 0.945 0.978 0.988 
Matrix U$srf  

Matrix Name:  U$srf  
Description:  not yet available  
Family:  U$NO_NAME  
References:    
Statistical Basis:  20 sequences  
Random Expectation (revalue):  0.02 matches per 1000 bp  
Promoter Matches:  not available  
Optimized Matrix Threshold:  0.77  
Length:  21 bp  
Nucleotide Distribution Matrix: 
 
Profile: 
 
Sequence Logo: 
Download this logo as: png, pdf or eps 
In case you want to save the resulting matrix to your personal library, some more information has to be entered:
Matrix Identification  

Matrix Identification  The matrix identification consists
of

Family Information  Each matrix belongs to a socalled matrix
family, where functionally similar matrices are grouped together
in order to eliminate redundant matches by MatInspector.
You can

Extra Information / References  Here you can enter further information which will be stored in the References field of the matrix. 
MatDefine is described in:
© 2022 Precigen Bioinformatics Germany GmbH  All rights reserved 