Core similarity | The "core sequence" of a matrix
is defined as the (usually 4) consecutive highest conserved positions
of the matrix. The core similarity is calculated as described here and in the MatInspector paper. The maximum core similarity of 1.0 is only reached when the highest conserved bases of a matrix match exactly in the sequence. More important than the core similarity is the matrix similarity which takes into account all bases over the whole matrix length! |
---|---|
Matrix similarity | The matrix similarity is calculated
as described here and
in the MatInspector paper. A perfect match to the matrix gets a score of 1.00 (each sequence position corresponds to the highest conserved nucleotide at that position in the matrix), a "good" match to the matrix usually has a similarity of >0.80. Mismatches in highly conserved positions of the matrix decrease the matrix similarity more than mismatches in less conserved regions. In MatInspector, a green background in the matrix similarity column marks a similarity above optimized (i.e. a "good" match), a red background marks a similarity below optimized (e.g. if a search was started using "optimized - 0.02". |
Model similarity | The model similarity is the percentage of the individual elements of a model that have to be present. A model similarity of 100% means that all elements of the model were found. |
Free energy | The free energy (given in kcal/mol) is a thermodynamic parameter for the stability of secondary structures (hairpins). The higher the free energy is the more stable the hairpin. |
Match rate | The match rate is the number of matching base pairs in percent of the total element length. This score applies to direct repeats, short multiple repeats, and terminal repeats. |
p-value for common TFs |
The p-value is the
probability to obtain an equal or greater number of sequences with a
match in a randomly drawn sample of the same size as the input
sequence set. The lower this probability the higher is the importance
of the observed common TFs.
Note: The p-values for common TFs are based on the pre-calculated promoter matches. Therefore, the p-values are only correct if sequences with an average length of about 1100 basepairs (600 basepairs up to ElDorado 02-2016) are searched for TF sites with optimized matrix similarity. |
p-value for models |
The p-value for models is calculated
optionally. For determination of this specificity score a background
promoter sequence set of 5000 human promoters (average length 1169 bp (635 bp up to ElDorado 02-2016))
is scanned with the models generated by FrameWorker. The results of this
search are used to check whether the models can also be found with a
set of randomly selected promoters. The p-value for models is the probability to obtain an equal or greater number of sequences with a model match in a randomly drawn sample of the same size as the input sequence set. The lower this probability the higher is the specificity of the model. Note: The p-value for models is no absolute quality measure. It is useful for ranking of models that are determined for one set of sequences, but it cannot be used to compare the quality of models derived from different sets of training sequences as the score depends on the number of training sequences, the number of sequences in which model matches are found and the length of the sequences. |
© 2022 Precigen Bioinformatics Germany GmbH - All rights reserved |