Genomatix: Comparative MicroRNA analysis
(only available on GGA)
This task analyzes count data and performs a
comparative analysis of microRNA (or more generally noncoding RNA) expression.
Input for this task can be
If two data sets are supplied (e.g. treatment vs. control, or condition A vs. condition B,
or tissue A vs. tissue B), the task performs a differential analysis, i.e.
calculates lists of significantly up and downregulated microRNAs/noncoding RNAs.
If replicates for treatment and control data are available, the user can select from different
methods like 'DESeq' or 'edgeR' to calculate the differential expression.
If the expression values for two different conditions (here called "treatment" and "control" for simplicity)
are to be compared, the following statistical testing methods for evaluating differential expression are available:
 the procedure introduced by Audic and Claverie in
Audic S, Claverie JM (1997)
The significance of digital gene expression profiles
Genome Res. 1997;7(10):986995
 DESeq,
an R package based on
Anders S, Huber W (2010)
Differential expression analysis for sequence count data
Genome Biology 2010;11:R106
 DESeq2,
an improved version of DESeq described in
Huber W, Love M, Anders S (2014)
Moderated estimation of fold change and dispersion for RNASeq data with DESeq2
bioRxiv preprint, 2014. doi:10.1101/002832
 edgeR,
an R package based on the ideas presented in
Robinson MD, Smyth GK (2007)
Moderated statistical tests for assessing differences in tag abundance
Bioinformatics 2007;23(21):28812887
Robinson MD, Smyth GK (2008)
Smallsample estimation of negative binomial dispersion, with applications to SAGE data
Bioinformatics 2008;9(2):321332
Robinson MD, Oshlack A (2010)
A scaling normalization method for differential expression analysis of RNAseq data
Genome Biology 2010;11:R25
While the AudicClaveriemethod does not handle replicates, 'DESeq2', 'DESeq' and 'edgeR' were developed specifically for
replicate data. Moreover, edgeR cannot be used if there are no replicates available.
Audic and Claverie introduced a formula to compute a conditional probability for observing N reads (treatment)
in a class given that M reads were observed before (control).
These pvalues, in combination with the Genomatix normalized expression (NE) value
are used to evaluate differential expression.
The 'DESeq2', 'DESeq' and 'edgeR' methods model count data (here the number of reads from an RNASeq experiment mapped to a transcript) by a
negative binomial distribution. The parameters of the distribution (mean and dispersion) are estimated from the data,
i.e. from the read counts in the input files. Each method computes a measure of read abundance, i.e. expression level
(called 'base mean' or 'mean of normalized counts' in DESeq/DESeq2, and 'concentration' or 'countspermillion' in edgeR) for each transcript and apply a
hypothesis test to each transcript to evaluate differential expression. In particular, the three methods determine a (adjusted) pvalue and a log2 fold change
(in expression level) for each transcript.
One parameter can be set for DESeq: the dispersion estimates are found by fitting a curve through the pertranscript dispersion estimates. The way this
fitting is done can be specified to be either 'parametric' (the default in DESeq) or 'local'.
Default settings are used for the other parameters, in particular single pooled values are used as empirical dispersion estimates, and the maximum of the empirical
and fitted values is used the dispersion for a transcript resp. cluster. If there are no replicates, the settings are changed to the 'blind' method for computing the empirical
dispersion estimates, and the fitted dispersion values are used. Sometimes, the parametric fitting fails, and in this case, the analysis should be
rerun with the 'fitting method' set to 'local'. For details please refer to the
DESeq vignette.
For DESeq2, two parameters are settable: The testing for differential expression can either be done with a Wald test or a Likelihoodratio test.
The former is the default testing method in DESeq2, while the latter is the one in use for DESeq. The other settable parameter is  as for DESeq  the
fitting method used in dispersion estimation. See the
DESeq2 vignette for details.
edgeR normalizes the count data using the TMM (trimmed mean of Mvalues) method introduced by Robinson and Oshlack. All parameters used in the edgeR
algorithm as set to their respective default values. In particular, tagwise (i.e. pertranscript) dispersion estimation is used, with the tagwise dispersions squeezed towards the
common dispersion, as described in the
edgeR vignette.
Before the analysis, any transcripts without any mapped reads are removed from the dataset, i.e. from the input file
91.input_replicate_analysis, all transcripts are removed, which have a read count of '0' in all samples. In the output file
92.output_replicate_analysis, these transcripts are listed with the value 'NA' in each output column except the 'id' (pvalue, foldchange etc.).
For defining up and downregulated transcripts between two conditions or samples,
the following criteria are used (parameters set by the user):
Note that the first input set is regarded as "treatment", whereas the second input file is used as "control",
i.e. "upregulation" refers to a higher expression in set1 than in set2.
Also note that the direction of up and downregulation will change if the two data sets are
exchanged in the input.
Input 
ncRNA file(s) for condition 1 and 2 
MicroRNA/noncoding input data is accepted if it is in one of the following two formats:
 tabseparated file, containing an ID for microRNA/ncRNA in column 1 and count data (integer) in column 2
Lines starting with a # are treated as comments and are ignored.
Also, any additional values found after column two will be ignored.
Note: When several files are compared (sample/control or replicates),
the same number of IDs should occur in all input files.
IDs that have no values in all input files will be deleted from the analysis.
Example:
#miRNA read count
hsamiR1015p 0
hsamiR1013p 14
hsamiR103a3p 1636
hsamiR103a25p 7
 output files from the
smallRNA mapping task
Here, the input files are tabseparated files consisting of exactly 10 columns generated by a
mapping against the smallRNA library. There, the mapping result is classified into 10 classes
of noncoding RNA according to the sequence ontology project (sequenceontology.org).
These files are called <smallRNA class>_nc.tsv and contain a listing of the total read counts
for each small RNA in the library, with genomic positions for all organisms.
Example:
#org chr posfrom posto strand GXN name total hits relative hits annotation
hsa chr9 97848308 97848330 0 GXN_144 hsamiR241* 74 Infinity hsamiR241* MIMAT0000079 Homo sapiens miR241*
hsa chr9 97848344 97848365 0 GXN_8422 hsamiR30745p 176481 Infinity hsamiR30745p MIMAT0019208 Homo sapiens miR30745p
hsa chrX 73438265 73438286 0 GXN_1233 ecamiR421 2 Infinity ecamiR421 MIMAT0013216 Equus caballus miR421
hsa chrX 73438391 73438413 0 GXN_2923 hsamiR374b* 14 Infinity hsamiR374b* MIMAT0004956 Homo sapiens miR374b*
For this GGA task the microRNA_m_nc.tsv files are relevant.
These files contain the read counts for the mature (2123 bp) and biologically
active microRNAs as annotated in the miRBase database. The "microRNA_h_nc.tsv"
files, in contrast, examine the genomic hairpin structures (7080 bp) from
which the mature microRNAs are eventually generated via the Dicer enzyme process.
You can either
 upload one or several ncRNA files from your local computer
 or import the file(s) from the GGA (see more details)
Note: If multiple files are uploaded, they are treated as replicates for one condition.

Organism 
Filter 
Depending on the type of input, this parameter will have different consequences:
 For tabseparated input files with ID and count data only, this organism will influence the links
to the Genomatix GenomeBrowser in the output.

For input files from the smallRNA mapping task the following applies:
Since the smallRNA Mapping reports matches to microRNAs in all available organisms,
they need to be filtered in a first step to the pertinent species.
This species can be set here.
Hairpin microRNA sequences that align to 100% in the genome of the target organism are included
even if they were originally found in another organism. Please refer to the smallRNA mapping help page for a more
indepth explanation of these "transmapped hairpin microRNA sequences".

Significance Analysis 
Differential Analysis Parameters 
The differential analysis parameter section will only appear,
if at least one control file was uploaded in the section above.
There are four available algorithms for calculating the differential expression/enrichment values:
 DESeq (recommended for replicate data, but does work on nonreplicates, too)
It is possible to select the 'fitting type' parameter for DESeq, i.e. the way how the curve is fitted through the
dispersion estimates. For details on the meaning of this parameter please refer to the DESeq vignette.
 DESeq2 (recommended for replicate data, but does work on nonreplicates, too)
As for DESeq, it is possible to select the 'fitting type' parameter for DESeq2, i.e. the way how the curve is fitted through the
dispersion estimates.
Additionally, DESeq2 offer two alternative methods for testing for differential expression: Wald test and Likelihoodratio test
(with Wald test being the default).
For details on the meaning of the parameters please refer to the DESeq2 vignette.
 edgeR (recommended for replicate data, does not work on nonreplicates)
 the procedure introduced by Audic and Claverie (if no replicates are available)
The thresholds that define a transcript as differentially expressed (or a region as enriched/depleted) can be set here.
There are two criteria, that are combined (both must be satisfied for differential expression/enrichment):
 an adjusted pvalue threshold for the significance of observing the detected change
Note that the pvalues calculated by the different methods (DESeq/DESeq2, edgeR, AudicClaverie) can differ.
Also note, that setting the pvalue to 1 allows skipping of this criterium.
 a threshold for the log2 fold change of expression/enrichment level
A log2 ratio of 1 is a fold change of 2; a log2 ratio of 0.585 is a fold change of 1.5;
e.g. if the log2 fold change of expression/enrichment is set to ≥ 1, the expression values must go up
by at least 100% to appear in the differentially expressed transcripts/enriched regions list.
The log2 fold change thresholds can be set separately for up and downregulation (enrichment/depletion).
Note, that by setting the log2 fold change thresholds to 0, fold changes are ignored in the analysis.

Principal Component Analysis 
PCA 
If this parameter is set, a Principal Component Analysis (PCA) is automatically started on the condition / control files in order to identify subgroups or outliers.

Output 
Result 
Here, you can edit the default name of the result file. 
Email address 
Here you can choose between two methods for receiving
the results:
 Show result directly in browser window
In this option the URL of the result is directly shown in your browser
window.
Warning: Please use this option
only for analyses which can be performed in a short time.
If the analysis takes longer than the timeout of the webserver, the
connection will be terminated and you will receive an error message
(e.g. "The document contained no data."). In this case, the results will
not be available, please restart the analysis using the option
below "Send the URL of the result to".
 Send the URL of the result via email
In this option an email with the URL of the results will be sent
to the user provided email address, when the analysis is finished.
The results will be available for a limited time on our server.
For details of how long your results will be kept please see the resultemail.
After that period they will be deleted unless protected in the project management!

The output consists of the following:
 Analysis Parameters
 Differential ncRNA Analysis
 PCA Overview, if parameter PCA is set
 PCs (principal components), if parameter PCA is set
 Download of Data Files
The result sections are described in detail below.
 The ElDorado version used
 Result name
 Name of input file(s) and control file(s)
 The method and version used for differential expression analysis
 The thresholds used for differential expression analysis
For the differential expression analysis, a comparison of the expression values of the two input data sets
(treatment versus control, possibly each with replicate data) is done.
First the comparison is done on microRNA/ncRNA level by the selected method (DESeq, edgeR, or Audic/Claverie),
i.e. each microRNA is checked, whether it fulfills the userdefined thresholds regarding
 minimum log2 fold change of expression level for treatment (set1) compared to control (set2)
 the pvalue threshold for such a change
Note that the log2 fold change values cannot be calculated under certain conditions
(e.g. if no expression is detected for a transcript in the control set).
Such cases are indicated by a "Inf", "Inf" or "NA" value in the output.
Overview
The following numbers are given in an overview table
 total number of analyzed ncRNAs
 differentially expressed ncRNAs
The download links below the numbers allow accessing tabseparated data files (suffix *.tsv),
containing details like microRNA name, read numbers, pvalue, log2 fold change for each microRNA;
for details of the format see below.
Those microRNA, that were found to be significantly expressed (by user criteria) are listed in a table:
Please note, that hairpin microRNA sequences that align to 100% in the genome of the target organism are included
even if they were originally found in another organism.
Most columns of the table can also be found
in the 10.microRNA_summary.tsv (
for details see below). They are shortly explained here:
microRNA name 
name of the microRNA 
Annotation 
Annotation details for this microRNA.
A microRNA originally found and annotated for one organism may raise a signal in an experiment
involving another organism, due to sequence similarity or for biological reasons. 
pvalue 
pvalue (depends on the selected method) 
adj. pvalue 
adjusted pvalue(depends on the selected method) 
log2(fold change) 
log2(expression value of control data set / expression value of treatment data set)
Note, that this value can be Inf/+Inf if one of the conditions shows no expression 
Regulation  Regulation of treatment (set1) compared to control (set2), (values can be "up", "down", "no") 
rel.hits 
relative number of hits for this microRNA for each replicate from the treatment sets and the control sets 
Genome 
link to the Genomatix GenomeBrowser
for a graphical display of the region(s) where the corresponding hairpin structure
of this microRNA is located. This can be relevant in examining the possible regulation
of this microRNA.
If there are multiple hairpin locations in the genome, there will be a link to each
of them in the GenomeBrowser.
For a given mature microRNA, a hairpin location is not necessarily known.

The table can be interactively sorted by various criteria like pvalue or up/downregulation.
This is done by clicking on the corresponding column header.
The width of a column can be changed by dragging the divider beside the column header.
Plots
Depending on the method used for differential analysis, scatter and volcano plots are displayed.
Note:
The graphics are at first depicted as small icons, which can be enlarged by clicking on them.
For a detailed explanation of the various plots please
refer to the help page of the differential expression task.
Overview output of the Principal Component Analysis.
The top loadings for the first principal components accounting for 90% of the variance in the data.
All result files that can be downloaded separately from the result page
together with the statistics files (in text format) can be downloaded
as an archive (tarfile).
Here is an overview of results and their corresponding file names (for format details see below):
Details on  Filename 
Main results: 
all analyzed microRNAs 
10.ncRNA_summary.tsv 
differentially expressed microRNAs 
11.significant_ncRNA.tsv 
If differential analysis was selected: 
count data used for the test method 
91.input_replicate_analysis 
library size, i.e. the total read numbers 
91.input_replicate_analysis.libsize 
result from the test method 
92.output_replicate_analysis 
If PCA analysis was selected: 
Loadings plot for the first two PCs 
Loadings.png 
Information displayed in loadings tables 
Loadings.xml 
Plot of the top 40 loadings for PCs contributing to 90% of variance (maximum 10) 
Loadings_PCx.png 
Score plot for first two PCs (in png and pdf format) 
ScorePlot2D.png /.pdf 
Scree plot for all computed PCs 
ScreePlot.png 
Information displayed in overview tables 
Statistics.xml 
Loadings for each transcript/locus for each PC calculated 
loadings.tsv 
Scores for all samples for each PC calculated 
scores.tsv 
Plots: 
MA plot 
93.MA_plot.png 
Volcano plot of adjusted pvalues 
94.volcano_plot.png 
Histogram of unadjusted pvalues 
95.pvalue_histogram.png 
Dispersion plot (for DESeq/DESeq2) 
96.dispersion_plot.png 
BCV plot (for edgeR) 
96.bcv_plot.png 
Data files for all analyzed microRNAs and
those containing only significant microRNAs
1: Genomatix microRNA Id
2: microRNA name
3: Annotation details for this microRNA
4: pvalue (depends on the selected method)
5: adjusted pvalue(depends on the selected method)
6: log2(fold change), i.e. log2(expression value of control data set / expression value of treatment data set),
note, that this value can be Inf/+Inf if one of the conditions shows no expression
7: Regulation of treatment (set1) compared to control (set2), (values can be "up", "down", "no")
8: was the microRNA found to be significant (this value depends on user settings for the analysis)
the following columns depend on the number of input files:
 number of reads for each replicate from the treatment sets and the control sets
 relative number of hits for this microRNA for each replicate from the treatment sets and the control sets
Here is an example of the output format:
GX_ID microRNA name Annotation ...
GXN_6537 btamiR1814a btamiR1814a MIMAT0011874 Bos taurus miR1814a ....
GXN_6624 btamiR2404 btamiR2404 MIMAT0011961 Bos taurus miR2404 ....
... pvalue adj. pvalue log2 (foldchange) Regulation significant? ...
... 1.06554329607119e06 0.00042195514524419 3.40094966114274 down 1 ...
... 5.34623064669077e07 0.000264638417011193 5.44369293283376 down 1 ...
...#hits treat1 #hits ctrl1 rel.hits treat1 rel.hits ctrl1
... 1276 4955 0.00242099511247382 0.0255730056410283
... 1 16 1.89733159284782e06 8.25768093353083e05
Data files if one of the test methods 'edgeR', 'DESeq' or 'DESeq2' was selected
 91.input_replicate_analysis
containing the count data that was used as input for the test method
 91.input_replicate_analysis.libsize
containing the library size, i.e. the read numbers of the various input files
 92.output_replicate_analysis
contains values computed from the Rscript, depending on the chosen method for testing differential expression (DESeq, DESeq2 or edgeR).
The input transcripts are filtered such that any transcripts with a read count total of 0 (i.e. no reads in any sample) are omitted
from the analysis. These transcripts are listed in the the output file with all values (except the 'id' column) set to 'NA'.
1. id: Genomatix Transcript Id
2. pvalue: pvalue resulting from the hypothesis test of the selected test method for differential
expression (DESeq, DESeq2 or edgeR; for DESeq2, the pvalues are either from a Wald or LRT test)
3. adj. pvalue: BenjaminiHochberg adjusted pvalue (from column 2)
4. log2FoldChange: logarithmic (base 2) foldchange in read abundance/expression level in treatment
over control (> 0 is enrichment in treatment, < 0 is decrease in treatment); for DESeq and DESeq2, the
foldchange computation is based on the base mean, for edgeR on the concentration
for DESeq the remaining columns are
5. baseMean: mean expression level across all replicates, treatment and control
6. baseMean control: mean expression level within control group
7. baseMean treatment: mean expression level within treatment group
8. variance ratio control: ratio of the estimate of the base variance of the counts for the control
group and the value predicted with the base variance function; according to the package authors,
a large value may indicate a false hit;
see the Rvignette for details (within the vignette, this value is referred to as 'resVarA')
9. variance ratio treatment: as column 8, just for the treatment group; within the DESeq vignette,
this value is referred to as 'resVarB'
for DESeq2 the remaining columns are
5. baseMean: mean expression level across all replicates, treatment and control
6. standard error
7. Wald/LRT statistic
for edgeR the remaining column is
5. log countspermillion: expression level, logarithmic (base 2)
© 2019 Intrexon Bioinformatics Germany GmbH  All rights
reserved 