Principal component analysis (PCA) is a statistical procedure that can be used for exploratory data analysis. PCA uses linear combinations of the original data (e.g. gene expression values) to define a new set of unrelated variables (principal components). These new variables are orthogonal to each other, avoiding redundant information.
Thus, PCA can be used to reduce the dimensions of a data set, allowing the description of data sets and their variance with a reduced number of variables. Since similarities between data sets are correlated to the distances in the projection of the space defined by the principal components, PCA can also be used to identify outliers with respect to the principal components.
It is often sufficient to look at the first two components, as these describe the largest variability.
A more detailed description of the PCA is available in Wikipedia.
This task can be used to get an impression on the similarity of RNAsequencing samples, i.e. to identify subgroups or outliers.
The variance in RNASeq data usually grows with the expression mean. PCA on the matrix of normalized read counts will often lead to principal components that are dominated by the variance of a few highly expressed genes. DESeq2 regularizedlogarithm transformation (rlog), transforms the data matrix of read counts per gene (or transcript) to log scale but specifically adopts for the high random noise of low count data. This is explained in detail on "RNASeq workflow: genelevel exploratory analysis and differential expression". The matrix of raw counts is input to the DESeq2 rlog function and the resulting transformed matrix is used as input for the principal component analysis (PCA, using the R package pcaMethods):
Rlog transformation is the default. Although not recommended, it is possible to do PCA directly on normalized expression values. Based on the read distribution in the input files a normalized expression value (NE) will be calculated for each locus (or transcript) for each input file. The NEvalue is based on the number of reads located in the exons of the locus/transcript and is normalized to the length of the locus/transcript and the density of the data set. The resulting matrix for NEs is then used as input for the principal component analysis as described.
Input  

Input file(s) with read positions from RNASeq 
Input data are accepted in
BED / bigBed file format or
BAM file format containing the input regions.
For some tasks BAM support might not be available.
For those tasks that allow to choose replicate data as input, you can use shift/ctrlkeys to select multiple files
from the list. All selected files will then be treated as replicates.
When adding a new file, a new window will open, asking you to either
For the new BED/BAM files, you will have to select the correct organism, as the
organism and the genome build are associated with the BED file for future use
(the default is your latest choice in the current session).
Note that files critically depend on the underlying genome build, which can be changed by selecting a different ElDorado version on the top right of the page before uploading a file. You can see the list of genomes available in ElDorado. Note that almost all browsers have a general upload limit of 2 GB, i.e. files bigger than this size should be zipped before uploading from your local computer. This restriction does not apply when using the direct import from the GGA/GMS. Optionally you can specify a name for saving uploaded files on the server, otherwise the name of the uploaded file will be used. If several files are uploaded, the string given here will be used as prefix for each file name. If any of the regions in the input file cannot be completely assigned to the selected genome (e.g. wrong chromosome numbering or wrong positions within a chromosome), an error message will appear and the regions will be skipped. If no valid region is found in an uploaded file, the complete file will be skipped. After one or several BED/BAM files were uploaded successfully, and after closing the popup window,
the list of available BED/BAM files will be automatically updated.
Uploaded BED or BAM files can be deleted from the project anytime via the project management. 
Options  rlog Transformation: Transform the count data matrix by the DESeq2 rlog function [default] Alternatively the normalized expression values (NEs) are used for PCA.

Parameters for PCA  Number of Groups: Here you can select the number of groups for your samples (e.g. the 3 groups control, treatment 1, and treatment 2), with a maximum of 13 groups.
Group properties: 
Transcript/Locus  The expression analysis can be based on different units of underlying data:

Output  
Result  Here, you can edit the default name of the result file. 
Email address  An email with the URL of the results will be sent
to the user provided email address, when the analysis is finished.
The results will be available for a limited time on our server. For details of how long your results will be kept please see the resultemail. After that period they will be deleted unless protected in the project management! 
The output has a number of sections, depending on the input (one or two data sets) and parameters:
The result sections are described in detail below.
The example shown below is based on data published by Roforth et al:
Samples  Number of samples submitted to analysis 

PCs  Number of principal components calculated (max 10) 
Variables  Number of loci or transcripts considered for analysis 
Method  svd = singular value decomposition (details) 
R2  The proportion of variance explained by each PC calculated (eigenvalue) 
R2cum  The cumulative proportion of the variance explained by the current and all preceding principal components. 
Example:
The score plot displays each sample in the data set with respect to the first two principal components and can therefore be used to interpret the relations among the samples. This information can be used to identify outliers.
In the example below, the replicate samples show a high similarity with respect to the first two principal components, a small within group variance and a good separation of groups.
The scree plot visualizes which principal components account for which fraction of total variance in the data. The principal components are listed by decreasing order of contribution to the total variance. The bars show the proportion of variance represented by each component (R2) and the points shows the cumulative variance (R2cum).
Example:
The loadings plot is a plot of the relationship between original variables (genes) and subspace dimensions. It allows the identification of genes that are most strongly correlated or anticorrelated with the first two principal components.
For the top principal components that are needed to account for 90% of the variance in the data (or up to a maximum of 10 PCs) the 40 transcripts/loci with the highest absolute loadings are shown.
Example:
This score plot displays each sample in the data set with respect to the first three principal components.
Example:
File  Description 

91.input_replicate_analysis  expression data used as PCA input (the transfomed count data in case of rlog transformation or normalized expression values (NEs) if rlog was switched off) 
Statistics.xml  Information displayed in overview tables 
Loadings.xml  Information displayed in loadings tables (rank, geneID, symbol and loading) 
scores.tsv  Scores for all samples for each PC calculated 
loadings.tsv  Loadings for each transcript/locus for each PC calculated 
07.expression_profile.tsv  expression values for each transcript/locus for each input file (separately) (detailed description) 
ScorePlot2D.png  Score plot for first two PCs 
ScorePlot3D.png  3Dscore plot for the first 3 PCs (only if more than 2 samples were submitted) 
ScreePlot.png  Scree plot for all computed PCs 
Loadings.png  Loadings plot for the first two PCs 
Loadings_PCx.png  Plot of the top 40 loadings for PCs contributing to 90% of variance (maximum 10) 
© 19982018 Genomatix AG  All rights reserved 