Quantile normalization
Quantile normalization is frequently used in microarray data analysis. It was introduced as quantile standardization
and then renamed as quantile normalization
.
Quantile, quartile, percentile ???
Quantiles
are just the lines that divide data into equally sized groups.percentiles
are just quantiles that divide the data into 100 equally sized groups
Example:
0 quartile = 0 quantile = 0th percentile
1 quartile = 0.25 quantile = 25th percentile
2 quartile = .5 quantile = 50th percentile (median)
3 quartile = .75 quantile = 75th percentile
4 quartile = 1 quantile = 100th percentile
Quantile normalization
Quantile normalization transform the statistical distributions across samples to be the same.
Assumptions
- The roughly same distribution of values across samples
- Most genes are not differentially expressed
Assume global differences in the distribution are induced by only technical variation!
How Q-normalization work
row: genes
column: samples/Arrays
Procedure:
- order values within each sample
- determine a rank from lowest to highest and record the order within each sample
- Average across rows and substitute value with average
- re-order averaged values in the original order recorded in 2.
Tied rank entries ?
Average the tied rank entries’ mean values and substitute.
When NOT to normalize
Consider a dilution experiment. In which distributions are supposed to decrease (left plot), Q-normalization does the totally wrong thing (right plot). When you expect a real difference in distributions, Q-normalization will create weird artifacts.
Smooth quantile ormalizaiton
Assumptions
the statistical distribution of each sample should be the same ( or have the same distributional shape) within biological groups or conditions
, but allowing
that they may differ between groups
How to
At each quantile, a weight is computed comparing the variability between groups relative to the total variability between and within groups
Let gene(g) denote the $g^{th}$ row after sorting each column in the data. For each row, gene(g), we compute the weight $w(g)$ ∈ [0,1], where a weight of 0 implies quantile normalization within groups is applied and a weight of 1 indicates quantile normalization is applied. The weight at each row depends on the between group sum of squares SSB(g) and total sum of squares SST(g), as follows:
$$ w_{(g)} = \operatorname{median} \bigg\lbrace 1- \frac{SSB_{(i)}}{SST_{(i)}} \bigg\rbrace \text{for } i = g -k, \cdots, g, \cdots, g+k $$
where $k$ = floor(Total number of genes * 0.05). The number 0.05 is a flexible parameter that can be altered to change the window of the number of genes considered.
StatQuest: Quantile Normalization
reference
- https://en.wikipedia.org/wiki/Quantile_normalization
- BIOMEDIN 245: Statistical and Machine Learning Methods for Genomics, Stanford
- Hicks SC, Okrah K, Paulson JN, Quackenbush J, Irizarry RA, Bravo HC. Smooth quantile normalization. Biostatistics. 2018;19(2):185‐198. doi:10.1093/biostatistics/kxx028