I wanted to share two posts and ask your take about it...and also write some thoughts out for myself to put ideas together :)

I came across those two posts from cross validated and I thought they can be relevant for this forum too :

1) There is a minimum sample size for a ttest to be valid

2) Ttest for non-normal with n>50

here's my thoughts: there are lots of R packages that can be used to find differentially expressed genes, one of my favorite is limma(), which runs a t.test through moderated t.statistics through "borrowing" information from all the genes in order to estimate the data about a single gene (correct me if I'm wrong). however, in my microarray analysis classes I was taught to use t.test for DEGs when we had "enough" samples.

the word "enough" always confused me, 3-5-10-100-10^89?! I didn't know that t.t.est was initially developed to analyze 4 samples, and although more elegant ways of evaluating real differences in the mean distribution of two samples have been developed, t.test is still widely used. so...what's your thought about the use of t.test for DEGs with N > 20-30? would you completely discard it? would you still use it for big sample sizes?

now let's say we have more than two conditions. here another example is anova. anova analyses, by definition, the variance across samples. a good and detailed description is in this file from the Jackson Laboratories. now, if we have just a "few" (same as "enough", very irritating word) samples then estimation of the variance can be tough and here's where tools like limma() come in handy: it uses values from other genes and it "shrinks" the variance. but then, let's say we have N > 30, would you still consider using ANOVA?

lastly, normality. I often underestimated the importance of normally distributed data, then started reading about central limit theorem, parametric and non-parametric testing. quick recap: parametric testing = t.test/anova non-parametric testing = wilcoxon rank test/mann whitney test , and I found it nicely mapped in a table here. so, my question is: when you analyze your data, how much weight do you put on their distribution? i.e. do you run a shapiro test to check the distribution or just go with "how their distribution looks like"?

lots of this goes back to power, but let's assume that we are given a set of data to analyze and we are not designing the experiment from the get-go...or if we do we have limited $$$...oh wait...I forgot that in research money are not a problem (..add sarcastic grin... lol) :)

as a small test, I ran limma and t.test on a set of 28 normal and 32 tumor samples from some CEL files we had in the lab. the list of DEGs with the same thresholds (p.value <0.05 and log2FC > 1.5) is exaclty the same but pvalues, as expected, are lower in limma.

alright, I think that's it for now...thanks for reading it, i thought about those things for awhile so I'm curious to see what others think.

I haven't mentioned analysis like SAM or resampling because otherwise it'd become too long of a post feel free to share your ideas about them too

thanks :)