Key terms in microbiome analyses
When evaluating the composition of the microbiome in a sample based on sequencing data, different higher-level measures are often used that does not provide information on changes in abundance of specific taxa.
Sizeable shifts in the ratio of commensal bacteria in the gut microbiome is referred to as dysbiosis. Common forms of dysbiosis in human gut samples are increased levels of Proteobacteria or reduction in diversity.
Different measures exist to estimate diversity of a sample, jointly called alpha diversity. The different measures reflect the richness (number) or distribution (evenness) of a microbial sample, or aim to reflect a combination of both properties.
Rarefaction curves are often used when calculating alpha diversity indices, because increasing numbers of sequenced taxa allow increasingly accurate estimates of total population diversity.
Rarefaction curves can therefore be used to estimate the full sample richness, as compared to the observed sample richness.
While alpha diversity is a measure of microbiome diversity applicable to a single sample, beta diversity is a measure of similarity or dissimilarity of two communities. As for alpha diversity, many indices exist each reflecting different aspects of community heterogeneity. Key differences relates to how the indices value variation in rare species, if they consider presence/absence only or incorporate abundance, and how they interpret shared absence. Bray-Curtis dissimilarity is a popular measure which consider both size (overall abundance per sample) and shape (abundance of each taxa) of the communities(Bray, 1957). Beta diversity is an essential measure for many popular statistical methods in ecology, such as ordination based methods, and is widely used for studying the association between environmental variables and microbial composition.
In summary, alpha diversity measures can be seen as a summary statistic of a single population (within sample diversity), while beta diversity measures is an estimate of similarity or dissimilarity between populations (between samples).
Normalization across samples of sequencing data is performed to account for differences in sequencing depths.
Rarefaction to even read count
This is often performed by subsampling without replacement of the QC’ed set of reads, to a smaller, predetermined and fixed total. “Without replacement” means that each read that is selected and assigned to the normalized sample is not returned to the original pool, thus cannot be selected again. An advantage of this approach is that data is retained as count data and thereby allow for further analyses with statistical tools requiring count data.
Normalization by sample sum
An alternative to normalization by rarefaction where a subset and even number of reads are selected form each sample, read counts can be converted to relative frequencies by dividing with the sample sum. Here, we use the full sample data and normalize to relative abundances. The resulting values are fractions and therefore no longer counts.
The core microbiome
The precise definition of the core microbiota varies between studies but all aim to identify the more reliably detected taxa for further analyses. Measures of mean abundance across samples and fraction of samples with zero abundance are often used to filter the taxa for further analysis. Often, lower abundant taxa are removed from further single-taxa analyses, or are analyzed using different statistical approaches that better handle their distribution properties. Thresholds and statistical models must be selected based on the individual study design, depending on type of microbiome and goal of the analyses.
While the definition of a set of core taxa on a study-by-study basis is practical for statistical and interpretational reasons, many studies have aimed to identify a population-scale core, often referred to as the core measurable microbiome (CMM), defined as the taxa found across all or a defined set of human communities. While this is a interesting biological question, it is calculated with a different aim than the above discussed filtering performed for robustness and statistical purposes.