Exploring the potential of hierarchical generalized linear models in animal breeding and genetics

Rönnegård, Lars; Lee, Y.

doi:10.1111/jbg.12059

Abstract

Complex problems do not always require complex tools, and searching for these simple solutions is what makes science really exciting. Hypothesis testing and prediction in animal breeding and genetics can be rather complex because they require modelling of correlation structures between related individuals, which researchers in animal breeding are notably good at. Animal breeders are also exceptionally good at collecting large amounts of high-quality data through the close mutual collaboration between farmers and animal breeding organizations. One of the great challenges, though, is to apply the complex models to large data sets in practice, and a multitude of linear mixed models have been applied using clever sparse matrix techniques. But where do we go from here? In 1981, the eminent animal breeder C. R. Henderson visited Iowa State University and sometimes came to the office of Youngjo Lee, then a student of Oscar Kempthorne, to explain his mixed model equation. Even though this student was not convinced at that time of the usefulness of Henderson's ideas, 15 years later, he and John Nelder (Lee & Nelder 1996 J. R. Statist. Soc. B. 58:619-678) extended the BLUP approach to a broad class of statistical models with random effects: hierarchical generalized linear models (HGLMs). HGLMs can be fitted using their hierarchical (h-) likelihood, an extension of the so-called joint likelihood used by Henderson that consists of a joint density for the observations and random effects. The estimates of fixed and random effects are derived by maximizing the h-likelihood and produce direct extensions of Henderson's mixed model equations which are easy to recognize and interpret for those previously acquainted with the animal model, whereas the variance components are estimated by maximizing an adjusted profile of the h-likelihood, a direct extension of REML. So, what Lee and Nelder did was to extend familiar theory applied in animal breeding research to a much wider class of models. Modelling of HGLMs is relatively straightforward because of the hierarchical nature of the h-likelihood where models for variance components and dispersion parameters can be added one by one. A wide range of distributions can also be used to model both the response variable(s) and the random effects, which further increases the modelling flexibility. Some examples of very useful and standard HGLMs are a Poisson response with gamma random effects, frailty models for survival analysis, dealing with heterogeneity by including random effects in a model for the residual variance and models for smoothing data using random effects. These models are all found in the book by Lee, Nelder and Pawitan from 2006 (Chapman & Hall/CRC) together with applications on data collected in various fields of research. The code for fitting all these examples is available in GenStat together with the data. The h-likelihood can not only be used for model fitting but is also a statistical framework for deriving model selection tools. The standard Fisher likelihood is a marginal likelihood where all random effects have been integrated out, and the focus is on statistical testing of fixed effects, whereas the h-likelihood allows inference of both fixed and random effects so that model selection can be based on the random effects as well. The conditional AIC from the h-likelihood is actually equivalent to the deviance information criterion (DIC) applied in Bayesian statistics (see Lee & Noh 2012 Stat. Mod. 12: 487-502). Here, we highlight some aspects of HGLMs and their extensions, as applied to questions specific to animal breeding, together with possible future applications using spatial modelling and variable selection. The animal model traditionally assumes a constant residual variance for all observations, but there is a concern, especially for a trait like milk yield, that the residual variance sometimes seems to increase with selection. To investigate this possibility, models including a genetically structured residual variance have been proposed where a model for the residual variance includes both fixed and random effects on a logarithmic scale. This model is included in a class of models referred to as double hierarchical generalized linear model (DHGLM) introduced by Lee and Nelder (2006 J. R. Statist. Soc. C. 55: 139-185) and can be fitted using two interconnected HGLMs. Using standard software including sparse matrix techniques developed for animal breeding purposes (e.g. DMU or ASReml), this model can be fitted within a reasonable amount of time on large data sets (Felleki et al. 2012 Genet. Res. 94:307-317, R€onnegard et al. 2013 J. Dairy Sci. 96:2627-2636). There is also an hglm package in R, which can be applied to animal models and is computationally efficient for moderately sized data with pedigrees including around 1000 animals or less. This package is a fast implementation using interconnected GLMs as described in the book by Lee, Nelder and Pawitan (2006) and uses the standard glm function in R together with a QR decomposition. The bigRR package uses the machinery of the hglm package to compute shrinkage estimates for models having a small number of observations (<1000) and a much larger number of parameters. The package is an elegant addition to the ever increasing toolbox for genomic prediction (Shen et al. 2013 Genetics 193:1255-1268). Statisticians have a responsibility to develop userfriendly and reliable software for applied users. The development of HGLM software has not been focused on animal breeding applications, but only minor adjustments are required for such applications. There are packages in R for fitting pure HGLMs (HGLMMM), survival models with random effects (frailtyHL) and DHGLMs (dhglm), which fit the h-likelihood directly using numerical optimization and any potential bias of the estimated variance components, especially for binary data with small cluster sizes, can thereby be eliminated. Which kind of developments and applications can we expect in the future for animal breeding and genetics? Just as the residual variance can be modelled using DHGLMs, the genetic variance can be modelled as well. R€onnegard and Lee (2010 Conference paper WCGALP, Leipzig) showed that smoothing the genetic variance over adjacent SNPs in genomic prediction is possible using this approach. Lee and Noh (2012 Stat. Mod. 12: 487-502) presented model selection tools for modelling the variance of random effects. An interesting application in genomic selection would be to model the variances of SNP effects with genomic information as fixed effects, for instance using an indicator variable of whether the SNP is located in an exon or not, or using the minor allele frequency as a covariate. It should also be useful for extensions of the animal model when there is a need to model the genetic variance. Such a model for the genetic variance could, for instance, include fixed effects of sex and age, as well as random effects to account for geographical differences in heritability. There is a great interest in methods to fit spatial models in ecology, environmetrics and economy, where random effects are estimated for different regions on a map and information is borrowed from neighbouring regions by fitting a spatial correlation matrix. These models are not common in animal breeding applications although data are often collected from herds that are geographically dispersed. The h-likelihood has been extensively used for spatial modelling and could be a useful tool for such purposes in animal breeding applications as well. Many Bayesian methods to perform variable selection in genomic prediction and QTL detection have been developed, but this is also possible using the HGLM approach with a major computational improvement. In the HGLM approach, a scaled gamma mixture model for the distribution of effects is used and provides a whole family of penalized likelihoods, including LASSO among many others. This family of models can be fitted using a common iterative weighted least squares algorithm (Lee & Oh 2009 Technical report 2009-4, Dep. of Statistics, Stanford University). Lee and Bjørnstad (2013 J. R. Statist. Soc. B. 75:553-575 with available R code) showed that multiple tests can be viewed as a multiple prediction problem of whether a null hypothesis is true or not and that these predictions can be made by fitting random effects. Thus, random-effect estimation under various distributional assumptions is of great interest in several fields of studies including genetics, and the use of HGLMs could be a way to solve complex problems in animal breeding with familiar simple tools. Exploring these possibilities is a potentially fascinating route for those adventurous enough to tread on recently broken statistical ground.

Published in

Journal of Animal Breeding and Genetics
2013, volume: 130, number: 6, pages: 415-416

SLU Authors

Rönnegård, Lars
- Department of Animal Biosciences, Swedish University of Agricultural Sciences

UKÄ Subject classification

Genetics and Breeding in Agricultural Sciences

Publication identifier

DOI: https://doi.org/10.1111/jbg.12059

Permanent link to this page (URI)

https://res.slu.se/id/publ/51241