A frequentist and a Bayesian approach to estimating HIV prevalence accounting for non-response using population-based survey data.
Enhanced and novel frequentist and Bayesian approaches to estimating disease measures such as HIV prevalence utilizing the recent advances in statistical computing software are explored and applied making use of population-based complex survey data. In particular design-consistent estimates and logistic regression models for HIV prevalence are respectively computed and fitted using each of the approaches. Practical survey data are rarely obtained using simple random sampling schemes, instead complex sampling designs, that are designed to refect complex underlying population structures, are employed. These designs usually involve stratification, multistage sampling and unequal selection probability of sampling units giving rise to data that are hierarchical (multilevel), clustered, and hence correlated. This is particularly true for large-scale population-based surveys. Consequently this often gives rise to units that are correlated within clusters as well as multiple sources of variability rendering standard statistical methods based on the assumption of independence of units inappropriate. Survey logistic regression models built from a generalized linear modelling framework were used to explain the variation in HIV prevalence accounting for the nonindependence of the units. In addition, a hierarchical logistic regression model built from a generalized linear mixed modelling framework was used to capture the variability and correlation of the units within clusters and further determine how different layers interact and impact on a response variable. In particular, the logistic regression models for HIV prevalence on demographic, behavioural and socio-economic variables were developed from a frequentist and a Bayesian perspective. Statistical methods that incorporate prior known information about unknown parameters are vital in most scientific and biological research especially in studies where replicative experimental investigations are not possible. The Bayesian statistical paradigm offers a framework upon which a prior distribution of a parameter can be combined with the likelihood of the observed data to obtain a posterior distribution for explaining the stochastic variation in a response variable. Computer-intensive simulation-based algorithms such as the Markov chain Monte Carlo (MCMC) methods were used to draw samples from the posterior distribution for inference purposes. A Bayesian logistic regression model for HIV prevalence on demographic and socio-economic variables was fitted from a generalized linear modelling framework using the MCMC algorithms. Furthermore, practical complex survey data are often characterized by missing observations due to non-response, a phenomenon that is true to the data used for the current research. Often, the analyses of such data take a complete case approach, that is taking a list-wise deletion of all cases with missing observations, assuming that missing values are missing completely at random (MCAR). In the current research, we systematically simulate or generate multiple values for the missing observations under a multiple imputation method accounting for the structure of the data. A rectangular complete data set is produced and the variability or uncertainty induced by the very process of imputing the values for the missing observations is accounted for. The study utilizes complex (multi-layered and clustered data with missing values) survey data obtained from the 2010-11 Zimbabwe Demographic and Health Surveys (2010-11ZDHS). The results show that HIV prevalence varies considerably across subgroups of the population. All the analyses are done using R statistical software packages.