Doctoral Degrees (Statistics)
Permanent URI for this collectionhttps://hdl.handle.net/10413/7126
Browse
Browsing Doctoral Degrees (Statistics) by Date Accessioned
Now showing 1 - 20 of 51
- Results Per Page
- Sort Options
Item D-optimal designs for drug synergy.(2009) Kabera, Muregancuro Gaëtan.; Ndlovu, Principal.; Haines, Linda Margaret.This thesis is focused on the construction of optimal designs for detecting drug interaction using the two-variable binary logistic model. Two specific models are considered: (1) the binary two-variable logistic model without interaction, and (2) the binary two-variable logistic model with interaction. The two explanatory variables are assumed to be doses of two drugs that may or may not interact when jointly administered to subjects. The main objective of the thesis is to algebraically construct the optimal designs. However, numerical computations are used for constructing optimal designs in cumbersome cases. The problem of constructing optimal designs is to allocate weights to specific points of the design space in such a way that information associated with model parameters is maximized and the variances of the mean responses are minimized. Specifically, the D-optimality criterion discussed in this thesis minimizes the determinant of the asymptotic variance-covariance matrix of the estimates of the model parameters. The number of support points of the D-optimal designs for the two- variable binary logistic model without interaction varies from 3 to 6. Support points are equally weighted only in case of the 3-point designs and in some special cases of the 4-point designs. The number of support points of the D-optimal designs for the two-variable binary logistic model with interaction varies from 4 to 8. Support points are equally weighted only in case of the 4-point designs and in some special cases of 8-point designs. Numerous examples are given to illustrate theoretical results.Item Statistical methods for longitudinal binary data structure with applications to antiretroviral medication adherence.(2010) Maqutu, Dikokole.; Zewotir, Temesgen Tenaw.Longitudinal data tend to be correlated and hence posing a challenge in the analysis since the correlation has to be accounted for to obtain valid inference. We study various statistical methods for such correlated longitudinal binary responses. These models can be grouped into five model families, namely, marginal, subject-specific, transition, joint and semi-parametric models. Each one of the models has its own strengths and weaknesses. Application of these models is carried out by analyzing data on patient’s adherence status to highly active antiretroviral therapy (HAART). One other complicating issue with the HAART adherence data is missingness. Although some of the models are flexible in handling missing data, they make certain assumptions about missing data mechanisms, the most restrictive being missing completely at random (MCAR). The test for MCAR revealed that dropout did not depend on the previous outcome. A logistic regression model was used to identify predictors for the patients’ first month’s adherence status. A marginal model was then fitted using generalized estimating equations (GEE) to identify predictors of long-term HAART adherence. This provided marginal population-based estimates, which are important for public health perspective. We further explored the subject’s specific effects that are unique to a particular individual by fitting a generalized linear mixed model (GLMM). The GLMM was also used to assess the association structure of the data. To assess whether the current optimal adherence status of a patient depended on the previous adherence measurements (history) in addition to the explanatory variables, a transition model was fitted. Moreover, a joint modeling approach was used to investigate the joint effect of the predictor variables on both HAART adherence status of patients and duration between successive visits. Assessing the association between the two outcomes was also of interest. Furthermore, longitudinal trajectories of observed data may be very complex especially when dealing with practical applications and as such, parametric statistical models may not be flexible enough to capture the main features of the longitudinal profiles, and so a semiparametric approach was adopted. Specifically, generalized additive mixed models were used to model the effect of time as well as interactions associated with time non-parametrically.Item Comparative approaches to handling missing data, with particular focus on multiple imputation for both cross-sectional and longitudinal models.(2012) Hassan, Ali Satty Ali.; Mwambi, Henry G.Much data-based research are characterized by the unavoidable problem of incompleteness as a result of missing or erroneous values. This thesis discusses some of the various strategies and basic issues in statistical data analysis to address the missing data problem, and deals with both the problem of missing covariates and missing outcomes. We restrict our attention to consider methodologies which address a specific missing data pattern, namely monotone missingness. The thesis is divided into two parts. The first part placed a particular emphasis on the so called missing at random (MAR) assumption, but focuses the bulk of attention on multiple imputation techniques. The main aim of this part is to investigate various modelling techniques using application studies, and to specify the most appropriate techniques as well as gain insight into the appropriateness of these techniques for handling incomplete data analysis. This thesis first deals with the problem of missing covariate values to estimate regression parameters under a monotone missing covariate pattern. The study is devoted to a comparison of different imputation techniques, namely markov chain monte carlo (MCMC), regression, propensity score (PS) and last observation carried forward (LOCF). The results from the application study revealed that we have universally best methods to deal with missing covariates when the missing data pattern is monotone. Of the methods explored, the MCMC and regression methods of imputation to estimate regression parameters with monotone missingness were preferable to the PS and LOCF methods. This study is also concerned with comparative analysis of the techniques applied to incomplete Gaussian longitudinal outcome or response data due to random dropout. Three different methods are assessed and investigated, namely multiple imputation (MI), inverse probability weighting (IPW) and direct likelihood analysis. The findings in general favoured MI over IPW in the case of continuous outcomes, even when the MAR mechanism holds. The findings further suggest that the use of MI and direct likelihood techniques lead to accurate and equivalent results as both techniques arrive at the same substantive conclusions. The study also compares and contrasts several statistical methods for analyzing incomplete non-Gaussian longitudinal outcomes when the underlying study is subject to ignorable dropout. The methods considered include weighted generalized estimating equations (WGEE), multiple imputation after generalized estimating equations (MI-GEE) and generalized linear mixed model (GLMM). The current study found that the MI-GEE method was considerably robust, doing better than all the other methods in terms of small and large sample sizes, regardless of the dropout rates. The primary interest of the second part of the thesis falls under the non-ignorable dropout (MNAR) modelling frameworks that rely on sensitivity analysis in modelling incomplete Gaussian longitudinal data. The aim of this part is to deal with non-random dropout by explicitly modelling the assumptions that caused the dropout and incorporated this additional sub-model into the model for the measurement data, and to assess the sensitivity of the modelling assumptions. The study pays attention to the analysis of repeated Gaussian measures subject to potentially non-random dropout in order to study the influence on inference that might be caused in the data by the dropout process. We consider the construction of a particular type of selection model, namely the Diggle-Kenward model as a tool for assessing the sensitivity of a selection model in terms of the modelling assumptions. The major conclusions drawn were that there was evidence in favour of the MAR process rather than an MCAR process in the context of the assumed model. In addition, there was the need to obtain further insight into the data by comparing various sensitivity analysis frameworks. Lastly, two families of models were also compared and contrasted to investigate the potential influence on inference that dropout might have or exert on the dependent measurement data considered, and to deal with incomplete sequences. The models were based on selection and pattern mixture frameworks used for sensitivity analysis to jointly model the distribution of the dropout process and longitudinal measurement process. The results of the sensitivity analysis were in agreement and hence led to similar parameter estimates. Additional confidence in the findings was gained as both models led to similar results for significant effects such as marginal treatment effects.Item Likelihood based statistical methods for estimating HIV incidence rate.(2013) Gabaitiri, Lesego.; Mwambi, Henry G.Estimation of current levels of human immunodeficiency virus (HIV) incidence is essential for monitoring the impact of an epidemic, determining public health priorities, assessing the impact of interventions and for planning purposes. However, there is often insufficient data on incidence as compared to prevalence. A direct approach is to estimate incidence from longitudinal cohort studies. Although this approach can provide direct and unbiased measure of incidence for settings where the study is conducted, it is often too expensive and time consuming. An alternative approach is to estimate incidence from cross sectional survey using biomarkers that distinguish between recent and non-recent/longstanding infections. The original biomarker based approach proposes the detection of HIV-1 p24 antigen in the pre-seroconversion period to identify persons with acute infection for estimating HIV incidence. However, this approach requires large sample sizes in order to obtain reliable estimates of HIV incidence because the duration of antigenemia before antibody detection is short, about 22.5 days. Subsequently, another method that involves dual antibody testing system was developed. In stage one, a sensitive test is used to diagnose HIV infection and a less sensitive test such is used in the second stage to distinguish between long standing infections and recent infections among those who tested positive for HIV in stage one. The question is: how do we combine this data with other relevant information, such as the period an individual takes from being undetectable by a less sensitive test to being detectable, to estimate incidence? The main objective of this thesis is therefore to develop likelihood based method that can be used to estimate HIV incidence when data is derived from cross sectional surveys and the disease classification is achieved by combining two biomarker or assay tests. The thesis builds on the dual antibody testing approach and extends the statistical framework that uses the multinomial distribution to derive the maximum likelihood estimators of HIV incidence for different settings. In order to improve incidence estimation, we develop a model for estimating HIV incidence that incorporate information on the previous or past prevalence and derive maximum likelihood estimators of incidence assuming incidence density is constant over a specified period. Later, we extend the method to settings where a proportion of subjects remain non-reactive to a less sensitive test long after seroconversion. Diagnostic tests used to determine recent infections are prone to errors. To address this problem, we considered a method that simultaneously makes adjustment for sensitivity and specificity. In addition, we also showed that sensitivity is similar to the proportion of subjects who eventually transit the “recent infection” state. We also relax the assumption of constant incidence density by proposing linear incidence density to accommodate settings where incidence might be declining or increasing. We extend the standard adjusted model for estimating incidence to settings where some subjects who tested positive for HIV antibodies were not tested by a less sensitive test resulting in missing outcome data. Models for the risk factors (covariates) of HIV incidence are considered in the last but one chapter. We used data from Botswana AIDS Impact (BAIS) III of 2008 to illustrate the proposed methods. The general conclusion and recommendations for future work are provided in the final chapter.Item Statistical modelling of availability of major food cereals in Lesotho : application of regression models and diagnostics.(2012) Khoeli, Makhala Bernice.; Mwambi, Henry G.Oftentimes, application of regression models to analyse cereals data is limited to estimating and predicting crop production or yield. The general approach has been to fit the model without much consideration of the problems that accompany application of regression models to real life data, such as collinearity, models not fitting the data correctly and violation of assumptions. These problems may interfere with applicability and usefulness of the models, and compromise validity of results if they are not corrected when fitting the model. We applied regression models and diagnostics on national and household data to model availability of main cereals in Lesotho, namely, maize, sorghum and wheat. The application includes the linear regression model, regression and collinear diagnostics, Box-Cox transformation, ridge regression, quantile regression, logistic regression and its extensions with multiple nominal and ordinal responses. The Linear model with first-order autoregressive process AR(1) was used to determine factors that affected availability of cereals at the national level. Case deletion diagnostics were used to identify extreme observations with influence on different quantities of the fitted regression model, such as estimated parameters, predicted values, and covariance matrix of the estimates. Collinearity diagnostics detected the presence of more than one collinear relationship coexisting in the data set. They also determined variables involved in each relationship, and assessed potential negative impact of collinearity on estimated parameters. Ridge regression remedied collinearity problems by controlling inflation and instability of estimates. The Box-Cox transformation corrected non-constant variance, longer and heavier tails of the distribution of data. These increased applicability and usefulness of the linear models in modeling availability of cereals. Quantile regression, as a robust regression, was applied to the household data as an alternative to classical regression. Classical regression estimates from ordinary least squares method are sensitive to distributions with longer and heavier tails than the normal distribution, as well as to outliers. Quantile regression estimates appear to be more efficient than least squares estimates for a wide range of error term distribution. We studied availability of cereals further by categorizing households according to availability of different cereals, and applied the logistic regression model and its extensions. Logistic regression was applied to model availability and non-availability of cereals. Multinomial logistic regression was applied to model availability with nominal multiple categories. Ordinal logistic regression was applied to model availability with ordinal categories and this made full use of available information. The three variants of logistic regression model gave results that are in agreement, which are also in agreement with the results from the linear regression model and quantile regression model.Item Use of statistical modelling and analyses of malaria rapid diagnostic test outcome in Ethiopia.(2013) Ayele, Dawit Getnet.; Zewotir, Temesgen Tenaw.; Mwambi, Henry G.The transmission of malaria is among the leading public health problems in Ethiopia. From the total area of Ethiopia, more than 75% is malarious. Identifying the infectiousness of malaria by socio-economic, demographic and geographic risk factors based on the malaria rapid diagnosis test (RDT) survey results has several advantages for planning, monitoring and controlling, and eventual malaria eradication effort. Such a study requires thorough understanding of the diseases process and associated factors. However such studies are limited. Therefore, the aim of this study was to use different statistical tools suitable to identify socioeconomic, demographic and geographic risk factors of malaria based on the malaria rapid diagnosis test (RDT) survey results in Ethiopia. A total of 224 clusters of about 25 households were selected from the Amhara, Oromiya and Southern Nation Nationalities and People (SNNP) regions of Ethiopia. Accordingly, a number of binary response statistical analysis models were used. Multiple correspondence analysis was carried out to identify the association among socioeconomic, demographic and geographic factors. Moreover a number of binary response models such as survey logistic, GLMM, GLMM with spatial correlation, joint models and semi-parametric models were applied. To test and investigate how well the observed malaria RDT result, use of mosquito nets and use of indoor residual spray data fit the expectations of the model, Rasch model was used. The fitted models have their own strengths and weaknesses. Application of these models was carried out by analysing data on malaria RDT result. The data used in this study, which was conducted from December 2006 to January 2007 by The Carter Center, is from baseline malaria indicator survey in Amhara, Oromiya and Southern Nation Nationalities and People (SNNP) regions of Ethiopia. The correspondence analysis and survey logistic regression model was used to identify predictors which affect malaria RDT results. The effect of identified socioeconomic, demographic and geographic factors were subsequently explored by fitting a generalized linear mixed model (GLMM), i.e., to assess the covariance structures of the random components (to assess the association structure of the data). To examine whether the data displayed any spatial autocorrelation, i.e., whether surveys that are near in space have malaria prevalence or incidence that is similar to the surveys that are far apart, spatial statistics analysis was performed. This was done by introducing spatial autocorrelation structure in GLMM. Moreover, the customary two variables joint modelling approach was extended to three variables joint effect by exploring the joint effect of malaria RDT result, use of mosquito nets and indoor residual spray in the last twelve months. Assessing the association between these outcomes was also of interest. Furthermore, the relationships between the response and some confounding covariates may have unknown functional form. This led to proposing the use of semiparametric additive models which are less restrictive in their specification. Therefore, generalized additive mixed models were used to model the effect of age, family size, number of rooms per person, number of nets per person, altitude and number of months the room sprayed nonparametrically. The result from the study suggests that with the correct use of mosquito nets, indoor residual spraying and other preventative measures, coupled with factors such as the number of rooms in a house, are associated with a decrease in the incidence of malaria as determined by the RDT. However, the study also suggests that the poor are less likely to use these preventative measures to effectively counteract the spread of malaria. In order to determine whether or not the limited number of respondents had undue influence on the malaria RDT result, a Rasch model was used. The result shows that none of the responses had such influences. Therefore, application of the Rasch model has supported the viability of the total sixteen (socio-economic, demographic and geographic) items for measuring malaria RDT result, use of indoor residual spray and use of mosquito nets. From the analysis it can be seen that the scale shows high reliability. Hence, the result from Rasch model supports the analysis carried out in previous models.Item Statistical methods for analysing complex survey data : an application to HIV/AIDS in Ethiopia.(2013) Mohammed, Mohammed Omar Musa.; Zewotir, Temesgen Tenaw.; Achia, Thomas Noel Ochieng.The HIV/AIDS pandemic is currently the most challenging public health matter that faces third world countries, especially those in Sub-Saharan Africa. Ethiopia, in East Africa, with a generalised and highly heterogeneous epidemic, is no exception, with HIV/AIDS affecting most sectors of the economy. The first case of HIV in Ethiopia was reported in 1984. Since then, HIV/AIDS has become a major public health con cern, leading the Government of Ethiopia to declare a public health emergency in 2002. In 2011, the adult HIV/AIDS prevalence in Ethiopia was estimated at 1.5%. Approximately 1.2 million Ethiopians were living with HIV/AIDS in 2010. Surveys are an important and popular tool for collecting data. Analytical use of survey data especially health survey data has become very common, with a focus on the association of particular outcome variables with explanatory variables at the population level. In this study we used the data from the 2005 Ethiopian Demographic and Health Survey, (EDHS 2005), and identified key demographic, socioeconomic, sociocultural, behavioral and proximate determinants of HIV/AIDS risk factor. Usually most survey analysts ignore the complex survey design issues like clustering, stratification and unequal probability of selection (weights). This study deals with complex survey design and takes the design aspect into account, because failure to do so leads to bias parameters estimates and standard error, wide confidence intervals and statistical tests will be incorrect. In this study, three statistical approaches were used to analyse the complex survey data. The first approach was a survey logistic regression used to model the binary outcome (HIV serostatus) and set of explanatory variables (the dependence of the HIV risk factors). The difference between survey logistic regression and the ordinary logistic regression is that survey logistic regression approach takes the study design into account during analysis. The second approach was a multilevel logistic regression model, that assumed that the data structure in the population was hierarchical, and that individual within household was selected from clusters that were randomly selected from a national sampling frame. We considered a three-level model for our analysis. This second approach considered the results from Frequentist and a Bayesian multilevel models. Bayesian methods can provide accurate estimates of the parameters and the uncertainty associated with them. The third approach used was a Spatial models approach where model parameters were estimated under the Integrated Nested Laplace Approximation (INLA) paradigm.Item Covariates and latents in growth modelling.(2014) Melesse, Sileshi Fanta.; Zewotir, Temesgen Tenaw.The growth curve models are the natural models for the increment processes taking place gradually over time. When individuals are observed over time it is often apparent that they grow at different rates, even though they are clones and no differences in treatment or environment are present. Neverthless the classical growth curve model only deals with the average growth and does not account for individual differences, nor does it have room to accommodate covariates. Accordingly we strive to construct and investigate tractable models which incorporate both individual effects and covariates. The study was motivated by plantations of fast growing tree species, and the climatic and genetic factors that influence stem radial growth of juvenile Eucalyptus hybrids grown on the east coast of South Africa. Measurement of stem radius was conducted using dendrometres on eighteen sampled trees of two Eucalyptus hybrid clones (E. grandis χ E.urophylla, GU and E.grandis χ E. Camaldulensis, GC). Information on climatic data (temperature, rainfall, solar radiation, relative humidity and wind speed) was simultaneously collected from the study site. We explored various functional statistical models which are able to handle the growth, individual traits, and covariates. These models include partial least squares approaches, principal component regression, path models, fractional polynomial models, nonlinear mixed models and additive mixed models. Each one of these models has strengths and weaknesses. Application of these models is carried out by analysing the stem radial growth data. The partial least squares and principal component regression methods were used to identify the most important predictor for stem radial growth. Path models approach was then applied mainly to find some indirect effects of climatic factors. We further explored the tree specific effects that are unique to a particular tree under study by fitting a fractional polynomial model in the context of linear mixed effects model. The fitted fractional polynomial model showed that the relationship between stem radius and tree age is nonlinear. The performance of fractional polynomial models was compared with that of nonlinear mixed effects models. Using nonlinear mixed effects models some growth parameters like inflection points were estimated. Moreover, the fractional polynomial model fit was almost as good as the nonlinear growth curves. Consequently, the fractional polynomial model fit was extended to include the effects of all climatic variables. Furthermore, the parametric methods do not allow the data to decide the most suitable form of the functions. In order to capture the main features of the longitudinal profiles in a more flexible way, a semiparametric approach was adopted. Specifically, the additive mixed models were used to model the effect of tree age as well as the effect of each climatic factor.Item Combining dynamic factor models and artificial neural networks in time series forecasting with applications.(2014) Babikir, Ali Basher Abd Allah.; Mwambi, Henry Godwell.This study investigates and examines the advantages and forecasting performance of combining the dynamic factor model (DFM) and artificial neural networks (ANNs) leading to new novel models that have capabilities to produce more accurate forecasts with application to the South African financial sector data. The overall aim of the study is to provide forecasting models that accommodate all relevant variables and the presence of any nonlinearity in the data to produce more adequate forecasts and serve as an alternative to traditional and current forecasting models, particularly in the presence of a changing and interacting environment. The thesis consists of four independent papers corresponding to four chapters. The first chapter brings together two important developments in forecasting literature; the artificial neural networks (ANNs) and factor models. The chapter introduces the Factor Augmented Artificial Neural Network (FAANN) hybrid model in order to produce a more accurate forecasting. The model is applied to forecasting three time series variables, namely, Deposit rate, Gold mining share prices and Long term interest rate. The out-of-sample root mean square error (RMSE) and Diebold-Mariano test results show that the FAANN model yields substantial improvements over the autoregressive AR benchmark model and standard dynamic factor model (DFM). The superiority of the FAANN model is due to the ANNs flexibility to account for potentially complex nonlinear relationships that are not easily captured by linear models. In the second chapter we introduce a new model that exploits the artificial neural networks model as a data smoother to alleviate the effect of major financial crisis and nonlinearity due to high fluctuations such as those associated with the 2008 crisis. The chapter introduces the ANN-DF model, where in the first stage the best fitted ANNs for each single series of the data set which contains 228 monthly series is used to obtain the in-sample forecasts of each series. In the second stage, the factor model is used to extract the factors from the smoothed data set, and then these factors are used as explanatory variables in forecasting. The model is applied to forecast three South Africa variables, namely, Rate on 3-month trade financing, Lending rate and Short term interest rate in the period 1992:01 to 2011:12. The results, based on the root mean square errors of three, six and twelve months ahead out-of-sample forecasts over the period 2007:01 to 2011:12 indicate that, in all of the cases, the ANN-DFM and the DFM statistically outperform the autoregressive (AR) models. In the majority of the cases the ANN-DFM outperforms the DFM. The results indicate the usefulness of smoothing and factor extraction in forecasting performance. The forecast results are confirmed by the test of the equality of forecast accuracy proposed by Diebold-Mariano (1995). The third chapter evaluates the role of the DFM model (liner in nature) and the ANN model (with capacity to handle nonlinearity) as competing forecasting estimation methods. The chapter uses artificial neural networks (ANNs) as nonlinear method based on the fact that the relationships between input and output variables in ANNs do not need to be specified in advance. In this chapter, the same extracted factors are used as input and independent variables for ANNs and the Dynamic Factor Model. This was necessary in order to investigate the forecasting performance of the linear and the nonlinear methods under the same conditions. We refer to the new model as Factor Artificial Neural Network (FANN). The empirical results of the Root Mean Square Error (RMSE) for the out-of-sample forecasts from 2007:01 to 2011:12 indicate that the proposed FANN model is an effective way to improve forecasting accuracy over the Dynamic Factor Model (DFM), the ANN and the AR benchmark model. The results confirm the usefulness of the factors that were extracted from a large set of related variables when we compared the FANN model and the standard univariate ANN model. Finally, combining forecasts is often considered as a successful alternative to using just an individual forecasting method. Different forecasting methods are considered especially when the forecasts are generated form the linear and the nonlinear methods. Thus, chapter four investigates the forecasting performance of combining independent forecasts of the Dynamic Factor Model and the Artificial Neural Networks models using linear and nonlinear combining procedures for the same variables of interest. The analysis was based on three financial variables namely the JSE return index, government bond return index and the Rand/Dollar exchange rate in South Africa. The out-of- sample results of three, six and twelve month horizons from 2006:01 to 2011:12 for the DFM and ANNs provided more adequate forecasts compared to benchmark auto-regressive (AR) models with reduction in the RMSE of around 2 to 12 percent for all variables and over all forecasting horizons. The ANN as a nonlinear combining method outperforms all linear combining methods and is the best individual model for all variables and over all forecasting horizons. The results suggest that the ANN combining method can be used as an alternative to linear combining methods to achieve greater forecasting accuracy. We attribute the superiority of the ANN combining method to its ability to capture any existing nonlinear relationship between the individual forecasts and the actual forecasting values.Item Bayesian spatial models with application to HIV, TB and STI modeling in Kenya.(2014) Owino, Ngesa Oscar.; Mwambi, Henry Godwell.; Achia, Thomas Noel Ochieng.This dissertation is concerned with developing and extending statistical models in the area of spatial modeling with particular interest towards application to HIV, TB and HSV-2 data. Hierarchical spatial modeling is a common and useful approach for modeling complex spatially correlated data in many settings in epidemiological, public health and ecological studies. Chapter 1 of this thesis gives a chronological development of disease mapping models, from non-spatial to spatial and from single disease models to multiple disease models. In Chapter 2, a new model that relaxes the over-restrictive normal distribution assumption on the spatially unstructured random effect by using the generalised Gaussian distribution is introduced and investigated. The third chapter provides a framework for including sampling weights into the Bayesian hierarchical disease mapping model. In this model, design effect is used to re-scale the sample sizes. A new model for over dispersed spatially correlated binary data is developed in chapter 4 of this thesis; in this model, the over dispersion parameter is modeled by a beta random effect which is allowed to vary spatially also. In chapter 5, the common multiple spatial disease mapping models are reviewed and adopted for the binary data at hand since the original models were developed based on Poisson count data. The methodologies developed in this dissertation widen the toolbox for spatial analysis and disease mapping in applications in epidemiology and public health studies.Item Nonlinear mixed-effects models for multi-variate longitudinal data with application to HIV disease dynamics.(2014) Luwanda, Artz George.; Mwambi, Henry Godwell.The motivation for the study of nonlinear mixed-effects models is due to the growing interest in the estimation of parameters in HIV disease dynamical models using real multivariate longitudinal data with varying degrees of informativeness. Special analytical and approximation techniques are needed to deal with such data because the repeated observations on any experimental unit are likely to be correlated over time while multiple outcomes within the unit will also be correlated. Furthermore, observations may be irregularly made within and between individuals making direct use of standard methods practically impossible. In this thesis, we consider a nonlinear mixed-effects model for a multivariate response variable that takes into account left-censored observations. Then we study a case where data are unbalanced among subjects and also within a subject because for some reason only a subset of the multiple outcomes of the response variable are observed at any one occasion. Dropout models that take into consideration the partially observed outcomes are proposed. We further derive a joint likelihood function which takes into account the multivariate responses and the unbalancedness in such data as a result of censoring and dropout. We then show how the methodology can be used in the estimation of the parameters that characterise HIV dynamical system in the presence of several covariates. We have also used multiple imputation to compare covariate coefficients in the complete data and the partially observed data. Through a simulation study, we have also seen that a small limit of quantification provides better parameter estimates in the sense of standard errors and confidence limits of the parameters. Since there are usually no analytic solutions for such complex models, the stochastic approximation Expectation-Maximisation (SAEM) is used as an approximation method. The methodology is illustrated using a routine observational dataset from two HIV clinics in Malawi.Item A perspective on incomplete data in longitudinal multi-arm clinical trials, with emphasis on pattern-mixture-model based methodology.(2014) Grobler, Anna Christina.; Matthews, Glenda Beverley.; Molenberghs, Geert.Missing data are common in longitudinal clinical trials. Rubin described three different missing data mechanisms based on the level of dependence between the missing data process and the measurement process. These are missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Data are MCAR when the probability of dropout is independent of both observed and unobserved data. Data are MAR when the probability of data being missing does not depend on the unobserved data, conditional on the observed data. When neither MCAR nor MAR is valid, data are MNAR. The aim of this thesis is to discuss statistical methodology required for analysing missing outcome data and provide valid statistical methods for the MAR, MCAR and MNAR scenarios. This thesis does not focus on data analysis where covariate data are missing. Under MCAR complete and available case analyses are valid. When data are MAR multiple imputation, likelihood-based models, inverse probability weighting and Bayesian models are valid. When data are MNAR pattern-mixture, selection and shared-parameter models are valid. These methods are illustrated by an in depth analysis of two data sets with missing data. The first data set is the SAPiT trial an open label, randomised controlled trial in HIVtuberculosis co-infected patients. Patients were randomised to three arms; each initiating antiretroviral therapy at a different time. CD4+ count, an indication of HIV progression, was measured at baseline and every 6 months for 24 months. The primary question was whether CD4+ count trajectory over time differed for the three treatment arms. The assumption that missing data are MCAR was not supported by the observed data. We performed a range of sensitivity analyses under both MAR and MNAR assumptions. The second data set is a placebo-controlled, randomised clinical trial conducted for 8 weeks to determine the effectiveness of hypericum or sertraline in reducing depression, measured by the Hamilton depression scale. The trial randomised 340 participants, with 28% lost to follow-up before Week 8. We performed a sensitivity analysis under different assumptions about the missing data process. The missing data mechanism was not MCAR. Under MAR assumptions, some of the sensitivity analyses found no difference between either of the treatment arms and placebo, while some found a significant difference between sertraline and placebo, but not between hypericum and placebo. This re-analysis contributed to the literature around the effectiveness of St John’s Wort because it changed the conclusions of the original analysis.Item Statistical distributions and modelling of GPS-Telemetry elephant movement data including the effect of covariates.Mutwiri, Robert Mathenge.; Mwambi, Henry Godwell.In this thesis, I investigate the application of various statistical methods towards analysing GPS tracking data collected using GPS collars placed on large mammals in Kruger National Park, South Africa. Animal movement tracking is a rapidly advancing area of ecological research and large amount of data is being collected, with short sampling intervals between successive locations. A statistical challenge is to determine appropriate methods that capture most properties of the data is lacking despite the obvious importance of such information to understanding animal movement. The aim of this study was to investigate appropriate alter- native models and compare them with the existing approaches in the literature for analysing GPS tracking data and establish appropriate statistical approaches for interpreting large scale mega-herbivore movements patterns. The focus was on which methods are the most appropriate for the linear metrics (step length and movement speed) and circular metrics (turn angles) for these animals and the comparison of the movement patterns across herds with covariate. A four parameter family of stable distributions was found to better describe the animal movement linear metrics as it captured both skewness and heavy tail properties of the data. The stable model performed favourably better than normal, Student's t and skewed Student's t models in an ARMA-GARCH modelling set-up. The ex- ibility of the stable distribution was further demonstrated in a regression model and compared with the heavy tailed t regression model. We also explore the ap- plication circular linear regression model in analysing animal turn angle data with covariate. A regression model assuming Von Mises distributed turn angles was shown to fit the data well and further areas of model development highlighted. A couple of methods for testing the uniformity hypothesis of turn angles are pre- sented. Finally, we model the linear metrics assuming the error terms are stable distributed and the turn angles assuming the error terms are von Mises distributed are recommended for analysing animal movement data with covariate.Item Numerical study of convective fluid flow in porous and non-porous media.(2015) Makanda, Gilbert.; Sibanda, Precious.Abstract available in PDF file.Item Application of mixed model and spatial analysis methods in multi-environmental and agricultural field trials.(2015) Negash, Asnake Worku.; Mwambi, Henry Godwell.; Zewotir, Temesgen Tenaw.Agricultural experimentation involves selection of experimental materials, selection of experimental units, planning of experiments, and collection of relevant information, analysis and interpretation of the results. An overall work of this thesis is on the importance, improvement and efficiency of variety contrast by using linear mixed mode with spatial-variance covariance compare to the usual ANOVA methods of analysis. A need of some considerations on the recently widely usage of a bi-plot analysis of genotype plus genotype by environment interaction (GEE) on the analysis of multi-environmental crop trials. An application of some parametric bootstrap method for testing and selecting multiplicative terms in GGE and AMMI models and to show some statistical methods for handling missing data using multiple imputations principal component and other deterministic approaches. Multi-environment agricultural experiments are unbalanced because several genotypes are not tested in some environments or missing of a measurement from some plot during the experimental stage. A need for imputation of the missing values sometimes is necessary. Multiple imputation of missing data using the cross-validation by eigenvector method and PCA methods are applied. We can see the advantage of these methods having easy computational implementation, no need of any distributional or structural assumptions and do not have any restrictions regarding the pattern or mechanism of missing data in experiments. Genotype by environment (G×E) interaction is associated with the differential performance of genotypes tested at different locations and in different years, and influences selection and recommendation of cultivars. Wheat genotypes were evaluated in six environments to determine the G×E interactions and stability of the genotypes. Additive main effects and multiplicative interactions (AMMI) was conducted for grain yield of both year and it showed that grain yield variation due to environments, genotypes and (G×E) were highly significant. Stability for grain yield was determined using genotype plus genotype by environment interaction (GGE) biplot analysis. The first two principal components (PC1 and PC2) were used to create a 2-dimensional GGE biplot. Which-won where pattern was based on six locations in the first and five locations in the second year for all the twenty genotypes? The resulting pattern is one realization among many possible outcomes, and its repeatability in the second was different and a future year is quite unknown. A repeatability of which won-where pattern over years is the necessary and sufficient condition for mega-environment delineations and genotype recommendation. The advantages of mixed models with spatial variance-covariance structures, and direct implications of model choice on the inference of varietal performance, ranking and testing based on two multi-environmental data sets from realistic national trials. A model comparison with a ᵪ2-test for the trials in the two data sets (wheat and barley data) suggested that selected spatial variance-covariance structures fitted the data significantly better than the ANOVA model. The forms of optimally-fitted spatial variance-covariance, ranking and consistency ratio test were not the same from one trial (location) to the other. Linear mixed models with single stage analysis including spatial variance-covariance structure with a group factor of location on the random model also improved the real genotype effect estimation and their ranking. The model also improved varietal performance estimation because of its capacity to handle additional sources of variation, location and genotype by location (environment) interaction variation and accommodating of local stationary trend. The knowledge and understanding of statistical methods for analysis of multi-environmental data analysis is particularly important for plant breeders and those who are working on the improvement of plant variety for proper selection and decision making of the next level of improvement for country agricultural development.Item Statistical methods to evaluate disease outcome diagnostic accuracy of multiple biomarkers with application to HIV and TB research.(2015) Mohammed, Muna Balla Elshareef.; Mwambi, Henry Godwell.One challenge in clinical medicine is that of the correct diagnosis of disease. Medical researchers invest considerable time and effort to improving accurate disease diagnosis and following from this diagnostic tests are important components in modern medical practice. The receiver oper- ating characteristic (ROC) is a statistical tool commonly used for describing the discriminatory accuracy and performance of a diagnostic test. A popular summary index of discriminatory accuracy is the area under ROC curve (AUC). In the medical research data, scientists are simultaneously evaluating hundreds of biomarkers. A critical challenge is the combination of biomarkers into models that give insight into disease. In infectious disease, biomarkers are often evaluated as well as in the micro organism or virus causing infection, adding more complexity to the analysis. In addition to providing an improved understanding of factors associated with infection and disease development, combinations of relevant markers are important to the diagnosis and treatment of disease. Taken together, this extends the role of, the statistical analyst and presents many novel and major challenges. This thesis discusses some of the various strategies and issues in using statistical data analysis to address the diagnosis problem, of selecting and combining multiple markers to estimate the predictive accuracy of test results. We also consider different methodologies to address missing data and to improve the predictive accuracy in the presence of incomplete data. The thesis is divided into five parts. The first part is an introduction to the theory behind the methods that we used in this work. The second part places emphasis on the so called classic ROC analysis, which is applied to cross sectional data. The main aim of this chapter is to address the problem of how to select and combine multiple markers and evaluate the appropriateness of certain techniques used in estimating the area under the ROC curve (AUC). Logistic regression models offer a simple method for combining markers. We applied resampling methods to adjust for over-fitting associated with model selection. We simulated several multivariate models to evaluate the performance of the resampling approaches in this setting. We applied these methods to data collected from a study of tuberculosis immune reconstitution in ammatory syndrome (TB-IRIS) in Cape Town, South Africa. Baseline levels of five biomarkers were evaluated and we used this dataset to evaluate whether a combination of these biomarkers could accurately discriminate between TB-IRIS and non TB-IRIS patients, by applying AUC analysis and resampling methods. The third part is concerned with a time dependent ROC analysis with event-time outcome and comparative analysis of the techniques applied to incomplete covariates. Three different methods are assessed and investigated, namely mean imputation, nearest neighbor hot deck imputation and multivariate imputation by chain equations (MICE). These methods were used together with bootstrap and cross-validation to estimate the time dependent AUC using a non-parametric approach and a Cox model. We simulated several models to evaluate the performance of the resampling approaches and imputation methods. We applied the above methods to a real data set. The fourth part is concerned with applying more advanced variable selection methods to predict the survival of patients using time dependent ROC analysis. The least absolute shrinkage and selection operator (LASSO) Cox model is applied to estimate the bootstrap cross-validated, 632 and 632+ bootstrap AUCs for TBM/HIV data set from KwaZulu-Natal in South Africa. We also suggest the use of ridge-Cox regression to estimate the AUC and two level bootstrapping to estimate the variances for AUC, in addition to evaluating these suggested methods. The last part of the research is an application study using genetic HIV data from rural KwaZulu-Natal to evaluate the sequence of ambiguities as a biomarker to predict recent infection in HIV patients.Item Measuring poverty and child malnutrition with their determinants from household survey data.(2016) Habyarimana, Faustin.; Zewotir, Temesgen Tenaw.; Ramroop, Shaun.The eradication of poverty and malnutrition is the main objective of most societies and policy makers. But in most cases, developing a perfect or accurate poverty and malnutrition assessment tool to target the poor households and malnourished people is a challenge for applied policy research. The poverty of households and malnutrition of children under five years have been measured based to money metric and this approach has a number of problems especially in developing countries. Hence, in this study we developed an asset index from Demographic and Health Survey data as an alternative method to measure poverty of households and malnutrition and thereby examine different statistical methods that are suitable to identify the associated factors. Therefore, principal component analysis was used to create an asset index for each household which in turn served as response variable in case of poverty and explanatory (known as wealth quintile) variable in the case of malnutrition. In order to account for the complexity of sampling design and the ordering of outcome variable, a generalized linear mixed model approach was used to extend ordinal survey logistic regression to include random effects and therefore to account for the variability between the primary sampling units or villages. Further, a joint model was used to simultaneously measure the malnutrition on three anthropometric indicators and to examine the possible correlation between underweight, stunting and wasting. To account for spatial variability between the villages, we used spatial multivariate joint model under generalized linear mixed model. A quantile regression model was used in order to consider a complete picture of the relationship between the outcome variable (poverty index and weight-for-age index) and predictor variables to the desired quantiles. We have also used generalized additive mixed model (semiparametric) in order to relax the assumption of normality and linearity inherent in linear regression models, where categorical covariates were modeled by parametric model, continuous covariates and interaction between the continuous and categorical variables by nonparametric models. A composite index from three anthropometric indices was created and used to identify the association of poverty and malnutrition as well as the factors associated with them. Each of these models has inherent strengths and weaknesses. Then, the choice of one depends on what a research is trying to accomplish and the type of data being used. The findings from this study revealed that the level of education of household head, gender of household head, age of household head, size of the household, place of residence and the province are the key determinants of poverty of households in Rwanda. It also revealed that the determinants of malnutrition of children under five years in Rwanda are: child age, birth order of the child, gender of the child, birth weight of the child, fever, multiple birth, mother’s level of education, mother’s age at the birth, anemia, marital status of the mother, body mass index of the mother, mother’s knowledge on nutrition, wealth index of the family, source of drinking water and province. Further, this study revealed a positive association between poverty of household and malnutrition of children under five years.Item A frequentist and a Bayesian approach to estimating HIV prevalence accounting for non-response using population-based survey data.(2016) Chinomona, Amos.; Mwambi, Henry Godwell.Enhanced and novel frequentist and Bayesian approaches to estimating disease measures such as HIV prevalence utilizing the recent advances in statistical computing software are explored and applied making use of population-based complex survey data. In particular design-consistent estimates and logistic regression models for HIV prevalence are respectively computed and fitted using each of the approaches. Practical survey data are rarely obtained using simple random sampling schemes, instead complex sampling designs, that are designed to refect complex underlying population structures, are employed. These designs usually involve stratification, multistage sampling and unequal selection probability of sampling units giving rise to data that are hierarchical (multilevel), clustered, and hence correlated. This is particularly true for large-scale population-based surveys. Consequently this often gives rise to units that are correlated within clusters as well as multiple sources of variability rendering standard statistical methods based on the assumption of independence of units inappropriate. Survey logistic regression models built from a generalized linear modelling framework were used to explain the variation in HIV prevalence accounting for the nonindependence of the units. In addition, a hierarchical logistic regression model built from a generalized linear mixed modelling framework was used to capture the variability and correlation of the units within clusters and further determine how different layers interact and impact on a response variable. In particular, the logistic regression models for HIV prevalence on demographic, behavioural and socio-economic variables were developed from a frequentist and a Bayesian perspective. Statistical methods that incorporate prior known information about unknown parameters are vital in most scientific and biological research especially in studies where replicative experimental investigations are not possible. The Bayesian statistical paradigm offers a framework upon which a prior distribution of a parameter can be combined with the likelihood of the observed data to obtain a posterior distribution for explaining the stochastic variation in a response variable. Computer-intensive simulation-based algorithms such as the Markov chain Monte Carlo (MCMC) methods were used to draw samples from the posterior distribution for inference purposes. A Bayesian logistic regression model for HIV prevalence on demographic and socio-economic variables was fitted from a generalized linear modelling framework using the MCMC algorithms. Furthermore, practical complex survey data are often characterized by missing observations due to non-response, a phenomenon that is true to the data used for the current research. Often, the analyses of such data take a complete case approach, that is taking a list-wise deletion of all cases with missing observations, assuming that missing values are missing completely at random (MCAR). In the current research, we systematically simulate or generate multiple values for the missing observations under a multiple imputation method accounting for the structure of the data. A rectangular complete data set is produced and the variability or uncertainty induced by the very process of imputing the values for the missing observations is accounted for. The study utilizes complex (multi-layered and clustered data with missing values) survey data obtained from the 2010-11 Zimbabwe Demographic and Health Surveys (2010-11ZDHS). The results show that HIV prevalence varies considerably across subgroups of the population. All the analyses are done using R statistical software packages.Item Statistical methods for handling incomplete longitudinal data with emphasis on discrete outcomes with application.(2017) Kombo, Abdallah Yussuf.; Mwambi, Henry Godwell.In longitudinal studies, measurements are taken repeatedly over time on the same ex- perimental unit. These measurements are thus correlated. The variances in repeated measures change with respect to time. Therefore, the variations together with the po- tential correlation patterns produce a complicated variance structure for the measures. Standard regression and analysis of variance techniques may result into invalid inference because they entail some mathematical assumptions that do not hold for repeated mea- sures data. Coupled with the repeated nature of the measurements, these datasets are often imbal- anced due to missing data. Methods used should be capable of handling the incomplete nature of the data, with the ability to capture the reasons for missingness in the analysis. This thesis seeks to investigate and compare analysis methods for incomplete correlated data, with primary emphasis on discrete longitudinal data. The thesis adopts the general taxonomy of longitudinal models, including marginal, random e ects, and transitional models. Although the objective is to deal with discrete data, the thesis starts with one continu- ous data case. Chapter 2 presents a comparative analysis on how to handle longitudinal continuous outcomes with dropouts missing at random. Inverse probability weighted generalized estimating equations (GEEs) and multiple imputation (MI) are compared. In Chapter 3, the weighted GEE is compared to GEE after MI (MI-GEE) in the analy- sis of correlated count outcome data in a simulation study. Chapter 4 deals with MI in the handling of ordinal longitudinal data with dropouts on the outcome. MI strategies, namely multivariate normal imputation (MNI) and fully conditional speci cation (FCS) are compared both in a simulation study and a real data application. In Chapter 5, still focussing on ordinal outcomes, the thesis presents a simulation and real data ap- plication to compare complete case analysis with advanced methods; direct likelihood analysis, MNI, FCS and ordinal imputation method. Finally, in Chapter 6, cumulative logit ordinal transition models are utilized to investigate the inuence of dependency of current incomplete responses on past responses. Transitions from one response state to another over time are of interest.Item Flexible statistical modelling in food insecurity risk assessment.(2015) Lokosang, Laila Barnaba.; Ramroop, Shaun.; Zewotir, Temesgen Tenaw.Food insecurity has remained a persistent problem in Sub-Saharan Africa. Conflict and other protracted crisis have rendered a significant proportion of Africa’s populations to suffer the risk of food insecurity, as their resilience to livelihood shocks weakens. A significant and immense body of research in the past two decades has largely centred on describing the incidence of food insecurity and vulnerability. Limited research was done using statistical methods to determine the likelihood of food insecurity risk. The use of flexible statistical techniques for a sound and purposive monitoring, evaluation, planning and decision making in food security and resilience was limited. The study aimed to extend the use of statistics into the expanding field of food security and resilience, and also to provide new direction for future research involving applications of the methods explored, such as adjustments in statistical methods, sampling and data collection. The study specifically aims at helping food security analysts with tested and statistically robust tools for use in the analyses of the likelihood of food insecurity risk in settings with structural food insecurity issues. Moreover, it aimed to inform practice, policy and analysis in monitoring and evaluation of food insecurity risk in protracted crisis; thus helping in improving risk aversion measures. Utilising secondary data, the research examines relevant statistical techniques for determining predictors of food insecurity risk, namely; Principal Component Analysis; Multiple Correspondence Analysis; Classification and Regression Tree Analysis; Survey Logistic Regression, Generalized Linear Mixed Models for Ordered Categorical Data; and Joint Modelling. The study was conducted in the form of structured analysis of different datasets vi collected in the conflict-ridden South Sudan. Assets owned by households, as well as availability of livelihood endowments, was used as proxy for determining the level of resilience in particular demographic unit or geographical setting. The study highlighted the strengths and weaknesses of the techniques explored in the analysis as identifying or classifying potential predictors of food insecurity outcomes. Each technique is capable of generating a unique composite index for measuring the amount of resilience and predicting and classifying households according to food insecurity phase based on factor loadings. In general, the study determined that each method explored has peculiar strengths as well as limitations. However, a noteworthy implication observed is that asset-based statistical analysis, whether based on composite index that can be used as proxy for measuring the amount of resilience to food insecurity eventualities or on regression modelling approaches, does assure sufficient rigour in drawing conclusions about the wellbeing of households or populations under study and how they might withstand food insecurity and livelihood shocks. As food insecurity and malnutrition continue to attract substantial attention, such flexible analytical approaches exert potential usefulness in determining food insecurity risks, especially in protracted crisis settings.
- «
- 1 (current)
- 2
- 3
- »