Doctoral Degrees (Statistics)
Permanent URI for this collectionhttps://hdl.handle.net/10413/7126
Browse
Browsing Doctoral Degrees (Statistics) by Title
Now showing 1 - 20 of 51
- Results Per Page
- Sort Options
Item Adjusting the effect of integrating antiretroviral therapy and tuberculosis treatment on mortality for non-compliance : an instrumental variables analysis using a time-varying exposure.(2018) Yende-Zuma, Fortunate Nonhlanhla.; Mwambi, Henry Godwell.; Vansteelandt, Stijn.In South Africa and elsewhere, research has shown that the integration of antiretroviral therapy (ART) and tuberculosis (TB) treatment saves lives. The randomised controlled trials (RCTs) which provided this compelling evidence used intent-to-treat (ITT) strategy as part of their primary analysis. As much as ITT is protected against selection bias caused by both measured and unmeasured confounders, but it is capable of drawing results towards the null and underestimate the e ectiveness of treatment if there is too much non-compliance. To adjust for non-compliance, \as-treated"and \per-protocol"comparisons are commonly made. These contrast study participants according to their received treatment, regardless of the treatment arm to which they were assigned, or limit the analysis to participants who followed the protocol. Such analyses are generally biased because the subgroups which they compare often lack comparability. In view of the shortcomings of the \as-treated"and \per-protocol"analyses, our objective was to account for non-compliance by using instrumental variables (IV) analysis to estimate the e ect of ART initiation during TB treatment on mortality. Furthermore, to capture the full complexity of compliance behaviour outside the TB treatment duration, we developed a novel IV-methodology for a time-varying measure of compliance to ART. This is an important contribution to the IV literature since IV-methodology for the e ect of a time-varying exposure on a time-to-event endpoint is currently lacking. In RCTs, IV analysis enable us to make use of the comparability o ered by randomisation and thereby have the capability of adjusting for unmeasured and measured confounders; they have the further advantage of yielding results that are less sensitive to random measurement error in the exposure. In order to carry out IV analysis, one needs to identify a variable called an instrument, which needs to satisfy three important assumptions. To apply the IV methodology, we used data from Starting Antiretroviral Therapy at Three Points in Tuberculosis (SAPiT) trial which was conducted by the Centre for the AIDS Programme of Research in South Africa. This trial enrolled HIV and TB co-infected patients who were assigned to start ART either early or late during TB treatment or after TB treatment completion. The results from IV analysis demonstrate that survival bene t of fully integrating TB treatment and ART is even higher than what has been reported in the ITT analysis since non-compliance has been accounted for.Item Analysis of discrete time competing risks data with missing failure causes and cured subjects.(2023) Ndlovu, Bonginkosi Duncan.; Zewotir, Temesgen Tenaw.; Melesse, Sileshi Fanta.This thesis is motivated by the limitations of the existing discrete time competing risks models vis-a-vis the treatment of data that comes with missing failure causes or a sizableproportions of cured subjects. The discrete time models that have been suggested to date (Davis and Lawrance, 1989; Tutz and Schmid, 2016; Ambrogi et al., 2009; Lee et al., 2018) are cause-specific-hazard denominated. Clearly, this fact summarily disqualifies these models from consideration if data comes with missing failure causes. It is also a well documented fact that naive application of the cause-specific-hazards to data that has a sizable proportion of cured subjects may produce downward biased estimates for these quantities. The existing models can be considered within the multiple imputation framework (Rubin, 1987) for handling missing failure causes, but the prospects of scaling them up for handling cured subjects are minimal, if not nil. In this thesis we address these issues concerning the treatment of missing failure causes and cured subjects in discrete time settings. Towards that end, we focus on the mixture model (Larson and Dinse, 1985) and the vertical model (Nicolaie et al., 2010) because these models possess certain properties which dovetail with the objectives of this thesis. The mixture model has been upgraded into a model that can handle cured subjects. Nicolaie et al. (2015) have demonstrated that the vertical model can also handle missing failure causes as is. Nicolaie et al. (2018) have also extended the vertical model to deal with cured subjects. Our strategy in this thesis is to exploit both the mixture model and the vertical model as a launching pad to advance discrete time models for handling data that comes with missing failure causes or cured subjects.Item Analysis of longitudinal binary data : an application to a disease process.(2008) Ramroop, Shaun.; Mwambi, Henry Godwell.The analysis of longitudinal binary data can be undertaken using any of the three families of models namely, marginal, random effects and conditional models. Each family of models has its own respective merits and demerits. The models are applied in the analysis of binary longitudinal data for childhood disease data namely the Respiratory Syncytial Virus (RSV) data collected from a study in Kilifi, coastal Kenya. The marginal model was fitted using generalized estimating equations (GEE). The random effects models were fitted using ‘Proc GLIMMIX’ and ‘NLMIXED’ in SAS and then again in Genstat. Because the data is a state transition type of data with the Markovian property the conditional model was used to capture the dependence of the current response to the previous response which is known as the history. The data set has two main complicating issues. Firstly, there is the question of developing a stochastically based probability model for the disease process. In the current work we use direct likelihood and generalized linear modelling (GLM) approaches to estimate important disease parameters. The force of infection and the recovery rate are the key parameters of interest. The findings of the current work are consistent and in agreement with those in White et al. (2003). The aspect of time dependence on the RSV disease is also highlighted in the thesis by fitting monthly piecewise models for both parameters. Secondly, there is the issue of incomplete data in the analysis of longitudinal data. Commonly used methods to analyze incomplete longitudinal data include the well known available case analysis (AC) and last observation carried forward (LOCF). However, these methods rely on strong assumptions such as missing completely at random (MCAR) for AC analysis and unchanging profile after dropout for LOCF analysis. Such assumptions are too strong to generally hold. In recent years, methods of analyzing incomplete longitudinal data have become available with weaker assumptions, such as missing at random (MAR). Thus we make use of multiple imputation via chained equations that require the MAR assumption and maximum likelihood methods that result in the missing data mechanism becoming ignorable as soon as it is MAR. Thus we are faced with the problem of incomplete repeated non–normal data suggesting the use of at least the Generalized Linear Mixed Model (GLMM) to account for natural individual heterogeneity. The comparison of the parameter estimates using the different methods to handle the dropout is strongly emphasized in order to evaluate the advantages of the different methods and approaches. The survival analysis approach was also utilized to model the data due to the presence of multiple events per subject and the time between these events.Item Application of mixed model and spatial analysis methods in multi-environmental and agricultural field trials.(2015) Negash, Asnake Worku.; Mwambi, Henry Godwell.; Zewotir, Temesgen Tenaw.Agricultural experimentation involves selection of experimental materials, selection of experimental units, planning of experiments, and collection of relevant information, analysis and interpretation of the results. An overall work of this thesis is on the importance, improvement and efficiency of variety contrast by using linear mixed mode with spatial-variance covariance compare to the usual ANOVA methods of analysis. A need of some considerations on the recently widely usage of a bi-plot analysis of genotype plus genotype by environment interaction (GEE) on the analysis of multi-environmental crop trials. An application of some parametric bootstrap method for testing and selecting multiplicative terms in GGE and AMMI models and to show some statistical methods for handling missing data using multiple imputations principal component and other deterministic approaches. Multi-environment agricultural experiments are unbalanced because several genotypes are not tested in some environments or missing of a measurement from some plot during the experimental stage. A need for imputation of the missing values sometimes is necessary. Multiple imputation of missing data using the cross-validation by eigenvector method and PCA methods are applied. We can see the advantage of these methods having easy computational implementation, no need of any distributional or structural assumptions and do not have any restrictions regarding the pattern or mechanism of missing data in experiments. Genotype by environment (G×E) interaction is associated with the differential performance of genotypes tested at different locations and in different years, and influences selection and recommendation of cultivars. Wheat genotypes were evaluated in six environments to determine the G×E interactions and stability of the genotypes. Additive main effects and multiplicative interactions (AMMI) was conducted for grain yield of both year and it showed that grain yield variation due to environments, genotypes and (G×E) were highly significant. Stability for grain yield was determined using genotype plus genotype by environment interaction (GGE) biplot analysis. The first two principal components (PC1 and PC2) were used to create a 2-dimensional GGE biplot. Which-won where pattern was based on six locations in the first and five locations in the second year for all the twenty genotypes? The resulting pattern is one realization among many possible outcomes, and its repeatability in the second was different and a future year is quite unknown. A repeatability of which won-where pattern over years is the necessary and sufficient condition for mega-environment delineations and genotype recommendation. The advantages of mixed models with spatial variance-covariance structures, and direct implications of model choice on the inference of varietal performance, ranking and testing based on two multi-environmental data sets from realistic national trials. A model comparison with a ᵪ2-test for the trials in the two data sets (wheat and barley data) suggested that selected spatial variance-covariance structures fitted the data significantly better than the ANOVA model. The forms of optimally-fitted spatial variance-covariance, ranking and consistency ratio test were not the same from one trial (location) to the other. Linear mixed models with single stage analysis including spatial variance-covariance structure with a group factor of location on the random model also improved the real genotype effect estimation and their ranking. The model also improved varietal performance estimation because of its capacity to handle additional sources of variation, location and genotype by location (environment) interaction variation and accommodating of local stationary trend. The knowledge and understanding of statistical methods for analysis of multi-environmental data analysis is particularly important for plant breeders and those who are working on the improvement of plant variety for proper selection and decision making of the next level of improvement for country agricultural development.Item Appraising South African residential property and measuring price developments.(2022) Bax, Dane Gregory.; Zewotir, Temesgen Tenaw.; North, Delia Elizabeth.Housing wealth is well established as one of the most important sources of wealth for households and investors. However, owning a home is a fundamental human need, making monitoring residential property prices a social endeavour as well as an economic one, especially under times of economic uncertainty. Residential property prices also have a direct effect on the macroeconomy because of how they influence wealth effects where increased consumption by households is experienced through gains in households balance sheets due to increased equity. Collecting correct and adequate data is vitally important in analysing property market movements and developments, particularly given globalization, and the interlinked nature of financial markets. Although measuring residential property price developments is an important economic and social activity, matching properties over time is extremely difficult because the sale of homes is typically infrequent, characteristics vary, and homes are uniquely located in space. This thesis focuses on appraising several residential property types located throughout South Africa from January 2013 to August 2017, investigating different modelling approaches with the aim of developing a residential property price index. Various methods exist to create residential property price indices, however, hedonic models have proven useful as a quality adjusted approach where pure price changes are measured and not simply changes in the composition of samples over time. Before fitting any models to appraise homes, an autoencoder was built to detect anomalous data, due to human error at the data entry stage. The autoencoder identified improbable data resulting in a final data set of 415 200 records, once duplicate records were identified and removed. This study first investigated generalised linear models as a candidate approach to appraise homes in South Africa which showed possible alternatives to the ubiquitous log linear model. Relaxing functional form assumptions and considering the nested locational structure of homes, hierarchical generalised linear models were considered as the next candidate method. Partitioning around the mediods was applied to find additional spatial groupings which were treated as random effects along with the suburb. The findings showed that the marginal utility of structural attributes was non-linear and smooth functions of covariates were an appropriate treatment. Furthermore, the use of random effects helped account for the spatial heterogeneity of homes through partial pooling. Finally, machine learning algorithms were investigated because of minimal assumptions about the data generating process and the possibility of complex non-linear and interaction effects. Random forests, gradient boosted machines and neural networks were adopted to fit these appraisal functions. The gradient boosted machines had the best goodness of fit, showing non-linear relationships between the structural characteristics of homes and listing prices. Partial dependence plots were able to quantify the marginal utility over the distributions of different structural characteristics. The results show that larger sized homes do not necessarily yield a premium and a diminished return is evident, similar to the results of the hierarchical generalised additive models. The variable importance plots showed that location was the most important predictor followed by the number of bathrooms and the size of a home. The gradient boosted machines achieved the lowest out of sample error and were used to develop the residential property price index. A chained, dual imputation Fisher index was applied to the gradient boosted machines showing nominal and real price developments at a country and provincial level. The chained, dual imputation Fisher index provided less noisy estimates than a simple median mix adjusted index. Although listing prices were used and not transacted prices, the trend was similar to the ABSA Global Property Guide. In order to make this research useful to property market participants, a web application was developed to show how the proposed methodology can be democratised by property portals and real estate agencies. The Listing Price Index Calculator was created to easily communicate the results through a front-end interface, showing how property portals and real estate agencies can leverage their data to aid sellers in determining listing prices to go to market with, help buyers obtain an average estimate of the home they wish to purchase and guide property market participants on price developments.Item An assessment of modified systematic sampling designs in the presence of linear trend.(2017) Naidoo, Llewellyn Reeve.; North, Delia Elizabeth.; Zewotir, Temesgen Tenaw.; Arnab, Raghunath.Sampling is used to estimate population parameters, as it is usually impossible to study a whole population, due to time and budget restrictions. There are various sampling designs to address this issue and this thesis is related with a particular probability sampling design, known as systematic sampling. Systematic sampling is operationally convenient and efficient and hence is used extensively in most practical situations. The shortcomings associated with systematic sampling include: (i) it is impossible to obtain an unbiased estimate of the sampling variance when conducting systematic sampling with a single random start; (iii) if the population size is not a multiple of the sample size, then conducting conventional systematic sampling, also known as linear systematic sampling, may result in variable sample sizes. In this thesis, I would like to provide some contribution to the current body of knowledge, by proposing modifications to the systematic sampling design, so as to address these shortcomings. Firstly, a discussion on the measures used to compare the various probability sampling designs is provided, before reviewing the general theory of systematic sampling. The per- formance of systematic sampling is dependent on the population structure. Hence, this thesis concentrates on a specific and common population structure, namely, linear trend. A discussion on the performance of linear systematic sampling and all relative modifica- tions, including a new proposed modification, is then presented under the assumption of linear trend among the population units. For each of the above-mentioned problems, a brief review of all the associated sampling designs from existing literature, along with my proposed modified design, will then be explored. Thereafter, I will introduce a modified sampling design that addresses the above-mentioned problems in tandem, before providing a comprehensive report on the thesis. The aim of this thesis is to provide solutions to the above-mentioned disadvantages, by proposing modified systematic sampling designs and/or estimators that are favourable over its existing literature counterparts. Keywords: systematic sampling; super-population model; Horvitz-Thompson estimator; Yates' end corrections method; balanced modified systematic sampling; multiple-start balanced modified systematic sampling; remainder modified systematic sampling; balanced centered random sampling.Item Bayesian generalized linear mixed modeling of breast cancer data in Nigeria.(2017) Ogunsakin, Ropo Ebenezer.; Logue, Siaka.Breast cancer is the world’s most prevalent type of cancer among women. Statistics indicate that breast cancer alone accounted for 37% out of all the cases of cancer diagnosed in Nigeria in 2012. Data used in this study are extracted from patient records, commonly called hospital-based records, and identified key socio-demographic and biological risk factors of breast cancer. Researchers sometimes ignore the hierarchical structure of the data and the disease when analyzing data. Doing so may lead to biased parameter estimates and larger standard error. That is why the analyses undertaken in this study included the multilevel structure of cancer diagnosis, types, and medication through a Generalized Linear Mixed Model (GLMM) which consider both fixed and random effects (level 1 and 2). In addition to the classical statistics approach, this study incorporates the Bayesian GLMM approach as well as some bootstrapping techniques. All the analyses are done using R or SAS for the classical statistics approaches, and WinBUGS for the Bayesian approach. The Bayesian analyses were strengthened by advanced analyses of convergence and autocorrelation checks, and other Markov Chain assumptions using the CODA and BOA packages. The findings reveal that Bayesian techniques provide more comprehensive results, given that Bayesian analysis is a more statistically strong technique. The Bayesian methods appeared more robust than the classical and bootstrapping techniques in analyzing breast cancer data in Western Nigeria. The results identified age at diagnosis, educational status, grade tumor, and breast cancer type as prognostic factors of breast cancer.Item Bayesian modelling of non–gaussian time series of serve acute respiratory illness.(2019) Musyoka, Raymond Nyoka.; Mwambi, Henry.; Achia, Thomas Noel Ochieng.; Gichangi, Anthony Simon Runo.Respiratory syncytial virus (RSV), Human metapneumovirus (HMPV) and Influenza are some of the major causes of acute lower respiratory tract infections (ALRTI) in children. Children younger than 1 year are the most susceptible to these infections. RSV and influenza infections occur seasonally in temperate climate regions. We developed statistical models that were assessed and compared to predict the relationship between weather and RSV incidence in chapter 2. Human metapneumovirus (HMPV) have similar symptoms to those caused by respiratory syncytial virus (RSV). Currently, only a few models satisfactorily capture the dynamics of time series data of these two viruses. In chapter 3, we used a negative binomial model to investigate the relationship between RSV and HMPV while adjusting for climatic factors. In chapter 4, we considered multiple viruses incorporating the time varying effects of these components.The occurrence of different diseases in time contributes to multivariate time series data. In this chapter, we describe an approach to analyze multivariate time series of disease counts and model the contemporaneous relationship between pathogens namely, RSV, HMPV and Flu. The use of the models described in this study, could help public health officials predict increases in each pathogen infection incidence among children and help them prepare and respond more swiftly to increasing incidence in low-resource regions or communities. We conclude that, preventing and controlling RSV infection subsequently reduces the incidence of HMPV. Respiratory syncytial virus (RSV) is one of the major causes of acute lower respiratory tract infections (ALRTI) in children. Children younger than 1 year are the most susceptible to RSV infection. RSV infections occur seasonally in temperate climate regions. Based on RSV surveillance and climatic data, we developed statistical models that were assessed and compared to predict the relationship between weather and RSV incidence among refugee children younger than 5 years in Dadaab refugee camp in Kenya. Most time-series analyses rely on the assumption of Gaussian-distributed data. However, surveillance data often do not have a Gaussian distribution. We used a generalised linear model (GLM) with a sinusoidal component over time to account for seasonal variation and extended it to a generalised additive model (GAM) with smoothing cubic splines. Climatic factors were included as covariates in the models before and after timescale decompositions, and the results were compared. Models with decomposed covariates fit RSV incidence data better than those without. The Poisson GAM with decomposed covariates of climatic factors fit the data well and had a higher explanatory and predictive power than GLM. The best model predicted the relationship between atmospheric conditions and RSV infection incidence among children younger than 5 years. Human metapneumovirus (HMPV) have similar symptoms to those caused by respiratory syncytial virus (RSV). The modes of transmission and dynamics of these epidemics still remain poorly understood. Climatic factors have long been suspected to be implicated in impacting on the number of cases for these epidemics. Currently, only a few models satisfactorily capture the dynamics of time series data of these two viruses. In this study, we used a negative binomial model to investigate the relationship between RSV and HMPV while adjusting for climatic factors. We specifically aimed at establishing the heterogeneity in the autoregressive effect to account for the influence between these viruses. Our findings showed that RSV contributed to the severity of HMPV. This was achieved through comparison of 12 models of various structures, including those with and without interaction between climatic cofactors. Most models do not consider multiple viruses nor incorporate the time varying effects of these components. Common ARIs etiologies identified in developing countries include respiratory syncytial virus (RSV), human metapneumovirus (HMPV), influenza viruses (Flu), parainfluenza viruses (PIV) and rhinoviruses with mixed co-infections in the respiratory tracts which make the etiology of Acute Respiratory Illness (ARI) complex. The occurrence of different diseases in time contributes to multivariate time series data. In this work, the surveillance data are aggregated by month and are not available at an individual level. This may lead to over-dispersion; hence the use of the negative binomial distribution. In this paper, we describe an approach to analyze multivariate time series of disease counts. A previously used model in the literature to address dependence between two different disease pathogens is extended. We model the contemporaneous relationship between pathogens, namely; RSV, HMPV and Flu from surveillance data in a refugee camp (Dadaab) for children under 5 years to investigate for serial correlation. The models evaluate for the presence of heterogeneity in the autoregressive effect for the different pathogens and whether after adjusting for seasonality, an epidemic component could be isolated within or between the pathogens. The model helps in distinguishing between an endemic and epidemic component of the time series that would allow the separation of the regular pattern from irregularities and outbreaks. The use of the models described in this study, can help public health officials predict increases in each pathogen infection incidence among children and help them prepare and respond more swiftly to increasing incidence in low-resource regions or communities. This knowledge helps public health officials to prepare for, and respond more effectively to increasing RSV incidence in low-resource regions or communities. The study has improved our understanding of the dynamics of RSV and HMPV in relation to climatic cofactors; thereby, setting a platform to devise better intervention measures to combat the epidemics. We conclude that, preventing and controlling RSV infection subsequently reduces the incidence of HMPV.Item Bayesian spatial joint and spatial-temporal disease modeling with application to HIV, HSV-2 and Malaria using case studies from Kenya and Angola respectively.(2017) Okango, Elphas Luchemo.; Mwambi, Henry Godwell.; Owino, Ngesa Oscar.In this thesis we develop and extend existing statistical models for spatial disease modeling and apply them to HIV, HSV-2 and malaria data. The availability of geo-referenced data and free software has seen many disease mapping models developed and applied in epidemiology, public health, agriculture and ecology among other areas. In chapter 1 we provide a background and developments in the field of disease mapping. We present in brief some limiting assumptions and how recent developments have tried to relax them. Chapter 2 introduces a model; the semi-parametric joint model to model HIV and HSV-2. The semi-parametric joint model performed better than the single models in terms of DIC. The limiting linearity assumption was relaxed by using the penalized regression splines for the continuous covariate age. The main focus of chapter 3 was to develop a model that relaxes the stationarity assumption. This was achieved by allowing the e ects of the covariates to vary spatially by using the conditional autoregressive model. This new model performed better than the stationary models. In chapter 4 we introduce a spatial temporal spatially varying covariate model. In this model, the covariates were allowed to vary both spatially and temporally. We fit this model to the Angolan malaria data. The fifth chapter presents a review of various assumptions in spatial disease modeling and improvements for some limiting assumptions such as the normality assumption on random effects and linearity assumption on the covariates. We use the non-parametric spatial model approach to relax the limiting normality assumption. The last part of chapter 5 involves developing a joint spatially varying model (an extension of the spatially varying coefficient model in chapter 3) and fitting it to the HIV and HSV-2 data. Chapter six of the study provides the overview of the thesis, the conclusion and presents areas of further studies.Item Bayesian spatial modeling of malnutrition and mortality among under-five children in sub-Saharan Africa.(2019) Adeyemi, Rasheed Alani.; Zewotir, Temesgen Tenaw.; Ramroop, Shaun.The aim of this thesis is to develop and extend Bayesian statistical models in the area of spatial modeling and apply them to child health outcomes, with particular focus on childhood malnutrition and mortality among under-five children. The easy availability of a geo-referenced database has stimulated a paradigm shift in methodological approaches to spatial analysis. This study reviewed the spatial methods and disease mapping models developed for areal (lattice) data analysis. Observational data collected from complex design surveys and geographical locations often violates the independent assumption of classical regression models. By relaxing the restrictive linearity and normality assumptions of classical regression models, this study first developed a flexible semi-parametric spatial model that accommodates the usual fixed effect, nonlinear and geographical component in a unified model. The approach was explored in the analysis of spatial patterns of child birth outcomes in Nigeria. The study also addressed the issue of disease clustering, which is of interest to epidemiologists and public health officials. The study then proposed a Bayesian hierarchical analysis approach for Poisson count data and formulated a Poisson version of generalized linear mixed models (GLMMs) for analyzing childhood mortality. The model simultaneously addressed the problem of overdispersion and spatial dependence by the inclusion of the risk factors and random effects in a single model. The proposed approach identified regions with elevated relative risk or clustering of high mortality and evaluated the small scale geographical disparities in sub-populations across the regions. The study identified another challenge in spatial data analysis, which are spatial autocorrelation and model misspecification. The study then fitted geoadditive mixed (GAM) models to analyze childhood anaemia data belonging to a family of exponential distributions (Gaussian, binary and multinomial). The GAM models are extension of generalized linear mixed models by allowing the inclusion of splines for continuous covariate (or time) trends with the parametric function. Lastly, the shared component model originally developed for multiple disease mapping was reviewed and modified to suit the binary data at hand. A multivariate conditional autoregressive (MCAR) model was developed and applied to jointly analyze three child malnutrition indicators. The approach facilitated the estimation of conditional correlation between the diseases; assess the spatial association with the regions and geographical variation of individual disease prevalence. The spatial analysis presented in this thesis is useful to inform health-care policy and resource allocation. This thesis contributes to methodological applications in life sciences, environmental sciences, public health and agriculture. The present study expands the existing methods and tools for health impact assessment in public health studies. KEYWORDS: Conditional Autoregressive (CAR) model, Disease Mapping Models, Multiple Disease mapping, Health Geography, Ecology Models, Spatial Epidemiology, Childhood Health outcomes.Item Bayesian spatial models with application to HIV, TB and STI modeling in Kenya.(2014) Owino, Ngesa Oscar.; Mwambi, Henry Godwell.; Achia, Thomas Noel Ochieng.This dissertation is concerned with developing and extending statistical models in the area of spatial modeling with particular interest towards application to HIV, TB and HSV-2 data. Hierarchical spatial modeling is a common and useful approach for modeling complex spatially correlated data in many settings in epidemiological, public health and ecological studies. Chapter 1 of this thesis gives a chronological development of disease mapping models, from non-spatial to spatial and from single disease models to multiple disease models. In Chapter 2, a new model that relaxes the over-restrictive normal distribution assumption on the spatially unstructured random effect by using the generalised Gaussian distribution is introduced and investigated. The third chapter provides a framework for including sampling weights into the Bayesian hierarchical disease mapping model. In this model, design effect is used to re-scale the sample sizes. A new model for over dispersed spatially correlated binary data is developed in chapter 4 of this thesis; in this model, the over dispersion parameter is modeled by a beta random effect which is allowed to vary spatially also. In chapter 5, the common multiple spatial disease mapping models are reviewed and adopted for the binary data at hand since the original models were developed based on Poisson count data. The methodologies developed in this dissertation widen the toolbox for spatial analysis and disease mapping in applications in epidemiology and public health studies.Item Bayesian spatio-temporal and joint modelling of malaria and anaemia among Nigerian children aged under five years, including estimation of the effects of risk factors.(2023) Ibeji, Jecinta Ugochukwu.; Mwambi, Henry Godwell.; Iddrisu, Abdul-Karim.Childhood mortality and morbidity in Nigeria have been linked to malaria and anaemia. This thesis focused on exploring the risk factors and the complexity of the relationship between malaria and anaemia in under 5 Nigerian children. Data from the 2010 and 2015 Nigeria Malaria Indicator Survey conducted by Demographic Health Survey were used. In 2010, the prevalence of malaria and anaemia was 48% and 72%, respectively, while in 2015, 27% and 68% were the respective prevalences of malaria and anaemia diseases. Machine learning-based exploratory classification methods were used to explain the relationship and patterns between the independent variables and the two dependent variables, namely malaria and anaemia. Decisions made by the public health body are centered on the administrative units (i.e., states) within the country. Therefore, the development of disease mapping and a brief overview of limiting assumptions and ways of tackling them was explained. Consequently, malaria and anaemia spatial variation for 2010 and 2015 was analyzed with the inclusion of their respective risk factors. A separate multivariate hierarchical Bayesian logistic model for each disease was adopted to investigate the spatial pattern of malaria and anaemia and adjust for the risk factors associated with each disease. Furthermore, a multilevel model analysis was applied to independently investigate the spatio-temporal distribution of malaria and anaemia. A joint model was further adopted to check for the relationship between malaria and anaemia and their common risk factors and relax the nonlinearity assumption. In the 2010 data, type of place of residence, mother’s highest educational level, source of drinking water, type of toilet facility, child’s sex, main floor material, and households that have electricity, radio, television, and water were significantly associated with malaria and anaemia. While in the 2015 data, the type of place of residence, source of drinking water, type of toilet facility, households with radio, main roof material, wealth index, child’s sex, and mother’s highest educational level had a significant relationship with malaria and anaemia. The results from this study can guide policymakers to tailor-make effective interventions to reduce or prevent malaria and anaemia diseases. This will help adequately distribute limited state health system resources, such as personnel, funds and facilities within the country.Item Co-morbidity of childhood anaemia and malaria with a district-level spatial effect.(2021) Roberts, Danielle Jade.; Zewotir, Temesgen Tenaw.Anaemia and malaria are the leading causes of sub-Saharan African childhood morbidity and mortality. This thesis aimed to explore the risk factors as well as the complex relationship between anaemia and malaria in young children across the districts or counties of four contiguous sub-Saharan African countries, namely Kenya, Malawi, Tanzania and Uganda. Nationally representative data from the Demographic and Health Surveys conducted in all four countries was used. The observed prevalence of anaemia and malaria was 52.5% and 19.7%, respectively, with a 15.1% prevalence of co-infection. Machine learning based exploratory classification methods were used to gain insight into the relationships and patterns among the explanatory variables and the two responses. The administrative districts are the level at which public health decisions are made within each of the countries. Accordingly, the best linear unbiased predictor (BLUP) ranking and selection approach was adopted to investigate the district-level spatial effects, while controlling for child-level, household-level and environmental factors. Further to the geoadditive model, a generalised additive mixed model with a spatial effect based on the geographical coordinates of the sampled clusters within the districts was applied. The relationship between the two diseases was further explored using joint modelling approaches: a bivariate copula geoadditive model and shared component model. The child’s age, mother’s education level, household wealth index and cluster altitude were found to be significantly associated with both the anaemia and malaria status of the child. The results of this study can help policy makers target the correct set of interventions or prevent the use of incorrect interventions for anaemia and malaria control and prevention. This aids in the targeted allocation of limited district health system resources within each of these countries.Item Combining dynamic factor models and artificial neural networks in time series forecasting with applications.(2014) Babikir, Ali Basher Abd Allah.; Mwambi, Henry Godwell.This study investigates and examines the advantages and forecasting performance of combining the dynamic factor model (DFM) and artificial neural networks (ANNs) leading to new novel models that have capabilities to produce more accurate forecasts with application to the South African financial sector data. The overall aim of the study is to provide forecasting models that accommodate all relevant variables and the presence of any nonlinearity in the data to produce more adequate forecasts and serve as an alternative to traditional and current forecasting models, particularly in the presence of a changing and interacting environment. The thesis consists of four independent papers corresponding to four chapters. The first chapter brings together two important developments in forecasting literature; the artificial neural networks (ANNs) and factor models. The chapter introduces the Factor Augmented Artificial Neural Network (FAANN) hybrid model in order to produce a more accurate forecasting. The model is applied to forecasting three time series variables, namely, Deposit rate, Gold mining share prices and Long term interest rate. The out-of-sample root mean square error (RMSE) and Diebold-Mariano test results show that the FAANN model yields substantial improvements over the autoregressive AR benchmark model and standard dynamic factor model (DFM). The superiority of the FAANN model is due to the ANNs flexibility to account for potentially complex nonlinear relationships that are not easily captured by linear models. In the second chapter we introduce a new model that exploits the artificial neural networks model as a data smoother to alleviate the effect of major financial crisis and nonlinearity due to high fluctuations such as those associated with the 2008 crisis. The chapter introduces the ANN-DF model, where in the first stage the best fitted ANNs for each single series of the data set which contains 228 monthly series is used to obtain the in-sample forecasts of each series. In the second stage, the factor model is used to extract the factors from the smoothed data set, and then these factors are used as explanatory variables in forecasting. The model is applied to forecast three South Africa variables, namely, Rate on 3-month trade financing, Lending rate and Short term interest rate in the period 1992:01 to 2011:12. The results, based on the root mean square errors of three, six and twelve months ahead out-of-sample forecasts over the period 2007:01 to 2011:12 indicate that, in all of the cases, the ANN-DFM and the DFM statistically outperform the autoregressive (AR) models. In the majority of the cases the ANN-DFM outperforms the DFM. The results indicate the usefulness of smoothing and factor extraction in forecasting performance. The forecast results are confirmed by the test of the equality of forecast accuracy proposed by Diebold-Mariano (1995). The third chapter evaluates the role of the DFM model (liner in nature) and the ANN model (with capacity to handle nonlinearity) as competing forecasting estimation methods. The chapter uses artificial neural networks (ANNs) as nonlinear method based on the fact that the relationships between input and output variables in ANNs do not need to be specified in advance. In this chapter, the same extracted factors are used as input and independent variables for ANNs and the Dynamic Factor Model. This was necessary in order to investigate the forecasting performance of the linear and the nonlinear methods under the same conditions. We refer to the new model as Factor Artificial Neural Network (FANN). The empirical results of the Root Mean Square Error (RMSE) for the out-of-sample forecasts from 2007:01 to 2011:12 indicate that the proposed FANN model is an effective way to improve forecasting accuracy over the Dynamic Factor Model (DFM), the ANN and the AR benchmark model. The results confirm the usefulness of the factors that were extracted from a large set of related variables when we compared the FANN model and the standard univariate ANN model. Finally, combining forecasts is often considered as a successful alternative to using just an individual forecasting method. Different forecasting methods are considered especially when the forecasts are generated form the linear and the nonlinear methods. Thus, chapter four investigates the forecasting performance of combining independent forecasts of the Dynamic Factor Model and the Artificial Neural Networks models using linear and nonlinear combining procedures for the same variables of interest. The analysis was based on three financial variables namely the JSE return index, government bond return index and the Rand/Dollar exchange rate in South Africa. The out-of- sample results of three, six and twelve month horizons from 2006:01 to 2011:12 for the DFM and ANNs provided more adequate forecasts compared to benchmark auto-regressive (AR) models with reduction in the RMSE of around 2 to 12 percent for all variables and over all forecasting horizons. The ANN as a nonlinear combining method outperforms all linear combining methods and is the best individual model for all variables and over all forecasting horizons. The results suggest that the ANN combining method can be used as an alternative to linear combining methods to achieve greater forecasting accuracy. We attribute the superiority of the ANN combining method to its ability to capture any existing nonlinear relationship between the individual forecasts and the actual forecasting values.Item Comparative approaches to handling missing data, with particular focus on multiple imputation for both cross-sectional and longitudinal models.(2012) Hassan, Ali Satty Ali.; Mwambi, Henry G.Much data-based research are characterized by the unavoidable problem of incompleteness as a result of missing or erroneous values. This thesis discusses some of the various strategies and basic issues in statistical data analysis to address the missing data problem, and deals with both the problem of missing covariates and missing outcomes. We restrict our attention to consider methodologies which address a specific missing data pattern, namely monotone missingness. The thesis is divided into two parts. The first part placed a particular emphasis on the so called missing at random (MAR) assumption, but focuses the bulk of attention on multiple imputation techniques. The main aim of this part is to investigate various modelling techniques using application studies, and to specify the most appropriate techniques as well as gain insight into the appropriateness of these techniques for handling incomplete data analysis. This thesis first deals with the problem of missing covariate values to estimate regression parameters under a monotone missing covariate pattern. The study is devoted to a comparison of different imputation techniques, namely markov chain monte carlo (MCMC), regression, propensity score (PS) and last observation carried forward (LOCF). The results from the application study revealed that we have universally best methods to deal with missing covariates when the missing data pattern is monotone. Of the methods explored, the MCMC and regression methods of imputation to estimate regression parameters with monotone missingness were preferable to the PS and LOCF methods. This study is also concerned with comparative analysis of the techniques applied to incomplete Gaussian longitudinal outcome or response data due to random dropout. Three different methods are assessed and investigated, namely multiple imputation (MI), inverse probability weighting (IPW) and direct likelihood analysis. The findings in general favoured MI over IPW in the case of continuous outcomes, even when the MAR mechanism holds. The findings further suggest that the use of MI and direct likelihood techniques lead to accurate and equivalent results as both techniques arrive at the same substantive conclusions. The study also compares and contrasts several statistical methods for analyzing incomplete non-Gaussian longitudinal outcomes when the underlying study is subject to ignorable dropout. The methods considered include weighted generalized estimating equations (WGEE), multiple imputation after generalized estimating equations (MI-GEE) and generalized linear mixed model (GLMM). The current study found that the MI-GEE method was considerably robust, doing better than all the other methods in terms of small and large sample sizes, regardless of the dropout rates. The primary interest of the second part of the thesis falls under the non-ignorable dropout (MNAR) modelling frameworks that rely on sensitivity analysis in modelling incomplete Gaussian longitudinal data. The aim of this part is to deal with non-random dropout by explicitly modelling the assumptions that caused the dropout and incorporated this additional sub-model into the model for the measurement data, and to assess the sensitivity of the modelling assumptions. The study pays attention to the analysis of repeated Gaussian measures subject to potentially non-random dropout in order to study the influence on inference that might be caused in the data by the dropout process. We consider the construction of a particular type of selection model, namely the Diggle-Kenward model as a tool for assessing the sensitivity of a selection model in terms of the modelling assumptions. The major conclusions drawn were that there was evidence in favour of the MAR process rather than an MCAR process in the context of the assumed model. In addition, there was the need to obtain further insight into the data by comparing various sensitivity analysis frameworks. Lastly, two families of models were also compared and contrasted to investigate the potential influence on inference that dropout might have or exert on the dependent measurement data considered, and to deal with incomplete sequences. The models were based on selection and pattern mixture frameworks used for sensitivity analysis to jointly model the distribution of the dropout process and longitudinal measurement process. The results of the sensitivity analysis were in agreement and hence led to similar parameter estimates. Additional confidence in the findings was gained as both models led to similar results for significant effects such as marginal treatment effects.Item Covariates and latents in growth modelling.(2014) Melesse, Sileshi Fanta.; Zewotir, Temesgen Tenaw.The growth curve models are the natural models for the increment processes taking place gradually over time. When individuals are observed over time it is often apparent that they grow at different rates, even though they are clones and no differences in treatment or environment are present. Neverthless the classical growth curve model only deals with the average growth and does not account for individual differences, nor does it have room to accommodate covariates. Accordingly we strive to construct and investigate tractable models which incorporate both individual effects and covariates. The study was motivated by plantations of fast growing tree species, and the climatic and genetic factors that influence stem radial growth of juvenile Eucalyptus hybrids grown on the east coast of South Africa. Measurement of stem radius was conducted using dendrometres on eighteen sampled trees of two Eucalyptus hybrid clones (E. grandis χ E.urophylla, GU and E.grandis χ E. Camaldulensis, GC). Information on climatic data (temperature, rainfall, solar radiation, relative humidity and wind speed) was simultaneously collected from the study site. We explored various functional statistical models which are able to handle the growth, individual traits, and covariates. These models include partial least squares approaches, principal component regression, path models, fractional polynomial models, nonlinear mixed models and additive mixed models. Each one of these models has strengths and weaknesses. Application of these models is carried out by analysing the stem radial growth data. The partial least squares and principal component regression methods were used to identify the most important predictor for stem radial growth. Path models approach was then applied mainly to find some indirect effects of climatic factors. We further explored the tree specific effects that are unique to a particular tree under study by fitting a fractional polynomial model in the context of linear mixed effects model. The fitted fractional polynomial model showed that the relationship between stem radius and tree age is nonlinear. The performance of fractional polynomial models was compared with that of nonlinear mixed effects models. Using nonlinear mixed effects models some growth parameters like inflection points were estimated. Moreover, the fractional polynomial model fit was almost as good as the nonlinear growth curves. Consequently, the fractional polynomial model fit was extended to include the effects of all climatic variables. Furthermore, the parametric methods do not allow the data to decide the most suitable form of the functions. In order to capture the main features of the longitudinal profiles in a more flexible way, a semiparametric approach was adopted. Specifically, the additive mixed models were used to model the effect of tree age as well as the effect of each climatic factor.Item D-optimal designs for drug synergy.(2009) Kabera, Muregancuro Gaëtan.; Ndlovu, Principal.; Haines, Linda Margaret.This thesis is focused on the construction of optimal designs for detecting drug interaction using the two-variable binary logistic model. Two specific models are considered: (1) the binary two-variable logistic model without interaction, and (2) the binary two-variable logistic model with interaction. The two explanatory variables are assumed to be doses of two drugs that may or may not interact when jointly administered to subjects. The main objective of the thesis is to algebraically construct the optimal designs. However, numerical computations are used for constructing optimal designs in cumbersome cases. The problem of constructing optimal designs is to allocate weights to specific points of the design space in such a way that information associated with model parameters is maximized and the variances of the mean responses are minimized. Specifically, the D-optimality criterion discussed in this thesis minimizes the determinant of the asymptotic variance-covariance matrix of the estimates of the model parameters. The number of support points of the D-optimal designs for the two- variable binary logistic model without interaction varies from 3 to 6. Support points are equally weighted only in case of the 3-point designs and in some special cases of the 4-point designs. The number of support points of the D-optimal designs for the two-variable binary logistic model with interaction varies from 4 to 8. Support points are equally weighted only in case of the 4-point designs and in some special cases of 8-point designs. Numerous examples are given to illustrate theoretical results.Item Discrete time-to-event construction for multiple recurrent state transitions.(2023) Batidzirai, Jesca Mercy.; Manda, Samuel.; Mwambi, Henry Godwell.Recent developments in multi-state models have considered discrete time rather than continuous time in the modeling of transition intensities, whose major drawback lies in the possibility of resulting in biased parameter estimates that arise from issues of handling ties. Discrete-time models have included univariate multilevel models to account for possible dependence among specific pairwise recurrent transitions within the same subject. However, in most cases, there would be several specific pairwise transitions of interest. In such cases, there is a need to model the transitions with the aim of identifying those transitions that are correlated. This provides insight into how the transitions are related to each other. In order to investigate the interdependencies between transitions, the unique contribution of this thesis is to propose a multivariate discrete-time multi-state model with multiple state transitions. In this model, each specific recurrent transition is associated with a random effect to capture possible dependence in the transitions of the same type or different types. The random effects themselves were then modeled by a multivariate normal distribution and model parameters were estimated using maximum likelihood methods with Gaussian quadratures numerical integration. A simulation study was done to evaluate the performance of the proposed model. The model yielded satisfactory results for most fixed effects and random effects estimates. This is noticed by near-zero biases and mean square errors of the average estimates as well as high 95% coverage probabilities of the 95% confidence intervals from 1000 replications. The proposed methodology was applied to marriage formation and dissolution data from KwaZulu-Natal province, South Africa. Five transitions were considered, namely: Never Married to Married, Married to Separated, Married to Widowed, Separated to Married and Widowed to Married. The presence of very small unobserved subject-to subject heterogeneity for each transition and a weak positive correlation between transitions were produced. Statistically, the model produced smaller standard errors compared to those from univariate models, hence it is more precise on estimates. The multivariate modeling of discrete time-to-event models provides a better understanding of the evolution of all transitions simultaneously, thus in addition to covariate effects, giving an assessment of how one transition is associated with the other. Empirical results confirmed well known important socio-demographic predictors of entering and exiting a marriage. Age at sexual debut played a positive critical role in most of the transitions. More educated subjects were associated with a lower likelihood of entering a first marriage, experiencing a marital dissolution as well as remarrying after widowhood. Subjects who had a sexual debut at younger ages were more likely to experience a marital dissolution than those who started late. Age at first marriage had a negative association with marital dissolution. We may, therefore, postulate that existing programs that encourage delay in onset of sexual activity for HIV risk reduction for example, may also have a positive impact on lowering rates of marital dissolution, thus ultimately improving psychological and physical health.Item Financial modelling of cryptocurrency: a case study of Bitcoin, Ethereum, and Dogecoin in comparison with JSE stock returns.(2022) Kaseke, Forbes.; Ramroop, Shaun.; Mwambi, Henry Godwell.The emergency of cryptocurrency has caused a shift in the financial markets. Although it was created as a currency for exchange, cryptocurrency has been shown to be an asset, with investors seeking to profit from it rather than using it as a medium of exchange. Despite being a financial asset, cryptocurrency has distinct, stylised facts like any other asset. Studying these stylised facts allows the creation of better-suited models to assist investors in making better data-driven decisions. The data used in this thesis was of three leading cryptocurrencies: Bitcoin, Ethereum, and Dogecoin and the Johannesburg Stock Exchange (JSE) data as a guide for comparison. The sample period was from 18 September 2017 to 27 May 2021. The goal was to research the stylised facts of cryptocurrencies and then create models that capture these stylised facts. The study developed risk-quantifying models for cryptocurrencies. The main findings were that cryptocurrency exhibits stylised facts that are well-known in financial data. However, the magnitude and frequency of these stylised facts tend to differ. For example, cryptocurrency is more volatile than stock returns. The volatility also tends to be more persistent than in stocks. The study also finds that cryptocurrency has a reverse leverage effect as opposed to the normal one, where past negative returns increase volatility more than past positive returns. The study also developed a hybrid GARCH model using the extreme value theorem for quantifying cryptocurrency risk. The results showed that the GJR-GARCH with GDP innovations could be used as an alternative model to calculate the VaR. The volatile nature of cryptocurrency was also compared with that of the JSE while accounting for structural breaks and while not accounting for them. The results showed that the cryptocurrencies’ volatility patterns are similar but differ from those of the JSE. The cryptocurrency was also found to be an inefficient market. This finding means that some investors can take advantage of this inefficiency. The study also revealed that structural breaks affect volatility persistence. However, this persistence measure differs depending on the model used. Markov switching GARCH models were used to strengthen the structural break findings. The results showed that two-regime models outperform single-regime models. The VAR and DCC-GARCH models were also used to test the spillovers amongst the assets used. The results showed short-run spillovers from Bitcoin to Ethereum and long-run spillovers based on the DCC-GARCH. Lastly, factors affecting cryptocurrency adoption were discussed. The main reasons affecting mass adoption are the complexity that comes with the use of cryptocurrency and its high volatility. This study was critical as it gives investors an understanding of the nature and behaviour of cryptocurrency so that they know when and how to invest. It also helps policymakers and financial institutions decide how to treat or use cryptocurrency within the economy.Item Flexible Bayesian hierarchical spatial modeling in disease mapping.(2022) Ayalew, Kassahun Abere.; Manda, Samuel.The Gaussian Intrinsic Conditional Autoregressive (ICAR) spatial model, which usually has two components, namely an ICAR for spatial smoothing and standard random effects for non-spatial heterogeneity, is used to estimate spatial distributions of disease risks. The normality assumption in this model may not always be correct and misspecification of the distribution of random effects could result in biased estimation of the spatial distribution of disease risk, which could lead to misleading conclusions and policy recommendations. Limited research studies have been done where the estimation of the spatial distributions of diseases under the ICAR-normal model were compared to those obtained from fitting ICAR-nonnormal model. The results from these studies indicated that the ICAR-nonnormal models performed better than the ICAR-normal in terms of accuracy, efficiency and predictive capacity. However, these efforts have not fully addressed the effect on the estimation of spatial distributions under flexible specification of ICAR models in disease mapping. The overall aim of this PhD thesis was to develop approaches that relax the normality assumption that is often used in modeling and fitting of ICAR models in the estimation of spatial patterns of diseases. In particular, the thesis considered the skewnormal and skew-Laplace distributions under the univariate, and skew-normal for the multivariate specifications to estimate the spatial distributions of either univariable or multivariable areal data. The thesis also considered non-parametric specification of the multivariate spatial effects in the ICAR model, which is a novel extension of an earlier work. The estimation of the models was done using Bayesian statistical approaches. The performances of our suggested alternatives to the ICAR-normal model were evaluated by simulating studies as well as with practical application to the estimation of district-level distribution of HIV prevalence and treatment coverage using health survey data in South Africa. Results from the simulation studies and analysis of real data demonstrated that our approaches performed better in the prediction of spatial distributions for univariable and multivariable areal data in disease mapping approaches. This PhD has shown the limitations of relying on the ICAR-normal model for the estimations of spatial distributions for all spatial analyses, even when the data could be asymmetric and non-normal. In such scenarios, skewed-ICAR and nonparametric ICAR approaches could provide better and unbiased estimation of the spatial pattern of diseases.
- «
- 1 (current)
- 2
- 3
- »