Masters Degrees (Statistics)
Permanent URI for this collectionhttps://hdl.handle.net/10413/7127
Browse
Recent Submissions
Item Estimation of the value at risk using a long-memory GARCH application to JSE Indices.(2020) Khumalo, Moses Bhekinhlahla.; Chinhamu, Knowledge.; Chifurira, Retius.Financial data are characterized by stylized facts; this makes it difficult to model financial assets if these stylized facts are not taken into account. Therefore, the implementation of accurate risk management tools such as value at risk (VaR), which is crucial in the management of market risk, becomes a futile exercise. This study aims to compare the performance of the long-memory GARCH-type models with heavy-tailed innovations in estimating the value at risk of the All Share Index, the Mining Index, and the Banking Index. This was achieved by investigating the empirical properties of the JSE Indices, fitting the FIGARCH, HYGARCH, and FIAPARCH with the Student’s t-distribution (STD), skewed Student’s t-distribution (SSTD), and generalized error distribution (GED). The study further estimates VaR for the short and long-trading positions on the 95th, 99th, and 99,7th quantiles, as well as backtests the results. The main findings indicate that the JSE All Share index returns is best captured by the FIGARCH-SSTD model, whereas the JSE Mining Index retuns most robust model is the FIAPARCH-STD model. For the JSE Banking Index returns, the FIAPARCH-STD model is predominantly appropriate at most of different VaR levels. The findings of the study provide a solution to both risk practitioners and asset managers for better understanding the behaviour of the financial indices’ returns. Finally, this can assist the role players in fastidiously managing risks and assets’ returns.Item Evaluation of single and multiple missing data imputation techniques: a comparative application on BMI data.(2017) Ndwandwe, Lethani Mbongeni.; Lougue, Siaka.; Hazra, Annapurna.Missing data are a common occurrence in various fields of data science and statistics. The research into missing data is one of the most important topics in applied statistics, especially in academic, government and industry-run clinical trials. However, this data loss can result in an inadequate basis for study inferences. Dealing with missing data involves neglecting or imputing unobserved values. However, the methods used to deal with the missingness in a data set may bias the results and lead to results which do not reflect a true picture of the reality under investigation in a study. This thesis discusses the various missing data mechanisms and how missing values can be inferred. The main objective of this thesis is to evaluate the performance of several single and multiple imputation methods for a continuous dataset to find the best imputation techniques. Based on a complete survey data (2014 Lesotho Demographic Household Survey), missingness was created in the response variable (BMI) using three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Missing values were then imputed using three single imputation methods and two multiple imputation methods, namely: mean substitution, hot-deck and regression, multiple linear regression and predictive mean matching (PMM), respectively. The analysis indicated that the PMM imputation method is more precise and can also produce lower estimated standard error compared to other methods.Item Evaluation of single and multiple missing data imputation techniques: a comparative application on BMI data.(2017) Ndwandwe, Lethani Mbongeni.; Lougue, Siaka.; Hazra, Annapurna.Missing data are a common occurrence in various fields of data science and statistics. The research into missing data is one of the most important topics in applied statistics, especially in academic, government and industry-run clinical trials. However, this data loss can result in an inadequate basis for study inferences. Dealing with missing data involves neglecting or imputing unobserved values. However, the methods used to deal with the missingness in a data set may bias the results and lead to results which do not reflect a true picture of the reality under investigation in a study. This thesis discusses the various missing data mechanisms and how missing values can be inferred. The main objective of this thesis is to evaluate the performance of several single and multiple imputation methods for a continuous dataset to find the best imputation techniques. Based on a complete survey data (2014 Lesotho Demographic Household Survey), missingness was created in the response variable (BMI) using three missing data mechanisms: missing completely at random (MCAR), missing at random (MAR) and missing not at random (MNAR). Missing values were then imputed using three single imputation methods and two multiple imputation methods, namely: mean substitution, hot-deck and regression, multiple linear regression and predictive mean matching (PMM), respectively. The analysis indicated that the PMM imputation method is more precise and can also produce lower estimated standard error compared to other methods.Item Concepts for the construction of confidence intervals for measuring stability after hallux vulgus surgery: theoretical development and application.(2021) Ogutu, Sarah Atieno.; Mwambi, Henry Godwell.; Ziegler, Andreas.The absolute change in the corrected angle measured immediately after surgery and after bone healing is a clinically relevant endpoint to judge an osteotomy's stability. The primary objective of this research is to illustrate the non-inferiority of a novel screw used for fixation of the osteotomy compared with a standard screw. If the difference in the angles after surgery and after bone healing can be assumed to be normally distributed, the absolute change follows the folded normal distribution. The most natural approach to present the clinical study results is using a confidence interval to compare two folded normal distributions. We construct a confidence interval to compare two independent folded normal distributions using the ratio of two chi-square random variables, the difference of two chi-square distribution, and the bootstrap method. We illustrate the approaches from a study on hallux valgus osteotomy. The proposed confidence intervals permit an investigation of the noninferiority for the two treatment groups in clinical trials with end points following a folded normal distribution. The application to real data results indicates that the confidence interval for the ratio of two chi-squares random variable and bootstrap is straightforward and easy to calculate. Bootstrapping was asymptotically more accurate than the standard interval obtained from samples that assume normality. Also, it was an appropriate way to ascertain the stability of the results. Judging by δ of the bootstrap method, we establish non-inferiority for the new surgical method. In conclusion, the approaches are promising, and we recommend them for use to compare other practical data that require the use of the folded normal distribution.Item Flexible statistical modelling of the determinants of childhood anaemia in Tanzania and Angola.(2020) Ndlangamandla, Qondeni.; Ramroop, Shaun.; Mwambi, Henry Godwell.Anaemia is one of the major causes of morbidity and mortality in children aged five or less in Africa, affecting 25% of the world’s population. In developing countries, it accounts for more than 89% of the disease burden. Although anaemia affects all population groups, the more vulnerable groups are children under five years of age and women of reproductive age (15–49 years) compared to any other age group. According to the World Health Organization’s 2008 report, 50% of anaemia cases in Africa were associated with insufficient consumption of iron (iron deficiency anaemia). This study aims to determine the factors associated with childhood anaemia in Tanzania and Angola. For us to serve our aim, the Tanzania Demographics and Health Survey (TDHS) and the Angola Demographics and Health Survey (ADHS) data sets were fitted to several statistical models that could robustly model the response variable, anaemia, which is binary. Survey Logistic Regression (SLR), which is under the class of Generalized Linear Models (GLM), fits because of its robustness, not only in modelling dichotomous responses, but also in it ability to deal with data that assumes complex survey designs. The SLR model was extended by a Generalized additive mixed model (GAMM), which was fitted to relax the assumption of normality and to fit other terms non-parametrically. Furthermore, to cater for the effect of spatial effect and spatial variability, a Spatial Generalized linear mixed model (SGLMM) was fitted to the two data sets to help in the investigation of factors that are spatially related to childhood anaemia. The SLR and SGLMM models were fitted using the SAS software (PROC SURVEYLOGISTIC and PROC GLIMMIX, respectively), while the GAMM model was fitted using the statistical-R software. Moreover, smooth maps were produced for the outcome variable using ARCGIS software for the purpose of identifying the hot spots of childhood anaemia in the country. Our aim for this study was successfully achieved. After the three models were fitted into the two data sets, they revealed that the factors that were highly associated with childhood anaemia in both countries are: the highest level of education of caretakiers (mothers), child gender, age of the child and stunting status. The models also revealed that the standard of living in Tanzania has a significant effect in childhood anaemiaItem Modelling tuberculosis risk factors among adult men in South Africa.(2021) Mlondo, Muziwandile Nhlakanipho.; Melesse, Sileshi Fanta.; Mwambi, Henry Godwell.Tuberculosis is among the major public health problems not only in South Africa but worldwide. Tuberculosis is an underlying cause of more than 1.5 million deaths each year worldwide, making it the world's top infectious killer. There are more cases for men than women. Such a heavy burden requires an understanding of the tuberculosis status of the people, especially among men, and associated risk factors. Therefore, this study uses some statistical methods that are suitable to estimate the effect of the risk factors associated with tuberculosis among adult men. The study used the 2016 South African Demographic and Health Survey data. The Generalized Linear Models, such as the binary logistic regression model that assumes a simple random sampling as a sampling method followed by survey logistics that incorporate the complex design by means of robust standard errors of estimates, were applied to the data. The findings revealed that models that account for complex design are more suitable than those that do not account for complexity. To account for variability between the primary sampling units generalized linear mixed model was then used. GLMMs accounts for correlation within clusters by means of random effects which also account for cluster to cluster heterogeneity. Further, a generalized additive mixed-effect model was used to fit nonlinear and non-normal data; the categorical variables were modeled parametrically and continuously by non-parametric models. The thesis also discussed limitations for each of these models. The findings from this study revealed that the risk factors of tuberculosis are: any chronic disease, current age, region, race, number of times away from home, marital status, weight, and interaction effect of chronic disease and age, the interaction effect of smoking status and number of household members.Item Flexible statistical modeling of childhood malnutrition in Malawi.(2019) Magagula, Mzwakhe Elmon.; Ramroop, Shaun.Childhood malnutrition is one of the most significant health problems affecting public health departments, mainly in developing countries. The development of proper assessment of malnutrition is one of the challenges faced by policy makers in many countries across the globe. Therefore, the current study was undertaken with the primary objective of assessing and determining all possible determinants of malnutrition in Malawi, using the Demographic and Health Survey (DHS) data 2015/16. Different types of statistical models were adopted to allow variety in methodology and to find the most accurate results among the models used. As a point of departure, the study utilized Generalized Linear Models (GLM) to account for the ordering of the outcome variable (severe, moderate and nourished). Furthermore, we noticed that it would be substantial to extend the ordinal logistic regression to include random effects and therefore to consider the variability between the primary sampling units or villages. Furthermore, we adopted a class of models that allows flexible functional dependence of an outcome variable on covariates by using nonparametric regression. Hence, the use of the generalized additive mixed model (GAMM), which relaxes the assumption of normality and linearity inherent in linear regressions. Analyses of childhood stunting have mainly used mean regression, yet modelling using quantile regression is more appropriate than using mean regression in that the former provides flexibility to study the impact of predictors on different desired quantiles of the response distribution, whereas the latter allows only studying the impact of predictors on the mean of the response variable. Therefore, quantile regression models were adopted for the provision of a complete picture of the relationship between the outcome variable (stunting) and the predictor variables on different desired quantiles of the response distribution. This study fitted a Bayesian additive quantile regression model with structural spatial effects for childhood stunting in Malawi, using 2015/16 DHS data. Inference was fully Bayesian, using the new integrated nested Laplace approximation (INLA), purely because of its much faster computation as compared to Markov chain Monte Carlo (MCMC). Furthervii Abstract more, different types of quantile regression models were fitted and compared according to each Deviance Information Criteria (DIC) for determination of the best model among them. Each of these models has inherent strengths and weaknesses. The choice of one depends on what the research is trying to accomplish and the type of data one has. In this study, we combined the results from different models, mainly from our quantile regression models. The significant determinants of childhood stunting in Malawi were found to be the age of the child, the education level of parents (mother and father), the family’s place of residence, gender of the child, incidence of recent fever, incidence of recent diarrhoea, multiple births, mother’s age at the birth, body mass index of the mother, wealth index of the family, source of drinking water and districts. Furthermore, from the spatial quantile regression model, a map was generated showing the distribution of malnutrition in a district level of Malawi. This map gave us an overview on how stunting is distributed in Malawi and from the map we were able to visualize and assess affected districts.Item Modelling poverty in Zimbabwe based on the demographic health survey dataset using GLMs and GAMMs.(2020) Mtshali, Precious.; Ramroop, Shaun.; Mwambi, Henry Godwell.Zimbabwe has been in a state of political, economic, and social crisis for the past 15 years. In 2004, 80% of Zimbabweans were living below the national poverty line. By January 2009, only 6% of the population held jobs in the formal sector. Living in poverty may lead to stressful conditions that are linked to poor mental health problems in adults and developmental issues in children. This study investigates the risk factors that affect poverty status in Zimbabwe and makes recommendations for current policy on poverty, using statistical models such as generalized linear models (GLMs) and generalized additive mixed models (GAMMs). This study makes use of the Zimbabwe 2015 Demographic and Healthy Survey Dataset (DHS). The index was created using 29 variables questions from a principal component analysis. The first component was taken and the factor score was used. There was a cutoff below the median and above the median. Hence, the dichotomous response variable was socioeconomic status (SES) (1=Poor, 2=Not poor).The DHS data has explanatory variables such as the level of education, sex of the household head and age of the household head, size of the household head, and place of residence and sex of the household head. The results in both models (GLMs and GLMMs) reveal that these demographic factors are key determinants of poverty of households in Zimbabwe. This study demonstrates that the government of Zimbabwe needs to pay attention and intervene by looking into the demographic factors that affect poverty status.Item Factors affecting child mortality in Lesotho using 2009 and 2014 LDHS data.(2021) Mkhize, Nonduduzo Noxolo.; Melesse, Sileshi Fanta.; Mwambi, Henry Godwell.; Ramroop, Shaun.Child mortality rate is known to be the important indicator of social development, quality of life, welfare as well as the overall health of the society. In most countries, especially the developing countries; the death of a child is usually caused by transferable, preventable diseases and poor health. Progress in improving under-five mortality since 1990 has been made globally. There has been a decline globally in under-five mortality from 12.7 million in 1990 to approximately 6 million in 2015. All regions except the developing countries in Sub-Saharan Africa, Central Asia, Southern Asia and Oceania had reduced the rate by 52% or more in 2013. Lesotho is a developing country with one of the highest rates of infant and child mortality. The study uncovers the factors influencing child mortality in Lesotho based on the Lesotho Demographic and Health Surveys for 2009 and 2014. The survey logistic regression, a model under the generalized linear model framework was used to find the factors related to under-five child mortality to account for the sampling designs complexity. The SLR model is not able to account for variability occurring from connection between subjects from the equal clusters and household. The generalized linear mixed model is then put into application. To ease the normality assumptions and the linearity assumption in the parametric models, the semi-parametric generalized additive model, was lastly used for the data. Finding the determining factors that result in child mortality will benefit the way intervention programs are planned and the formulation for policy makers to lead in the decreasing of child mortality; and accomplish MDGs. This study intends to improve the existing knowledge on child mortality in Lesotho by studying the determining factors in detail. Based on the previous studies this paper will recommend intervention designs and policy formulation. Overall, the findings of this research showed that birth order number, weight of child at birth, age of child, breastfeeding, wealth index, education attainment, mother’s age, type of place of residence, number of children living were the key determining factors of the under-five mortality in Lesotho. The study displays that policy makers should strengthen the interventions for child health in order to decrease child under-five mortality. The results achieved can help with the policy formulation to control and reduce child mortality. The government should continually assess current programs to review and develop programs that are more applicable.Item Risk factors associated with and factors that influence intimate partner violence. A case study of sub-Saharan regions.(2021) Mhelembe, Talani Mabrow.; Ramroop, Shaun.; Habyarimana, Faustin.The reduction of intimate partner violence is critical to most societies' well-being and posterity, and for policymakers. However, in most cases, coming up with an accurate, intimate partner violence evaluation tool that focuses on vulnerable women, is a challenge for applied policy research. Intimate partner violence for women of conceptive age (15-49 years) has been measured utilizing the number of cases reported, and this approach has several underlying problems. Therefore, in this work, we came up with a rating scale from Demographic and Health Survey data as an alternate method to measure (Chapman & Gillespie, 2018) intimate partner violence, and examine different statistical methods suitable for identifying the associated factors. A generalized linear mixed model technique was utilized to elongate survey logistical regression to incorporate random effects, and account for variability amongst the primary sampling units. This was done to account for the complexity of the sampling design and the ordering of outcome variables. We have also utilized the generalized additive mixed model to ease the assumptions of normality and linearity intrinsic in linear regression models, in which categorical independent predictors were modeled by parametric model, continuous covariates, and interaction between the continuous and categorical variables by non-parametric models. Each of these models has inherent flaws and strengths. The choice of a statistical model depends on the objectives to be achieved. The findings from this current scientific setting revealed that the following determinants are the key factors influencing intimate partner violence: age of the woman's partner, marital status, region where the woman lives, age of the woman, media exposure, size of the family, polygamy, sex of the household head, wealth index, pregnancy termination status, body mass index, marital status, cohabitation duration, partner's desire for children, partner's education level, woman's working status, and woman's earnings compared to partner's earnings.Item Estimating the size of the underground economy in South Africa using the Multiple Indicators Multiple Cause Model (MIMIC) and the Currency Demand Approach (CDA).(2021) Koloane, Cathrine Thato.; Bodhlyera, Oliver.The underground economy is a major challenge across the world affecting both developed and developing economies. South Africa is no exception to this phenomenon and has lost billions of rands due to the underground economy. Tax revenue loss due to illicit trade was estimated to be approximately R36.5 billion in 2019, with illicit cigarettes and tobacco and undervalued clothing and textiles perceived to be the main contributors to this economy. The objective of this research is to estimate the size of the underground economy in South Africa using the Currency Demand Approach (CDA) and the Multiple Indicators Multiple Cause (MIMIC) models. To accomplish this, secondary economic data was obtained from Statistics South Africa (STATSSA), World Bank, South African Reserve Bank (SARB) and the International Monetary Fund (IMF) for the period 2000 to 2020. The results from the MIMIC model showed that the underground economy in South Africa was growing with estimates ranging from 25.4% to 32.3% of GDP for 2003 to 2020.The model further indicated that mining employment rate, tax burden and government expenditure are the causes of the underground economy and Nominal Gross Domestic Product (NGDP) and labour force participation rate are the indicators of the underground economy. Similarly, the CDA model showed a steadily increasing underground economy estimated at 28.8% of GDP on average for 2003 to 2020. Furthermore, the CDA model showed that NGDP, tax burden, interest rate, unemployment rate, self-employment rate and social benefits granted by the government are determinants of the underground economy. This study makes a significant contribution to the body of knowledge in this research area and will provide much needed insights on the relative magnitude of the underground economy, drivers of the underground economy and the extent of tax evasion in South Africa, ultimately contributing towards an improved tax base and compliance. It will further serve as a basis for future research in this topic by academia, private sector, government, multilateral bodies and all other interest groups.Item Modelling South African official gold reserves position, and foreign exchange reserves position using time series models.(2020) Gumede, Sibusiso.; Bodhlyera, Oliver.; Mwambi, Henry Godwell.Every central bank of the country should hold enough reserves such as foreign exchange currency, gold, or any form of reserves to be able to help its country in times of difficulties or financial crises. This involves the process of ensuring that adequate official public sector foreign assets are readily available to meet any defined range of objectives by a country. Reserves can also play a pivotal role in supporting and maintaining confidence in the policies for monetary and exchange rate management, including the ability to intervene in the foreign market to influence the value of the local currency. It can also be used to provide proof to the market that a country can meet its current and future external obligations, limit external exposure by maintaining foreign currency liquidity to absorb shock during times of crisis, show the support of domestic currency by external assets, assist the government in meeting its foreign exchange needs and external debt obligations, and maintain sufficient reserves for national disasters or emergencies. All this cannot be done without the understanding of all factors that affect reserves of the country, hence careful analysis of reserves in a country plays a crucial role on how the central bank should manage the reserves of such a country. This includes a wide range of social, economic, and statistical analyses. However, this study focuses more on the statistical analysis part, which is, building models to predict or forecast the trajectory of reserves positions in future. These models should be able to consider all the factors that influence the reserves, such as trend, seasonality and the variability (random variability). The Seasonal ARIMA models were used as initial models to forecast the future reserves positions. Seasonal ARIMA Generalized Autoregressive Conditional Heteroskedasticity models with Skewed Student-t Distribution (SARIMA – GARCH – SSTD) were also used to forecast volatility from the foreign exchange reserves data after statistical test were carried out and the data was found to have ARCH Effects. The best volatility model that was found to produces best forecast for foreign exchange reserves data was the SARIMA (0,1,0) (2,1,0)12 – GARCH (1,1) – SSTD model. The SARIMA model developed earlier for gold reserves data was then benchmarked with the Holt-Winters' Seasonal method. The results from the analysis showed that SARIMA model outperformed Holt-Winters' Seasonal method in forecasting gold reserves positions. We found that future gold reserves positions can be better predicted using the SARIMA (1,1,0) (0,1,2)12 model. The best model was selected from many other models using model diagnostics process such as comparisons of the AIC, RMSE, number of significant parameters and the evaluation of residuals to identify their flexibility. Using the forecasting methods developed in this study, the central bank can better understand what to expect in the future and decide on what measures to implement for national economic stability.Item Modelling and forecasting the costs of attending to electricity faults using univariate and multivariate time series forecasting models.(2018) Buthelezi, Nkosiyapha Mthunzi.; Bodhlyera, Oliver.Electricity price forecasting has turned into a very essential element for both public and private decision making. Both shortage of supply of electricity and electricity cost still remains the country’s most biggest problems and needs to be addressed decisively. Apart from the demand and supply side of electricity, electricity cost is an important part of electricity delivery. Therefore, the accurate estimation of electricity cost and it’s maintenance is an important part of the country’s electricity supply strategy. The main aim of this study is to forecast the cost of rectifying or attending to electricity faults. The study demonstrates that the AutoRegressive Integrated Moving Average (ARIMA), AutoRegressive Integrated Moving Average with exogenous variables (ARIMAX), Vector AutoRegressive (VAR) and Random Forest methods are capable of producing accurate forecasts of costs associated with attending to reported faults. In this study, we analyse the costs of attending to electrical faults in the Bethlehem and Bloemfontein areas of the Free State region of South Africa, from 4 January 2012 to 3 June 2017, using univariate and multivariate ARIMA, ARIMAX, VAR and Random Forest models. ARCH and GARCH models are also used to model the volatility found in the daily costs data. The model developed based on these data can be used to forecast future faults costs and can help policy makers with planning decisions.Item Statistical models to determine factors affecting under-five child mortality in South Africa.(2020) Bovu, Andisiwe.; Melesse, Sileshi Fanta.The level of under-five child mortality is an important indicator of economic, social and health development of the nation. In the last two decades, substantial progress has been made in improving under-five child mortality globally, with deaths dropping among children under the age of five years from approximately 12 million in 1990 to about 6.3 million in 2015. However, significant strides to address the key risk factors are still needed in the Sub-Saharan Africa region if they are to achieve the Sustainable Development Goals 2030. The key objective of the study is to identify key factors associated with mortality of children under the age of five years in South Africa. In order to identify these factors, the study used different statistical models that accommodate a binary response variable. Models used include Logistic Regression, Survey Logistic Regression, Generalized Linear Mixed Models and Generalized Additive Models. Although logistic regression is useful in modelling data with a dichotomous outcome, it is not suitable for modelling data obtained through a complex survey that incorporates weights, stratification and clustering. Survey logistic regression is used to model the relationship between binary dependent and the set of explanatory variables by making use of the sampling design information. In this case, the inclusion of random effects in the model results in generalized linear mixed models (GLMM). These models are an extension of linear mixed models that allow response variable from different distributions, such as binary responses. One can think of GLMM as an extension of generalized linear models (e.g. logistic regression) that combine both features of fixed and random effects. These statistical models assume linearity parametric form for the explanatory variable. However, this assumption of linear independence of response on covariates may not hold. Hence, we introduce generalized additive models (GAM). The GAM models show some non-linear relationship between the response variable and some covariates. The results showed that, the size of child at birth, breastfeeding, birth order number, ethnicity, number of children 5 under, total children ever born, source of drinking water and province were significantly associated with under-five child mortality. The study concludes that prolonged breastfeeding, improved health services and source of water are among the main factors to decline under-five child mortality further. Therefore, the study suggests that there is a need to strengthen child health interventions in South Africa to reduce the under-five mortality rate even more in order to achieve sustainable development goals (SDG) 2030.Item Classification of banking clients according to their loan default status using machine learning algorithms.(2022) Reddy, Suveshnee.; Chifurira, Retius.; Zewotir, Temesgen Tenaw.Loan lending has become crucial for both individuals and companies. For lending institutions, although profitable, it can be very risky due to clients defaulting on their loan agreement. Credit risk assessment is a critical process which is carried out by most lending institutions; it reduces the possibility of lending to clients who will default on their loan repayment, however, it does not eliminate the problem. Thus, a collections process which aims to retrieve unpaid debt is also necessary. With South Africa facing another recession, which was only worsened by the lockdown during the covid-19 pandemic, lending institutions can expect an increase in the number loan defaulters. To counter this increase, changes will have to be made to their policies and processes. Changes can be made to either the loan application procedures (e.g. credit risk assessment, affordability assessment et cetera) or the post disbursal procedures (e.g. collections processes). The aim of this study is to predict whether a client will default on his/her loan, using machine learning algorithms, in order to enhance the collection process of the financial institution under study, where default is defined as missing at least three payments in the first 12 months of the loan being granted. The logistic regression model, decision tree, random forest, support vector machine, Naïve Bayes classifier, k-nearest neighbours algorithm and the artificial neural network were fitted to the balanced dataset. In the researcher’s analysis, loan data from a South African financial institution were used for the period August 2019 to December 2019. Variables related to a client’s demographics, income, expenses and debt, as well as loan information, were included in the dataset. Exploratory data analysis (EDA) was utilised in order to analyse the dataset and summarise their main characteristics. To reduce the dimensionality of the dataset, two techniques were used, namely principal component analysis (PCA), which is also used to correct the data for multicollinearity, and feature selection (i.e., recursive feature elimination). Each model was fitted to the dataset using these two techniques, and the confusion matrix and metrics such balanced accuracy, true positive ratio, true negative ratio, AUC score and the Gini coefficient were used to evaluate the different models in order to determine which model performed the best and was most suited for this application problem. The results show that when using the PCA approach, the random forest model, which obtained a balanced accuracy score, true positive ratio and AUC score of 0.69, 0.74 and 0.74, respectively, performed the best. The random forest model also performed the best when using the feature selection technique, obtaining a balanced accuracy score, true positive ratio and AUC score of 0.69, 0.74 and 0.75, respectively. When comparing the random forest model using PCA to the random forest model using feature selection, the results showed a marginal difference between each performance metric analysed. The random forest model using PCA utilised 48 variables, whereas the random forest model using feature selection utilised only 18 variables and thus seemed to be more suitable for the classification problem under study. The results of this study are expected to benefit analysts and data scientists in financial institutions who would like to identify the robust machine learning algorithms for classifying defaulting clients. This study is also of significance to policy makers who would want to identify the risk factors associated with loan defaulting clients.Item Statistical models to analyse a baseline survey on rural KwaZulu-Natal adults’ HIV prevalence and associated risk factors.(2020) Moodley, Kameshan.; Zewotir, Temesgen Tenaw.; Roberts, Danielle Jade.South Africa is at the global epicentre of the HIV-AIDS pandemic. Though there has been an increase in prevention and control measures that has led to a significant reduction in HIV-AIDS mortality rates globally, South Africa has experienced a high share of the HIV burden. HIV-AIDS imposes a substantial economic burden on both individuals and governments. It has had a considerable effect on poverty by affecting potentially economically active citizens who would otherwise have entered the workforce and contributed to the local and national economy. This has hindered economic growth and development in South Africa. The 2016 UNAIDS Gap Report estimates that in 2015 there were seven million people living with HIV in South Africa and that this resulted in 180,000 AIDS related deaths in the same year. The same year saw an unprecedented 380,000 new reported infections. The prevalence of HIV-AIDS in South Africa remains high at 19.2% among the general population. This study was an investigation into the determinants of HIV in adults in the age group 15-49 years. The study used the HIV Incidence Provincial Surveillance System (HIPSS) to collect data between June 2014 and June 2015. The final data set comprised 9,804 observations and consisted of explanatory variables pertaining to individuals’ socio-economic, socio-demographic and behavioural circumstances. The response variable was binary indicating whether a participant tested positive or negative for HIV. Incorporating survey weights into the data owing to the complex sample design, necessitated the use multilevel regression procedures. To this end, survey logistic regression and the generalised linear mixed models were employed. The results emanating from these models revealed that factors encompassing socioi economic, demographic and selected behavioural characteristics were significantly associated with HIV prevalence in the study location. In some instances, it is possible that households in close proximity exhibit some similarities with the inevitable result of spatial autocorrelation requiring the use of geographically weighted regression techniques able to account for spatial autocorrelation. The application of a spatial multilevel model showed that the influence between households in close proximity is greater than between those further away, a phenomenon that would be ignored in conventional multilevel models.Item Risk factors and classification of diabetes in South Africa.(2019) Grundlingh, Nina.; Zewotir, Temesgen Tenaw.; Roberts, Danielle Jade.Diabetes prevalence has been seen to be on the increase in recent years, globally and in South Africa. The number of people with diabetes globally has risen from 108 million in 1980 to 442 million in 2014. It was estimated that, of the 1.8 million people between 20 and 79 years old with diabetes in South Africa in 2017, 84.8% were undiagnosed. Diabetes was the 2nd leading underlying cause of death in South Africa in 2016. Identifying risk factors for diabetes will assist in raising public awareness and assist public authorities to develop prevention programs. This study aimed to investigate the prevalence and risk factors associated with diabetes in the South African population aged 15 years and older, as well as explore various statistical methods of classifying a person’s diabetic status. This study made use of the South African Demographic Health Survey 2016 data which involved a two-stage sampling design. The study participants included 6442 individuals aged 15 years and older. Of the individuals sampled, 11%, 67% and 22% were found to be non-diabetic, pre-diabetic and diabetic, respectively. Classification methods, namely, a decision tree, random forest and Bayesian neural network, were used to assess classification of diabetic status based on the risk factors. Of the classification methods, the Bayesian neural network gave the highest accuracy (75.9%). These methods however, failed to account for the complex survey design and sampling weights. In addition, these methods are not able to provide the estimated effect that a risk factor has on the diabetic status. Regression models were employed to identify the significant risk factors. Due to the ordinal nature of diabetic status, initially the proportional odds model was fit. However, the proportional odds assumption was found to be violated. A multinomial generalized linear mixed model was fitted to account for the complexity of the design. However, the model’s residuals were found to be spatially autocorrelated. Accordingly, a spatial generalized additive mixed model, which accounts for the complexity of the survey structure as well as incorporates nonlinear spatial effects, was adopted. The highest accuracy from the regression models considered was obtained from this adjusted surface correlation model (accuracy = 70.8%). Individuals of the Black/African race were more likely to be diabetic (OR = 1.429; 95% CI: 1.032-1.978) than other races. Individuals taking high blood pressure medication were 1.444 times more likely to be diabetic than pre-diabetic (95% CI: 1.167-1.786) compared to those not taking high blood pressure medication.Item Modelling depression in South Africa.(2020) Ghoor, Tahzeeb.; Roberts, Danielle Jade.; Lougue, Siaka.Depression is considered to be the leading cause of disability worldwide, with approximately 350 million individuals, of all ages, affected. The mental disorder is predominant in females and poverty is associated with an increased prevalence. The 12-month prevalence in South Africa is approximately 16.5%, with a lifetime prevalence of common mental disorders among adults of 38% (World Health Organization (WHO), 2017). In order to assist individuals in dealing with depression, it is important for such individuals to be identified at an early stage in order to provide them with the necessary support before their depression becomes unmanageable, thus putting them at risk for self-inflicted harm. The objective of this study was to investigate the prevalence and risk determinants of depression among South African individuals between the ages 15 to 49 years old and to determine which factors contribute the most to this mental illness. This study made use of data from the 2016 South African General Household Survey which was carried out using a multistage cluster sampling technique. The sample was not spread geographically in proportion to the population, but rather equally across the enumeration areas. The response variable of interest was binary, indicating whether an individual considered himself/herself depressed or not. Three statistical approaches were applied. The first was the survey logistic regression model which is a design-based approach. In this approach, parameter estimates and inferences were based on the sampling weights, and only inferences concerning the effects of certain covariates on the response variable were of interest. The second was a generalized linear mixed model which is a model-based approach. In this approach, interest was also on estimating and accounting for the proportion of variation in the response variable that was attributable to each of the multiple levels of sampling. This approach also accounted for possible correlations in the data where individuals in the same household or cluster tend to be more alike than those from other households or clusters. Lastly, a Bayesian network was applied to model the conditional dependence among the variables. This approach is a type of probabilistic graphical model that uses Bayesian inference for calculations of the probabilities. i The results indicated that substance abuse, the person’s perceived health status and gender were significantly associated with depression. Each of the three techniques were then used to classify the depression status of the individuals, and their performances in this classification were compared. The purpose of being able to classify an individual’s depression status, based on their individual and household factors, is to be able to identify a depressed individual in order to target them for intervention. The generalized linear mixed model proved to be the better performing technique in terms of classification. Thus, we recommend that when using data based on a complex survey design, this technique is considered in classifying the occurrence of an event of interest.Item Joint modelling of child poverty and malnutrition in children aged 6 to 59 months in Malawi.(2019) Dube, Lindani.; Roberts, Danielle Jade.The objective of this study was to identify risk factors associated with poverty and malnutrition of children among the ages 6-to-59 months in the country of Malawi, making use of the joint model. By joint modelling, we refer to simultaneously analysing two or more response variables emanating from the same individual. Using the 2015/2016 Malawi Demographic and Health Survey, we jointly examine the relationship that exists between poverty and malnutrition of children among 6-to-59 months in Malawi. Jointly modelling these two outcome variables is appropriate since it is expected that people that live under poverty would have a poor nutrition system, and if a child is malnourished, the likelihood that they come from a poor family is greatly enhanced. Jointly modelling correlated outcomes can improve the efficiency of parameter estimates compared to fitting separate models for each outcome, as joint models have better control over type I error rates in multiple tests. A generalized linear mixed model (GLMM) was adopted and a Bayesian approach was used for parameter estimation. The potential risk factors considered in this study comprised of the childs age in months, gender of child, birth weight, birth order, mothers education level, head of household sex, language, household smoking habit, anaemic level, type of residence (urban or rural), region, toilet facility, source of drinking water, and multiple births. Each response was modelled separately as well as jointly and the results compared. The R package MCMCglmm was used in the analyses. The joint model revealed a positive association between malnutrition of children and poverty in the household.Item Statistical analysis of the school attendance rate among under 20 South African learners.(2020) Chabalala, Thabang Goodman.; Roberts, Danielle Jade.; Zewotir, Temesgen Tenaw.School attendance is very crucial for the growth and development of the mindset of a child. The development of the mindset and provision of training to learners is an investment of a better future for the country. The government even made school attendance compulsory because of the fruits it bears in the future. But in the past, many studies have reflected a problem with school attendance and mostly the financial constrains appearing as the hindrance towards school attendance. Which is why the government has taken the initiative to make school attendance free for those who doesn’t afford to pay for it. This has reduced a greater number of individuals who had a wish to attend school but with no funds to pay for it and allowed an opportunity for those who need it. But still the country is experiencing individuals who are in school going age but not attending school. Some of these individuals are enrolled for school but choose not to attend. This brings many questions now about the factors affecting school attendance of learners. Which brings us to the aim of this study which is to identify factors affecting school attendance of learners at the basic education level. In identification of these factors, the study made use of different statistical mod- els which accommodate the binary response. The models used in the study include Correspondence Analysis(CA), Survey Logistic Regression(SLR), Generalized Lin- ear Mixed Model(GLMM) and Generalized Additive Mixed Model(GAMM). The results suggest that the likelihood of school non-attendance is associated with Northern Cape and Western Cape which are mostly dominated by Coloured/Indian/Asian race groups sharing ”Other” relationship to household head and have no parents presence. Moreover, the female learners with mothers not alive and coming from families with salaries and pension/grant as source of income are less likely to attend school. While learners coming from all other provinces except the two specified above, African/Black by race, sharing child/grandchild relationship to household head, have both parents alive, deviating from household with high wealth index z-score and have total income above R25000 are more likely to attend school. This is a clear indication that the initiatives which were applied by the government and results of the past studies have assisted in improving school attendance, but still more initiatives are needed to cover the areas which are still reflecting poor school attendance in order to meet the aims of the Millennium Development Goals.