on May 17, 2011 by if (function_exists('coauthors')) { coauthors(); } else { the_author(); } ?> in Public Health, Comments (0)
Bayesian Network Modelling of Factors Influencing the Development of Obesity
Obesity – A Public Health Problem
Obesity is a major public health problem. The prevalence has risen continuously over the past few decades and it is beginning to place an intense strain on our health resources. Although Obesity is a complicated disease both ; it is ultimately preventable.
Sources of Data
In order to complete our Bayesian analysis the Health Survey for England (HSE) data from 2003 until 2006 was used. These surveys focussed on Coronary Heart Disases and provided detailed information on diet and physical activity levels. The HSE is freely available for download from the UKDA. A detailed look into the structure of the HSE is also available.
The variables that were extracted and used in the model were divided into three categories:
- Socio-demographic: Age, Sex, Social Class, Occupational Status, Marital Status, Dependent Children, access to Transport and Leisure, Health, Ethnicity, Education level
- Energy expenditure: Recreational PA, Occupational PA, Incidental PA (walking behaviour).
- Energy Intake: Cake, snack and fried food intake, fruit and vegetable intake.
What is a Bayesian Network?
A Bayesian Network is a pictorially displaying the probabilistic relationships between different variables. Typically, we will have a collection of correlated random variables, i.e. a set of variables whose values can vary between different data points. The fact that the variables are correlated mean that the values observed for one variable may be influenced by the value of another variable. So for our obesity problem we have many socio-demographic variables, e.g. age, gender, economic status, social class etc., and diet and activity related variables, e.g. amount of recreational physical activity, amount of occupational physical activity, amount of fruit and vegetable intake etc. In a more traditional statistical analysis we might choose one of these variables as our response variable of interest, e.g. recreational physical activity, and use the remaining socio-demographic variables as predictors. Due to the highly correlated nature of the socio-demographic variables and the other energy expenditure and energy intake variables this traditional statistical approach can lead to bias – see for example Greenland (2003) for a discussion of this issue. Instead, a Bayesian Network attempts to model the joint distribution of all the variables, i.e. the probability of see the observed values of all the variables, so that no one variable is picked out for special status. The pictorial representation of a Bayesian Network tells us which other variables each variable is conditionally dependent on. A simple Bayesian Network for our obesity problem is shown below,
The picture, or graph as it is called, clearly shows two variables, Age and Gender, whose values influence the value of the HoursOfSport variable. The arrows indicate that HoursOfSport is conditionally dependent upon the Age and Gender variables. The direction of the arrows should not be interpreted as implying a direction of causality. The graph merely tells us that the probability of a particular value for the HoursOfSport variable is dependent upon the Age and Gender values for that person. We write this, P(HoursOfSport | Age, Gender). The conditional dependency could arise because of statistical correlations between variable resulting from causal mechanisms operating in different directions to the arrows of the Bayesian Network. From this you should realise that more than one graph can describe the same joint distribution. However, the arrows do help us to identify relevant variable and sub-relationships that we might be interested in if we are trying to understand or predict the HoursOfSport variable. For real problems a Bayesian Network will be more complicated than the simple network shown above. The network below
shows a graph for a joint distribution between five different variables. The connected nature of the graph illustrates the highly correlated nature of the variables in this case. For all Bayesian Networks the graph must not contain any cycles (this would lead to irresolvable cycles of dependencies) and so the graph is often referred to as a Directed Acyclic Graph or DAG. The directed adjective refers to the fact that we have arrows linking between the nodes.
The graphical depiction of the joint probability distribution of a collection of variables is a succinct and intuitive way of displaying the relationships between the variables. I can efficiently and easily communicate to my colleagues what I think is going on between the variables – I simply draw the graph. For our obesity data I can simply draw a graph and use the resulting probabilistic model to make predictions. However, how do I know I have the correct graph? What I drew was influenced by my particular biases and prejudices about what is important in causing obesity. Luckily, because we have a probabilistic model we can use Bayesian inference techniques to work out from the data what the most appropriate graph is – we let the data tell us what relationships are present, free from any prejudice. The graph is simply a probabilistic way of displaying a mathematical equation. The graph tells us the probability of getting a particular observation or data point given that particular DAG, i.e. it tell us P(Data | DAG). We then use the incredibly simple but powerful Bayes’ rule to turn this into P(DAG |Data),
This gives us the probability of a particular DAG given the observed data. From P(DAG|Data) we could work out, say the most likely graph (DAG) for that data set. The precise details of how we use P(DAG|Data) is explained in the video presentation.
References
Greenland S. (2003) Quantifying biases in causal models: classical confounding vs collider-stratification bias. Epidemiology 14, 300–306
Tags: bayesian networks, modelling, obesity
No Comments
Leave a comment