In my previous post on data quality (view it here), we looked at input data and how recency, frequency and data collection process can affect the quality and types of analysis that can be done. In this post, we turn to the role of methodology in creating quality data and the factors that go into selecting the right methodology. The discussion will get a bit technical, but it’s worth getting a handle on these important concepts to appreciate what goes into creating quality data.
Applying the Right Methodology
For EA and for the purposes of this examination, “methodology” refers to the techniques we use to build our data products, such as DemoStats and DaytimePop, as well as those used to execute custom projects. These techniques range from simple rule-based algorithms to machine learning methods. At EA we use a range of techniques when building our datasets. In large part, the type, amount, reliability and recency of data available dictate the method we use.
We build unique methods for nearly every theme and geography captured by our datasets. What does that mean? For DemoStats alone we have created and maintain over 150 unique algorithms to produce 754 variables across 42 themes for 5 time periods and 11 levels of geography.
It is helpful to think about methodology as a spectrum, with model accuracy on one end and model generalization at the other. There is not a direct trade-off between accuracy and generalization. The best models have high levels of accuracy and generalization. Yet, modelling techniques tend to start at one end of the spectrum and, through model training, calibration and testing, work toward the other end of the spectrum. The following graphic illustrates where along the accuracy-generalization continuum various modelling techniques starts.
Figure 1. The methodology spectrum and where common modelling techniques fall (click to enlarge)
When deciding which techniques to use to build a standard dataset like WealthScapes or execute a custom project, we compare the advantages and disadvantages of techniques focused on accuracy versus those focused on generalization, as shown in Table 2.
Table 1. Advantages and disadvantages of accuracy and generalization (click to enlarge)
This table uses a couple of technical terms that are important for anyone working with data, methodology and models to understand. Let’s start with correlation versus causation. Correlation is just a statistical metric—a mathematical formula that compares two variables. Correlation says nothing about the existence of a real-world relationship between two variables or the nature of that relationship. Causation, on the other hand, explicitly looks at how attributes or phenomenon interact.
For example, if we are trying to predict how many jelly beans a worker in an office consumes in a day, we might find that specific variables correlate highly with jelly bean consumption, such as amount of soda consumed in a day, distance from the worker’s desk to the bowl of jelly beans and number of hours the worker spends in the office. In this scenario, it would be easy to conclude that high soda consumption causes jelly bean consumption.
But that would be an inappropriate conclusion. Jelly bean consumption and soda are linked, but indirectly. In this case the causal factor driving jelly bean consumption is more likely the worker’s attitude about nutrition; soda consumption is acting as a proxy. If soda were removed from the office environment, it is very possible that jelly bean consumption would increase rather than decrease. In fact with further tests, we would likely be able to determine that distance to the jelly bean bowl (access) and hours spent in the office (exposure) have a significant and direct causal relationship to jelly bean consumption. The greater a worker’s access and exposure to jelly beans, the more jelly beans the worker will eat. Always remember: Correlation is not causation.
The other terms we need to understand when it comes to assessing modelling techniques and methodologies are “out of sample,” “out of time” and over-fitting—three terms that are interrelated. When we describe a model as over-fit, we are saying that the model is not well generalized. Over-fit models treat random noise as systemic. Over-fit models perform exceptionally well—meaning there is a small amount of error—when tested against the data used to create the model. But when testing the model against data that are independent—referred to as “out of sample” data from the data used to build the model—the error is significantly bigger. Data that are both out of sample and from a different type period are called “out of time.” By testing a model out of sample and out of time, we protect against over-fitting and get an understanding of how well the model generalizes.
For example, Figure 2 shows the error obtained for model predictions compared to the data used to train the model and versus compared to out-of-sample data. There are two things this chart tells us: 1) the model is less accurate when applied out of sample and 2) after training step 12, the performance of the model becomes progressively worse; beyond training step 12 the model is clearly over-fit. So the model resulting from training step 12 should be the model used for further analysis or to generate predictions going forward because it has the best out of sample performance.
Figure 2. Model over-fitting in comparison of training data error and out-of-sample data error (click to enlarge)
At EA, we work hard to balance prediction accuracy and model generalization when building our datasets. We test different modelling frameworks for nearly every level of geography and collection of variables we produce. Where the data are available at a high frequency and are reliable, we judiciously apply techniques that are more focused on prediction accuracy. Where data are less frequent and less reliable or we have to predict far into the future, we focus on building models that generalize well and do our best to really get at causation rather than correlation.
When it comes to creating analytical models and datasets, one size does not fit all. Modelling techniques are designed to solve particular types of problems with specific types of data based on a set of assumptions. It is possible to adapt most modelling techniques to a wide range of applications and input data types. But this can only be done within certain limits. When choosing the right methodology, it is very important to be aware of the limitations of different modelling techniques, as well as the limitations imposed by the input data.
In the first two parts of this series, we have seen how input data and methodology have tremendous impact on data quality, requiring analysis before any analytics project can begin. In the upcoming, final part, we will examine the role that quality control and assurance plays in creating quality data.
Sean Howard is Vice President, Demographic Data at Environics Analytics