Data mining or data dredging?
Posted on July 28, 2015 by Katherine Stagg
Data mining is the process by which large-scale data sets are examined in order to find previously unknown links between different variables. It has developed as a research tool in response to the increasing power and capabilities of computers and technology over the last few decades, which allow far larger amounts of data to be simultaneously handled than was previously possible. Instead of formulating a hypothesis from observations and then collecting data to see if the hypothesis is true or not, as in conventional research, data mining uses data that has already been collected and analyses it to see if links between different variables can be found. Hypotheses about why these associations exist can then be formed.
Data mining initially appears to be quite similar to normal statistics. Fundamentally it’s just examining data to see if there’s a correlation between two different variables. How it differs from conventional statistics is in its scale and in the type of data it can handle. Ideally, the data examined in data mining should encompass a whole population, rather than just a representative sub-sample. This data may be numerical but does not have to be. For example, it could be images, such as CTs or MRIs, with data being examined to find links between different radiological features and diseases.
For an example of how data mining techniques can be used to study the health of populations, check out this paper looking at diabetes in Saudi Arabia. By using pre-collected data about non-communicable diseases the researchers were able to analyse the effect of different diabetes treatments in different age groups and create models that allowed predictions about a treatment efficacy to be made.
Data mining is a brilliant tool for research, but like most things can be exploited. Data dredging is when data mining is abused, so that the same data set is examined too many times. If enough different variables are looked at, some will show correlations that occur solely by chance rather than representing a true relationship. The more times one data set is examined the more likely a false positive result will be produced. Other ways to manipulate data mining is to alter the data set that is being looked at, for example by selecting a sub-population to examine. If a selection bias is introduced when selecting the sub-population data that previously showed no correlation can be altered to suggest a positive result.
Data mining is a technique that allows us to examine data on a bigger scale than is possible with conventional statistics and has the ability to show up relationships between different pieces of data that would otherwise not be recognised. Most data dredging is done unintentionally and occurs due to misunderstandings about how data mining should be applied, rather than malicious hunting for evidence that doesn’t really exist. With proper caution data mining can generate new, exciting results that might not have otherwise been produced.
1. Data Mining in Healthcare and Biomedicine: A Survey of the Literature, Illhoi Y, J Med Syst (2012) 36:2431–2448
2. Application of data mining: Diabetes health care in young and old patients, A.A. Aljumah, M. Gulam Ahamad , M. Khubeb Siddiqui, Journal of King Saud University – Computer and Information Sciences, Volume 25, Issue 2, July 2013, Pages 127–136
3. Data dredging, bias, or confounding, GD Smith and S Ebrahim, BMJ. 2002 Dec 21; 325(7378): 1437–1438
4. Siddharth Kalla (Oct 16, 2010). Data Dredging. Retrieved Jul 27, 2015 from Explorable.com: https://explorable.com/data-dredging
Data mining or data dredging? by Katherine Stagg is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License. All images used within the blog are not available for reuse or republication as they are purchased for Students 4 Best Evidence from shutterstock.com.