Monday 31 March 2014

Big Data - in the process of growing up

A very good article was published in the FT online a few days ago.  Its title is 'Big Data: are we making a big mistake', and it's a commendably well thought out discussion of some of the challenges of Big Data.  And perhaps a bit of a warning to not get too carried away :-)

You should go and read the full article, because it's got loads of great stuff in it.  A few of the concepts that leapt out at me in particular were the following.

One of the interesting things about modeling data in order to make predictions (as opposed to explain the data), is that features that are correlated to (but not causal for) the outcome of interest are still useful. But, the FT article makes the really good point that even in this case, correlations can be more fragile than genuinely causal features.  This is because while a cause is likely to remain a cause, correlations can more easily change over time (covariate drift).  This doesn't mean we can't use correlation as a way to inform predictions, but it does mean that we need to be much more careful and be aware that the correlations may change.

The article also discusses sample variance and sample bias.  This is in many ways the crux of the matter for Big Data.  In principle, very large data sets offer us the chance to drive sample variance towards zero.  But it really has much less to offer in terms of sample bias, and indeed (as the article points out) many of largest data sets, because of the way they're generated, are actually very vulnerable to high levels of sample bias.  This is not to say that one can't have both (the large particle physics and astrophysics data sets are great examples where both sample variance and sample bias are addressed very seriously), but it is a warning that just because your data set is huge, it doesn't mean that is is free from sample bias.  Far from it.

I've felt for a while that 'Big Data' (and data science) are currently going through a rapid phase of growing up, which I suppose is pretty inevitable because they're in many ways new disciplines.  They've gotten up to speed on the algorithmic/computer science side of things very rapidly, but are still very much in the process of learning many of the lessons that are well-known to the statistics community.  I think this is just a transition phase (these things take a bit of time), but it seems clear that a big part of the maturation of Big Data/data science lies in getting fully up to speed on the knowledge and skills of statistics.

No comments:

Post a Comment