Which kind of big counts in big data?

Microsoft Research’s Cambridge group wrote a paper with the mocking title Nobody ever got fired for using Hadoop on a cluster.  In it they found that 50% of internal analytics jobs on Microsoft’s company cluster were using less than 14 GB of data in 2011. (Other studies from Yahoo and Facebook had similar “big data” numbers). So I guess that means we can start doing big data mapreduce hadoop jobs on our phones because they have a lot more 14 GB of SSD storage. πŸ˜‰

Number of records or GB of input data are not always a relevant measure of bigness.

Even with modest sample sizes, the size of your computations can grow huge when doing nth order polynomial regression, other typical featuring engineering creating hundreds or thousands of variables to train, or training just about any deep learning model.

Other big computations can come from processing event logs from corporate systems. I’m skeptical even dedicated time series databases they can make short work of the audit trails in corporate databases to discover useful process patterns and make predictions. Regardless of the computation time, from what I’ve seen, there is untapped big value in those event logs/audit trails. So if you want to impress your boss, don’t focus on TB or PB. Show them management insights from process mining data they never knew their systems were already gathering.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s