
“It’s a thin line between love and hate…”
– The Persuaders
Happy Valentine’s Day!
Let me use this occasion to confess that I am conflicted: I both love and hate the term ‘Big Data’.
I detest ‘Big Data’ for the simple reason that it has become so pervasive and so overused as to be rendered meaningless (perhaps even more so than ‘cloud computing’).
Some companies believe that if they simply sprinkle some ‘Big Data pixie dust’ on their marketing material, the skies will open, and they will be showered with new customers and new revenue.
Instead, market confusion reigns and overall vendor credibility gets a little more diluted.
So why do I love ‘Big Data’?
Like it or not, there is no better term to describe the complexities associated with managing and leveraging today’s data. So even though I wince a little every time I use the term, I struggle to clearly and succinctly communicate the set of challenges and opportunities new companies and technologies are seeking to address.
So what is ‘Big Data’?
While definitions vary, I prefer Wikipedia’s. It is not that previous technologies and solutions are incapable of handling today’s data. The problem is that managing and exploiting data with these technologies has become costly and awkward (requiring what I deem as ‘unnatural acts’ and herculean effort on the part of IT and operations).
Regardless of your definition, the sooner we accept ‘Big Data’ as fact, the more we can focus on identifying solutions and opportunities and less on the term itself.
How do you feel about ‘Big Data’?
One piece of my definition of Big Data is “Too big to move.” For data profiling, some kinds of data cleansing, and a lot of aggregation for analytics etc., the idea of actually moving the data to a different system where it is processed en masse and then the result somehow moved back is just unworkable. That’s the “herculean feat” that caught my attention when I was trying to articulate the advantages of “ETL vs .ELT” (you can guess which side Oracle was on in that debate). I think the consensus is that the ETL engine is dying off in the age of Big Data, partly because if you have to move it and crunch it in batch, you’re probably doing it in Hadoop; and partly because if you can get away with not moving it, e.g. if it’s in an MPP columnar relational database and you want to do some aggregation, you can probably crunch it mostly in place and at worst move the much smaller results.
Thanks for the comment. Data inertia is definitely an issue. I see two potential scenarios here: 1) ELT (as you mentioned); and/or 2) discreet ETL based on specific task at hand. For the latter, universal indexing and search capabilities will be key to understand what data is available/useful to extract for a given job.