So what about BigData?
Listening to experts, we learn about the importance of the V's (*) - talking with Business End Users, we will likely just mention data size: record/source multitude of data/database volumes. There are an awfull lot of interpretations into what BigData brings to the table, and why we would need to care. Regardless of your definition of BigData, once you have come up with your BigData case, you will need to figure out how to capture and handle results.
In this blog, I would like to offer some cautions. Some pitfalls can be quite easily avoided. Some are harder - but none are new to us, really. The very mistakes we frequently see happen with customers in traditional Data handling, reporting, Warehousing and Data Warehouse Automation, also lurk here. So maybe, Data Warehouse Automation principles, and lessons learned in the traditional data disciplines could also be applied to BigData. That's what I offer to investigate.
To make it a little more tangible, I needed to select a BigData illustration case. TimeXtender is not a bank or insurance company, we are not a big brand name, so topics such as Fraud Detection for banking or social analytics and sentiment analysis for major brands would be far less pertinent (we wish we were, but we are not there yet).
However, as we are currently investigating Marketing Automation, which presents BigData potential, I propose to use this as guiding example. Some of the things we have learned:
Do and Don't - part 1
When a visitor arrives on our website (not unlike yourself), before even clicking a single button, there are many things we can learn. Not to worry, it would be hard for me to figure out exactly who you were, but I can - to some degree - know where you have been coming from, what your interests may be based on other pages you browsed before even arriving here, what you have noticed so far, based on the path you have taken before reading this post.
In fact, your presence on our site may be characterized by over one hundred BigData attributes : landing IP address, trail, timing, time of day, hour of day, path, ... to name but a few. Capturing all of the BigData attributes would cost very little effort - so which ones should I try and capture today?
"Well, why don't we capture everything then - at least, for now? Who knows what we would be able to do with it later down?"
The capture-all is the very reflex we battle when customers ask us about what Data sets to upload in a Data Warehouse. As soon as you have made connection to Data, you could easily copy all data attributes and records. Because, we never know some hidden treasure may reside therein. Classic mistake: although copying everything (at least for now, right?) may be easy to do, does not mean we should. We know from 30 odd years of Data Warehouse domain experience that this will cause trouble. "Hey, why is this process running slow again?" - well, because when the data set was small and interdependencies far less complex, we decided to capture everything.
Capturing everything you could ever know today about a given BigData record will likely yield similar issues. Ok, so how do we Do this?
Just like in a modern Data Warehouse Automation set-up, maybe we need to think in increments. Start small, think big.
What do we really want to know about the website visitor? When would it make sense to interact? When to pop-up a question? and so on ...
Yes, that will mean that we will likely miss some initial attributes, but consider the potential downside:
do we really care by which browser and browser version you have come and landed on our site? (**)
I think I would rather know whether you are just looking or may be genuinely interested. I'd rather know if you already have a SQL server Data Warehouse, and whether you may be ready for taking the next step, I would indeed like know where you are located, but only so we could figure out which contact person would be best suited to help (not to categorize unique page hits or suchlike). I'd rather know if you have interest in a particular customer testimonial, to try and figure out what topic may be of interest. And I would like to know if the help you may be looking for may be ours to offer.
Some of these things will not be tied into BigData at all - so we will need to interact. I would be interested in using Big Metadata to pop up my next question, only when are showing behaviour which leads me to believe you are likely to respond well. "Wanna buy some Automation?" will not be the best way forward. BigData in this context may supplement other things I know about, but in order to interact with you - I may need to refer to other, proven approaches.
(just in case - if you are looking to automate your SQL server based Warehouse, drop me a line firstname.lastname@example.org and we could -maybe- even TALK about it).
Do and Don't - part 2
Once we have captured our BigData set, and identified interesting attributes, how are we going to match records with other data? The way some of the few BigData case are being described, a lot of the data is an uncontrollable, volatile (another V*) stream of bits and bytes. How can we structure the potentially massive flows and zillions of attributes into re-usable format?
How will I find the right BigData sources? How will I manage them? What about infrastructure? And once we have succeeded in setting all up, how will we be able to consistently match findings with interested users? As we learn more and progress further, how will we share our work, to allow others to expand on it?
Ok, so today I happen to be spending some time & research on the topic, and I may have come up with a good starter kit, but what if I need to come back next week, next month or next year to refine what we have been doing to date? How will this get tracked, authored and documented? How will we keep track of history, change? What if we need to connect more users to our data sets and findings?
Today, as the field is so new to everyone, wouldn't we be better of just hitting some trial and error and only worry about structure further down? Well - of course! Absolutely. But Don't jump yet.
In fact, to answer this question from the Bigdata (Marketing Automation) perspective,, we could refer to a similar question from the traditional Analytics domain: do we need to wait until we have a full blown data warehouse setup, even invest in Data Warehouse Automation before we should start reporting? Of course not.
What you would likely Do, is set up whatever reporting system you see fit, maybe using Data Discovery methods or just hard-core Excel savvy to get to results right now. Getting results fast can be more important as getting them the right way. We do know that, over time, structure will be required. From ad-hoc Discovery and reporting, most of use take the step to the Data Warehouse, for obvious reasons (***). From there on, Automation may be a logical next. The opposite is equally true: TimeXtender customer are people who have invested in Data Warehouse Automation. But they already had a Data Warehouse when they did, and most of them had invested in ad-hoc reporting before that.
Why not apply the same method for handling the BigData case?
Waiting to build structure for which we do not yet know the intended purpose would be a waste of time. Purchasing a predefined structure which would be capable of handling all potential BigData in's and out's would be madness. At least, I do not see how that could work, this is why I also frown when people try to deploy standard DataMarts which try to predict, capture and produce all possible KPI's in a given context. The reason why a standard sytem or a set of standard reports is doomed to fail, is that there is no such thing as a "standard business question", in my experience.
Start today - worry about structure later. When implementing structure, go easy: structure should not kill time-to-result, structure should allow for flexibility as we expand on (BigData) learning.
Draw benefit from your initial work, and build structure around it, not the other way round.
Do and Don't - part 3
Back to our Marketing Automation case. I now understand that, in order to really work with BigData in the marketing automation context, you need to weigh a lot of options. After the start, I likely need to implement structure and maybe think about automation. I need integration of technical components, I need to figure out how to attract a crowd, produce interesting content, maybe invest in tooling, think about which BigData attributes I need to work with, consult with Data Spercialists, Marketing firms and content syndicators.
Sounds like we may need to apply for a Business loan to fund the entire activity.
Maybe we'd be better of waiting?
Maybe so. You don't want to jump ahead of things, but on the other hand - you do not want to be late in this game.
I think it all comes down to the environment you operate in. Some environments have loads of money to spend on things like real estate and huge ICT systems, most of us have to be a little more cerative. But all of us, with whatever past investment we made, found a way to get started.
The fact that you have been reading thus far, leads me to believe you may have a sincere BigData interest. How do you usually approach new domains of interest? How did you get started with your other Analytics projects and cases? Did you start out on your own, form a committee?
Getting started is the thing to Do.
Start by working with what you already have. Ask the marketing department which stats they already track, before investing in mass volume tracking. Talk to peers and colleagues, visit some more websites - search for blogs on the topic. Look at what competitors may be doing.
Rest assured, with all the BigData hype going on today - an event is likely happening in the next days or weeks, somewhere nearby.
Maybe you want to check that one out first?
Footnotes and credits
(*) Velocity, Variety, Volume, Virtualization and some more V's - see this blogpost by TimeXtender partner element61, by courtesy of its author Kristof Gobbens
(**) There may be technical reasons why our website does not work really well on your browser of platform - in that case please email me email@example.com
(***) according to the Kimball Institute, a Data Warehouse is "a copy of transaction data specifically structured for query and analysis" which allows to have (among other things):
- a single common data model
- integrate data coming from different sources
- maintain data history
- improve data quality
- manage an independent security model