The Evolution of Questions
Today it’s Big Data; yesterday it was Data Warehouse; the day before it was VLDB (Very Large Database). Making sense of the changes in the ways technology enables us to do business is easier if we have a label for it. Today it’s Big Data. The questions that we are trying to answer, though, are the same:
• Can I access the data I need real time?
• How much of it can I access?
• Can I aggregate and correlate the data?
• Can I load the data into the database in enough time to make it useful?
• Can I backup and restore the data?
• Will this data flow into the next generation of system?
• If I move to a new technology what happens to my investment in code in the old technology?
• What are the business drivers moving to the new technology; is it worth the investment?
• What business practices are going to be affected?
I’m not going to try to answer all of questions here-:
20 years ago the question was, can I afford to pay for that much storage? I remember justifying the cost of developing in a client server technology because the storage was $2000/gig for SCSI (Solaris) drives, and $40000/gig for DASD (IBM Mainframe). At 65 gig of storage to start and 120 gig of storage before it was done, the pricing was significant. Today, the question is more along the lines if, “Can I afford not to keep the data?” Then it becomes, “How am I going to store it?” And then it becomes, “Can I use all the data that’s here?” Just because you can buy a couple of terabytes of storage for a couple of hundred bucks at Best Buy doesn’t mean that you can find the piece of information you need in a reasonable period of time. Of course, a couple of terabytes only starts the ball rolling. We have customers with less than 10 employees that have 10+ terabytes of data. Records for data warehouses are being broken regularly and the words go from a few terabytes to hundreds of petabytes and upwards to exabytes.
Then, pundits begin asking, “What could we really do if we could gather all available data real time?”
The questions become more interesting. How about this one: “If we track all the cars on the road, can we optimize rush hour travel time? How about fuel economy and tire wear? Are safety and national fuel economy sufficient reason to take on this project, or can? Making sense of the changes in the ways technology enables us to do business is easier if we have a label for it we identify that an individual car is not performing the way it has all along, and from there interpret a need for a new tire or an oil change? Can we then transmit that to the owner of the car? Or perhaps, sell a maintenance plan that has the repair team at the owner’s house overnight replacing the worn tire while the owner sleeps?” Once our vision begins to encompass big questions, well, then we have the Manhattan Project or the Moon Mission. It’s not necessarily a government project; today, products that answer commercial questions are answered by the market.
The questions move from “vision” questions to “practicality” questions. How would we store information which doesn’t meet a classical “relational” model? How do we get terabytes of data per hour, or per minute, or per second into the database? Then how do we–in real time–access analyze, and act on petabytes of data? Do we store this on disk or flash? In case of flash, do we need that much memory? Oh, wait, those are easy questions. How do we aggregate that much information? How do we report it? How many processors can we use simultaneously? How do we reach the drivers in order to reroute them? How do we identify the accidents, road work, traffic snarl-ups that need to be circumvented?
We’ve offered a page and a half of questions, now it’s time to offer some suggestions and advice. Let’s start with an anecdote. When my son graduated with a degree in biotechnical engineering, he started his interview process at a company that did the work he’d wanted since he was old enough to read. The only opening (jobs were scarce that year) was in post-sales product support (which would be the stepping stone to new product design). The key question he was asked was, “If you were asked to write a test plan for a piece of equipment that seemed to have problems, how would you go about it?” He answered the question with a question: “Can I see the spec?” He was offered the job on the spot; he was the first candidate to ask for a product specification.
If you never have to do a project that stretches your mind, that makes you ask, not “When can we do this?” or “How much will it cost?”— But, “is this even possible?” Then you have not experienced true fun in the IT arena. When this happens, the first thing you do is what you’ve been doing since the beginning of your career: Follow standard project management practices. Start with a feasibility study and gather requirements. Later identify whether your existing environment is up to it.
The rest of this magazine will be filled with the new key words: Hadoop, Graph Databases, Array databases, and NoSQL. Forget the terminology. Look at the specs. What is your vision? What do you need to accomplish? What tools are available? Has anybody else done anything similar with these technologies? Remember to budget for training and staffing. Be sure to offer existing staff the ability to cross-train in the new technology (failure to do this causes turnover). Once you identify what looks like a working data set, the new keyword is the same as the old one: Benchmark, Test, Prove, and Stress test. Use tools you’ve been using, or get new ones, and design new tests. I remember, back in the 90s’, when I was doing a video interview with a vendor and asked about scale: Can this volume of data be managed (these were the days when 20 gig was considered a “VLDB” and unmanageable; we had just put up an initial 65 gig on its way to 125 gig). My response: It’s not, “Can you do this” anymore, the question has become, “How do we do it?”