Do you remember back in 1999 when everyone was talking about the Y2K systems crash that was going to take place when the clocks turned midnight on December 31? Seems funny now but some of us were busy until the very last minute ensuring that every system, database, and application was ready to handle 2 more digits on every single date stored, processed and transmitted. What possible lack of vision did software developers, architects have to do something as silly as not creating dates with the full year?
Back in the days when computer memory and disks were expensive, software developers kept data down to a minimum, including removing unnecessary digits from a date. Every byte was expensive to store and networks were slow to send and receive data so economies had to be made.
Today, things are much different. We can all get free accounts on Google Drive or Dropbox with lots of gigabytes of storage. We carry around USB sticks and portable disks with thousands of files, photos, and videos. Our digital cameras enable us to take photos of every single moment, only to be forgotten in some folder, never to be seen again because nobody has time to flip through thousands of photos and videos of your kids from age 0 to 15.
It is a different world, storage is cheap, cloud subscriptions allow us to store gigabytes of data anywhere and just anything else, we fully exploit it to the maximum with little regard to the consequences. A lazy and somewhat cautious attitude has taken over, where we decide to store everything just in case and sort it out later. We have swung the pendulum from spending lots of time deciding what is really worth keeping, to spending lots of time looking for stuff.
From Data Warehouse to a Data Lake
The data warehouse approach was and is an understanding that there are key metrics that a company needs in order to manage their business, to analyze their actions, campaigns, investments and overall activities and understand their results. It is typically a painstaking process that involves going around all stakeholders asking for the metrics and indicators they currently use or that they need to manage their business unit, derive processes and data flows and ultimately implement it into a highly orchestrated technology environment.
Data Lakes are the counterpart of a\the above in where we throw all data in to… well a “lake” and we then go “fishing” for data. Sometimes they bite, other times we get nothing. If we get nothing it could be because we didn’t really know how and what to look for or potential a technical issue. If you get something it could be exactly what you needed or potentially a side effect, an alternate fact or an incorrect parameterized view. Think of it as your folders of hundreds and thousands of family photos you have taken over the past 5 years and looking for a specific one where your Aunt Muriel celebrated her 80th birthday. Hope that you are a highly organized person to ensure that folder naming, photo naming and categorization is your strength otherwise you will be doing a lot of previews before you find it.
There must be a better way, a more common sense way of dealing with data other than simply hoarding it. It makes for poor performance, costly systems and for a high degree of complexity that only a small percentage of companies and institutions really need.
Where Big Data is Key
Big Data, Data Lakes and NoSQL have a place in the technology stack and in certain sectors or departments such as scientific research, high volume small transactions such as telecommunication and gaming metrics and web and log activity. The emerging field of Data Scientists has sprung from the need to extrapolate from the unknown, to detect patterns from the masses and from the understanding that we need to know what we don’t yet know.
Systems are developed with an initial purpose but that purpose can be deviated by users that take unexpected actions or enter data that was not expected. These patterns need to be recognized and fixed, taken advantage of or at least understood and measured. Take for example a simple game like Clash Royale, a mobile game with millions of users playing it every second. The game patterns are not always repetitive, the level of each character and card is planned beforehand but the combinations are impressive. Analyzing game usage is critical not only for the improvement of the game itself but for monetization. Air travel is another perfect example, where millions of travelers each and every day go through the airports and manage to do so more or less within a reasonable amount of time. The logistics behind it are fascinating.
This is where Big Data shines and putting everything into one large bucket and sorting it out later makes sense, especially if you have tools like R and SAS to perform statistical analysis, programming languages such as Python to write custom routines to find stuff, and Hadoop platforms to mass treat the data as quickly as possible.
But for the average business, the remaining 95% of businesses, do you really need a Hadoop cluster, a NoSQL Mongo DB or a set of Python developers or data scientists on staff? Or perhaps you just need to do some thinking about what is really important to measure and focus your efforts around a set of data, selected intelligently, to manage your business. Then select a tool that can easily manage that data and can expand or contract as needed.
How Big is Big Data?
The truth is that a large portion of companies, big and small, have very little data, and more importantly, their data is highly structured. We are talking about invoices, purchase orders, support tickets, sales calls, customer records, and inventory lists. It is not rocket science! Most data sets are not very large, averaging 5 GB. As an example, an email account with over 10 years of emails, most of it spam, and attachments are still under 20 GB.
By being smarter about how you do things you can truly increase efficiency while at the same time not restricting yourself on the data you select to bring to your analysis playground and not bring everything “just in case”. Smart Data means that you take your data through a series of steps, each step reducing the amount of data and increasing its quality.
It starts with transactional data that is typically voluminous and without context. This is where most Big Data shops focus and aim at tackling directly. But by applying some groupings, basic and efficient queries you can reduce the volume down considerably without losing much of the information it contains. Think about it as a compress file or picture that they remove all the unwanted bytes keeping all they need to restore the file to its original format.
Analytical data must have all the basic elements to provide “what if” type of analysis, regression analysis, it serves as a basis for things such as drill-downs, pivot tables, scatter plots. Data that enables the discovery of information that eventually takes us to make decisions. Decisive data are subsets, metrics that are clear performance indicators of an activity, an investment or a process.
The ability to drill up all the way to the transactional is important, and more important is the ability to change underlying layers quickly based on changes to the data structure.
But all of the above needs to be done intelligently and not just thrown into a big bucket and hope to sort things out later. It does not have to mean less data or missing metrics. It just means that it may take some more time to think about what you are trying to get out of your data that will truly help you guide your business. But once you have done that it, and if done right, the flexibility to adapt to market or process changes needs to still be there.
Before you think about Big Data, think about what data you really need to make decisions. Think Smart Data.