How much data you really need?

How much data you really need?

With the advent of Big Data and the advancement of technology to store it, are we justified in holding large quantities of it? What are the drawbacks/benefits of doing so and how can organizations find the right balance.


This is more than a decade ago. A large telecom organization was negotiating with its strategic outsourcing partner. The topic – how much data should its data warehouse hold? 3, 4 or 6 months. And the challenge was – holding more data, meant more expenditure. And even though it meant more business value, the strategic outsourcing partner won that battle and the “Usage & Retention” team at the telecom operator had to settle for 3 months! How would this question be settled today?

We all have read multiple forecasts telling us how the data volume is growing at an unprecedented pace. We all have also heard about the 3 Vs of data (if no, here they are – variety, velocity, volume) fueling this growth. Digitization continues to be one of the biggest driver. IOT is another one. The list goes on. Variety of application areas continue to fuel the consumption of this data – customer management, product development, marketing spends, risk management, etc. And data stewards continue to argue for providing unrestrained access to users to all the data – as the best means for gaining value from analysis.

How should organisations face this data deluge? Doesn’t storing more data means more cost? While hardware costs have been falling, they have not been at the same rate of growth in data. Unless you are using open-source software, your software costs are not going down either. Plus there are people costs and other overheads. Then again, what if you store all the data but don’t make use of it. Conversely, if you don’t store the data, you miss out on that million dollar insights. So how does one solve this quandary!

Let’s start with the easiest part – what data needs to be stored. The data you keep is directly driven by the use case – if you want to understand customer behaviour you need to store transactions, profile, previous purchases, etc. But if you want to do profitability analytics you will need finance, revenue, and cost data. How much you store is also driven to a certain extent by the use cases/industry. Data for statutory reporting like Basel-II, and fraud analysis will typically require 5-7 years of data, whereas customer cross-sell will require 1-3 yrs. For most customer analytics, Financial services will typically store 3 years, Retail will store 2-3, whereas Telecom will look at less than a year’s data.

The next task is to dissect if you are needing to store unstructured data e.g. web logs, chats, text, social, voice, etc. If your industry is going the digital way or you are focussing more on online, expect the answer to be yes.

Finally bear in mind that not all data is created equal. Some data is accessed more routinely than other. What portion of your data is expected to be accessed very frequently, which is going to be accessed irregularly, and which once-in-a-while. In the Telecom industry CDR or call data records are the lifeline of the business. However their usage can be starkly different. While the Usage & Retention teams will ask for the recent 90 days of CDR data at a high velocity for multiple types of customer analysis, the statutory requirements team will need to store 5+ years of data but access it in-frequently, whereas the revenue management team may want it for 1-3 years and keep churning it moderately. Each industry will have its own scenarios of this example.

Based on the answers to the above three questions, you can look at one of the following scenarios to host your data environment – use general purpose transactional DBMSs (Oracle, MS SQL) along with specialized analytics DBMSs (Teradata, IBM Netezza), and/or open-source platforms (No-SQL DBMS/ Hadoop file system). If yours is an organization moving towards digital transformation, it will be more likely than not that you will have a loosely integrated mix of all of these as your data environment – called the data lake. Welcome to the new data foundation to support the Digital Organization.


Share this post

Leave a Reply

Your email address will not be published. Required fields are marked *

Contact Us