Business users more often than not judge their tools by their looks. Sure, a nice, appealing, and user-friendly interface is great to have, but when it comes to analytics, the user interface plays only a minor part in the complete solution equation. Analytics and decision making is based on data which is usually stored in one or many databases. The magic happens below the water surface and the users only see the tip of the iceberg (user interface). Hidden in the dark most of the “dirty work” takes place, such as data preparation, data coordination, storage and architecture, all of which lays the groundwork for analytical solutions to perform to the best of their abilities. Using the rule of the thumb I would say 7 out 8 things that need to happen before raw data provides useful insights are not seen by business users nor they need be.
Today I want to put to light some of the stuff we rarely talk about but is in a world of its own and a very integral part of any analytical solution: the database. The basic requirements analytical databases must satisfy are: integrity, accuracy, availability, and performance. I will mainly address another one: capacity.
Mankind has been forever trying to properly store and use its data. The first system of/on record is the 30,000-year-old Ishango bone which was most probably used by the first people to count days, the number of domestic animals or warriors. Data was written on the bones with a sharp rock (in order to make dashes). This prehistoric system is obviously so good that is still used today – we can see it in action in pubs, breweries, at card games, and everywhere else.
Of course, modern methods of storing the data are entirely different. Our data is digitized and the very principles of storing and accessing the data are not known to most users – they are not even interested in them. Users only care about the experience – the data they seek or use should be accessible immediately and all the time, accurate and readily available.
And since most times the experience of business users that use some sort of analytical solutions is lacking they are becoming more interested in what is hidden behind brilliant graphics, presentations and visualizations. Usually their first stop with queries about data related challenges is the IT department. Complaints about data integrity and inflexible systems are very common when business users seek the possibility of independent work and higher speeds. If those needs are not met by IT department shadow IT happens. Business users are increasingly choosing additional tools on their own since work simply has to be done (in time).
Have you ever wondered what database solutions companies use today? The systems that came out in the 90s when IT departments built their data warehouses on classic relational databases (like Oracle, Microsoft SQL, IBM DB/2) are still very much present. These last century solutions were built for the needs of IT departments that wanted to facilitate their work and are less suitable for direct use in the modern business environment.
In this century databases have already experienced two (r)evolutions. Specialized analytical appliances appeared, followed by analytical databases. In recent years analytical systems based on distributed data processing, most often having their roots in HDFS, emerged.
Currently there is an abundance of choice when it comes to database solutions on the market and this makes it difficult for companies to decide which is best suited to their needs and requirements. The latter are also constantly evolving which creates demand for even faster development. Such technology development made a big mess and the result is that most companies use very heterogeneous, siloed and often incompatible solutions in the workplace. Complex IT environments also have higher maintenance costs and are less competitive; implementing change is becoming a nightmare. But ask any businessman and you will learn that when it comes to ROI time is no longer measured in decades but in months. Database and analytical solutions should follow suit.
Let me quickly overview what kind of database systems can one find in enterprise data warehouses today:
The first generation of analytical solutions used classical relational databases such as Oracle, DB/2 (IBM) and SQL Server (Microsoft). These databases were mainly intended for use with transaction systems and stored data generated by business: customers, accounts, production. Since the specialized analytical systems were very expensive and rare 10 or more years ago, relational databases were also used by analysts for analytical purposes. Sure, these were not the ideal choice but data warehouses have tried to overcome their limitations with different optimization techniques. Soon it became clear that classical relational databases are simply not the right solution for analytics and specialized and dedicated analytical solutions took over.
Right after Y2K, Netezza introduced a revolutionary design of an analytical appliance, completely shaking the market and established database vendors. Netezza’s solution was a bunch of low-cost MPP servers with fast interconnect that run database originated in PostgreSQL in Linux environment. Voilà! The fastest, preconfigured and extremely cheap beast was created, it was more than 100x faster than relational databases and on top of everything it was superbly easy to use. In its time, it was kind of a technology miracle. I still remember one of our customer’s DBA saying: “It is against physics and impossible!”
Soon other manufacturers tried mimicking Netezza’s approach with more or less good copies: Oracle with Exadata, HP bought Vertica, EMC bought Greenplum, and Microsoft acquired DATAllegro. In 2010, Netezza ended up under IBM’s umbrella as the industry giant caught the moving train at the last minute and thus protected its host computer market. In the end, SAP joined this game with its HANA solution.
The importance of big boxes was soon diminishing as the hardware prices steeply declined and capacity continued to grow rapidly. These conditions allowed for pure software solutions to appear – analytical databases that operated on any general hardware entered the market. This had various positive effects – database solutions became cheaper, the speed increased, and the standardization of hardware within the company was simplified. Some of said solutions are: HP Vertica, EMC GreenPlum, EXASOL, MongoDB.
Despite the rapid development of data management solutions, the needs of companies to store and analyze data grew even faster. Research and development has recently focused its efforts on effective and efficient collection and analysis of massive amounts of data generated by various devices connected to the Internet. Companies that want to leverage data from internet of things, mobile devices, personal fitness trackers, home automation devices, company websites, social networks and e-commerce platforms need the so called big data solutions. The most effective solutions for managing big data have roots in HDFS based cluster systems. Modern HDFS based solutions are emerging monthly, together with new vendors, new concepts and solutions, which is also a challenge for companies and business users who have trouble deciding which is the right solution for their needs.
Some popular choices are: Cloudera, Hadoop/HBase, MonetDB, MapR…
The rapid development of new platforms for storing large quantities of data also brings many challenges. Data transfer to analytical systems is slowing down, data requires a lot of duplicate space and comes to data warehouses with a delay. Storing and analyzing huge amounts of data is also associated with high costs. New systems have been developed to remedy these challenges – new solutions only connect various data sources at the logical level. Data virtualization platforms hide the complexity of the company’s data management solutions to end users. Business users do not even know that data is analyzed from multiple, physically disconnected databases.
Solving the chaos of the data management systems is not a one-time job any more but has become a continuous process of improving performance and functionality of existing data systems and adding new components to the existing ones when needed. In order to be able to find optimal solutions and implement them successfully one will need several broadly educated internal IT resources that are not relying simply on the vendors or “independent” analysts information.
Even experienced analysts at established research and advisory companies today struggle with analytics. It is such a fast-developing area that their findings and forecasts often contradict each other. Gartner, Ventana Research, Bloor, Howard Dresdner, BARC, BIScorecard, ZDnet, WIRED and others involved in market research usually have their own “versions of truth” and their forecasts predicting future development and winners resemble a guessing game.
In order to limit the noise that comes from new, innovative and somewhat revolutionary companies market researchers are constantly raising the bar that determines who and what products are included in the research. Consequently many new and truly innovative companies and their products are left out of these charts. Therefore organizations seeking to gain impressive competitive advantage by using the latest and greatest solutions have to invest in their own research. On the other hand enterprises that base of their analytical technology purchases solely on market research and opinions of advisory firms will never use the latest and most up-to-date solutions and will therefore also be less competitive.
Remember: Purchasing second best product available is already a competitive disadvantage.
IT departments are faced with many tough but justified questions, dilemmas and concerns. Some of the most important and common ones sound like this: “Should we retire our 1st generation data warehouse database and replace it with new one?”
The answer depends on several factors. If most of the analytical reports in business environment are static and users do not perform extensive ad-hoc or advanced analytics, and the data in the data warehouse is limited to the internal data sources, then existing classic database is most likely still good enough and does not need upgrading.
Otherwise, particularly if you deal with large amounts of data and have already run into performance issues you should think about different solutions. Some businesses cannot afford to fall behind, today telecommunication providers, banks, insurance companies, internet sales, utilities and retail need databases that will work hand in hand with analytical solutions.
Industry standards make it possible to connect all devices quite easily with each other. This applies to all levels of analytics as well, be it data integration, metadata level, or analytical tools themselves. Modern tools are able to read data from virtually any electronic source link them together and present a consolidated result.
Best of breed companies constantly seek new and competitive solutions, often test cutting-edge technologies and introduce them to the business environment. Most likely these companies use two or more generations of analytical platforms simultaneously. Meanwhile the more conservative companies are usually just replacing (one) older technology with another.
Once you get out of the comfort zone and start looking for the best analytical solutions available you can soon find one that fits your business. But it is of the utmost importance that you extend the research beyond traditional software and hardware vendors and stick to your own criteria when choosing a solution. Sure, this requires a lot of additional effort from the employees in order to gather the required knowledge. But once these become really capable of seeking out modern solutions on the market and taking into account the current and future needs of the company the decision on which ones to pick will be an easy one and supported by creation of additional business value.
Still, it is best to take a month or two or even half a year before you make a final decision and purchase. Be sure to throw everything at the POC (Proof of Concept) and make sure it can handle it. It is well worth the effort and beats living with a wrong decision for several years or even a decade.
In any case, I advise you to educate yourself and your colleagues before any decision is made about investment in data and analytical infrastructure. Widen your search for potential solutions, seek information from several different sources and do not just simply (or blindly) rely on your current technology suppliers. All solutions work well for vendors; you are after the one that works best for you and your data!
I hope you have read it all the way. Despite my best effort, this article still turned out quite technical. Above all, I will be very pleased if it will affect your thinking about the possibilities that exist out there and the opportunities they offer.