MapR Converged Data Platform – What can it offer and how can we use it?

by Matej Ugrin | Sep 28, 2018 | Blog

Software-defined storage has made a difference in how companies perceive and use data at their disposal. Distributed file systems enabled businesses to better manage their data – locally and in the cloud. Data environments became way easier to provision; business users can now equip the data with different rules and policies that dictate if and when the data can be moved to a different storage tier directly inside the distributed file system without the need of external data movement applications.

Such environments are known to be scalable, high performant, readily available and reliable under all demanding circumstances without sacrificing security and environment simplicity. It also becomes even more attractive if they run on the so-called white box infrastructure; meaning they are not dependent on specialized appliances or hardware providers. And in case of MapR there is no cloud or hardware lock-in. So what is MapR all about? “All” in this case actually means just about everything one can envision under the data umbrella, starting from the file storage, streaming, NoSQL technologies or storing data for containerized applications.

Efficient data management will help you build applications and data services faster. What you probably don’t know is that it can be done under one roof. MapR Converged Data Platform is a highly integrated enterprise storage software that brings software-defined storage, big data challenges and data science environments to a whole another level.
A platform for all the data and applications

The beauty of MapR lies in its carefully planned storage architecture. With  MapR, users have a single platform (on a single codebase!) that delivers data-wide convergence. It is the only platform that has a distributed file system which supports storage and analytics of data streams, files and NoSQL tables in the same converged platform. This architecture also features a multi-model NoSQL database, event streaming engine, several querying tools with ANSI SQL support, and a broad set of open-source data management and analytics frameworks. There is support for technologies such as Hive, HBase, Spark, Drill and others which enable the development of robust and reliable data-oriented applications in a common, convergent environment. The platform supports the creation of both batch and real-time applications using large data loads or small incremental events – everything under one root. Managing tons of data across different systems, applications and silos is generally complex and a true pain – but significantly less so when MapR is used.

The MapR architecture is the key to providing speed, scale, and reliability, driving both operational and analytical workloads in a single platform. And the fact that it runs with great results on commodity hardware is another benefit. It just works with whatever you throw at it – even exabytes of data. And there is everything else that one would expect from an enterprise level data platform – it checks the right boxes when it comes to high availability (HA), disaster recovery (DR), data recovery and uptime guarantees. Furthermore, a unified security solution watches over all platform components at all time.

Unleash the power of big data

The big data environments have been associated with the Hadoop ecosystem in the past, but MapR’s implementation goes past that. Spark, Hdfs, Kafka and a host of other open-source technologies are tightly integrated and managed within a single platform that proves that breaking down data silos is possible and shows the way it should be done. Platform’s unique architecture design is reflected in the distributed filesystem MapR-XD which simply eliminates the limitations imposed by typical big data environments that requires several separated siloes for addressing a different kind of workloads. On top of that, MapR is Posix compliant, supports random read/write operations and simplifies data movement with the support of NFS and data movement policies.

Spoil your data scientists

Once the company has all its data in order or at its disposal, it is time to put the data to work. In order to do so, the company has to create a data science environment and employ people who know to extract value from data – data scientists. And since these experts like to spend their time on data-related business challenges and not tinkering with tools and data silos that don’t produce results, they should be “spoiled” – technology-wise. MapR takes a holistic approach to self-service data science. A preconfigured Docker environment in shape of a Data Science Refinery is equipped with the latest advanced analytics tools such as Drill, Spark, Hive and Zeppelin. These can be paired with several solutions such as TensorFlow, Caffe, MXNet, H2O and others.

Today, data scientists on average spend 80% of their time on cleaning and processing of data. If those tasks can be made quicker, their job would be a lot easier and productive – and the scientists could focus on pressing data-related business challenges.

Even a simplified access to data (eg. NFS) can go a long way. There are several other benefits of building a data science environment with MapR. Data and job processing placement on different types of hardware is just one of them (eg. SSD vs HDD, CPU vs GPU).

Apps, containers and microservices

The architecture of the MapR Converged Data Platform is not just about big data, its data-centric vision also takes into account the new generation of applications – and even those that we are yet to build tomorrow. Modern applications rely on microservices to get the right data in the right way for them to perform business tasks. Microservices themselves can be viewed as single-purpose applications or funciton that work in unison via lightweight communications, such as data streams and deliver a simplified way to build and integrate modern applications in ways that have traditionally been impossible with monolithic applications. MapR’s platform supports building event-driven microservices that leverage event streaming technologies (like MapR-ES) as the communications vehicles. By converging file, database, and streaming services one can develop agile applications for various types of workloads and business needs – for example: real-time, batch or advanced analytics applications.

But storing data for microservices, especially those running in containers, have proven to be quite challenging for traditional storage technologies due to lack of connectivity to the container ecosystem and inability to meet performance and scaling requirements. Despite the stateless and ephemeral nature of containers in most cases, their data still have to live somewhere. This can be a distributed file system, NoSQL database, a streaming solution or other dedicated storage technology. MapR provides an ideal platform for such cases and can leverage all of its platform storage services as a long-living persistent data store.

Cutting backup, archiving, and retention costs

Backups, compliance archiving and other secondary data are traditionally associated with high storage costs and tedious retention processes. MapR’s platform takes into account data tiering and can automatically transfer older data to dense-storage and other cost-optimized locations (eg Amazon S3). Its cluster topology devoted to data archiving is known for high capacity/density and modest with system resources. In any case, the cost-effective data storage is aimed at ensuring regulatory compliance and different legal requirements.

Wrapping up

MapR Converged Data Platform was not first on the market, but it is the first one to deliver a unified and centralized management of data in a common platform by combining different types and forms of data without the use of separate specialized technologies. Its distributed file system handles files, documents, data streams, and tables under one roof, facilitating the data storage, maintenance and easing the administration burden. It does so in a cost-effective way. And it’s not picky – it simply handles all the data on-premise, in the cloud or at the edge of the network; and can scale-out to practically any workload.