What Is Big-Data ?
Big-Data is one of the emerging concepts of this era. The term “Big-Data”describes large volume of data that can be structured, semi-structured and unstructured.
In other words, we can say that it is used for the collection of data sets that are so large and complex that that it is difficult to process using traditional applications and tools.
The features of Big-Data can be explained using 3V’s- Volume, Variety and Velocity.
Volume: Companies collect data from variety of sources like including social media, business transactions, information from sensor and machine-to-machine data.
Variety: The flow of data is very fast and must be dealt with in a timely manner.
Velocity: Data comprises of variety of formats text, numeric, audio, video, email, picture etc.
Also, there are 2 additional dimensions:
Variability: Variability means that the flow of data is highly inconsistent with periodic peaks. These event-triggered peak data loads can be very challenging to handle and that too when the data is unstructured.
Complexity: Data comes from multiple sources which are very difficult to link, match, cleanse and transform.
Tools used to Analyse Big-Data
Storing and Analysing It can be a challenging job due to its complexity. Given below are top tools which are used to store and analyse It.
Apache Hadoop is one of the most widely used tool for Big-Data. It is a java based free software framework that can effectively store large amount of data in a cluster. HDFS i.e. Hadoop Distributed File System is the storage system of Hadoop which splits it and distribute across many nodes in a cluster. It also has the capability to replicate data and thus ensures the availability of data.
We all know that SQL proves to be very effective when it comes to handle structured data. But what about unstructured data? For that, we can use NoSQL i.e. Not Only SQL. This has the capability to store the unstructured data with no particular schema. Each row can have its own set of column values.
Microsoft HDInsight in a Microsoft solution for Big-Data and it is powered by Apache Hadoop. It is available as a service in the cloud. It uses Windows Azure Blob storage as the Default File System. It also ensures High availability with low cost.
Hive is a distributed database management for Hadoop. It supports SQL like query option HiveSQL (HSQL) to access it. This can be used for data mining operations. Also, Hive runs on the top of Hadoop.
PolyBase works on the top of SQL Server 2012 Parallel Data Warehouse (PDW) and it is used to access the data stored in PDW. It is a data warehousing appliance built for processing large volume of relational data and provides integration with Hadoop allowing us to access the non-relational data as well.
It is one of the most popular tools of Microsoft which can be used to connect data stored in Hadoop using Excel 2013. Power View feature of EXCEL 2013 can be used to easily summarise the data.