Often Big data characteristics are described with the
help of Five Vs (Big Data Volume Velocity Variety and Veracity). They are
as follows.
Volume – How Big is data
o
The Volume of Big data is growing at exponential rate and
expected to reach around 44 ZB (1021) by 2020.
Velocity – How Fast is data processed
o
speed at which new data is generated and the speed at which data
moves around.
o
The latency of processing big data and decision making is very
important and that’s where it makes huge difference with conventional RDBMS.
Variety – The various types
of data
o
ConventionalRDBMS supports only Structured Data but Big data
supports three kinds of data.
§ Structured – Highly structured and Usually stored in an RDBMS.
Approximately 20% of all world’s data is structured. Examples – Numbers,
Dates, and groups or tables of words and numbers (for example, a customer
table with name, age, address, and so on columns).
§ Semi-Structured – Semi-structured
data does not necessarily conform to a fixed schema (structure) but may be
self-describing and may have simple key/value pairs. Cannot be stored in
rows and tables in a typical database. For example, JSON, XML, Logs,
Tweets.
§ Unstructured – Lacks
structure or parts of it lack structure. 80% of the world’s data is
unstructured. Example Formats – Free-Form Text, Emails, Images,
Videos, Voice Recordings, Social media conversations, Sensor data, etc.
Veracity – How
accuracy/meaningful/trustworthy are the results to the given problem
space.
Value – Useful Business
value extracted out of big data.
Big Data Analytic Companies
include all these Five characteristic Vs into consideration
before they decide to build programs for data analysis.
Big
Data Management
Currently many large enterprises (Google, LinkedIn,
Facebook, IBM, Oracle) are already entered into Big data management life cycle,
which will include collection of data to decision making phases as shown below.
§ Data Collection
§ Data Storage
& Organization
§ Summarizing
§ Analysis
§ Synthesizing
§ Decision Making
Below is the high level architecture of Big Data
Software companies big data analysis model. Data is collected from various
sources like Web servers, social media, etc and stored in Hadoop Cluster and
supplied through Analytics Platform and Big Data Warehouse and made available
to Business Intelligence Users.
Big data analytics software companies like IBM,
Facebook, LinkedIn, Google, Twitter are already evolving into technology that
allows analyzing the data while it is being generated (sometimes referred to as
real-time in-memory analytics), without ever putting it into databases. So, if
any enterprise doesn’t recognize the importance of Big data analytics, it will
definitely fall behind the future market trends.
Big Data Challenges
Below are the current challenges of Big Data
management and decision making faced by big data analytic companies.
§ HighVolume of Data. Scalablity.
§ High Velocity of data generation
§ Complex and Variety data types especially
Semi-structured and Unstructured
§ Disk Storage and Transmission capacities. By 2013, a
single disk can store upto 4 TB data and its maximum data transfer speed
is upto 128 mb/sec only. With this storage and transfer limitations, one can
read entire disk in roughly 5 hours.
§ Data management
issues of access,
utilization, updating, governance, and reference.
§ Privacy and Security is another
major challenge in Big data. For Example, Information
regarding the people is collected and used in order to add value to
the business of the organization. This is done by creating insights in
their lives which they are unaware of.
§ Data Sharing between big data companies, about
their clients and operations threatens the culture of secrecy and
competitiveness.
Big Data Analytic Challenges
§ Ability to
determine what data to collect and how to analyse it to find patterns and
correlations as the data is very huge.
§ Ability to
understand big data business intelligence objectives & information
needs and come with Appropriate computer algorithms.
§ Need
experienced mathematics and statistics knowledge to build the relations between
data.
§ Ability to
present data (both verbal and written) to ensure the insights are understood
and acted upon.
Big Data Solutions
Below are the solutions for the above discussed big
data challenges
§ Distributed
storage across multiple disks
§ Implement Parallel
Processing
§ Bring the code to
the data for processing instead of bringing data to code.
One and Only technology that meets all the above expectations is
Hadoop, an open source framework for storing and parallel
processing of distributed data across multiple nodes.