Big Data refers to large volume of data, that cannot be processed using traditional databases. When we have reasonable amounts of data we typically use traditional relational databases like Oracle, MySQL, SQL Server to store and work with the data. However when we have large volume of data then traditional databases will not be able to handle the data.
Traditional databases are good at working with structured data that can be stored in rows and columns. However, if we have unstructured data that does not follow a structure then using a relational database is not be the right choice.
In case of Big data we have large amounts of data which can be in any format like images, flat files, audio etc whose structure and format may not be the same for every record.test
The size of the big data, the volume of data that gets created from time to time, may be significantly larger compared to traditional databases. This will be difficult to handle with traditional databases.
Big data is characterized by the 3 V’s – Volume, Velocity and Variety.
- Volume : The volume of data collected is organizations is large and comes from different sources like sensors, meter readings, business transactions etc
- Velocity : Data is created at high speed and has to be handled and processed quickly. Instruments like IOT devices, RFID tags, Smart meters and others lead to automated generation of data at unprecedented speed
- Variety : Data comes in all formats. It can be in audio, video, numeric, text, email, satellite images, atmospheric sensors etc
Examples And Usage Of Big Data
Storing data without analyzing it to gain meaningful insights from the data would be a waste of resources. Before we look at testing of big data it would be useful to understand how it is being used in the real world.E-commerce
Amazon, Flipkart and other e-commerce sites have millions of visitors each day with hundreds of thousands of products. Amazon uses big data to store information regarding products, customer and purchases.Apart from this data is also gathered around the product searches, views, products being added to cart, cart abandonment, products that are bought together etc.All of this data is stored and processed in order to suggest products that the customer is most likely to buy.If you open a product page, you can see this in action under the “Frequently bought together”, “Customers who bought this item also bought” and “Customers who viewed this item also viewed” sections.This information is also used to recommend deals / discounts and rank the products in the search results.All of this data has to be processed very quickly which is not feasible with traditional databases.Social Media
Social media sites generate huge amounts of data in terms of pictures, videos, likes, posts, comments etc. Not only is data stored in big data platforms, they are also processed and analyzed to offer recommendations on content that you might like.Twitter
- There are 310 million monthly active users on Twitter
- A total of 1.3 billion accounts have been created on Twitter
- Each day 500 million tweets are sent by users which is about 6000 tweets per second
- Over 618,725 tweets were sent in a minute during FIFA World Cup final in 2014
- There are 1.9 billion monthly active users on Facebook
- Over 1.28 billion users log on to Facebook everyday
- 350 million photos are uploaded everyday
- 510,000 comments and 293,000 statuses are updated every minute
- 4 new petabytes of data is generated every day
- Everyday videos generate 8 billion views
- 700 million people use Instagram every month
- 40 billion photos have been shared on Instagram
- Users like 4.2 billion pictures everyday
- 95 million photos are uploaded everyday
Not only is data stored in big data platforms, they are also processed and analyzed to offer recommendations of things you may be interested in.
For example, if you search for a washing machine on Amazon and go to Facebook, Facebook will show you ads for the same.
This is a big data use case because, there millions of websites that advertise on Facebook and there are billions of users.
Storing and processing this information to display the right advertisement to the right users cannot be accomplished by traditional databases in the same amount of time.
Targeting the right customer with the right ad is important because a person searching for washing machines is more likely to click on an ad of a washing machine than an ad for a Television.
Healthcare
- FDA and CDC created the GenomeTrakr program which processes 17 terabytes of data which is used to identify and investigate food borne outbreaks. This helped FDA in identifying one nut-butter production centre as the source of a multi state Salmonella outbreak. FDA halted the production at the factory which stopped the outbreak.
- Aetna, an insurance provider processed 600,000 lab results and 18 million claims in a year to assess the risk factor of patients and focus the treatment on one or two which significantly impacts and improves the health of the individual.
Data formats in Big Data
One common question that people ask is – why we cannot use traditional relational database for big data. To answer this, first we need to understand the different data formats in big data.
Data formats in big data can be classified into three categories. They are:
- Structured Data
- Semi Structured Data
- Unstructured data
Structured Data
- This refers to data that is highly organized.
- It can be easily stored in any relational database.
- This also means that it can be easily retrieved / searched using simple queries.
Examples of Structured Data
The image below depicts the data model for an application. Here you can see the tables and associated columns in the tables. In this example the user table t_user stores details like the users name, password, email, phone numbers etc. The length of the fields and their data types are predefined and have a fixed structure.
Semi-Structured Data
- Semi-structured data is not rigidly organized in a format that can allow it to be easily accessed and searched.
- Semi-structured data is not usually stored in a relational database.
- However they can be stored in a relational database after some processing and converted to structured format.
- Semi-structured data lies between structured and unstructured data.
- They can contain tags and other metadata to implement a hierarchy and order.
- In semi-structured data, the same type of entities may have different attributes in different order.
Examples of Semi-Structured Data
CSV, XML and JavaScript Object Notation (JSON) are examples of semi-structured data which are used in almost all applications.
Sample of an XML file is given below. We can see that the XML file refers to a catalog and the books which are part of the catalog. This data can be stored in a relational database with some processing.
Sample JSON content is given below. In the below example, we have the address and phone numbers of a user along with some other details. This information can also be stored in a relational database after processing.
Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
{ "firstName": "Adam", "lastName": "Levine", "age": 22, "address": { "streetAddress": "18 Elm Street", "city": "San Jose", "state": "CA", "postalCode": "94088" }, "phoneNumber": [ { "type": "home", "number": "845-156-5555" }, { "type": "fax", "number": "789-658-9874" } ] }
Unstructured Data
- Unstructured data does not have any predefined format.
- It does not follow a structured data model.
- It is not organized into a predefined structure.
- Images, videos, word documents, mp3 files can be considered as unstructured data even though they have an internal structure
- This lack of structure makes it difficult to store and retrieve such data from relational databases
- Up to 80% of the data produced in an organization is unstructured data
Examples of Unstructured Data
Images, videos, word documents, presentations, mp3 files etc
Why traditional relational databases cannot be used to support big data
- Traditional relational databases like Oracle, MySQL, SQL Server cannot be used to big data since most of the data we have will be in unstructured format.
- Variety of data – Data can be in the form of images, video, pictures, text, audio etc. This could be military records, surveillance videos, biological records, genomic data, research data etc. This data cannot be stored in the row and column format of the RDBMS.
- The volume of data stored in big data is huge. This data needs to be processed fast and this requires parallel processing of the data. Parallel processing of RDBMS data will be extremely expensive and inefficient.
- Traditional databases are not built to store and process data in large volumes / size. Example: Satellite imagery for USA, Roadmaps for the world, all the images on Facebook.
- Data creation velocity – Traditional databases cannot handle the velocity with which large volumes of data is created. Example: 6000 tweets are created every second. 510,000 comments are created every minute. Traditional databases cannot handle this velocity of data being stored or retrieved.
Test Strategy And Steps For Testing Big Data Applications
There are several areas in the process / workflow of a big data project where testing will be required. Testing in big data projects is typically related to database testing, infrastructure and performance testing and functional testing. Having a clear test strategy contributes to the success of the project.
Database Testing Of Big Data Applications
A significant part of the testing effort will be spent on data validation compared to testing of the software components.
Before we go further let us understand the flow of data in a big data application. This workflow is shown in the image below.
- Data can flow into big data systems from various sources like sensors, IOT devices, scanners, CSV, census information, logs, social media, RDBMS etc.
- The big data application will work with these data sets. This data may have to be cleaned and validated to ensure that correct data is used going forward.
- As this data will be huge, we will have to bring it into Hadoop (or a similar framework) where we can work with the data.
- Once the data is in Hadoop we will have to verify whether the data has been properly imported into Hadoop.
- We will have to test correctness and completeness of the data.
- In order to work with Hadoop you should be aware of the commands used in Hadoop.
- In order to validate the source data, you should have knowledge of SQL since the source of data could be an RDBMS system
- The big data application will work on the data in Hadoop and process it as per the required logic
- Though our big data application processes the data in Hadoop, we will also want to validate that it was processed correctly as per the customer requirements.
- In order to test the application we use test data. The data that is available in Hadoop is huge and we cannot use all the data for testing. We select a subset of the data for testing, which we call test data.
- We will also have to run the same process on test data, as per customer requirements.
- Then we will compare it with the results of the processing from the big data application to confirm that the application is processing the data correctly.
- In order to process the test data you will require some knowledge of Hive, Pig Scripting, Python and Java. You will develop scripts to extract and process the data for testing.
- You can think of the big data application as an application that the developer has written which will process large volumes of data. For example: Consider that you are working for Facebook and the developers have developed a big data application where any comment that contains the phrase “Free Credit Card Offer” is marked as spam. This is an overly simplified example, usually the applications are more complex and involve identifying patterns in data and coming up with predictions using data science to differentiate spam comments from legitimate comments.
- The processed data is then stored in a data warehouse.
- After the data is stored in the data warehouse it may be validated again to ensure that it aligns with the data that was generated after processing, by the big data application.
- The data from the data warehouse is usually analyzed and depicted in a visual format so that Business Intelligence (BI) can be gained from it. Some organizations use business intelligence tools from vendors like SAP, Oracle or they use languages like R using Shiny to visualize the data.
- Once the data has been represented visually, it will have to be validated
- Web services may be used in order to transfer the data from the data warehouse to the BI system. In such cases, the web services will also have to be tested and the tester should have knowledge of testing web services.
As seen in the above steps, database testing will be a key component of testing in big data applications. In summary, the above steps can be classified into 3 major groups:
- Data Staging Validation: Here we validate the data taken from various sources like sensors, scanners, logs etc. We also validate the data that is pushed into Hadoop (or similar frameworks).
- Process Validation: In this step the tester validates that the data obtained after processing through the big data application is accurate. This also involves testing the accuracy of the data generated from Map Reduce or similar processes.
- Output Validation: In this step the tester validate that the output from the big data application is correctly stored in the data warehouse. They also verify that the data is accurately being represented in the business intelligence system or any other target UI.
Performance Testing of Big Data Applications
Big data projects involve processing of large amounts of data in a short period of time.
This operation requires heavy computing resources and smooth data flow in the network.
Architectural issues in the system can lead to performance bottlenecks in the process, which can impact the availability of the application. This can in turn impact the success of the project.
Performance testing of the system is required to avoid the above issues. Here we measure metrics like throughput, memory utilization, CPU utilization, time taken to complete a task etc.
It is also recommended to run failover tests to validate the fault tolerance of the system and ensure that if some nodes fail, other nodes will take up the processing.
Big Data / Hadoop Performance Testing
Performance testing of the big data application focuses on the following areas.
- Data Loading And Throughput: In this test the tester observes the rate at which data is consumed from different sources like sensor, logs etc, into the system. The tester also checks the rate at which the data is created in the data store. In case of message queues, we test the time taken to process a certain number of messages.
- Data Processing Speed: In this test we measure the speed with the data is processed using MapReduce jobs.
- Sub-System Performance: In this test we measure the performance of various individual components which are part of the overall application. It may be beneficial to test components in isolation to identify bottlenecks in the application. This can include testing of MapReduce process, performance of queries etc.
Performance Testing Approach
The performance testing approach for big data applications is shown in the image below.
- In order to begin performance testing, the big data cluster which is to be tested, has to be setup.
- The load and jobs that will be executed as part of the performance test have to be identified and planned.
- Writing custom scripts/clients that may be required
- Run the performance test and study the results.
- If the results are not satisfactory and the system does not meet performance standards, then the components have to be optimized and the tests have to be run again.
- The previous step has to be repeated till performance requirements are met.
Functional Testing of Big Data Applications
Functional testing of big data applications is performed by testing the front end application based on user requirements. The front end can be a web based application which interfaces with Hadoop (or a similar framework on the back end).
Results produced by the front end application will have to be compared with the expected results in order to validate the application.
Functional testing of the applications is quite similar in nature to testing of normal software applications.
Roles and Responsibilities Of A Tester In Big Data Applications
- The tester should be able to work with unstructured data and semi-structured data. They should also be able to work with structured data in the data warehouse or the source RDBMS.
- Since the schema may change as the application evolves, the software tester should be able to work with a changing schema.
- Since the data can come from variety of data sources and differ in structure, they should be able to develop the structure themselves based on their knowledge of the source.
- This may require them to work with the development teams and also with the business users to understand the data.
- In general applications the testers can use a sampling strategy when testing manually or an exhaustive verification strategy when using an automation tool. However in case of big data applications since the data set is huge even extracting a sample which represents the data set accurately, may be a challenge.
- Testers may have to work with the business and development team and may have to research the problem domain before coming up with a strategy
- Testers will have to be innovate in order to come up with techniques and utilities that will provide adequate test coverage while maintaining high test productivity.
- Testers should know how to work with systems like Hadoop, HDFS. In some organizations, they may also be required to have or gain basic knowledge of setting up the systems.
- Testers may be required to have knowledge of Hive QL and Pig Latin. They may also be called upon to write MapReduce programs in order to ensure complete testing of the application.
- Testing of big data application requires significant technical skills and there is a huge demand for tester who possess these skills.
Advantages Of Using Big Data / Hadoop
- Scalable : Big data applications can be used to handles large volumes of data. This data can be in terms of petabytes or more. Hadoop can easily scale from one node to thousands of nodes based on the processing requirements and data.
- Reliable : Big data systems are designed to be fault tolerant and automatically handle hardware failures. Hadoop automatically transfers tasks from machines that have failed to other machines.
- Economical : Use of commodity hardware along with the fault tolerance provided by Hadoop, makes it a very economical option for handling problems involving large datasets.
- Flexible : Big data applications can different types of heterogeneous data like structured data, semi structured data and unstructured data. It can process data extremely quickly due parallel processing of data.
Disadvantages Of Using Big Data / Hadoop
- Technical Complexity – The technical complexity involved in big data projects is significantly higher compared to normal projects. Each component of the system belongs to a different technology. The overheads and support involved in ensuring that the hardware and software for these projects run smoothly, is equally high.
- Logistical Changes – Organizations that want to use big data may have to modify how data flows into their systems. They will have to adapt their systems to constant flow of data rather than in batches. This could translate to significant change to their existing IT systems.
- Skilled Resources – Testers and developers who work on big data project need to be highly technical and skilled at picking up new technology on their own. Finding and retaining highly skilled people can be a challenge.
- Expensive – While big data promises use of low cost machinery to solve computing challenges, the human resources required in such projects are expensive. Data mining experts, data scientists, developers and testers required for such projects cost more than normal developers and testers.
- Accuracy of Results – Extracting the right data and accurate results from the data is a challenge. Example: Gmail can sometimes mark a legitimate email as spam. If many users mark emails from someone as spam, gmail will start marking all the emails from that sender, as spam.
Hadoop Architecture
Hadoop is one of the most widely used frameworks in big data projects. Though testers may be interested in big data from a testing perspective, it is beneficial to have high level understanding of the Hadoop architecture.
The above diagram shows the high level architecture of Hadoop. Each node (Client, Master Node, Slave Node) represents a machine.
Hadoop is installed on client machines and they control the work being done by loading cluster data, submitting MapReduce jobs and configuring the processing of data. They are also used to view results. All of these machines together form a cluster. There can be many clusters in the network.
MasterNodes have two key responsibilities. First, they handle distributed storage of data using NameNodes. Second, parallel processing of data (MapReduce) which is coordinated by JobTracker. Secondary NameNode acts as a backup NameNode.
Slave nodes form the bulk of the servers. They store and process the data. Each slave node has a DataNode and a TaskTracker.
The DataNode is a slave of and receives instructions from the NameNodes and carries out storage of data as shown below.
The TaskTracker is a slave to and receives instructions from JobTracker. It processes the data using MapReduce which is a two step process. The workflow of Map process is shown below.
Big Data Tools / Common Terminologies
- Hadoop – Hadoop is an open source framework. It is used for distributed processing and storage of large datasets using clusters of machines. It can scale from one server to thousands of servers. It provides high availability using cheap machines by identifying hardware failures and handling them at application level.
- Hadoop Distributed File System (HDFS) – HDFS is a distribute file system which is used to store data across multiple low cost machines.
- MapReduce – MapReduce is programming model for parallel processing of large data sets
- Hive – Apache Hive is data warehouse software that is used for working with large datasets stored in distributed file systems
- HiveQL – HiveQL is similar to SQL and is used to query the data stored in Hive. HiveQL is suitable for flat data structures only and cannot handle complex nested data structures.
- Pig Latin – Pig Latin is a high level language which is used with the Apache Pig platform. Pig Latin can be used to handle complex nested data structures. Pig Latin is statement based and does not require complex coding.
- Commodity Servers – When working with big data, you will come across terms like Commodity Servers. This refers to cheap hardware used for parallel processing of data. This processing can be done using cheap hardware since the process is fault tolerant. If a commodity server fails while processing an instruction, this is detected and handled by Hadoop. Hadoop will assign the task to another server. This fault tolerance allows us to use cheap hardware.
- Node – Node refers to each machine where the data is stored and processed. Big data frameworks like Hadoop allow us to work with many nodes. Nodes may have different names like DataNode, NameNode etc.
- DataNodes – This are the machines which are used to store data and process the data.
- NameNodes – NameNode is the central directory of all the nodes. When a client wants to locate a file, it can communicate with the NameNode which will return the list of DataNodes servers where the file / data can be located.
- Master Nodes – Master nodes which oversee storage of data and parallel processing of the data using MapReduce. It uses NameNode for data storage and JobTracker for managing the parallel processing of data.
- JobTracker – It accepts jobs, assigns tasks and identifies failed machines
- Worker Nodes – They are the bulk of virtual machines and are used for storing and processing data. Each worker node runs a DataNode and TaskTracker – which is used for messaging with the master nodes.
- Client Nodes – Hadoop is installed on client nodes. They are neither master nor worker nodes and are used to setup the cluster data, submit MapReduce jobs, view the results.
- Clusters – A cluster is a collection of nodes working together. These nodes can be master, worker or client nodes.
Big Data Automation Testing Tools
Testing big data applications is significantly more complex than testing regular applications. Big data automation testing tools help in automating the repetitive tasks involved in testing.
Any tool used for automation testing of big data applications must fulfill the following needs:
- Allow automation of the complete software testing process
- Since database testing is a large part of big data testing, it should support tracking the data as it gets transformed from the source data to the target data after being processed through the MapReduce algorithm and other ETL transformations.
- Scalable but at the same time, it should be flexible enough to incorporate changes as the application complexity increases
- Integrate with disparate systems and platforms like Hadoop, Teredata, MongoDB, AWS, other NoSQL products etc
- Integrate with dev ops solutions to support continuous delivery
- Good reporting features that help you identify bad data and defects in the system
Big Data Testing Services
Finding skilled resources for testing big data projects, retaining them, managing higher salary costs and growing the team while meeting project needs at the same time is a challenge and this issue is addressed by big data testing service providers.
Organizations that provide big data testing services have team members who are highly technical. They are able to learn new technologies quickly and troubleshoot issues independently. They have experience across a large number of technologies, platforms and frameworks, which is crucial when testing big data applications.
Big data testing service provider have a pool of skilled resources who are experienced in big data testing. They are able to deploy these resources to projects quickly.
If an organization were to grow the big data team internally, it would take a significant amount of time and cost to hire skilled resources. This may impact projects that have a specific timeline. Other than this they also have to retain and manage career aspirations of team members who want to grow in the domain.
In traditional organizations the big data project may be a side project or one among many. However, big data projects is a focus area for organizations that provide big data testing service. This allows experts in the domain to grow their technical skills and domain knowledge, while they grow in the organization.
As a result, big data testing service providers are a good solution for organizations who want to get the expertise but do not have the time or resources to develop an in-house team.
If you found this tutorial useful, please share it with your friends / colleagues.
How are you using big data in your organizatutotion? Please share your inputs in the comments.
How are you using big data in your organizatutotion? Please share your inputs in the comments.