- Volume : The volume of data collected is organizations is large and comes from different sources like sensors, meter readings, business transactions etc
- Velocity : Data is created at high speed and has to be handled and processed quickly. Instruments like IOT devices, RFID tags, Smart meters and others lead to automated generation of data at unprecedented speed
- Variety : Data comes in all formats. It can be in audio, video, numeric, text, email, satellite images, atmospheric sensors etc
Examples And Usage Of Big Data
Storing data without analyzing it to gain meaningful insights from the data would be a waste of resources. Before we look at testing of big data it would be useful to understand how it is being used in the real world.E-commerce
Amazon, Flipkart and other e-commerce sites have millions of visitors each day with hundreds of thousands of products. Amazon uses big data to store information regarding products, customer and purchases.Apart from this data is also gathered around the product searches, views, products being added to cart, cart abandonment, products that are bought together etc.All of this data is stored and processed in order to suggest products that the customer is most likely to buy.If you open a product page, you can see this in action under the “Frequently bought together”, “Customers who bought this item also bought” and “Customers who viewed this item also viewed” sections.This information is also used to recommend deals / discounts and rank the products in the search results.All of this data has to be processed very quickly which is not feasible with traditional databases.Social Media
Social media sites generate huge amounts of data in terms of pictures, videos, likes, posts, comments etc. Not only is data stored in big data platforms, they are also processed and analyzed to offer recommendations on content that you might like.Twitter
- There are 310 million monthly active users on Twitter
- A total of 1.3 billion accounts have been created on Twitter
- Each day 500 million tweets are sent by users which is about 6000 tweets per second
- Over 618,725 tweets were sent in a minute during FIFA World Cup final in 2014
- There are 1.9 billion monthly active users on Facebook
- Over 1.28 billion users log on to Facebook everyday
- 350 million photos are uploaded everyday
- 510,000 comments and 293,000 statuses are updated every minute
- 4 new petabytes of data is generated every day
- Everyday videos generate 8 billion views
- 700 million people use Instagram every month
- 40 billion photos have been shared on Instagram
- Users like 4.2 billion pictures everyday
- 95 million photos are uploaded everyday
Healthcare
- FDA and CDC created the GenomeTrakr program which processes 17 terabytes of data which is used to identify and investigate food borne outbreaks. This helped FDA in identifying one nut-butter production centre as the source of a multi state Salmonella outbreak. FDA halted the production at the factory which stopped the outbreak.
- Aetna, an insurance provider processed 600,000 lab results and 18 million claims in a year to assess the risk factor of patients and focus the treatment on one or two which significantly impacts and improves the health of the individual.
Data formats in Big Data
- Structured Data
- Semi Structured Data
- Unstructured data
- This refers to data that is highly organized.
- It can be easily stored in any relational database.
- This also means that it can be easily retrieved / searched using simple queries.
- Semi-structured data is not rigidly organized in a format that can allow it to be easily accessed and searched.
- Semi-structured data is not usually stored in a relational database.
- However they can be stored in a relational database after some processing and converted to structured format.
- Semi-structured data lies between structured and unstructured data.
- They can contain tags and other metadata to implement a hierarchy and order.
- In semi-structured data, the same type of entities may have different attributes in different order.
Gambardella, Matthew XML Developer's Guide Computer 44.95 2000-10-01 An in-depth look at creating applications with XML. Ralls, Kim Midnight Rain Fantasy 5.95 2000-12-16 A former architect battles corporate zombies, an evil sorceress, and her own childhood to become queen of the world.
{ "firstName": "Adam", "lastName": "Levine", "age": 22, "address": { "streetAddress": "18 Elm Street", "city": "San Jose", "state": "CA", "postalCode": "94088" }, "phoneNumber": [ { "type": "home", "number": "845-156-5555" }, { "type": "fax", "number": "789-658-9874" } ] }
- Unstructured data does not have any predefined format.
- It does not follow a structured data model.
- It is not organized into a predefined structure.
- Images, videos, word documents, mp3 files can be considered as unstructured data even though they have an internal structure
- This lack of structure makes it difficult to store and retrieve such data from relational databases
- Up to 80% of the data produced in an organization is unstructured data
Why traditional relational databases cannot be used to support big data
- Traditional relational databases like Oracle, MySQL, SQL Server cannot be used to big data since most of the data we have will be in unstructured format.
- Variety of data – Data can be in the form of images, video, pictures, text, audio etc. This could be military records, surveillance videos, biological records, genomic data, research data etc. This data cannot be stored in the row and column format of the RDBMS.
- The volume of data stored in big data is huge. This data needs to be processed fast and this requires parallel processing of the data. Parallel processing of RDBMS data will be extremely expensive and inefficient.
- Traditional databases are not built to store and process data in large volumes / size. Example: Satellite imagery for USA, Roadmaps for the world, all the images on Facebook.
- Data creation velocity – Traditional databases cannot handle the velocity with which large volumes of data is created. Example: 6000 tweets are created every second. 510,000 comments are created every minute. Traditional databases cannot handle this velocity of data being stored or retrieved.
Test Strategy And Steps For Testing Big Data Applications
Database Testing Of Big Data Applications
- Data can flow into big data systems from various sources like sensors, IOT devices, scanners, CSV, census information, logs, social media, RDBMS etc.
- The big data application will work with these data sets. This data may have to be cleaned and validated to ensure that correct data is used going forward.
- As this data will be huge, we will have to bring it into Hadoop (or a similar framework) where we can work with the data.
- Once the data is in Hadoop we will have to verify whether the data has been properly imported into Hadoop.
- We will have to test correctness and completeness of the data.
- In order to work with Hadoop you should be aware of the commands used in Hadoop.
- In order to validate the source data, you should have knowledge of SQL since the source of data could be an RDBMS system
- The big data application will work on the data in Hadoop and process it as per the required logic
- Though our big data application processes the data in Hadoop, we will also want to validate that it was processed correctly as per the customer requirements.
- In order to test the application we use test data. The data that is available in Hadoop is huge and we cannot use all the data for testing. We select a subset of the data for testing, which we call test data.
- We will also have to run the same process on test data, as per customer requirements.
- Then we will compare it with the results of the processing from the big data application to confirm that the application is processing the data correctly.
- In order to process the test data you will require some knowledge of Hive, Pig Scripting, Python and Java. You will develop scripts to extract and process the data for testing.
- You can think of the big data application as an application that the developer has written which will process large volumes of data. For example: Consider that you are working for Facebook and the developers have developed a big data application where any comment that contains the phrase “Free Credit Card Offer” is marked as spam. This is an overly simplified example, usually the applications are more complex and involve identifying patterns in data and coming up with predictions using data science to differentiate spam comments from legitimate comments.
- The processed data is then stored in a data warehouse.
- After the data is stored in the data warehouse it may be validated again to ensure that it aligns with the data that was generated after processing, by the big data application.
- The data from the data warehouse is usually analyzed and depicted in a visual format so that Business Intelligence (BI) can be gained from it. Some organizations use business intelligence tools from vendors like SAP, Oracle or they use languages like R using Shiny to visualize the data.
- Once the data has been represented visually, it will have to be validated
- Web services may be used in order to transfer the data from the data warehouse to the BI system. In such cases, the web services will also have to be tested and the tester should have knowledge of testing web services.
- Data Staging Validation: Here we validate the data taken from various sources like sensors, scanners, logs etc. We also validate the data that is pushed into Hadoop (or similar frameworks).
- Process Validation: In this step the tester validates that the data obtained after processing through the big data application is accurate. This also involves testing the accuracy of the data generated from Map Reduce or similar processes.
- Output Validation: In this step the tester validate that the output from the big data application is correctly stored in the data warehouse. They also verify that the data is accurately being represented in the business intelligence system or any other target UI.
Performance Testing of Big Data Applications
Big Data / Hadoop Performance Testing
- Data Loading And Throughput: In this test the tester observes the rate at which data is consumed from different sources like sensor, logs etc, into the system. The tester also checks the rate at which the data is created in the data store. In case of message queues, we test the time taken to process a certain number of messages.
- Data Processing Speed: In this test we measure the speed with the data is processed using MapReduce jobs.
- Sub-System Performance: In this test we measure the performance of various individual components which are part of the overall application. It may be beneficial to test components in isolation to identify bottlenecks in the application. This can include testing of MapReduce process, performance of queries etc.
Performance Testing Approach
- In order to begin performance testing, the big data cluster which is to be tested, has to be setup.
- The load and jobs that will be executed as part of the performance test have to be identified and planned.
- Writing custom scripts/clients that may be required
- Run the performance test and study the results.
- If the results are not satisfactory and the system does not meet performance standards, then the components have to be optimized and the tests have to be run again.
- The previous step has to be repeated till performance requirements are met.
Functional Testing of Big Data Applications
Roles and Responsibilities Of A Tester In Big Data Applications
- The tester should be able to work with unstructured data and semi-structured data. They should also be able to work with structured data in the data warehouse or the source RDBMS.
- Since the schema may change as the application evolves, the software tester should be able to work with a changing schema.
- Since the data can come from variety of data sources and differ in structure, they should be able to develop the structure themselves based on their knowledge of the source.
- This may require them to work with the development teams and also with the business users to understand the data.
- In general applications the testers can use a sampling strategy when testing manually or an exhaustive verification strategy when using an automation tool. However in case of big data applications since the data set is huge even extracting a sample which represents the data set accurately, may be a challenge.
- Testers may have to work with the business and development team and may have to research the problem domain before coming up with a strategy
- Testers will have to be innovate in order to come up with techniques and utilities that will provide adequate test coverage while maintaining high test productivity.
- Testers should know how to work with systems like Hadoop, HDFS. In some organizations, they may also be required to have or gain basic knowledge of setting up the systems.
- Testers may be required to have knowledge of Hive QL and Pig Latin. They may also be called upon to write MapReduce programs in order to ensure complete testing of the application.
- Testing of big data application requires significant technical skills and there is a huge demand for tester who possess these skills.
Advantages Of Using Big Data / Hadoop
- Scalable : Big data applications can be used to handles large volumes of data. This data can be in terms of petabytes or more. Hadoop can easily scale from one node to thousands of nodes based on the processing requirements and data.
- Reliable : Big data systems are designed to be fault tolerant and automatically handle hardware failures. Hadoop automatically transfers tasks from machines that have failed to other machines.
- Economical : Use of commodity hardware along with the fault tolerance provided by Hadoop, makes it a very economical option for handling problems involving large datasets.
- Flexible : Big data applications can different types of heterogeneous data like structured data, semi structured data and unstructured data. It can process data extremely quickly due parallel processing of data.
Disadvantages Of Using Big Data / Hadoop
- Technical Complexity – The technical complexity involved in big data projects is significantly higher compared to normal projects. Each component of the system belongs to a different technology. The overheads and support involved in ensuring that the hardware and software for these projects run smoothly, is equally high.
- Logistical Changes – Organizations that want to use big data may have to modify how data flows into their systems. They will have to adapt their systems to constant flow of data rather than in batches. This could translate to significant change to their existing IT systems.
- Skilled Resources – Testers and developers who work on big data project need to be highly technical and skilled at picking up new technology on their own. Finding and retaining highly skilled people can be a challenge.
- Expensive – While big data promises use of low cost machinery to solve computing challenges, the human resources required in such projects are expensive. Data mining experts, data scientists, developers and testers required for such projects cost more than normal developers and testers.
- Accuracy of Results – Extracting the right data and accurate results from the data is a challenge. Example: Gmail can sometimes mark a legitimate email as spam. If many users mark emails from someone as spam, gmail will start marking all the emails from that sender, as spam.
Hadoop Architecture
Big Data Tools / Common Terminologies
- Hadoop – Hadoop is an open source framework. It is used for distributed processing and storage of large datasets using clusters of machines. It can scale from one server to thousands of servers. It provides high availability using cheap machines by identifying hardware failures and handling them at application level.
- Hadoop Distributed File System (HDFS) – HDFS is a distribute file system which is used to store data across multiple low cost machines.
- MapReduce – MapReduce is programming model for parallel processing of large data sets
- Hive – Apache Hive is data warehouse software that is used for working with large datasets stored in distributed file systems
- HiveQL – HiveQL is similar to SQL and is used to query the data stored in Hive. HiveQL is suitable for flat data structures only and cannot handle complex nested data structures.
- Pig Latin – Pig Latin is a high level language which is used with the Apache Pig platform. Pig Latin can be used to handle complex nested data structures. Pig Latin is statement based and does not require complex coding.
- Commodity Servers – When working with big data, you will come across terms like Commodity Servers. This refers to cheap hardware used for parallel processing of data. This processing can be done using cheap hardware since the process is fault tolerant. If a commodity server fails while processing an instruction, this is detected and handled by Hadoop. Hadoop will assign the task to another server. This fault tolerance allows us to use cheap hardware.
- Node – Node refers to each machine where the data is stored and processed. Big data frameworks like Hadoop allow us to work with many nodes. Nodes may have different names like DataNode, NameNode etc.
- DataNodes – This are the machines which are used to store data and process the data.
- NameNodes – NameNode is the central directory of all the nodes. When a client wants to locate a file, it can communicate with the NameNode which will return the list of DataNodes servers where the file / data can be located.
- Master Nodes – Master nodes which oversee storage of data and parallel processing of the data using MapReduce. It uses NameNode for data storage and JobTracker for managing the parallel processing of data.
- JobTracker – It accepts jobs, assigns tasks and identifies failed machines
- Worker Nodes – They are the bulk of virtual machines and are used for storing and processing data. Each worker node runs a DataNode and TaskTracker – which is used for messaging with the master nodes.
- Client Nodes – Hadoop is installed on client nodes. They are neither master nor worker nodes and are used to setup the cluster data, submit MapReduce jobs, view the results.
- Clusters – A cluster is a collection of nodes working together. These nodes can be master, worker or client nodes.
Big Data Automation Testing Tools
- Allow automation of the complete software testing process
- Since database testing is a large part of big data testing, it should support tracking the data as it gets transformed from the source data to the target data after being processed through the MapReduce algorithm and other ETL transformations.
- Scalable but at the same time, it should be flexible enough to incorporate changes as the application complexity increases
- Integrate with disparate systems and platforms like Hadoop, Teredata, MongoDB, AWS, other NoSQL products etc
- Integrate with dev ops solutions to support continuous delivery
- Good reporting features that help you identify bad data and defects in the system
Big Data Testing Services
How are you using big data in your organizatutotion? Please share your inputs in the comments.
This is a really an awesome article which will be informative to everyone.
ReplyDeleteSpoken English Classes in Chennai
Best Spoken English Class in Chennai
IELTS Coaching in Chennai
Japanese Classes in Chennai
spanish language in chennai
German Language Course in Chennai
Spoken English Classes in Anna Nagar
Spoken English Classes in Tnagar
I really wanna to thank you for giving a wonderful opportunity to read this blog.kindly update more and more like this in future.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
Hadoop Training in Anna Nagar
java training in chennai
python training in chennai
selenium training in chennai
Hadoop training in chennai
Big data training in chennai
hadoop training in Velachery
I want to thank for sharing this blog, really great and informative. Share more stuff like this.
ReplyDeleteAWS Certification in Chennai
AWS Training in Chennai
DevOps course in Chennai
Best DevOps Training in Chennai
Cloud Computing Courses in Chennai
Cloud Training in chennai
AWS Training in Anna Nagar
AWS Training in Chennai
Well Done! I am very glad to see your article and very attractive to me. Regularly following your blog, so please update more new one post.
ReplyDeleteUnix Training in Chennai
Unix shell scripting Training in Chennai
Excel Training in Chennai
Oracle Training in Chennai
Tableau Training in Chennai
Primavera Training in Chennai
Power BI Training in Chennai
Unix Training in Chennai
Unix shell scripting Training in Chennai
Well done!!! your article is very interest and very useful content. Really, this is a wonderful post and I want more updates about this...
ReplyDeleteCorporate Training in Chennai
Corporate Training institute in Chennai
Excel Training in Chennai
Social Media Marketing Courses in Chennai
Pega Training in Chennai
Embedded System Course Chennai
Linux Training in Chennai
Primavera Training in Chennai
Corporate Training in Chennai
Corporate Training institute in Chennai
This post is very nice! It's very interesting and I always like your post because your written style is very different. Keep it up...
ReplyDeleteEmbedded System Course Chennai
Embedded Course in chennai>
Unix Training in Chennai
Power BI Training in Chennai
Tableau Training in Chennai
Oracle Training in Chennai
Pega Training in Chennai
Oracle DBA Training in Chennai
Embedded System Course Chennai
Embedded Training in Chennai
Great job and the blog going great. Keep going.
ReplyDeleteSoftware Testing Course in Madurai
Software Testing Classes in Madurai
Software Testing Training in Madurai
Software Testing Course in Coimbatore
Best Software Testing Training Institute in Coimbatore
Software Testing Institute in Coimbatore
Great blog. Inspired me a lot. Thank you.
ReplyDeleteSoftware Testing Course in Madurai
Software Testing Classes in Madurai
Software Testing Training in Madurai
Software Testing Course in Coimbatore
Best Software Testing Training Institute in Coimbatore
Software Testing Institute in Coimbatore
Best version of https://ineedsoftwares.com/
ReplyDeleteGreat, this article is quite awesome and I have bookmarked this page for my future reference. Keep blogging like this with the latest info.
ReplyDeleteR Programming Training in Chennai
Data Analytics Training in Chennai
Data Science Course in Chennai
Data Analytics Courses in Chennai
Data Analyst Course in Chennai
Machine Learning Course in Chennai
Machine Learning Training in Velachery
R Training in Chennai
Great work, This is very different and very useful information. I like more details about this title.
ReplyDeletePega Training in Chennai
Pega Course in Chennai
Primavera Training in Chennai
Unix Training in Chennai
Power BI Training in Chennai
Excel Training in Chennai
Corporate Training in Chennai
Embedded System Course Chennai
Linux Training in Chennai
very conscious blog! I like a post and I have more knowledge after visiting your post. Thank you for your great work with sharing.
ReplyDeletePrimavera Training in Chennai
Primavera Course in Chennai
Pega Training in Chennai
Unix Training in Chennai
Tableau Training in Chennai
Power BI Training in Chennai
Excel Training in Chennai
Oracle Training in Chennai
Social Media Marketing Courses in Chennai
Your blog has very useful information about this technology which i am searching now, i am eagerly waiting to see your next post as soon
ReplyDeleteSoftware Testing Training in OMR
Software Testing Training in Adyar
Software Testing Training in T Nagar
Selenium Training in Tambaram
Selenium Training in Anna Nagar
Software Testing Training in Chennai
Software Testing Course in Chennai
Selenium Training in Adyar
Amazing experience on reading your article. It is really nice and informative.
ReplyDeletePython Training in Chennai
Python Training in Anna Nagar
JAVA Training in Chennai
Hadoop Training in Chennai
Selenium Training in Chennai
Python Training in Chennai
Python Training in Velachery
Thanks for providing such great and useful informations on your blog.update more data later.
ReplyDeleteHadoop Training in Chennai
Big data training in chennai
big data training in velachery
JAVA Training in Chennai
Python Training in Chennai
SEO Training in Chennai
Hadoop training in chennai
Big data training in chennai
big data training in velachery
The blog is really superb, I like it. I got more techniques and surely this post will be helpful for my growth. continue the great work and Please updates
ReplyDeleteExcel Training in Chennai
Advanced Excel Training in Chennai
Unix Training in Chennai
corporate training in chennai
Tableau Training in Chennai
Oracle Training in Chennai
Primavera Training in Chennai
Power BI Training in Chennai
Excel Training in Chennai
Advanced Excel Training in Chennai