Approach to Big Data Testing – Part 1

The Internet is filled with a lot of information on what big data is, the tools that are used to capture, manage and process big data sets and its characteristics such as Volume, Variety, Velocity and Veracity. However, there is limited content available when it comes to devising a test strategy for big data applications or how big data needs to be approached from a testing point of view.

The traditional tester wrote simple read/write queries against the database to store and retrieve data. Slowly, the size of data started increasing due to business needs and newer technologies like Data Warehousing mandated specialized skills which created a whole lot of designations within the tester community who were referred to as “Database Testers”, “ETL Testers” and “Data Warehouse Testers”. Now with the advent of big data, things are only getting from complex to worse for the tester. What should the tester expect from big data? What kind of challenges does it put forth to the tester? Does he need new skills?
As mentioned in the beginning, the internet has so much of definition and explanation about what Big Data is but very limited information about testing it. In this article let’s try and understand how traditional data processing is different from processing large data sets and look at how testers can approach them. Apparently, traditional data processing dealt with large data sets but no as huge as what we have now.
As Wikipedia puts it, “The world’s technological per-capita capacity to store information has roughly doubled every 40 months since the 1980s as of 2012, every day 2.5 exabytes of data were created. That kind of size is not easy for RDBMS to handle and that is where libraries like Apache Hadoop helps. Mike Olson, the CEO of Cloudera says, “The Hadoop platform was designed to solve problems where you have a lot of data – perhaps a mixture of complex and structured data – and it doesn’t fit nicely into tables. It’s for situations where you want to run analytics that are deep and computationally extensive, like clustering and targeting”. As we all know, reading from the disk is much slower than reading from RAM (Random Access Memory) and that is how traditional data processing works.
This is not suitable when processing huge data sets. Hadoop helps here with its HDFS (Hadoop Distributed File System), which lets you store large amount of data on a cloud of machines. On top of HDFS, Hadoop provides an API to process the stored data which is Map-Reduce. The idea is since the data is stored in a distributed manner across nodes, it can be processed well in that manner where each node can process the data stored on it instead of getting hit by performance degradation issues by moving it over the network. The last step in the process is to extract the data output from the second step and loading them into downstream systems which can be data warehouses of other systems that might use the data for further processing.
Now that we understand the processing approach behind large data sets, let’s see how testing approach needs to be.


  1. Good and very resourceful Blog post, this is one of the most important.

  2. Looking forward to the next blog post.

  3. I really liked this topic you made about very ingenious.

  4. Your exuberance is refreshing. Nice Post

  5. A perfect blog for Software Testing.

  6. I got good information about your blog thank you very much.
    Thanks for sharing this blog

  7. Great article! Thanks for sharing such a good information

  8. Extraordinary article! A debt of gratitude is in order for sharing such a decent information.


Google Q&A Forum