This article was first published on Towards Data Science - Medium.
Assume you must test 100TB of unstructured, unindexed data that has no cache attached. Can you feel the panic arising from all the things that could go wrong with this project? Are bottlenecks and slow processing the first things that come to mind? Add uncleaned data, unknown errors, transmission faults and ensuring that operations are applied to the whole volume, and you are still not even close to what Big Data testing means.
The V’s of Big Data
Working with Big Data reveals that testing is different compared to regular software. The challenges arise from the very attributes of data. Known as the three Vs, these are volume, velocity, and variety, often complemented with variability and value. Each of them poses specific challenges, and they can also create more problems through their synergies.
The definition of big is not restricted to a specific number. Some talk about Big Data measured in GB, while other projects deal with petabytes or zettabytes, depending entirely on their scope. In fact, any amount that doesn’t fit for analytic purposes on a single machine is considered big. To overcome this problem, the data is distributed across more computational centers by Hadoop. Without parallel processing, it would be impossible to go through the entire set. Just imagine you would have to filter the entire database of Facebook posts for a keyword, in a matter of seconds.
It’s not only a matter of volume but also of speed. For example, every minute there are almost half a million tweets and as many Facebook posts published. The same applies to RFID and data coming from IoT devices. Designing a test algorithm that could handle these updates in real time is one of the toughest challenges of Big Data testing. This pain point can also be addressed by distributed computing and parallelization. The methods used should focus on increasing performance.
Variety & Variability
Big Data can’t be put in a data frame. Its lack of homogeneity and standardization requires designing new ways of retrieving, querying and testing it. Various formats like text, images, sounds, XML, and more demand different validation methods, to prevent the propagation of errors from one step to another.
Another V related to variety is the variability of data. Not all of it is created equal, or at fixed intervals, therefore a set of data can contain missing values. This makes the classic analytical tools useless and requires different approaches.
Evaluating the potential applications of data is necessary to plan out investments in it. Since data is considered a new currency, each data set should be evaluated. The same data can be used differently and combined with other sets to get new patterns and insights. Also, adequately validated records from one company could become a new revenue stream by selling them to other organizations.
Testing Types and Prioritizing the Vs
Each type of data requires different testing method, adapted adequately to the most significant V for it:
- Data ingestion testing — applicable to databases, files, and near real-time records. A high priority needs to be given to variety in case of file-based data, and velocity when dealing with a large influx of records.
- Data migration testing — this testing type is absent in real-time processing. Therefore, the priority is given to volume.
- Data integration testing — focuses on identifying inconsistencies. The highlight is on variability, as data coming from different sources needs to be fed into a single repository.
- Data homogenization testing — here the vast majority of big data is unstructured or semi-structured, so variety imposes the need to create rules.
- Data standardized testing — volume is the most important feature here, to ensure that all data follows requirements and is compliant with regulations.
Technical Expertise Requirements
A simple excel sheet is no longer a proper environment to use for Big Data testing. Not even emulators are enough. Just a dedicated environment like Hadoop which implements Map Reduce and ETL Process Validation can capture all the complex aspects of this procedure. The ETL process means that data is extracted from its original destination (datamarts and data warehouses) as it comes, transformed to a file type or format needed for the analysis and loaded in the new database. The testing focuses on the quality of each step to ensure no info is lost during the process.
Considered a nice-to-have or a shortcut in traditional software testing, automation becomes a mandatory requirement for Big Data testing. As described by A1QA, it offers a shorter time-to-market and improves quality of the product. Automation ensures that the testing process covers the entire range of possibilities, not just a sample.
This change doesn’t come without additional trouble like the need to hire experts and to manage more software. While manual testers require little programming background and can be trained in a matter of weeks, automated testers need to have years of experience behind them.
Although virtualization offers a suitable alternative to a real-world release, it comes with specific challenges related to the latency of the virtual machine. When too many virtual images are created, there are performance hinges. The failure to manage the virtualization process can lead to increased costs and even security risks. However, the advantages such as scalability, elasticity cost effectiveness and the ability to create a sand-box environment for applications recommend virtualization as another fundamental V of big data.
When compliance requirements are added to the mix, it could mean that testing should also include a careful monitoring of the virtual machine’s logs.
Costs & Infrastructure
Since the necessary expertise of Big Data testers exceeds considerably that of manual testers, it follows that the staffing costs will drive up the budget. On the bright side, if done right, due to automation of testing, the number of man-hours necessary should drop consistently. This will, in fact, reduce costs in the long run.
Also, the necessary infrastructure can have a significant impact on the budget if it is not implemented through cloud solutions.
Big Data testing is different from regular software evaluation since it focuses less on the functionality and more on the quality of the data as if flows through the process. The most significant contribution of Big Data testing to software development will probably be linked to developing new ways to make sense of large data volumes. Another area that will get more visibility is optimization, as the data needs to be accurately processed in real-time and current architectures are not fit to handle the predicted volumes in the years to come. Test automation will most likely benefit greatly from the advancements created, since then Big Data practices could be replicated for regular software testing and thus offering its increased speed and accuracy.
By: Sophia Brooke