“Big Data” is a term that has been around since the early 1990s. A commonly used definition of Big Data came from Gartner calling it the “three Vs”.  “Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimization.”1  Currently, the volumes of data used in Big Data applications are in the terabyte to petabyte range. In terms of velocity, the data are often available in real time. Varieties of data include text, images, audio, and video. Drivers for the emergence of Big Data are increased computational power and speed, gains in data storage such as cloud-based infrastructure and distributed file systems, and advances in data analytics and visualization technologies.

The set of applications using Big Data continues to increase. The size of the industry around information management is currently $100 billion, and it is growing at a rate of 10% annually, which is double that of the software industry as a whole2.  Government, healthcare, media, science, and even sports, have uses for Big Data. For example, Big Data analytics is becoming a prevalent tool in winning elections; President Barack Obama can attribute some of his 2012 election success to it3.  By collecting data about people’s interests, including the television shows they watch, and the price of advertising on those shows, messaging can be tailored to voters’ favourite campaign issues.

Within Big Data, there are a number of subtopics, a few of which we will briefly review. These include Distributed Storage and Processing, Data Cubes, Data Warehouses, and Business Intelligence. A subsequent article will look more thoroughly at testing Data Warehouses.

A vast amount of data necessitates different approaches to storage and processing. For instance, one strategy is to distribute the data over a connected set of databases4.  The data are partitioned into smaller units, which may number in the hundreds of thousands, or more. These partitions are indexed to optimize searching and processing. On top of these partitions is a database that contains metadata information to look up the partition efficiently.

Processing of the data then occurs on a massive scale in a parallel fashion on small independent tasks.

Data cubes have applications in science and mathematics, but for our purposes, let us consider them in the Business Intelligence context. Data cubes are multidimensional arrays of data, designed to accommodate tracking and reporting of business measures of interest. With the data arranged in this way, business intelligence tools are used to “slice and dice” the data to give useful summary views and reports.

A broader collection of data is a characteristic of a Data Warehouse. Data Warehouses draw data from several different sources, perhaps in different formats (databases, spreadsheets, flat files), and bring them together in such a way that reporting and analysis are possible. The idea here is that the data are static, whereas on the various sources transactions are being carried out frequently. Since the data are stable, it is possible to obtain a snapshot of the data, thus enabling the use of business intelligence tools for reporting and visualization.

From a tester’s standpoint, as Big Data becomes central to more applications, we may need to augment our toolkit. A good understanding of databases, as well as the queries used to extract information from them, is essential.

Testers should also acquaint themselves with methods for assessing data quality in terms of its completeness, validity, accuracy and consistency. Testers may also need to familiarize themselves with database models, such as NoSQL, in addition to the more familiar relational databases.

With all its exciting possibilities, Big Data also presents serious testing challenges5.  For example, how can we set up a testing environment? Is it possible to create a viable subset of the vast amount of data involved in our application to be able to test? Other challenging areas around Big Data are non-functional areas such as security and performance. Testers may need to learn specialized tooling and undertake training to be able to tackle these challenges.

References:

1 De Mauro, Andrea; Greco, Marco; Grimaldi, Michele (2016). “A Formal definition of Big Data based on its essential Features”. Library Review. 65: 122–135. doi:10.1108/LR-06-2015-0061.
2 “Data, data everywhere”. The Economist. 25 February 2010. Retrieved 9 December 2012.
3 Lampitt, Andrew. “The real story of how big data analytics helped Obama win”. Infoworld. Retrieved 31 May 2014.
4 https://www.ministryoftesting.com/2013/06/testing-big-data-in-an-agile-environment/
5 http://pqa.wpengine.com/white-papers/the-future-of-testing/

Jim Peers is currently Test Manager at the Integrated Renewal Program at the University of British Colombia and is an alumni QA Practitioner from PLATO Testing.  Jim has more than sixteen years of experience in software development and testing in multiple industry verticals. After working in the scientific research realm for a number of years, Jim moved into software development, holding roles as tester, test architect, developer, team lead, project manager, and product manager. As a trusted technical advisor to clients, Jim has created test strategies approaches and plans for the most complicated of systems and loves to mentor and assist testers on multiple projects

https://www.linkedin.com/in/jim-peers-70977a6/, @jrdpeers