“Data isn't information, any more than fifty tons of cement is a skyscraper” - Clifford Stoll
What is Test Data Management (TDM)?
TDM consists of managing the provisioning of required test data efficiently and effectively, while at the same time ensuring compliance to regulatory and organizational standards. Below are some building blocks of TDM
· Data Subset – a process of slicing a part of the production database and load it into the test DB
· Data Masking – a process of masking the sensitive fields from the complete data set
· Data Archive- a process of storing data snapshot to restore it later as per build / release / cycle
· Test Data Refresh – a process of loading / refreshing the test data with latest data from prod
· Test Data Ageing – a process required for time based testing. Depending on the scenario that needs testing, either backdate or front date the given date
· Gold Copy – the baseline version of data that can be used for future releases
Why do we need TDM?
Research shows that projects cancelled due to poor data quality are 15 percent more costly than successful projects of the same size and type. It is noticed that almost over 10% of the defects raised in production are due to data that could have easily been captured during the various testing phases.
· To create “right-sized” test databases that accurately reflect E2E business processes
· To enable developers to correct defects early in the life cycle
· Allow to execute comprehensive non-functional tests
· To create realistic and manageable test databases by applying data sub-setting techniques
· To safeguard customer privacy/security by applying data privatization techniques
· Quickly and easily refresh data in Test Environments
· To empower test teams to select and book test data set
· To reproduce any reported bugs, the data used must be available
What are some of the indicators that your project needs TDM?
· Testing deadlines getting slipped due to data related outages and/or data synchronization issues
· Testers wasting more time in preparing test data than the actual testing
· Testers depends a lot on BA to provide meaningful test data
· High risk and penalties associated with not adhering to compliance and/or data privacy laws
· Lots of false defects due to data related issues
· Testers complaining about complexity in creating test data for consumption
· Test data are as voluminous as production and hinder performance
· Test data not being reused and every time being created from scratch (using the same process)
· A big delay in providing the test data as waiting for another system to get ready
· With projects growing, team complaining about managing the test data
· Outsourced and / or off-shored testing services have access to the customer’s PII data
What are some of the major activities of TDM?
· Acquiring an initial understanding of the test data landscape like a list of test regions, applications, types of data stores, frequency of data requests for each application etc.
· Carrying out data profiling exercise for each of the individual data stores across the enterprise
o Data types
o Data dependencies
o Data sources and providers
o Tools for data extraction, masking, creating, loading and so on
o Who needs test data, a tester, a developer or a vendor
o When to refresh the test data and when to clean
o Phase of cycle test data needs to be used, unit, integration, system or UAT?
· Assigning a version number to existing data
· Identify test region(s) where data need to be loaded or refreshed
· Restore “used” data to original “unused” state
· Carrying out masking
· Test data preparation
o Cloning production databases
o Generating synthetic data
o Sub setting production data
· Distribute unused data from other projects
· Load data dump (masked or unmasked) to target region
· Take back-up of data of new data (both databases & files) once the data is set up
· Assign version number of the backup and catalog it with proper description
· Refresh with data dumps (production slice or other regions)
Test Data Management Challenges
· Data Requirements
o How to synchronize and share test data among multiple applications and teams?
o How to resolve contention of environments?
o How to analyze existing data if are not profiled properly?
o How to handle sudden and immediate requests for test data during test execution?
o How to ensure proper data distribution so as to prevent redundant or unused data?
o How to ensure data reuse?
· Data Validity & Consistency
o How can it be ensured that the data has not ‘aged’ and has not become obsolete?
o How are you planning to refresh test data on a regular basis to avoid poor data quality and data integrity?
o How to manage complex and heterogeneous system coupled with different file formats having multiple touch points?
o What is your strategy for proper versioning of data?
o How to enable traceability from end-2-end business process?
o How to maintain traceability between test data to test cases to business requirements?
· Data Privacy
o How to mask sensitive personal information before migrating it to test environment(s)?
o Are you aware about different government mandates and regulations in place that stipulate the data must be masked, de-identified or encrypted?
o How to enable auditing of data?
· Data Selection & Subsetting
o How to plan a smaller subset of data in a scaled down, non production environment without risking coverage (of test data)?
o How to plan subset of data in different format for different teams (DW, Performance, Functional, System etc.) without resulting in long test cycles?
· Data Storage & Safety
o Is your company ready for high storage, license and maintenance cost when copies of full production data are required in a test environment?
o How many test environments require copies of full production data?
o What is the policy for version control, access-security and backup mechanisms?
o Is your company ready for high storage, license and maintenance cost when copies of
· Data Refresh
o How to manage impact of data refresh on ongoing projects?
o DBA like skills required for team managing TDM
o Is there any separate team for data engineering, data provisioning and data mocking etc.?
o Managing & maintaining referential integrity & data quality while data generation
o What is the time taken in copying huge volume of production data to different environments?
o How to strategize test data identification, extraction and conditioning?
o Coordination with multiple stakeholders
Test Data Management strategy
Quality data is a must for testing business functionality in the test environment. However, managing quality of data is often challenging due to complex relationships, limited infrastructure, sensitivity of data, and the lack of data conforming to business rules. A better test data management strategy not only ensures greater development and testing efficiencies, but helps organizations identify and correct defects early in the development process, when they are cheapest and easiest to fix. Any test data management strategy must efficiently supply a steady supply of relevant test data to support ever-tightening development cycles, while avoiding testing bottlenecks.
· Gathering & Analyzing test data
o Does the relevant production data exists, which can be used as test data
o Test cases not covered by production data must be covered by newly created test data
· Data Generation
o Have you outlined a set of criteria to automatically generate the quality of data required?
o Are the data generated are re-usable or needed to generate every time?
o Are the data generated from scratch or copied subset of data from production?
· Data de-identification
o Mask corporate, client, employee, etc. information
o Supports compliance with government and industry regulations
o Mask consistently complete business objects (e.g. Customer Order)
o Who will have access to this data? All internal team members or vendors doing testing?
o Do data need to be encrypted?
· Data Planning
o Capture E2E business process and the associated data for the testing
o How to select a subset of data? How do you ensure if selected data are relevant?
o Do we need 5x data for stress environment?
o If cloning or the migration of production data on test environments is required, should we clone full or 60%? What should be the periodicity of migration / cloning?
o What is the amount of changes in the production database and amount of application changes?
· Subset production data from multiple data sources
o Subsetting creates realistic test databases small enough to support rapid test runs, but large enough to reflect the variety of production data
o Create test data to force error and boundary conditions
· Data Reuse
o Have you labeled test data to correlate them to specific test cases?
o Are test data labeled for release / build / cycle?
o Can we categorize test data according to different testing stages like functional, stress?
· Data Maintenance
o What should be schedule and frequency of refreshing the test data?
o What is your plan for storing the data?
o How often it is migrated to the test environment?
· Data Refresh
o Accommodate changing test requirements
o Is this possible to automate data refresh?
· Data Auditing
o Can you trace the workflow from end to end?
o Can you analyze the data from audition logs and is this fit for the purpose?
· Cleaning up test environment post testing completion
o How and when the cleaning up of test data needs to be done, post testing completion?
o Are there any instances where altered test data cannot be cleaned up?
· Automate test data result comparison
o Automate identification of data anomalies and inconsistencies
· Use of central repository with version control
What are some pros and cons of cloning production databases?
· Pros: It is relatively simple to implement
o Expensive in terms of hardware, license and support cost
o Time consuming – Increases the time required to run test cases due to large data volumes
o Not agile: Developers, testers and QA staff can’t refresh the test data
o Inefficient: Developers and testers can’t create targeted test data sets for specific test cases or validate data after test runs
o Not collaborative between DBA and testing teams
o Not scalable across multiple data sources or applications
o Laborious: Production systems are typically large
o Risky: Nonproduction environments might be compromised or misused (developers, testers and QA staff need realistic data to do their jobs—but they do not have a valid business reason to access sensitive data such as corporate secrets, revenue projections or customer information)
What are some challenges of using Production Data in Test Enviornment (Production Cloning)?
· Data security is one of the most crucial challenges as production data can contain a lot of sensitive information like real customer details, vendor names etc. It can be overcome by data masking
· Data volume that needs to be dealt with is pretty huge. Think about 100K customer doing 5 transactions per hour is equivalent of generating 500K transactions per hour, which is a 5000K transactional record’s addition in one day. Just imagine the scale of data that needs to be loaded into the test environment. It can be overcome by data sub setting.
· Data can come from various sources like flat files, different relational databases, excel, etc. and can be in various formats. Maintaining data relationships and data integrity is another challenge.
· Production cloning might force to have production like infrastructure, means higher costs
· The Additional cost of storing production data (e.g. 50TB) in different test enviornments
· Increased load time from production to test environment will leads to less time for real testing
What are some pros and cons of generating synthetic data?
· Pros: Safe
o Resource-intensive: Requires a huge commitment from highly skilled DBAs with deep knowledge of the underlying database schema, as well as knowledge of implicit relationships that might not be formally detailed in the schema
o Tedious: DBAs must intentionally include errors and set boundary conditions within the synthetic data set to ensure a robust testing process, which adds time to the test data creation process
o Challenging: Despite the time and effort put forth by the DBA to generate synthetic test data, testers find it challenging to work with because synthetic test data doesn’t always reflect the integrity of the original data set or retain the proper context
o Time-consuming: Process is slower and can be error-prone
What are some pros and cons of Subsetting production databases?
§ Pros: Less expensive compared to cloning or generating synthetic test data
§ Cons: Skill-intensive: Without an automated solution, requires highly skilled resources to ensure referential integrity and protect sensitive data
What are some challenges in Data Subsetting?
· Maintaining referential integrity is the biggest challenges. Just imagine of fetching only 100 customer order records from 1 million customer orders’ records without losing any context
· Maintaining data integrity of the subset of the data. Just imagine, if customers’ records are in Oracle database but the customer order records are in SQL server.
· Maintaining data relationships across multiple sources. For example, a vendor might provide a data feed in flat file format for all customers’ orders.
Key features of TDM tool
TDM is about automate the provisioning of masked and synthetically generated data to meet the needs of test, development & QA team. TDM is needed for minimizing risk of data breach. TDM helps in using production data safely in test or development environment. TDM can be deployed on premises, in the cloud and via cloud hybrid configurations. Some of the tools in TDM space are Datamaker, Optim, HP TDM etc. Key features of TDM tool should be:
· Automatic discovery of sensitive data (locations) across databases
· Ability to create synthetic data where production data can’t be used or doesn’t exist
· Should be able to get connected with distributed databases
· Conformance and compliance team should be able to verify its functionality
· Capability of masking data in place or while copying to test, support or outsource environment
· Provision for smaller set of data requirements
· Support for packaged applications