Towards Quantifying Data Quality Costs

Won Kim, Cyber Database Solutions, Austin, Texas
Byoungju Choi, Department of Computer Science and Engineering, Ewha Women’s University, Seoul, Korea

COLUMN

PDF Version

Abstract

Today most organizations run their daily operations using data at their disposal. However, a vast majority of the organizations do not have adequate process and tools to maintain high quality operational data at all times. One of the key reasons for this is the lack of appreciation of the damages that low quality data can bring to an organization, and the cost of ensuring high quality of data. This article provides a basis for quantifying in monetary terms the costs of both low quality data and ensuring high quality data. A comparison of the costs of low quality data and ensuring high quality data can be a simple and compelling basis for an organization to determine the extent of the efforts it must expend to ensure high quality of its operational data.

1 TYPES OF OPERATIONAL DATA

To provide a basis for understanding the costs of both low quality data and ensuring high quality data, we need to understand the types of data that organizations use in running their daily operations. There are at least five types:

“Front-office” data
Front-office data is data that is used in running primary daily operations of an organization; for example, customer account data in a bank to support daily debit-credit transactions.
“Back-office” data
Back-office data is data that is used for analyses in support of strategic and tactical decision-making by an organization; for example, monthly or quarterly customer transactions data to support profit-loss, inventory, and merchandizing analyses by a nationwide retail chain. Often, back-office data consists of data extracted and transformed from front-office data. As such, low quality front-office data propagate to back-office data; and when low quality data is detected in front-office data, the part of the back-office data that was derived from the front-office data must be repaired. Back-office data also includes web log data that captures website visitors’ navigation patterns.
Archival data
Archival data is data that has been transferred from hard disk to tertiary storage devices such as CDs, tapes, floppy disks, etc. for backup purposes and to support mobile users. Again, low quality data propagate from secondary storage to tertiary storage; and when low quality data is detected in secondary storage, the part of the archival data that originated from secondary storage must be repaired.
Metadata
Metadata is data about data, and includes such data as the system catalogs in relational databases. It also includes data dictionary, which contains such information as versions of software that work together, etc.
Control data
Control data is data that data-processing software uses for proper operation and control; for example, the access control list of ordinary users and their privileges, system access audit trail, etc. Control data is usually managed by privileged users such as system administrators and database administrators.

Different types of operational data have different uses for an organization, and cause different types of damages. Some data, such as the web log data and system access audit trail, are automatically generated and maintained, and therefore are normally not susceptible to errors, while operational data entered by people are subject to input errors of various types. Further, errors in data, such as missing data, wrong data, etc., can propagate, thereby complicating the tasks of detecting and repairing errors.

2 COST OF LOW QUALITY DATA

There have been limited efforts to systematically understand the effects of low quality data. The efforts have been directed to investigating the effects of data errors on computer-based models such as neural networks, linear regression models, rule-based systems, etc. [Ballou et al 1987] [Kauffman et al 1993]. In practice, low quality data can bring monetary damages to an organization in a variety of ways. As observed earlier in [Kim 2002], the types of damage it can cause depend on the nature of data, nature of the uses of the data, the types of responses (by the customers or citizens) to the damages, etc. For businesses they all boil down to financial losses, and for the government they boil down to loss of trust and waste of taxpayers’ money. The types of damages include at least the following.

Loss of revenue
Organizations may actually lose money if they under-charge customers or citizens based on incorrect account information, such as the amount of charges patients incurred in hospital stays, the license fees and taxes citizens owed the government, etc. The shortfall on revenues can be determined manually on a periodic basis, and the historical amount of shortfall should be used as a guide to determine the level of efforts to prevent errors in data. In the case of a particular major hospital in the US, as much as 30 percent of the customer account information contained errors.
Waste of money
Organizations that mail marketing materials to incorrect postal addresses clearly waste money. This problem arises when organizations use misspelled or incomplete addresses or outdated addresses to send materials via postal mail. The problem does not arise if organizations use such incorrect addresses for other purposes, such as customer segmentation based on cities within a state or sections within a city, etc. The amount of wasted money can be computed if undeliverable mails are returned. If the mails are not delivered but also not returned, the rate of undelivered mails should be estimated by other means.
Organizations that make mistakes often have to take corrective actions. People resources and expenses incurred to correct the mistakes represent waste of money. For example, when organizations over-charge customers based on incorrect account information, they usually have to process refund requests at a later date. When organizations distribute reports, marketing materials, or product user manuals, etc. with important errors in them, they often have to re-print and re-distribute such materials.
Lost opportunity
Organizations that make inaccurate strategic and tactical decisions based on inaccurate analyses of customer segmentation, customers’ purchase behaviors, etc. suffer lost opportunity. For example, suppose that a manufacturer incorrectly decides, based on customer data that contains a disproportionate amount of low quality data, to target customers “in the 30s and 40s age groups and $60K to 70K income groups” to market a new product, rather than “40s and 50s age groups and $70K to 80K income groups, and with no children”. Then the manufacturer’s marketing campaign will be partially misdirected, and if the effective time window for the marketing campaign closed before the mistake is discovered, the customer group that the manufacturer did not pursue represents a lost opportunity, and a loss of potential additional revenue. The wrong customer group the manufacturer pursued represents waste of money (at least a less-than-optimal use of money).
Lost opportunity often comes in conjunction with waste of money. When organizations take corrective actions upon discovering mistakes, they end up doing the same work twice or more. The people resource and the time they spend to correct the mistakes could have been used for productive work, and thus represent a lost opportunity. For example, customers of software products, when given user manuals that contain errors, will call help desk (customer technical support line). Customer support engineers then must spend time answering such calls, rather than helping customers who call with other types of request for help.
Tarnished image (or loss of goodwill)
Repeated mistakes due to persistence of low quality data, like repeated mistakes from other reasons, can lead to desertion of customers, since after a while disappointed and frustrated customers will switch to competitors for goods and services. This is somewhat difficult to quantitatively assess, since customers switch vendors and brands for a variety of reasons. However, the customer churn rate and the consequent decline in revenue and increase in the cost of acquiring new customers can be fairly easily determined, and a certain fraction of the figures may be assigned to persistent uses of low quality data.
Invasion of privacy and civil liberties
Invasion of privacy and civil liberties occurs largely due to unauthorized dissemination of private data about individual customers and citizens. However, they can also occur as a result of a proper use and dissemination of low quality data maintained by organizations. Incorrect, as well as correct, information about the diagnosis and treatment of patients in doctors’ offices and hospitals, memberships in certain organizations, subscriptions for certain publications, purchase records of certain types of product, etc. can all potentially place individuals in bad situations. Such information can lead to discrimination in employment, a police investigation, detainment at airports and customs when traveling, etc.
Incorrect or outdated control data can also lead to invasion of privacy. If system administrators or database administrators do not properly manage the access control list, for example, by not updating it after certain employees have been terminated, unauthorized employees may access the computer systems or the databases, and damage the computer systems and databases, and/or disseminate certain private information about certain individuals found in the databases.
It is difficult to quantify monetary damages caused by invasion or privacy and violation of civil liberties, unless they result in lawsuits. But the costs of certain corrective actions such as a bank’s invalidating and re-issuing credit cards to millions of customers after a security breach in its computer systems can certainly be estimated or even measured.
Personal injury and death of people
It is rare but still possible for the use of incorrect data or unauthorized dissemination of data to lead to a destruction of properties, personal injuries and deaths of people. For example, invasion of privacy may lead to mental distress and subsequent medical treatment or even suicides. Wrong instructions, due to wrong or outdated data, for operating certain types of machines (e.g., construction equipment, and other hazardous equipment) or controlling certain environments (e.g., chemical plants, nuclear power plants, and oil refineries) can cause accidents and even disasters. The damages done to properties and people can translate to replacement costs, hospital charges, legal costs, and lost workdays for people involved (waste of money and lost opportunity).
Lawsuits
Damages attributable to low quality data can, under certain circumstances, lead to lawsuits by customers against businesses, and by citizens against the government. Invasion of privacy, personal injuries and deaths, and significant revenue losses, etc. are likely causes for lawsuits. The cost of a lawsuit includes legal expenses, human resources (other than the lawyers) and time spent on the legal action, judgments (i.e., damages awarded to the plaintiffs, fines levied on the organization and officers of the organization), and tarnished image due to bad press.

Although it is often difficult to estimate the cost of low quality data in monetary terms, this must be done. As they say, “money talks”, and only when expressed in explicit monetary terms, decision makers of organizations can relate to the consequences of low quality data on their organizations. The cost of low quality data should be estimated for all fields in all tables in a database, and for all conceived uses of data in all fields. However, as not all damages are guaranteed to occur (e.g., a lawsuit, death of people), probability should be assigned to each type of damage when estimating the total damages.

3 COST OF ENSURING HIGH QUALITY DATA

The cost of ensuring high quality data is the cost of preventing, detecting, and repairing low quality data. This does not include the cost of taking corrective actions for damages done due to low quality data. [Kim et al 2003] provides a rather comprehensive taxonomy of low quality data (also called dirty data) for the purpose of understanding the types of low quality data and the techniques appropriate for addressing them. Let us examine the three primary elements in maintaining high quality data.

Preventing low quality data
There are two methods to prevent low quality data from entering databases and files: automatic and manual. The automatic method is embedded in software, such as database management systems, and end-user applications (e.g., hospital patient records systems, order entry systems). Software systems usually reject certain wrong data types (e.g., numeric data for the person_name field, character data for the person_age field). Transaction management systems prevent bad updates through concurrency control and crash recovery mechanisms.
The automatic method, however, is rather limited in its coverage; for example, it cannot prevent misspelled words or many types of wrong data from entering the databases or files. The manual method is usually necessary to augment the automatic method, possibly assisted by software tools. Before storing data in the database or a file, people should check correctness of data. For example, people should check the spelling of character data, aided by a spell checker; check the correctness of both numerical and character data by examining the original sources of data (e.g., faxed purchase orders, employees’ resumes, filled-in surveys) and/or by cross-referencing other related and redundant sources of data. The manual method is typically costly, since it requires people resources, and possibly licensing of software tools.
Detecting low quality data
In general, it is not possible to prevent most types of low quality data. As such, it is necessary to detect low quality data after it has entered the databases or files. Detection may be done periodically (e.g., once a week, a month, a quarter) or before certain events (e.g., the release of financial data of a business to shareholders, a government tax audit). There are also two methods to detect low quality data: automatic and manual.
The automatic method involves the use of software applications that examine the database in many different ways. For example, the names of employees in a department of a corporation (in one table) may be compared against the names of employees reporting to each department manager (in another table); or a value range test may be done against the ages of all employees to make sure nobody’s age lies outside the range [18..65], if such a test was not done when the employees’ age data was entered; or if the organization’s rule is to ensure that “the customer technical support staff be at least 20% of the engineering development and quality assurance staff”, the counts of the employees in the two groups should be checked.
The manual method for detecting low quality data is similar to that for preventing low quality data. For detection, however, people need to examine the data stored in the databases and/or files, rather than the data being entered or updated. The manual method is also usually required to identify all data propagated from the low quality data. Low quality data may propagate in a few different ways. Low quality data in one field (e.g., the Salary field) of a table may be copied to anther field (e.g., the Salary+Commission field) in the same table or the same database. It may be copied to a separate remote database as backup against disk crashes or to improve performance or for autonomous administration of databases in a distributed environment. It may also have propagated into summary tables or materialized views created in the same database or in remote databases.
Repairing low quality data
Once low quality data has been detected, it has to be repaired. Again, there are two methods to repair low quality data: automatic and manual. The automatic method typically involves the use of a series of update statements against a database to replace wrong or outdated data with correct data. The manual method augments the automatic method, and involves people who manually examine data and repair them. Repairing low quality data usually requires a combination of automatic and manual methods.

4 BALANCING THE COSTS

A data quality strategy for an organization includes several elements:

Understanding the semantics of all the operational data.
Determining all the uses of the data and consequences of low quality data with respect to such uses.
Determining the costs of low quality data.
Determining the cost-effective methods of ensuring quality of data appropriate for the conceived uses of the data.
Establishing a process to ensure quality of data as determined, including the installation of necessary software tools and deployment of an appropriate level of human resources.

To determine the cost-effective methods of ensuring appropriate quality of data, organizations need to compare the cost of ensuring high quality data against the total estimated cost of low quality data, and select an appropriate combination of the methods for ensuring high quality data that collectively would be significantly lower than the cost of low quality data. This tradeoff decision is akin to that proposed between the accuracy and timeliness dimensions of data quality [Ballou and Pazer 1995].

REFERENCES

[Ballo 1987] Ballou, DP, Pazer, HL, Belardo, S., and Klein, B. “Implications of Data Quality for Spreadsheet Analysis,” Data Base, vol. 18, no. 3, pp. 13-19.

[Ballo 1995] Ballou, DP, Pazer, HL, “Designing Information Systems to Optimize the Accuracy-Timeliness Tradeoff,” Information Systems Research, vol. 6, no. 1, pp. 51-72.

[Kauff 1993] Bansal, A, Kauffman RJ, and Weitz, RR, “Comparing the Modeling Performance of Regression and Neural Networks as Data Quality Varies: A Business Value Approach,” Journal of Management Information Systems, vol. 10, no.1, pp. 11-32.

[Kim 2002] Kim, W. “On Three Major Holes in Data Warehousing Today,” Journal of Object Technology, vol. 1, no. 4, September-October 2002, pp. 39-47. http://www.jot.fm/issues/issue_2002_09/column3

[Kim et al 2003], Kim, W., Choi, BJ, Hong, EK, Lee, DH, Kim, SK. “A Taxonomy of Dirty Data,” The Data Mining and Knowledge Discovery Journal, vol. 7, no. 1, pp. 81-99.

About the authors

		Won Kim is President and CEO of Cyber Database Solutions (http://www.cyberdb.com) and MaxScan (www.maxscan.com) in Austin, Texas, USA. He is also Dean of Ewha Institute of Science and Technology, Ewha Women's University, Seoul. Korea. He is Editor-in-Chief of ACM Transactions on Internet Technology (http://www.acm.org/toit), and Chair of ACM Special Interest Group on Knowledge Discovery and Data Mining (http://www.acm.org/sigkdd). He is the recipient of the ACM 2001 Distinguished Service Award.
		Byoungju Choi, associate professor of computer science and engineering, (component-based software engineering, software testing, data and software quality), Ph.D. in computer science, Purdue University, bjchoi@ewha.ac.kr

Cite this column as follows: Won Kim, Byoung Choi: “Towards Quantifying Data Quality Costs”, in Journal of Object Technology, vol. 2, no. 4, July-August 2003, pp. 69-76. http://www.jot.fm/issues/issue_2003_07/column6

Previous column

Next column