What is Data Integrity?
Data integrity explained
Data integrity is the assurance of accuracy and consistency of data over the course of the data life cycle (from when the data is recorded until it is destroyed). In simple terms, data integrity means that you have recorded the data as intended and that it wasn’t unintentionally changed over the course of its life cycle. The concept is simple, but the practice is not. Data integrity is a critical component to creating or designing any software system that will store or move data.
Benefits of data integrity
Data integrity is important because just about every critical business decision is based on a company’s data. With good data integrity, you can analyze your company’s data to answer questions like: what were your business achievements? What were your business expenses? How are your sales in different regions? Are there areas of your business where expenses are growing faster than income? What is the productivity of different divisions of your workforce? Are you meeting your benchmark goals? Can you forecast your expenses for the upcoming fiscal year? If you don’t have good data, you can’t answer any of these questions accurately.
Challenges to data integrity
Challenges to data integrity fall into two broad categories:
Physical integrity challenges
The physical structures that house your data are subject to all manners of destruction that apply to any physical infrastructure: fire, flood, explosions, extreme temperatures, radiation, corrosion and any natural disaster that can impact a physical structure can impact the physical integrity of your data.
Logical integrity challenges
Data loses logical integrity if it loses rationality or correctness. For example, suppose the value for a field is supposed to be represented as a percentage, but your software enables users to enter that value as a dollar amount. The data no longer has logical integrity because it does not make sense and can’t be used or understood as it was intended. The most common way to lose logical integrity is human error. However, there are many ways that logical integrity can be compromised:
- The design of the software fails to include appropriate constraints.
- A software bug allows for incorrect data to be introduced into the system or deleted from a system.
- A transfer error can occur when moving data from one place to another — including unintended alterations or data compromise during transfer from one device to another.
- The data modification can be deliberately malicious, as in the case of hacking, or viruses and malware.
Maintaining data integrity
Physical data integrity
Maintaining physical data integrity involves creating systems that are resilient. Set up redundant systems so that if something happens to one system, the backup system can be used. You can also implement solutions like an uninterruptable power supply, clustered file systems, and radiation hardened chips. All these techniques can help support the physical integrity of your data.
Logical data integrity
Maintaining the logical integrity of your data can involve multiple systems and steps. You’ll need to plan your systems to maintain integrity. You’ll need to perform regular checks in input systems, storage systems, and any systems that move data.
Since most company data is stored in a data warehouse or database, you’ll need to configure these systems to ensure that the data integrity is maintained using a variety of constraints. These constraints include entity integrity (checks that each table has a primary key), referential integrity (checks the integrity of the foreign key), and domain integrity (specifies that all columns in a relational database must be declared upon a defined domain).
The more sources of truth in your data, the more likely you are to introduce errors. For better data integrity, you can move your data into a data warehouse to create a single, robust, source of truth.
And since most data is moved to a data warehouse via an ETL (Extract, Transform, and Load) system, you’ll need to use an ETL tool that can safely move the data without introducing duplicate data or corrupting the data.