Data Deduplication: What is Dedup?
Data deduplication or ‘dedup’ is the term applied to the removal of what is technically known as redundant data. As part of the deduplication process, duplicate or repeated data is deleted and only one copy of the relevant data remains stored. However, dedup never deletes the vital indexing of all data in order that it can be retrieved should it be required.
Dedup reduces the required storage capacity since only unique data is stored. This greatly improves the run times and performance capabilities of various pieces of software, storage systems and computers. Essentially, with data dedup, only one instance of any piece of any given data is retained so that the user has more free space to work with. Deduplication lets users greatly reduce the amount of space they need for backup by more that 90% on average because dedup can actually reduce storage requirements by over 100 times, the difference between needing 100MB and 1MB really counts at times.
As such, dedup is particularly important for large scale storage software and hardware employed by facilities such as government offices, hospitals, IPTV service providers and even legal services. Moreover, data deduplication is more cost effective and improves data protection as fewer files can be accessed and more can be properly secured.
There are many benefits of dedup including reducing start-up infrastructure costs, power, physical space, and even cooling measures.
When it comes to putting the theory of dedup into practice, there are typically two different methods of deduplication to choose from; ‘Source Dedup’ and ‘Target Dedup’.
Source dedup deduplicates data at the source of the data whereas target dedup removes data after it has been stored in a secondary or tertiary location. With source dedup, the entire deduplication process becomes transparent to the user and, normally, the backup files end up being bigger than the source data. Either dedup technique can be employed in one of three ways.
Data can undergo the dedup process before data is sent to any outside storage unit, deduplication like this is known as Client Backup Dedup. Post-process dedup is the term applied to the technique whereby new data is stored on a given storage device and then dedup occurs at a later date. In-line dedup is when both processes (storage and removal of unnecessary data) occur outside of the client’s hands and at the same time.
Deduplication solutions are certainly not foolproof, nor are they without inherent difficulties. Indeed, by leaving a machine in charge of which data to keep and which data to expel, data loss is not uncommon although the best programs can usually get around this by running very detailed hash calculations to work out which data needs to be removed. This is important for vendors as sending people to retrieve lost files is not cost effective.
Another major problem with data dedup is the intensity of power required to make it work properly. The hash calculations take time and power as each byte of data needs to be read and added to the hash equation before being looked up to see if it matches any existing hashes.
So, deduplication is an important process for maintaining storage space, file integrity, security and cost effectiveness. However, the complications involved in letting software and specialist hardware deal with the removal of files means it can often go awry and it requires constant monitoring.