Tradeoffs: Data Compression vs Data Deduplication Flashcards

(3 cards)

1
Q

Data Compresssion

A

Definition:

Data compression is the process of reducing the size of data by removing redundancy or encoding it more efficiently, making it smaller for storage or transmission.

Types of Data Compression

a) Lossless Compression
No data loss: Original data can be reconstructed exactly after decompression.

Techniques:

Dictionary-based: Replace repeating patterns (e.g. ZIP, GZIP).

Examples:

Text files, source code, database backups, PNG images.

b) Lossy Compression

Some data is lost permanently, usually non-critical data.

Goal: Achieve much higher compression ratios.

Examples:

JPEG (image compression), MP3 (audio), MP4 (video).

How it works (High-level)
Input data: Raw file or data stream.

Compression algorithm: Finds patterns, redundancies, or less significant data to remove or encode shorter.

Output: Smaller compressed data file or stream.

Decompression: Restores data (exactly for lossless, approximately for lossy).

Benefits

✅ Reduced storage cost: Saves disk space.

✅ Faster data transfer: Smaller size = less time to send over a network.

✅ Lower bandwidth usage: Useful in distributed systems and APIs.

✅ Improved caching efficiency: Compressed data fits better in memory/cache.

Drawbacks

⚠️ CPU overhead: Compression and decompression require extra processing.

⚠️ Latency impact: Real-time systems may experience delays if compression is heavy.

⚠️ Loss of quality (lossy): May degrade images, audio, or video if overly compressed.

Example: When you compress a text document using a ZIP file format, it uses algorithms to find and eliminate redundancies, reducing the file size. The original document can be perfectly reconstructed when the ZIP file is decompressed.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
2
Q

Data Deduplication

A

Definition:
Data Deduplication (often called dedupe) is a storage optimization technique that eliminates duplicate copies of repeating data, storing only one unique instance and referencing it wherever needed.

2️⃣ How it works
Data is broken into chunks (fixed-size or variable-size blocks).

Each chunk is analyzed using:

Hashing (e.g., SHA-1, MD5) → Generates a unique fingerprint.

If the fingerprint already exists in storage, the system:

Does not store the chunk again.

Creates a reference or pointer to the existing data block.

Only new, unique data chunks are stored physically.

Detail:

When deduplication detects duplicate data:

a)It does not store the duplicate block again.

b)Instead, it stores a pointer (a reference) in the file’s metadata that points to the original, already-stored block.

c) Both files or data chunks then share the same physical block on disk

Flow:
File Deduplication:

User A uploads File1.txt → Stored as:
Block A | Block B | Block C

User B uploads File2.txt, which is identical to File1.txt:

System checks hashes of blocks: A, B, C.

Finds all blocks already exist in storage.

Does not store new copies.

Creates metadata references:
File2.txt → pointers → [Block A, Block B, Block C]

Both files now point to the same physical blocks, saving space.

Examples:
a)Cloud storage (Google Drive, Dropbox):

When multiple users upload the same file, it’s stored once, with shared references.

b)Backup systems (Veeam, NetApp):

Deduplicate repeated OS files across multiple VM backups.

c) Email servers:

Attachments sent to multiple recipients are stored only once.

4️⃣ Benefits
✅ Storage efficiency: Saves disk space by avoiding duplicates.

✅ Cost savings: Less storage hardware required.

✅ Network efficiency: Less data sent during backup or replication (especially in cloud storage).

✅ Backup optimization: Ideal for incremental backups where many files share similar content.

5️⃣ Drawbacks
⚠️ Compute overhead: Requires CPU and memory for chunking, hashing, and index lookup.
⚠️ Not useful for compressed/encrypted data: Harder to find duplicate patterns.
⚠️ Limited to Identical Data: Only reduces data that is exactly the same.

How well did you know this?
1
Not at all
2
3
4
5
Perfectly
3
Q

Key Differences

A

Method of Reduction

Data Compression:

Reduces file size by eliminating redundancies within a single file or data stream.
Example: Repeated text patterns (AAAA BBBB) are stored more efficiently (A4 B4).

Data Deduplication:

Reduces overall storage usage by eliminating duplicate files or data blocks across the entire storage system or dataset.

Example: If multiple users upload the same file, it’s stored once, and all duplicates point to that original copy.

Scope: Compression works on a single file or data stream, while deduplication works across a larger dataset or storage system.

Data Loss

Data Compression: Can be lossless (perfect recovery) or lossy (quality loss for media).

Data Deduplication: Always lossless, no data is altered—just referenced differently.

Granularity

Data Compression: Byte-level or bit-level encoding.
Data Deduplication: File-level or block-level (fixed or variable chunks).

Important: In many systems (e.g., backup solutions, cloud storage), both are used together:

Deduplication first → remove duplicates.

Compression second → shrink unique data.

Conclusion:

Data compression is useful for reducing the size of individual files for storage and transmission efficiency. In contrast, data deduplication is ideal for large-scale storage systems where the same data is stored or backed up multiple times. Both techniques can significantly improve storage efficiency,

How well did you know this?
1
Not at all
2
3
4
5
Perfectly