Erasure Coding to Tape: A Cost-Effective and Reliable Solution for HPC Data Archiving

Erasure coding is a method of data protection that involves breaking down data into smaller chunks and adding redundancy to those chunks to protect against data loss. This process is particularly relevant for HPC data archiving on tape, where massive volumes of data are so massive that doing dual copies or replication if simply not an option but data still needs to be protected securely.

In this blog post, we will take a closer look at how erasure coding works and why it is an ideal solution for HPC data archiving on tape.

Erasure Coding, how it works, and why it is of interest to HPC

When data is processed through an erasure coding software, it is divided into multiple equal-sized chunks, known as data blocks. Alongside these data chunks, the encoder generates additional redundant chunks, known as parity blocks, which are used for redundancy purposes. The combination of data and parity chunks is known as the erasure coding schema, represented as (k+m).

In the data storage phase, all (k+m) chunks are stored sequentially across multiple tapes, with each tape containing one chunk. When it comes to data recovery, only a number k of chunks out of any of the (k+m) chunks are needed to retrieve the data from the tape storage and decode it using the software. This means that even if some tapes are damaged or lost, the original data can still be reconstructed using the required number (k) of chunks. To say this differently, a (k+m) schema allows to loss up to m media without losing any data.

The erasure coding schema (k+m) is selected based on the number of chunks (i.e. media) that can be lost and the specific use case. For example, if you have one single tape library, you may choose a schema that prevents media loss, whereas if you have three tape libraries each in different locations, you may choose a schema that prevents both media loss and site loss. The chosen configuration impacts the level of protection provided in terms of the number of tape media losses supported and the disaster resilience, and the “storage footprint” ie, the actual storage space used on tape.

Let’s talk numbers and schemas for a minute with some examples that makes it more explicit how erasure coding significantly reduce storage footprint in addition to providing protection against tape media loss:

	Media Loss Supported (up to)	Storage Footprint (vs initial data volume)
Erasure Coding schema 6+4	4	150%
Two copies	2	200%
Three copies	3	300%

A schema of 6+4 would divide data into 6 chunks and add 4 redundant chunks, resulting in a total of 10 chunks stored on tape. This schema can withstand the loss of up to 4 media while still ensuring data availability.
In comparison, having two or three copies of data would require double or triple the storage volume, respectively.
Using a 6+4 schema would require only 150% of the original storage volume, while having two or three copies would require 200% or 300%, respectively.

If one or more of the tapes are missing or damaged, the software can use the remaining data and parity chunks to reconstruct the missing data, up to m parity chunks. This is known as data reconstruction, and it is one of the key benefits of using erasure coding for data protection. The number of errors that can be corrected depends on the specific erasure coding scheme being used, but in general, the more parity chunks that are created, the more errors that can be corrected.

Top 5 reasons why erasure is an ideal solution to protect HPC data

Reason 1: Cost savings

One of the primary benefits of using erasure coding for HPC data archiving on tape at multiple tens of petabytes, is cost savings. By introducing redundancy at the fragment level instead of replicating entire datasets, erasure coding reduces the need for capacity and lowers infrastructure costs. For example, a typical erasure coding scheme might require only 1.2 times the capacity of the original data to provide the same level of protection as three copies of the data. This translates to significant cost savings for HPC data centers that need to store tens of petabytes of data.

Reason 2: Fault tolerance

Erasure coding's redundancy also provides superior fault tolerance compared to simple replication. With erasure coding, data can be reconstructed even if multiple fragments are lost or corrupted. This is because each fragment contains a piece of the original data as well as redundant information that can be used to reconstruct the missing pieces. In contrast, replication only protects against the loss of a single copy of the data.

In addition, for HPC data centers requiring disaster protection measures, deploying erasure coding to distribute data across three distinct locations offers resilient fault tolerance. This approach allows the system to endure the complete loss of one site without compromising data integrity, contingent upon the selected erasure coding scheme.

For HPC data centers that cannot afford to lose critical data, erasure coding offers a more robust solution.

Reason 3: Restoration performance

Another advantage of erasure coding is its ability to enable fast data reconstruction and short restoration times. Because each fragment contains a piece of the original data, multiple fragments can be reconstructed in parallel, reducing the time it takes to restore data. In addition, because erasure coding requires less capacity than replication, there is less data to restore in the event of a failure. This can be especially important for HPC data centers that need to restore large datasets quickly to meet tight deadlines.

Reason 4: Security

Combining erasure coding with immutability and encryption provides an extra safeguard against unauthorized access. Immutability, in the context of tape storage, ensures that once data is written to the tape, it cannot be altered or tampered with, providing an additional layer of data integrity. Regarding encryption, implementing client certificates for encryption adds another level of security by requiring authorized certificates for access. This ensures that only authenticated users with valid certificates can decrypt and access the data, significantly reducing the risk of unauthorized access. This is particularly crucial for HPC data centers managing sensitive information like financial or healthcare data, as erasure coding aids in meeting compliance standards and upholding data privacy.

Reason 5: Scalability

Finally, erasure coding is highly scalable, making it suitable for HPC data archives of all sizes. As data volumes grow, erasure coding can be easily expanded to add more capacity and redundancy. This is because erasure coding schemes can be designed to work with any number of fragments and any level of redundancy. For HPC data centers that need to store and protect massive volumes of data, erasure coding offers a flexible and scalable solution.

Conclusion

The relevance of erasure coding to HPC data archiving on tape cannot be overstated. HPC environments generate massive volumes of data, and this data needs to be archived and protected cost-effectively and reliably. Traditional replication-based data protection methods can be prohibitively expensive when it comes to storing and protecting such large volumes of data. Erasure coding offers a more cost-effective and reliable solution, as it reduces the amount of storage capacity required and provides fault tolerance and error correction capabilities.

And when it comes to implementing erasure coding for HPC data archiving on tape, look no further than Miria. Miria is a powerful backup and archiving solution that leverages erasure coding to protect massive volumes of data. With Miria, you can easily create erasure coding schemes that are tailored to your specific needs and manage your data with ease. Whether you're a small HPC data center or a large enterprise, Miria has the capabilities and scalability to meet your data protection needs.