Recently I suffered a hard drive failure on my home server. Seeking a better and
more voluminous storage solution, I decided that my best options was to obtain
new hard drives through a process called shucking.
The Failure That Started It All
My home server consisted of 3x3TB Seagate Barracuda drives arranged in a RAID5
configuration. This yielded a total usable storage capacity of 6TB. Around two
weeks ago, I logged in to shutdown the server so that I could move it to a
different location in my apartment. Before the shutdown, I did a routine check
of the RAID array status. That’s when I discovered that one of the disks had
failed and had been dropped from the array.
I suffered no data loss because of the single failure redundancy
of RAID5, but also because I have backups. See:
RAID is not a backup.
The event spurred my decision to get new drives for the server because of a
number of factors:
- Seagate Barracuda drives are not designed to run in a RAID configuration
- A rebuild would likely result in more drive failures, and a need for new drives anyway
- I was already low on storage and this was an excuse to build a bigger array
What Is Shucking?
I learned about shucking from the /r/datahoarders
subreddit. There is a lot of information there if you are interested in shucking
drives yourself, so I won’t bore you with the details here. In a nutshell,
shucking is when you take an external hard drive, and remove the case so that the
bare drive can be used in a regular computer.
That may sound like a ridiculous way to go about obtaining hard drives, until you
realize that you can save a lot of money by doing it that way.
The target of my shuck was
WD Easystore 8TB external hard drives. Inside of these is a high-quality
drive that is made specifically to operate as part of a RAID array. They used to
contain WD Red drives, but my research told me I should expect “white label” drives.
However, white labels drives are functionally identical to the WD Red drives. At
the time I purchased them, the Easystores were priced at $160, and the WD Reds
were $250. That’s a savings of $90 per drive!
Shucking was surprisingly easy. There are a ton of videos online showing how to do
it without breaking the tabs that hold the plastic case together. I found that the
easiest way to open the case without breaking it is to use two small screwdrivers.
Stick the screwdrivers into the gap between the black front panel and the gray
back panel. You want one screwdriver at the top, and one screwdriver at the bottom. Use
the screwdrivers to pry against the inside of the front panel and
leverage the case open. The only thing holding the case together
are four plastic tabs at the back. The top and bottom do not have tabs.
If you force the gray back panel straight back,
the four tabs will pop open, and the back panel will slide out with
the hard drive.
I got four 8TB “white label” drives, model WDC WD80EMAZ-00WJTA0. Most importantly,
they all report support for TLER,
the missing feature that made my Barracudas unsuitable for use in a RAID array:
SCT Error Recovery Control:
Read: 70 (7.0 seconds)
Write: 70 (7.0 seconds)
One caveat of the “white label” drives is that sometimes they won’t spin up in
a regular computer due to non-standard usage of the 3.3V SATA pins. Apparently,
the external hard drive control board uses the 3.3V signal to tell the drive to
spin down. Standard computers don’t do this, but they do supply 3.3V to the drive.
This prevents the drive from spinning up. This can be remedied by masking off the
3.3V pins on the drive with tape, or simply snipping the pins off with clippers.
Either method works.
Two of my drives required modification to the 3.3V pin. The other two did not. This
can be explained because all four of the drive were manufactured at different times.
Only the newer drives required the 3.3V fix. Having drives of different ages is
actually a reliability bonus, since manufacturing defects tend to affect all
drives in a batch more frequently.
I retired the old Seagate drives into the Easystore enclosures. They function perfectly
and I now have 3x3TB external hard drives. Strangely enough, all of the drives
work, despite one having been failed from the array. This is likely due to the
lack of TLER on these drives. What likely happened is a drive encountered a bad
sector and tried several time to reread it. Normally this is okay, and sometimes a
drive is able to read the sector after a few tries. The problem is that this messes
up the RAID controller (linux software RAID in this case). The RAID controller
sees the drive stop responding for a long time and assumes that the drive has died.
In reality, only a single sector has failed, but the drive is frozen trying to read it.
This is the problem the TLER fixes. It sets a time limit for the drive to try
rereading bad sectors. That way, the drive is never frozen long enough to be failed
out of a RAID array.
The new home server is up and running. I decided to go with 4x8TB drives in a
RAID6 array. This gives me a total usable storage capacity of 16TB. RAID6 also
has double redundancy. This means that the array can withstand two simultaneous drive
failures without loss of data.
The reason I decided to use RAID6 is twofold. Firstly,
RAID5 stopped working in 2009.
The rate of URE (unrecoverable read errors) per unit of data has not improved much
over time, while hard drive size has increase dramatically. This means that UREs
are more likely with larger drive sizes. The odds of a rebuild failure are high with a RAID5
array. If a URE occurs on any of the remaining drives during a RAID5 rebuild,
all data is lost, since there is no redundancy in the degraded state. Since any of
drives can fail the entire rebuild, the risk of failure is the sum of the risk of
all drives combined. RAID6 has a second level of redundancy, so a single URE does not cause the rebuild to fail.
Secondly, if in the future I decide to add more drives, RAID6 has a better usable
storage ratio than alternatives such as RAID1+0. RAID6 always uses two drives for
parity, whereas RAID1+0 always uses half the number of drives in the array for
parity. With only four drives, the usable storage ratio is the same between the
two RAID levels. However, if more drives are added, RAID6 has the advantage of
more usable storage, since the number of drives of parity is constant. RAID1+0 does
have some advantages with reliability, however, my server is primarily for storing media,
so ultra-high reliability is not really necessary. Nor are media files so important
that losing them would be devastating. Also, as I said before, RAID is not a backup.