Separate drive for live tv buffer?

richard1980 · #21

Average HDD lifespan is 50,000-70,000 hours (about 6-8 years) of average use. Obviously the less you use it, the longer the lifespan, and the more you use it, the shorter the lifespan. But that doesn't really tell the whole story. There's no guarantee that any drive will last that long. I've had drives fail after 6 months of light use, and some drives last through years of heavy use. What you have to look at is the probability that the drive will fail within a given period of time. There is a 100% chance the drive will fail given enough time, but there may only be a 1% chance the drive will fail within the first 1000 hours (just throwing numbers out there).

In RAID-0, if either drive fails, all data is lost. Assume there is a 20% chance of a single drive failing in the first 10,000 hours (just to use a round number). The probability of total data loss using a single drive would be 20%. But if you had 2 of those drives in RAID-1, the probability of total data loss would be 4% (probability of both drives failing = 20% * 20%). And the probability of total data loss using RAID-0 would be 36% (probability of either drive failing = (20% + 20%) - (20% * 20%)).

barnabas1969 · #22

You make a good point, Richard. I've built many PC's over the years and I've never had one fail after the first few hours of use, but I've never built one that runs as many hours per day, and never one with as much I/O activity as a Media Center PC. I will say, however, that my last PC at work was 8 years old before they replaced it (my company is pretty cheap when it comes to hardware/software... it's how we consistently crank out 40%+ net profit)... and it was still working fine.

I'm aware of how RAID-0 works. My purpose was to improve I/O throughput and to treat both drives as one large drive.

How do you come up with 50-70K hours compared to Hitachi's 1.2 million?

I do understand the whole "probability" thing though... maybe I should invest in some extra drives for backup space.

Speaking of backups... when I first built my HTPC, I made 200GB partitions on each of the 2TB drives for backing up the system disk (SSD). My SSD is 60GB with plenty of free space. I used to be able to backup the system image into the 200GB partition with no problem. However, Windows Backup now tells me that there's not enough space... even if I delete the data from the 200GB partition prior to starting the backup. I'm thinking that it's because I've moved the user's Documents and other directories to the HDD's, along with several other directories that are re-directed using junctions to prevent SSD wear due to buffering from PlayOn, vmcPlayIt, and other applications. Do you think Windows Backup is trying to back-up my whole 4TB RAID-0 because there are some "system" files on that drive?

richard1980 · #23

MTBF is not a way to measure how long a drive should last. It's simply a guess at product line reliability based on a small test of drives in a controlled environment over a short period of time. The formula for MTBF is:

devices tested * hours tested / failures

For example, let's say Western Digital performs a 30 day test of 10,000 drives. They get 5 failures in the 30 day period. The MTBF would be:

10,000 * (24 * 30) / 5 = 1,440,000 hours

Now what if Seagate performs a test of 10,000 drives for 30 days, and they get 500 failures? The MTBF will be:

10,000 * (24 * 30) / 500 = 14,400 hours

The Seagate drives had 100 times the number of failures, and thus the MTBF is 1/100th of the Western Digital batch. The logical conclusion to draw here is the Western Digital drives are more reliable than the Seagate drives, which is precisely why MTBF is published. It's a way for consumers to judge the reliability of the products

But here's the kicker. What if the failure rate of the Western Digital drives really shot up during the 2nd 30 days of their life? What if an additional 495 drives failed during the 2nd 30 days? That's 500 dead drives in a 60 day period, so the MTBF is:

10,000 * (24 * 60) / 500 = 28,800 hours

The same number of drives died as what died in the Seagate test, but it took twice as long to happen. So the Western Digital MTBF is double the Seagate MTBF. And the Western Digital MTBF sure doesn't look as good as it did in the first 30 days. If you were Western Digital, would you rather publish an MTBF of 1.44 million hours, or 28,800 hours?

Now then, what if Seagate had also extended their test for another 30 days, but they didn't have any additional drives die? The MTBF of the Seagate drives would be:

10,000 * (24 * 60) / 500 = 28,800 hours

If this were the end of the test, and both manufacturers published their MTBFs for the 60-day test, they would both publish an MTBF of 28,800 hours. But does that mean the drives are similar? No. As you can see, the Seagate drives are 100 times more likely to die in the first 30 days than the Western Digital drives.

Now, to go a step further. You were assuming the MTBF should be the life expectancy of the drive. Here's why you are wrong:

What if a long-term test was performed on a batch of 24,000 drives? During the test, Western Digital lost a drive every 24 hours. They conducted the test until all drives were dead. If a drive dies every 24 hours, the test will continue for (24,000 * 24) hours...576,000 hours. So what does that mean? Only one drive out of 24,000 will survive for a full 576,000 hours. But look at the MTBF at various times during the test:

After the first drive died, the MTBF would be:

24,000 * 24 / 1 = 576,000 hours

After the 2nd drive died, the MTBF would be:

24,000 * 48 / 2 = 576,000 hours

After the 3rd drive died, the MTBF would be:

24,000 * 72 / 3 = 576,000 hours

Are you noticing the pattern? The MTBF stays a constant 576,000 hours. But as I stated above, only one drive will last 576,000 hours. The other 23,999 drives died before then. The average life of the drives was not 576,000 hours. It was exactly half of that...288,000 hours.

Now then, go back to my previous comment. I stated the important thing to look at is the probability of a drive dying in given period of time. Looking at this case, the probability of a drive dying in the first 24 hours is 1/24,000. The probability of the drive dying in the first 48 hours is 2/24,000. The probability of the drive dying in the first 72 hours is 3/24,000. As you can see, the probability of a dead drive increases with time. Another way to look at it....there's a 50% chance any one of the 24,000 drives will die in the first 288,000 hours.

No sample of drives will ever perform like this hypothetical sample. In reality, things are a little bit different. When you plot the failure time of mechanical devices, you generally end up with what's called a bathtub curve. Here's what a bathtub curve looks like:

: ht21_1.gif (5.7 KiB) Viewed 1973 times

Now, there are three important things to realize about the bathtub curve. First, there is always a chance of failure at any time. Second, the highest chances of failure are at the beginning and the end of the curve. With hard drives, the beginning of the curve (the "infant mortality period") covers everything younger than about 6 months to a year. After that, the failure rate stays fairly constant until about 50,000 to 70,000 hours, after which the other end of the bathtub curve is reached, and failure rates start to climb due to old age. The last important thing to realize about the bathtub curve is it means the devices were all measured with the same usage patterns. Like all mechanical things, the more you use it, the greater the chance of failure. So if you plot different drives with different usage patterns, you will not end up with a bathtub curve. For example, if you had a sample with a large percentage of high-use drives, you might find the curve has more failures in the early life. At the same time, if the sample had a lot of low-use drives, the curve would have a lot more failures on the "old age" end. But if you plot different drives with similar usage patterns, you should get a bathtub curve each time.

barnabas1969 · #24

Richard, I love the way you explain things. Thank you. If they truly test drives in the manner you explained, why would they call it the "mean" time before failure. The "mean" is the point at which half died before, and half died after. Now, obviously, they can't test the drives for 1.2 million hours... so they have to extrapolate the data in some manner... but if they test them in the manner you described, I don't see how they could call that the "mean". Can you site any references to their testing methods?

I used to work for a semiconductor manufacturer. Our specialty was radiation-hardened devices that were intended for aerospace and military use (and that's all I can say about that). We did life testing as well... and we REALLY put some stress on those devices! I know that consumer-grade products don't get such stringent testing, but they must do something similar?

richard1980 · #25

You are defining the word "mean" incorrectly. Mean = average. The formula for averaging numbers is to sum them and divide by the total sample size. For example, take the 3 number 7, 13, and 79. To find the mean, sum the numbers and divide by the sample size: (7 + 13 + 79) / 3 = 33. So the mean of those three numbers is 33. Median is the middle number in the list, which means 50% of the remaining values are below the median, and the other 50% are above it. In this case, the median is 13 because one number is below 13 in the list and the other number is above it.

To understand why manufacturers call it a "mean", break down the first example from my previous post:

richard1980 wrote:For example, let's say Western Digital performs a 30 day test of 10,000 drives. They get 5 failures in the 30 day period. The MTBF would be:

10,000 * (24 * 30) / 5 = 1,440,000 hours

Each hard drive was tested for (24 * 30) hours, or 720 hours. There were 10,000 drives, so the total amount of power-on time for all the drives combined was (10,000 * 720 hours), or 7.2 million hours. Now out of the 7.2 million hours of testing, there were 5 failures. Therefore, you could say there was one failure for every 1.44 million hours of testing....or another way to say it is the mean time between failures was 1.44 million hours. A lower MTBF means a higher percentage of drives failed during testing, while a higher MTBF means a lower percentage of drives failed during testing. You could therefore say that during the test, a higher MTBF is more reliable than a lower MTBF. The key words to remember are "during the test". It should not imply anything about after the test or other drives not involved in the test, but somehow manufacturers think it does. It reminds me of one of those toothpaste commercials on TV..." 4 out of 5 dentists recommend this toothpaste"....well, they forgot to tell you they only asked 5 dentists, and 4 of them are insane. If they had asked every dentist in the world, I'm guessing 80% would not have recommended a tube of dog crap for toothpaste. But yet, that's what you are led to believe in the commercial.

As for a citation, you can find this formula all over the internet....just Google "MTBF formula". It's a standard formula used in many different applications, not just for HDDs. In fact, here's a link where APC talks about it (I'm sure you've seen an APC battery backup before...): http://www.apcmedia.com/salestools/VAVR ... _R0_EN.pdf. Specifically, here is what they say:

A common misconception about MTBF is that it is equivalent to the expected number of operating hours before a system fails, or the “service life”. It is not uncommon, however, to see an MTBF number on the order of 1 million hours, and it would be unrealistic to think the system could actually operate continuously for over 100 years without a failure. The reason these numbers are often so high is because they are based on the rate of failure of the product while still in their “useful life” or “normal life”, and it is assumed that they will continue to fail at this rate indefinitely. While in this phase of the products life, the product is experiencing its lowest (and constant) rate of failure. In reality, wear-out modes of the product would limit its life much earlier than its MTBF figure. Therefore, there should be no direct correlation made between the service life of a product and its failure rate or MTBF. It is quite feasible to have a product with extremely high reliability (MTBF) but a low expected service life. Take for example, a human being:

There are 500,000 25-year-old humans in the sample population.
Over the course of a year, data is collected on failures (deaths) for this population.
The operational life of the population is 500,000 x 1 year = 500,000 people years.
Throughout the year, 625 people failed (died).
The failure rate is 625 failures / 500,000 people years = 0.125% / year.
The MTBF is the inverse of failure rate or 1 / 0.00125 = 800 years.
So, even though 25-year-old humans have high MTBF values, their life expectancy (service life) is much shorter and does not correlate.

The reality is that human beings do not exhibit constant failure rates. As people get older, more failures occur (they wear-out). Therefore, the only true way to compute an MTBF that would equate to service life would be to wait for the entire sample population of 25-year-old humans to reach their end-of-life. Then, the average of these life spans could be computed. Most would agree that this number would be on the order of 75-80 years. So, what is the MTBF of 25-year-old humans, 80 or 800? It’s both! But, how can the same population end up with two such drastically different MTBF values? It’s all about assumptions!

If the MTBF of 80 years more accurately reflects the life of the product (humans in this case), is this the better method? Clearly, it’s more intuitive. However, there are many variables that limit the practicality of using this method with commercial products such as UPS systems. The biggest limitation is time. In order to do this, the entire sample population would have to fail, and for many products this is on the order of 10-15 years. In addition, even if it were sensible to wait this duration before calculating the MTBF, problems would be encountered in tracking products. For example, how would a manufacturer know if the products were still in service if they were taken out of service and never reported?

Lastly, even if all of the above were possible, technology is changing so fast, that by time the number was available, it would be useless. Who would want the MTBF value of a product that has been superceded by several generations of technology updates?

#26

Great thread - and great info guys.

barnabas1969 · #27

Richard,

Ah... sorry for my mixup with median vs. mean. It was late and I was tired. I wouldn't have asked the question if I had realized my mistake.

And thanks again for the wonderful way you explain things. I really appreciate the time you put into your answers.

So... I guess I need to setup something to backup the entire system periodically. I'm considering two possibilities. The first one is less expensive and will cause a backup to run longer. The second is more expensive and will backup faster.

1) Add some large drives to my desktop PC and backup over the network from the HTPC to the desktop.
2) Buy a multi-disc enclosure and plug it into the e-SATA port on the HTPC.

I've read that there are some issues with port multipliers on some systems. I'll have to look into that. I think I would prefer the e-SATA route (if it will work), but it's tempting to go with adding drives to my desktop.

Do you have any opinions about that?

EDIT: I have a wired, gigabit network.

richard1980 · #28

If you go with the gigabit network, you are either going to hit the throughput limitations of your HTPC drives or you are going to saturate the network, which is going to cause you problems trying to transmit other data across the network, such as streaming to extenders. And if you go the eSATA route, you are going to push the HTPC drives throughput limits, but the network will be safe. So either way you are going to be pushing something to the limits so much that you will experience some downtime during the backup. I'm not a big fan of that kind of solution. I would rather have no downtime.

If you didn't want any downtime, you could always opt for a secondary network (I would recommend a 100 Mbps network instead of a gigabit network because I think the gigabit network would allow your HDDs to get too close (or exceed) their throughput limitations), or you could go with a USB 2 or Firewire HDD attached to the HTPC. All three of these options limit the throughput of data to a point that it shouldn't impact any media playback or streaming experiences.

But it's really up to you. If you can handle having a small bit of downtime every now and then, I would go with the eSATA route. Otherwise, I would probably go with firewire.

barnabas1969 · #29

Yeah, I figure with a gigabit LAN, I could push about 95 megabytes/second over the LAN (80% throughput plus extra overhead, etc). That would probably push the write speed of the target drives near the limit. If my 2TB drives in the HTPC were full (they're not yet), it would take over 6 hours to push all the data across the network.

I figured the eSATA route would be a little faster, but I wasn't thinking about how that would impact the HTPC. My thought was to run it late at night when everyone is asleep... but there are some recordings taking place at that time.

Your idea to limit throughput by using USB is not bad. It will take longer to backup, but will impact the system less.

I haven't used Windows Backup much. I'll have to investigate more when I have time. I need to know if it will do an incremental backup.

huston · #30

I have a SSD and my tv is recorded to a second HDD and so I have no worries about the live tv buffer as it is on my D drive. However, my pause buffer is still on my C drive. How do I move it?

barnabas1969 · #31

Your pause buffer is built on the same drive where you tell Media Center the "Recorder Storage" is (in the Media Center "Settings"). The old buffer may be left behind on the C: drive, but it will actually buffer to the new location.

gthompson20 · #32

If anyone is bored Google performed a study on Hard Drive Failures in their data centers... its a great read!
http://static.googleusercontent.com/ext ... ilures.pdf

TheGreenButton - Windows Media Center Forums

TheGreenButton - Windows Media Center Forums

Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?

Re: Separate drive for live tv buffer?