Hello, and welcome back to CS615 System Administration! This is week 09, segment 1, and I think it's high time that we talk about a topic that makes every SysAdmin nervous because they don't quite know whether they're prepared -- and I mean: _really_ prepared -- for when disaster strikes: Backups. When was the last time you backed up your data? When was the last time you verified that you could _restore_ your data? I know, I scared you just now, didn't I? go ahead, verify your backup, then come back here; I'll wait. All good? Your backups are working? Alright, good. So where were we? Ah, yes, I was going to tell you that in the next few segments, we'll discuss the topic of backups in some detail, beginning with this video, where we cover the general core concepts and talk about basic principles. In the following videos, we'll then illustrate by example a few common backup strategies and tools, including filesystem snapshots and related considerations. But don't drop off the video just yet. I know, I know, backups are boring. It's like brushing your teeth, or washing your hands, or wearing a mask. It's tedious. You have to remember to do it. You have to build up a habit, and you can't stop doing it just because well, you know, it's been a while and you no longer feel like it. I know, I know. Nobody likes backups. Nobody gets excited about backups. But... you know what people do like? What they get really excited about? People looove to be able to restore data that they thought they had lost. People love restores. Love 'em. They go nuts about 'em. Have you ever accidentally deleted a file and then remembered that you had a backup copy of it somewhere? And isn't that just the best? So yeah, people really do love restores, just as much as they find backups boring. And so it's important to approach the topic of "backups" not as something that's an objective in and of itself, but rather as a means to an end: Nobody cares about backups, because backups are not what you want. What you want is the ability to recover lost data, to be able to restore data. And backups are just a way to enable this capability. So keeping this primary objective in mind, let's take a look at some of the core concepts, some terminology, and some broad considerations on the topic. I'm sure you've heard many of the terms we'll be using here before, but it's worth clarifying what we're talking about before we dig in deeper. For starters, there are different types of backups, starting with the most obvious, the full backup. This is pretty much what most people think of when they hear the term "backup": a complete copy of all the data. But there are also incremental backups. That is, a backup that only copied the data that has changed since the last time a backup was performed, as well as so-called differential backup, which are quite similar to incremental backups, but differ in that here we back up only data that has changed since the last full backup. We'll see the distinction of these three types in a minute. But we also have to talk about how we're backing up data: whether we're copying individual files, or whether we're performing a copy of the raw disk blocks, whether, and to what extent, we care about file meta data (of both the file and the filesystem), the distinction to the file data, and how we may handle, for example, open files or data that is written to while we perform a backup. This circles back in many ways to the fundamentals of file systems that we covered earlier on in week 3, and then brushes upon the distinction between journaling file systems and data replication on the one hand and filesystem snapshots, for example, on the other, as both can be viewed as a form of protecting against data loss, even if that may not necessarily serve as a traditional backup strategy. We'll see the use of filesystem snapshots in a later video by example. We'll also talk about extremely businessy terms relating to Disaster Recovery and having a Business Continuity Plan with a defined Recovery Point Objective or RPO as well as a Recovery Time Objective or RTO. I know, this all sounds terribly boring, but as so often, these fancy terms translate into very simple terms: a Recovery Point Objective really is just the fault tolerance window you are willing to accept, or, in other words, how far back you have to go to get back your data. So for example, if you perform backups every night at midnight, and you lose some data at 8 am, then obviously you can at best get back the data from the previous night's backup. Likewise, if you loose data at 11:59pm, you still only can get back the data from the previous night's backup, so your recovery point objective here would be 24 hours -- you are aiming to be able to restore data from 24 hours ago. If your RPO was, say, ten minutes, well, then you'd have to run backups every ten minutes. Your Recovery _Time_ Objective then is how long it takes you to restore the data, since that is not usually instantaneous, and the type of backup performed may play a role, so let's go ahead and take a look at the three main strategies: First, there's the simplest and most obvious approach to backups: you perform a full copy of all the data at regular intervals. So for example, we start our backup cycle on Sunday and copy all the data we have -- say, 7 terabytes. Then the next day, you simply do the same thing, and create a second copy of all of the 7 terabytes you have here. And then you repeat this regularly, creating full copies of all the data every day, so that after one week you've backed up 49 TB of data. That's a lot of data, but it sure makes restores easy, because all you have to do is go and fetch the latest full data dump, and there you go, back in business. So the "full backup" strategy has as drawbacks that it's obviously slow -- it takes time to copy 7 TB of data every night, and of course you then need to actually shovel all the data around and store it somewhere, but on the bright side, recovering from data loss is easy. Now let's compare this to a "differential" backup. In this model, we change what we back up each night, but of course at some point we want to start with a single full backup. Then, the next day, we only back up whatever data has changed since the last full backup, i.e., since Sunday. So let's assume here that this is the data that's changed, so we copy 2 TB. The next day, let's pretend that only the orange data blocks had changed, so we then back up _those_ blocks, but also the ones that had changed the day before, since we are only comparing datasets against the last complete backup. The next day, let's pretend the green data blocks changed, so we add those to what we back up. The next day, let's say nothing new changed, but since our promise is to back up everything that changed since our last _full_ backup, we again copy the same data. And if data then changes the next day we copy that and that on Saturday. So while we didn't have to copy 49 TB of data as in the "full backups every day" model, we brought the total data down to 31 TB. Now if we want to restore data, we need to combine only two data sets: the data from the original full backup, and then overlay the data from the last differential, since that includes _all_ the changes since the last full backup. So this approach improves the backup performance and storage utilization, but makes the recovery a little bit slower than from a full backup, since we have to overlay data from the second set. On the other hand, regardless of what data was lost, we always only require at most two data sets: the last full and the last differential. So that's not too bad. Now let's compare this to the "incremental" backup strategy. This model will initially seem almost identical to the differential approach, since here, too we don't copy all the data, but only, well, incremental changes. So after our initial full backup on Sunday, we might then back up the same blocks as in the differential model on Monday. But now on Tuesday we'll see a difference. If, again, we assume that there were changes only in the orange blocks, then in the incremental model we will only back up just those blocks on Tuesday, and similarly, with only changes in the green blocks happening, we'd then only back up those blocks. Now if no data changes at all on Wednesday, then we won't back up any new data, on on Friday would only back up what's changed then, just like on Saturday. So it should be obvious that in this model we are backing up significantly less data -- only the data that has changed since the previous incremental backup, so the total in our example here adds up to only 13 TB. So that's pretty neat, but what does this mean when we want to restore? In this model, when we want to restore data, we have to start with the full backup from Sunday, then overlay the data from Monday, then that from Tuesday, then that from Wednesday, Friday, and finally Saturday. So the difference to the previous two approaches then is that in this case, we maximize the backup performance and storage needs, as we really only copy the minimal amount of data, but that comes with the penalty of a much more complex restore process. And note that this also somewhat increases the probability of a failed restore, as our complete restore now depends on the full chain of all previous backups. Once again, there is no free lunch and there is no simple "correct" solution, but it's important to understand these different approaches. But not only do we need to think about _how_ we're backing up data, we also need to think about where we're backing up the data _to_, so let's talk about storage media. Now obviously this goes back in many ways to our week 02 videos in some ways, but let's quickly jot down the different storage media we might use and list properties that are relevant to us in this context. So first, we have magnetic tape, the original data medium and, you may be surprised to find, still a dominant storage medium for backups, using large, robotic tape libraries in enterprise environments. But of course just like we can use hard disks for data access by systems using, say, storage area networks, so can we use them for data backup purposes, just like we could use solid state drive backed storage network, and so you see how this closely tracks what we already discussed in week 2, and just as we mentioned then, you can of course also store your data in the cloud, why not. But how do we pick the right solution here? Of course the properties of these are different, but which ones matter, and how do they matter in comparison to when considering any type of storage media? One obvious factor is of course I/O performance -- for your backup destination, you probably want a highly performant solution, although you probably want to optimize sequential access of large objects, since a backup operation generally does not randomly write a few bytes here and then a few bytes somewhere else; you want to focus on reusability and longevity of the storage medium -- you probably want to be able to store your data there for a long time, and you may even consider using a medium that allows writing the data once, but then does not allow you to _overwrite_ it. DVDs or writable CDs are examples here, but of course those are pretty terrible storage devices as they degrade quickly, but enterprise scale vendors also nowadays offer "write once, read many" solutions backed by solid-state storage. In addition to this sort of protection of your data, you also may consider adding compression or encryption of the data right in the storage solution, or evaluate options to save space via data deduplication. All of those are factors to consider when looking for storage solutions for your backups, and I think you'll note that these are somewhat different from those you might consider when choosing storage media for your live filesystems. Another factor that comes into play is what the purpose of the backup is. I know, we already said the purpose of a backup is to be able to restore data, but within that larger objective, there are still at least two distinct use cases: We have long-term storage or archival and we have the rapid recovery from data loss. These two use cases are notably different and will require you to determine separate solutions. The first -- long-term storage -- is something that you likely want as part of a disaster recovery plan, or perhaps as part of certain restrictions or requirements mandated by your business or even the government. Consider, for example, operating data systems for any public office with a mandate to preserve communications and records, or a newspaper archive, for example. Here, long-term storage really means "long term" -- on the order of not years, but decades. This, then has a number of implications. For starters, you need a complete backup. Any incremental backups do not help you here. So you probably want to make sure to keep these separate from your regular, ongoing backups. What's more, you likely want to keep this data at a separate location from your normal operations, and accessing such data stores then may incur a performance cost, since you may have to physically travel or otherwise suffer data retrieval penalties. By doing full backups for long-term storage, you will also most likely only have coarse granularity: you can't store _all_ the data for long-term storage every minute, but probably need to take snapshots in time every... month or so. And then, if you are thinking long-term storage, really do think _long_ term storage. Which means, you have to consider what medium is suitable for storage of data for decades. And... how will you get to the data? Suppose you took a long-term archival snapshot in the year 2000, and you stored the data on magnetic tape and then shipped it to a storage facility somewhere. Now, over 20 years later, you want to access the data -- do you still have a tape library that can read the tapes? Do you have the right cables to connect this library to your 2021 laptop? I'm quite sure the tape library doesn't have a USB-C connector... Similarly, if you encrypted the data, make sure you also are able to _decrypt_ the data ten, fifteen years later. Who has access to the decryption key? How did you store _that_? So you see that this -- archival of data for the long term -- brings with it a whole bunch of considerations that do not even come directly into play when talking about regular backups, which we generally think of as protecting against sudden data loss. And even _that_ is not a uniform case, and different causes for data loss may lead you to develop different protection mechanisms: For example, if you want to protect your users from themselves, then the way to consider data restoration is by and large on an individual file level. Similarly, software bugs that lead to data being deleted, overwritten, or otherwise lost also require a granular file-level restoration, while hardware failure, such as a head-crash on a hard drive or other physical failure tends to lead to _all_ the data on the drive to be lost, hence requiring recovery of the whole system, not individual files. This is similar to how a security breach generally leads to you having to reconstruct the whole system, since you can't trust the individual tools or files any longer. But you also have to prepare for even bigger events, which _will_ eventually hit you: natural disasters, for example. I'm sure many of you may remember Hurricane Sandy a few years ago, which really did a number on all the enterprises with data centers and hosting located in lower Manhattan. So in cases like this, where you lose _all_ your data -- at least all your data in one, admittedly large, geographic region -- you are going beyond the general concept of backups and towards disaster recovery with data replication across diverse environments and providers, again intersection more with the requirements of long-term storage and archival. Now all of this is, as usual, a topic much too broad to be covered here, but the concept of a Business Continuity Plan or BCP all of a sudden doesn't seem quite so boring any longer when you prepare for the worst outcomes. In a way, you have to treat your backup strategy like insurance. Yes, it's boring, it's annoying, and it costs you a lot of money -- and the weird thing is, you really, really hope you will never need it -- but you do sleep a lot better knowing you have it. So let's at least think about recovery from a system failure for just a moment, since we'll look at more granular recovery from individual data loss in our next video: A system or hardware failure will in all likelihood lead to the loss of the entire system, which then also generally leads to at least _some_ downtime. Now obviously you should have your systems and services structured such that there is no single point of failure and any given system is replaceable, but you still will need to take down the system with the failure to recover. Now under some circumstances you may already have some protections in place -- RAID, as discussed in week 2, can provide some protection against individual hard drive failure, for example, and allow you to recover without downtime, but if your system has indeed failed completely, then the recovery will take at least _some_ time as you have to rebuild the whole system. Depending on your backup schedule, this may then even require you to retrieve the last complete backup from some archival storage, and in some cases at least _some_ data loss may be unavoidable. The best rule of thumb to protect against data loss here is to follow the 3-2-1 Rule, which requires that you keep at least three copies of your data on at least two different storage media with at least one copy at an offsite location. With this approach, you are minimizing the different failure options, but do make sure you are preparing for the applicable disasters. "Offsite" can mean "a different office on the other side of town", or it can mean "a different data center on the other side of the country" -- in the case of large natural disasters, the latter can help you when the former could not, although that, of course, also depends on the size of the country we're talking about. But let's say that you _did_ deploy your best backup strategy following all this advice, you _still_ have a few more things to worry about: The integrity and correctness of your backups as well as the tools you used to perform the backups. Because, hey, guess what, if you want to back up all your data, the tools and processes that do this need to have _access_ to all your data. And there have been attacks on backup software and other solutions that lead to data compromises. Which is a problem, because, if you, for example, manage to get some malware deployed on my systems, and I then perform a backup, I'm backing up your malware, and if I later restore from that backup, I may play back your malware. So it's important to keep in mind that the data you have on your backups is only as trustworthy as the data that you copied initially. Similarly, to restore data from a backup, you have to use only trusted tools: if you suffered a security breach, and you believe your system is compromised, then you can't use the tools on that system to restore data, even if you know the data on your backups is correct. So all in all, it's actually a pretty important but not often considered aspect of your data backup strategy to regularly verify the authenticity and integrity of your backups. That is, regularly try to restore a random sampling of your datasets from your backups -- this will verify that (a) you have the tools to do that; (b) know how to do it; (c) _can_ actually do it; and (d) that your backups are actually worth anything at all. If you can't restore data, your backups are pointless, worthless, and a waste of your time and resources. And with these wise words, and hopefully having scared you a little bit into checking and validating your own backups, we can take a short break here. In our next video, we'll take a more practical look at some of the tools we use for individual backups on the file level. I recommend that you try out the exercise linked here to get an idea how to use these tools and to get you thinking a bit more about performing backups. We'll talk more about the outcomes and objective of this exercise in the next video. For now, go ahead and check your own backups and make sure you can restore data from them. You do have backups of all your systems, don't you? I thought so. See you next time - cheers!