Hello, and welcome back to CS615 System Administration! This is week 02, segment 1, and we're going to be talking about storage models and disks. As system administrators, we are responsible for all kinds of devices: we build systems running entirely without local storage just as we maintain the massive enterprise storage arrays that enable decentralized data replication and archival. We manage large numbers of computers with their own hard drives, using a variety of technologies to maximize throughput before the data even gets onto a network, just as we often require and guarantee access to shared data storage across the internet. We begin with this topic, because without data storage, we can't even bring up an operating system. We are always dealing with ways to store data, and so as a SysAdmin, we better have at least a rough idea what's involved here, what can go wrong, and how to scale data access. --- Talking about scaling data storage... that's really a futile endeavor. No matter how much we make available, we'll soon run out of it. Think about it: just a few years ago, having access to even just one gigabyte of data was mind boggling. Alright, for most of you, this probably has shifted, and the notion of needing a Terabyte of storage for your videos, photos, and music doesn't seem so outlandish, but of course you may also have heard the famous quote attributed to Bill Gates that 640K "ought to be enough for anybody". Data expands to fill any void. No matter how much storage you allocate, your users will find a way to use it up. So we need to think about how to try to avoid the perpetual state of full disks. --- For that, we will discuss the following: - basic disk concepts - basic filesystem concepts - and file systems in general as well as the traditional Unix file system in particular --- Within each of these topics, we have a few sub topics: In order to understand basic disk concepts, we need to take a look at common storage models, which is the primary topic of this video. --- Following that, we'll talk a bit about disk interfaces -- how we connect storage devices both physically as well as what protocols we use. --- We'll talk about physical aspects of disk drives, as they are important to understand file system concepts later on... --- ...and from there, we'll talk about partitions, which will flow logically from the discussion of the physical structure, as you will see. --- Next, within the area of file system concepts, we'll talk about how to combine storage devices using, for example, a redundant array of independent disks, or RAID... --- ...logical volume management... --- ...and how formatting devices may influence the storage capabilities. --- Finally, we'll talk about what you can do with storage media once you have it hooked up appropriately; what types of filesystems there are, and exactly how the traditional Unix filesystem works. This part will be covered in the next week, while we'll try to cover all the other topics in week 2. --- So let's get started. We distinguish different storage models by how the device in charge of keeping the bits in place interacts with the higher layers: by where raw block device access is made available, by where a file system is created to make available the disk space as a useful unit, by which means and protocols the operating system accesses the file system. Somewhat simplified, we identify: - the operating system - the storage device itself -- i.e., that thing where the bits are actually stored - the file system sitting on top of the storage device and interacting with the - application software --- Ok, so the first -- and most common as well as most simple -- storage model we'll talk about here is Direct Attached Storage or "DAS". It's just what it sounds like: the storage medium -- most commonly a physical hard drive, is _directly_ attached to the host. - If you look at your laptop, a workstation or desktop, or a physical server, you'll likely find a physical hard drive present. Here, we see - one partition on the physical disk /dev/disk1 mounted under - slash, as well as several other partitions mounted elsewhere in the file system. --- Now for a workstation or a typical server, the direct attached storage device might be a regular IDE hard drive, as the one shown here. On the right, we then see how the different components come together, as obvious as it may seem: The storage device is part of the physical server and managed by the operating system; the filesystem is created on the storage device and facilitates access to the block-level storage provided by the hard drive to the application software on a file level. Makes sense, right? Direct Attached Storage is a very simple architecture with a number of advantages: since there is no network or other additional layer in between the operating system and the hardware, the possibility of failure on that level is eliminated. Likewise, a performance penalty due to network latency, for example, is impossible. At the same time, there are some disadvantages. Since the storage media is, well, directly attached, it implies a certain isolation from other systems on the network. This is both an advantage as well as a drawback: on the one hand, each server requires certain data to be private or unique to its operating system; on the other hand, data on one machine cannot immediately be made available to other systems. --- But we frequently want to be able to access certain data from multiple servers. When a user logs into hostA, she expects to find all her files in place just as when she logs into hostB. Recall that our shared linux systems at Stevens are behind a load-balancer, so when I ssh to linux-lab, I may be dropped into one of several physical systems, but I still expect to -- and fortunately do -- have all my files available. - The way we can accomplish this is by having the data stored on a _network attached storage_ device, which sits in a central location somewhere and where the files are accessed via, for example as shown here, the Network File System or NFS. Ok, so how does this work, then? We see that there is a central server -- kronos.srcit, in this case, which offers the data from my home directory - from a specific path and the local host -- eva, in this case -- mounts that under /home/jschauma. So the file server must have the actual storage, and has created a filesystem on that, which it then makes available over the network, such that - clients can then access it. But these clients do need to support the network file system, which in turn provides the standardized abstraction of the file-level access to the - application software. Now note that the file server accesses the storage device as "direct attached storage", meaning it's (as best as we can speculate, anyway) physically connected to _that_ server. What that looks like in practice depends on the scale of the system, but seeing how the path on kronos.srcit is called "xraid0-1", it might --- just be an Apple XRaid storage appliance, a mass-storage device that was offered by Apple up until 2008 and which offered storage of up to 10 Terabytes or so. I remember when I worked here at Stevens back in... 2006, we did buy an XRaid and used it as our NAS. While this system is no longer in use here, the export paths were kept the same to avoid having to update all possible clients, it seems. I also have a little SysAdmin war story relating to this XRaid here at this link for your entertainment. Anyway... --- Now Network Attached Storage allows multiple clients to access the same file system over the network, but that means it requires all clients to use specifically this file system. The NAS file server manages and handles the creation of the file systems on the storage media and allows for shared access, overcoming many limitations of direct attached storage. At the same time, however, and especially as we scale up, our requirements with respect to storage size, data availability, data redundancy, and performance, it becomes desirable to allow different clients to access large chunks of storage on a block level. To accomplish this, we build high performance networks specifically dedicated to the management of data storage: Storage Area Networks. - In these dedicated networks, central storage media is accessed using high performance interfaces and protocols such as Fibre Channel or iSCSI, making the exposed devices appear local on the clients, appearing effectively as a block-device using Direct Attached Storage, as shown here on the right. Another chunk carved out from the storage pool could then be used by a separate file server to build a new filesystem on and export that via NFS, as illustrated here on the left. Storage Area Network are also specifically that: networks. That is, they can be configured in a true, switched fabric using --- hardware that looks a lot like network gear, where it may use a variety of protocols to overlay storage communications on existing networks or build brand new, separate networks, often using fibre optical connections. Now once you've internalized that the SAN may be a fully switched fabric, and can be used to make available both flexible block-level storage as well as file-level network storage, it comes as no surprise that you end up with yet another layer of abstraction and start to offer storage as a service in... --- what else: the cloud. Now here there's yet another set of distinctions that we may want to make: - (1) services that provide file level storage and access as in the case of file hosting services such as, say, Dropbox, Google Drive, or Apple iCloud; - (2) services that provide access on an _object_ level, hiding file system implementation details from the client and providing for easier abstraction into an API, with Amazon's Simple Storage Service or S3 being the prime example; and - (3) services that offer clients access on the block level, allowing them to create file systems and partitions as they see fit, such as, obviously, AWS Elastic Block Store or EBS --- All of these categories have one thing in common: In order to provide the ability of accessing storage units in a programmatic way they offer a well-defined API for access, often times HTTP- and REST based. On the service provider's side -- shown here in the bottom of this graphic -- the storage model in use remains an opaque system to the customer. They may use a Storage Area Network to combine large numbers of distributed storage resources to present a large storage pool to the users; as users of the system, we really don't care, and the storage magically appears, as if out of the blue. --- We have already and will continue to see how this works by using AWS, but let's quickly illustrate the distinction between the object level and block level. First, let's take a look at S3: - We create a new S3 bucket and then recursively copy a bunch of files into it. And that's it. Nothing else needed. It really couldn't be much simpler - just like that we backed up a directory full of files into cloud storage with the ability to inspect the contents via the command-line and without having to worry about whether there is enough storage available, how large our files are, or whether there are any limitations on how many files we can create etc. Pretty neat, huh? Well, let's nuke the files again. Goodbye. Next, Elastic Block Storage. [pause] As the name suggests, this cloud storage method allows for access on the block-level, meaning we get a device that looks and behaves just like a real disk. From an instance's perspective, this is really just a variation of direct attached storage, because the virtual machine doesn't know -- nor care -- where the block device comes from. [continue] And in fact, your regular AWS EC2 instance will utilize an EBS volume as local storage, as shown here: What we see here is that we have a volume that is attached as a disk under /dev/sda1. We can get more information via the 'describe-volumes' command... ...which gives us the details about the volume's size, and some other properties. But with EBS, we can always make new "disks" magically appear! Here, let's create a new, four gigabyte sized disk. Now as a block-level storage device, we can't directly write files to this, but would need to create a new file system, something we'll get back in a recommended exercise in just a minute. But for now, let this brief demo suffice to illustrate what cloud storage can look like. --- Alright, let's take a break here and recap what we covered: We distinguished four primary storage models: - Direct Attached Storage - Network Attached Storage - Storage Area Networks - and Cloud Storage. We noted how these can be combined and may offer different kinds of access, although we primarily distinguished between - block-level access, where the device appears as if it was a directly attached physical disk and - file-level access, where systems interact with the storage medium through a file system or API. Each of these models has a number of - implications. For starters, it's important to remember that even when we're dealing with "magic" storage appearing out of nowhere in the cloud, - _somewhere_, _somebody_ is managing a physical storage medium. Secondly, as we add layers of abstraction and allow for increased flexibility, we are - changing the security model. A directly attached, physical drive can only be compromised from that host or by physical access, but a network file server may be compromised... well, over the network, for example. We also note that as we're combining the different models, we may - end up using or combining a significant number of technologies and protocols. We'll take a brief look at some of them in our next videos, when - we talk about disk interfaces and protocols. --- Alright, this brings us to the end of this video segment. Before I let you go, though, I'd like to leave you with a few recommended exercises and problems to help you deepen your understanding of what we covered today: - Try to research the details of some public cloud storage services, such as AWS, Google Public Cloud, Microsoft Azure etc. Can you find out what storage solutions they might be utilizing on the backend? What kinds of scalability considerations do they have to make? - Next, think about the different storage models we discussed and identify specific security problems. I just said that each has specific implications, but what might those be in particular? - And finally, here's a more specific exercise that I recommend: in it, I'm asking you to create an EBS volume, attach it to an instance, create a file system, add a file, then move the volume to another instance to help you get more comfortable with Elastic Block Storage and filesystem concepts. Good luck, and remember to ask questions if you run into problems. Until next time - thanks for watching!