Hello, and welcome back to CS615 System Administration. We're now in week 10, and in this first segment of the week we're starting our discussion of the topic of Configuration Management. This is one of those really interesting topics in System Administration that combines all the practices and experiences from managing hundreds or thousands of systems, their individual services, and everything in between with the broad concepts of distributed computing. It covers directly or indirectly almost all of the topics we've discussed so far, but takes us into a perhaps more academic area of the field. But let's first see what exactly "configuration management" is, and why we might need it... As much as System Administrators like to joke that their users are the cause of all the problems with the system, and that things would be just so wonderful and easy if only we didn't have to deal with those pesky requests, with users causing trouble, it's important to remember that the systems we manage are intended to fulfill a specific purpose, to be _useful_. That is, in order to be _useful_, they must be _used_, which in turn means that they are not _static_. Files are created, modified, or removed; users log in and run commands; services are started or terminated. In addition, the requirements of the systems, dictated at least in part by evolving business needs or emerging technologies are changing all too frequently as well. This leads to new software being added, patched or upgraded; user accounts are added or removed; jobs are scheduled or their frequency changed; interactions with other systems are enabled or prevented. In other words, our systems do continuously undergo change. On a single host, such changes are made by local modification of system configuration files, invocation of specific commands, and the installation or tuning of different applications, and these changes then need to be replicated across large numbers of systems, nowadays frequently even made more dynamic through the use of short-lived containers and the like. But as we build these servers, even if we meticulously document every step, we need to be careful to not rely on fragile, individual systems. In our discussion around automation of jobs and system-centric software development, we are seeking increased reliability and resilience through automation, which in turn requires us to inventory and document our needs. For example: in order to apply a security update to our web servers, we need to first know exactly which of our many hosts are running the vulnerable version, and then perform the same steps of upgrading the service on them. Since we cannot possibly keep track of hundreds of changes across thousands of systems ourselves, we delegate the task of applying well defined sets of changes to (possibly very large) numbers of systems to a class of software known as "Software Configuration Management" systems, simply referred to as “CMs”. We have hinted at the capabilities of such solutions before: In week 4, we explained that by the time the OS is installed on a host, we already have to perform at least a minimal amount of custom configuration. We also noted that in order to be able to add software on a host configuration management requires a tight integration with the system’s package manager. Another ever changing aspect of our systems was touched upon when we talked about user management, and in just the last few videos, we've repeatedly stressed the importance of automated systems set up for back up, which obviously requires a central configuration that is distributed across multiple hosts with some changes made to account for local differences. So how did Configuration Management systems evolve? They all developed to solve this generic problem: you have figured out how to correctly configure one server for the intended use case, and now you want to replicate this setup to others. And, yes, just about every single System Administrator I know has undergone the evolution of the system as I'm showing here -- it appears to be one of those lessons that everybody has to learn for themselves. But anyway, how would we go about this? Well, we might start by copying all the necessary bits from one system to the other, and then copy some of the config files over and then restart the service on the newly set up system. Cool - we're done! But then you realize that copying all this data around is not very efficient. Remember what we used in our backup videos to copy data more efficiently, to only sync the delta between two sets of data? That's right, we used rsync(1). And so we go ahead and do just that. Except, of course, this is a bit dangerous, because, hey, your machines may be _similar_, but they're not _identical_. There are some bits that are unique to your system that you can't blindly copy over. We already knew that, though, didn't we? That's right, remember this slide from week 3, when we talked about the filesystem hierarchy? Using this matrix, we get a pretty good idea of which data we can sync easily but we also face the distinct problem of having to somehow manage data that we marked as _non-shareable_. Meaning: we have some files here that are not shareable in that they are _identical_, but they are _predictable_ in their contents, and often times can be derived from certain system properties. So we're talking about doing what Computer Scientists do best: adding layers of indirection or abstraction. But ok back to our initial problem. Once we've realized that we can't blindly copy things from one server to another, we then begin to think about pre-populating the correct configuration for each destination server in a central place, and then we start to build a so-called "golden image", from which we can deploy all the shareable data, but where we also keep host-specific bits such that only those files that are relevant to this system are copied over before we then start the remote service and call it a day. Which... works fine if you have a handful of hosts, as you're sequentially looping over them, but clearly we have a scalability issue here. So we then turn the whole thing around and instead of pushing data from a single server to thousands of machines, we instead of each server pull the data, which really is just the same approach, but in reverse, more or less. And once you've deployed that, you notice that this, too, can suffer scalability issues, since now all of these machines will try to connect at the same time to this little server over here. So to try to prevent that, you then add a random sleep over here to spread out the load a bit, or perhaps you stage out the changes across multiple servers, or you instead try to switch to more of a pub-sub model, whereby the server periodically checks if any changes are even needed instead of thrashing the disk and calculating the checksums for all those files. Eventually, though, you have had enough and you switch to using an actual configuration management system, such as, for example, "puppet". But of course then you have to spend a fair bit of time actually understanding how to set up your service before you then install the service, although not after having cursed Java once more with all your heart and configuring the client side as well, and now everything is peachy and all your files are synced, until... you inevitably run into some edge condition that the CM system you picked doesn't solve, and you go back to step one where you begin copying files around. But ok, so regardless of which approach we're using, we've already identified a difference in how some of the data is to be handled: We have the shareable data, consisting of, say, some software packages or fixed data files, and we have system specific, but predictably configured files, such as many of those we find under the "/etc" directory. What does that look like in practice? Suppose you have your systems distributed across multiple geographical regions or data centers, as we discussed previously in the context of a business continuity plan or to minimize distance to your end users. So let's say we have our us-west us-east eu-north, and apac regions or data centers, with each hosting the same identical HTTP service. Now each of these systems will of course require its own default route, as but one example of a network configuration, but you also most likely want to ensure that each one uses their respective, local DNS resolver or syslog server, which, then, means that you have to ensure that /etc/resolv.conf and /etc/syslog.conf get configured correctly for each of the regions In addition, you probably have a few things that are common to _all_ of the systems. For example, you probably have to grant access to all of the systems here to the developers in charge of the application, but wouldn't want those users to have access to the DNS and syslog servers, meaning you'd then enable different accounts based on the server's role, while furthermore, you _also_ have at least some configurations that are the same for _all_ of the systems in your fleet. In other words, the configuration of your systems can be divided into two broad categories: those aspects that vary based on the systems specific placement, such as by geographical region, or perhaps by network zone, and those that are specific to the given task of the service. Amongst the unique, yet predictable properties are then the network configuration, often times derived at runtime from, for example, DHCP; the various critical infrastructure services we mentioned, including but not limited to DNS, NTP, and syslog; but also the minimum version of the operating system you're using, the user management, and things like the configuration of SSH, although of course there are countless other examples. On the other hand, we also have _service specific_ properties. For example, to provide the HTTP service, you'd have to define and configure across all systems the correct software package the correct configuration for the web server, including, for example, the TLS configuration, which of course in turn requires you to distribute the correct TLS certificate and keys, or perhaps your database configuration, as well as whatever content your server is hosting. And of course just like before, there are few limits to what else might fall into this category. Now having understood _what_ we might want to configure differently here, let's see _how_ we can go about defining this in a scalable manner. So here's our syslog server, and we might try to describe it in somewhat abstract terms like so: We require log rotation, SSH, the admin accounts we mentioned previously, and then the actual configuration of syslog itself. Different CM systems use different syntax and languages to represent such concepts, but to give you an idea, here's an example of how such a puppet resource might be defined. As you can tell, the system uses a domain-specific language that's reasonably self-explanatory, defining, for example, that a given software package should be installed and be kept up to date; that a specific service should be kept running, and what the various permissions and ownerships are to be on the relevant configuration files. Now to enable admin accounts on these systems, we would have to provide another definition, as this is separate from the syslog specific tasks, and the example shown here is using a Chef "cookbook", which uses a Ruby-like language, allowing you to define resource parameters in a dynamic fashion. Then, to round up our quick demonstration of different config management system examples, let's use CFEngine to illustrate how you might define the SSH relevant bits. As I'm sure you'll have noticed, these different systems are rather similar in how they structure their data and definitions, making it reasonably painless to move between them once you've internalized the general concepts. Now obviously you'd not be using three different CM systems here, and instead define everything using the system you chose, but I wanted to provide you with a simple and quick look at commonalities here. So now that we've defined our "syslog" service, we can then repeat the same thing for our DNS service as well as our HTTP service. But of course you'll have noticed that each of these does not define the 'logrotate' and 'ssh' services itself, but instead uses an "include" directive. That is, with a proper CM system, we can abstract specific components and reuse them, so that we then might have the "logrotate" functionality specified separately, just like the "ssh" service. But just like when we're writing software, we want to avoid duplication of code, right? So when we notice that we're using the same things across the board, then we can of course group these two things together as a separate resource, thereby minimizing the places where we might need to make changes should we wish to add something to the "common" service here. And of course we can then iterate on this approach and realize that "hey, wait a second", over here, we're again doing the same thing as over here, so why not put _that_ into its own definition and reuse this, which then let's us replace the repeated inclusion here with the shortened form shown here. So as you see, we are building up a much more programming-centric, automation focused approach here, going well beyond the concept of simply copying files around to focusing on really _defining_ our services, identifying reusable building blocks, and combining those to derive service specifications, which is precisely one of the strengths of using a capable configuration management system. So... goodbye, manual rsync, you didn't scale well at all, even if we made do with you for much longer than we should have. Ok, I think it's time to take a break. We're not done with the topic of configuration management, and our next video will go into more details of a CM system's required capabilities, the concept of state assertion, and, yes, a bit of allusion to distributed systems and the CAP theorem. To keep your mind focused on this topic, I'd recommend that you review some of the discussions from earlier videos. In particular, understanding our filesystem hierarchy is important here, and it'd be a good exercise for you to classify different parts of your systems. You also should try to follow the example we showed here and attempt to clearly define a particular service. For example, in order to configure, say, an SMTP server, what are the different components and services you would need? Do some research of the different config management systems out there and see how they differ and what they have in common. Some are easier to set up and try out than others, so maybe give it a go. And finally, read up on "Infrastructure as Code" and "Service Orchestration", two areas that overlap and intersect with "Configuration Management" significantly, but that are not entirely identical. We'll revisit these relationships in our next videos, but I hope these example exercises will help you deepen your understanding and better internalize what we've covered here so far. Until the next time - thanks for watching! Cheers!