Hello, and welcome back to CS615 System Administration! This is week 10, segment 2, and we're going to continue our discussion of the underlying concepts of configuration management systems. As we hinted at in our last video, where we talked about the abstraction of service definitions, we now want to go even further to better understand just how these critical infrastructure services impact the operation of our systems, and what important considerations we have to keep in mind in installing, running, and managing them. Let's look again at our service definitions from the last time: --- We had done a decent job of defining with some level of abstraction what needs to happen on a given system that fills a specific role and had noted the concept of inclusion of other modules, to build extensible systems to... well, to do what, exactly? To configure our services, you might say. And when we perform these configurations manually, what do we do? We - edit configuration files, we - install packages, and we - start services, for example. But it's critical to understand that for a configuration management system --- the main task is _not_ to "make changes". Instead, it is the job of a CM system - to _assert state_. And that is a significant and notable difference, that influences how these systems work. It may initially seems counterintuitive, but when you think about it, you really do not want to have specific changes be made on a system, but you want to ensure that the host is configured correctly, and then apply only _those changes_ necessary to get it to that state. We know that the entropy of an enclosed system increases with time, so our systems deteriorate, and it is the job of the CM system to bring order to chaos, to get us back on track. --- Here, let us illustrate the states we might care about: We start with a system in an "unconfigured" state, meaning the system does not meet the requirements we have, does not have the right packages installed or the right configuration files present. This might be a brand new system with just the bare OS installed, for example. Our CM system then - performs whatever changes we have defined to bring the system into the "configured" state. After the CM system has run, all required packages are installed, configured properly, and all services on the host are running. The system is ready for production. Now throughout the lifetime of the system, we may periodically apply - certain changes, such as upgrading a package or changing the way a specific service runs and then restarting it. But with time, - things change. Entropy -- or perhaps: users -- cause our system to deviate from the desired state into a "deviant" state. Perhaps somebody made some manual changes on the system, or added some packages that now conflict with the service. Now our - CM system runs regularly and detects this deviation, reverts or overrides such changes, and brings back our system into the known good state. But Entropy can be clever, and lead us to yet another state: - The "unknown" state. That is, we are now in a state that is not well defined, perhaps because the CM may have stopped running on the host or may erroneously be applying the wrong configuration. The host may have been shut down, the network disconnected, an intruder may have taken over control, or rats may have gnawed through the power cable. We simply don't know. What's worse, we may not even know that this host is in an unknown state! Not all failures are immediately obvious. In some cases, if the CM system _is_ still running, or if the host is booted up again after having been shut down - it may recover by itself, meaning a properly running CM system possesses at least some self-healing properties. But it's also possible for us to determine the status of such rogue systems in an "unknown" state via - monitoring and thereby bring those system effectively into the "known bad" or "deviant" state. Now in some cases a system may have reached a state where the CM system cannot recover by itself, so we may - rebuild it, thereby bringing it back into the "unconfigured" state, which of course may also - happen to bring back a system from an unknown state, either intentionally or automatically via, say, periodic and automatic rebuilds. But now note that we may have two additional states beyond just "configured" and the different states of not being correctly configured. Specifically, a host may be - marked as being "in service", meaning it is ready to accept production traffic for whatever functionality it may offer, or it may be - marked as being "out of service". These states are slightly different from the other ones in that they _may_ be influenced by the CM system, but it's also possible that the CM system does not care about this aspect -- it merely ensures a system is configured appropriately. And so if you have a deviant system, your - monitoring solution may trigger a change to take the service out of rotation, although of course that can also be accomplished from the - "configured" state. This aspect is sometimes managed through a separate supervising service called "Service Orchestration", which may bring individual systems - into and out of service from the "configured" state - and flip them from one to the other based on specific, externally defined properties. We show these actions here as dotted lines to illustrate that they _may_ be performed by the configuration management system or by any other means. Now this state diagram here may give you an idea of the transitions a system may undergo, but of course any single host does not operate in a vacuum, and CM systems running on each individual host and themselves often orchestrated from a central service are --- of course distributed systems, meaning they consist of components that may be located on different networks all across your infrastructure, and which require communications between them to coordinate their actions. As such, CM systems are subject to the CAP theorem like any other distributed system, meaning we have to consider three distinct and important properties: - First, there's Consistency. That is, we want to ensure that all systems are consistent amongst each other, with each receiving the same updates consistently. Secondly, - there's availability: the system can receive updates and ensures that the services managed by it remain available throughout the operation. Finally, there's - Partition Tolerance, meaning the CM system doesn't break down when it can't receive updates, and is able to recover gracefully from any such partition. The dilemma here is that due to the distributed nature of the system, we can only always guarantee - two of these three required properties at a time. Welcome to the wonderful world of distributed systems. --- The other important aspect of CM systems is that in their efforts to assert state, it's critical that the operations performed by the CM are _idempotent_. Idempotence is the property whereby a function can be applied repeatedly with its own result as input and yield the same outcome. So, mathematically speaking, this means that - f(f(x)) == f(x) One example of this is the absolute function: - If you take the absolute of -1, you get 1. If you then take the absolute of 1, you get 1, which is the same as the absolute of -1. Now in our context this means that you can perform the same operation multiple times in succession and always have the same outcome, which is notably different from _it doing the same thing_. Here, let's look at some examples: - Removing a file has the outcome that after the command completes, the file is no longer there. What if you then run the command again? Well, the command may complain, saying something like "no such file", but the outcome remains the same: the file resolv.conf is not present. - So this operation is idempotent. Likewise, - if we write some data to a config file in the manner shown here, then the outcome is the same, no matter how often we repeat the command: each time, the file will be truncated and the contents written. - So this is an idempotent change as well. Now if we change the command - to _append_ data, then obviously if we run this command twice, we get a different outcome: the file now contains two lines for this nameserver, so - this command then is _not_ idempotent. - Changing the owner of a file - is idempotent, - as is changing permissions. - But what about a command like this? - Installing a package should yield the same result if I run the command again, shouldn't it? I mean, if the package is already installed, then it would say so and the command would be a no-op, right? - Alas, this is decidedly _not_ and idempotent command: when you specify to install a package, most package managers will go and install _the latest_ package, which then may install a different version of the package on each invocation, thus yielding a different outcome. What if we - specify the exact package version? In that case, we should be guaranteed that either the same package is installed, or nothing happens. - But in reality, we have to admit that we don't know for sure: some packages execute scripts at startup, some expand configuration files based on system parameters, and so on. So we are unable to make a blanket statement here, and instead have to investigate in more detail, which gives us an idea of the complexity of writing truly idempotent CM recipes or workflows. --- And idempotence is but one important requirement. It enables some form of self-healing, as it causes no harm to execute the same steps over and over in order to bring us into the desired state, but do note that this does not mean that it's necessarily efficient: A CM system would be considered entirely idempotent if at each invocation it deleted all packages and then reinstalled everything from scratch. But that would incur a high cost, both in time wasted as well as in service availability, so a - good CM system then needs to try to be a bit smarter. - Sure, the changes need to be idempotent, but that's really on you: the system doesn't know what's idempotent and what's not. _You_ are the one defining what steps it should take. - But the CM system _should_ be able to determine whether any of the steps are necessary to be executed, and it also - needs to ensure that as it performs the changes, it will reach a final state that's desired, across multiple systems. That is, we require _eventual consistency_ across multiple hosts: suppose you want to roll out a package update across all HTTP servers. What if 1 out of ten systems can't apply the update, for whatever reason? Do you then leave that host as is? Then you have 9 hosts that are identical and one that's not. Or do you roll back the change on the 9 hosts to ensure consistency across all systems? Or perhaps you apply the change on the 9, but mark the 10th one as being "deviant" and take it out of rotation? Different systems solve this in different ways, and - as you can tell, you require some notable communications and awareness of the other systems, which is why we often integrate with monitoring systems that help us manage the service status, such as e.g., Service Orchestration. --- So we need to keep in mind that Configuration Management systems often times not only configure and manage complex systems, but - that they are themselves complex systems that may fail in complex ways. - CM systems are inherently trusted: in order to be able to make changes on a host, it needs to have the privileges to do so, meaning - that of course CM systems have the potential to really ruin your day, to break things to the point where they can't recover by themselves and require manual intervention. For this reason, it's important to have - a carefully planned, staged rollout plan for any changes you are pushing out, going slowly to incremental sets of systems, - with appropriate monitoring and checking and perhaps an automated approach to rollback if failure is detected; - this is one of the self-healing properties of many such systems: they analyze the success rate of the changes they are rolling out and automatically roll back if convergence cannot be accomplished, for example; And of course, you need to keep in mind that - whoever has access to the configuration management system, has effectively full root access to all systems managed by the system, so appropriate audits and checks are mandatory here. --- But what functionality does a CM system really require if we are looking to assert state? Now obviously, and as we've used in our examples, - the system needs to be able to install software, so it probably needs to integrate with the package manager in use; - it needs to be able to ensure that a specific service is started, restarted, and kept running as needed; this requires an integration with the init or rc or systemd components on your hosts; - it obviously needs to be able to apply file permissions and ownerships, as well as - be able to install static files, to add content to existing files, - or to generate host-specific configuration files from properties of the system or network location etc. But in addition to these requirements, CM systems also often need to have the ability to - execute specified commands that go beyond permissions, packages, or file management, as well as to be able - to collect data from the host and to be able to report its status or this collected data back to a central service. And if you look at this list of functionality, you'll notice that --- your CM system overlaps with a bunch of other important systems and services you utilize. - You often times require a generic remote command execution mechanism, - a data collection agent, - as well as a reporting infrastructure that allows you - to perform actions or collect data based on system properties. All of this directly intersects with - a lot of security relevant tasks: - identifying the deviation from a known-good state is obviously a critical piece of information in this context; - and you often run regular integrity checks, which really are just one form of running commands and collecting information, but you also - regularly have to roll out software updates and vulnerability patches, which really is just one form of applying software updates and then again, sometimes you need to be able to move - a host into a quarantine or well-defined-deviant state. --- So we see that configuration management as a system overlaps with and enables a large number of other systems and areas that are core to the System Administrator's routine: - Software deployment, both the initial installation of the OS either onto bare metal or into a VM, the building of machine images or containers, as well as the regular updates of software and installation of packages as needed for a given service. The overlap here is directly with your deployment engine or your continuous deployment pipeline. - Your monitoring solution, both for general reporting as well the frequent, but unpredictable data collection needs you might have. For example, you often have a need to run a check across all your systems that, say, collects the contents of a specific file, or checks for the existence of a specific configuration parameter. Your CM system may already have this information, or can be used to collect it, even though its primary purpose is not providing logging and monitoring. - Revision control and auditing changes is another area you intersect with, because once you've abstracted the configuration sufficiently, any and all changes become code changes, and the revision control benefits applied to software engineering also become mandatory and welcome in this context. - Compliance enforcement becomes possible thanks to configuration management: if you have a regulatory requirement to deploy certain changes across all your systems, then the CM system is how you guarantee such a change. - and so on and so on. So you see, that configuration management often provides the foundation for many additional infrastructure tasks. --- Let's visualize these co-dependencies and intersections, because hey, venn diagrams are awesome! We've already mentioned the intersection of configuration management and - service orchestration. But in order to use either one, you first need to - know what systems you have. That may seem trivial and obvious, but an accurate asset inventory database is a core critical infrastructure component that lists -- and keeps up to date! -- your comprehensive inventory. But of course not all systems are the same, so we also need a - central place to define the role a given system may play, such as "http server" or "mail server", but further finer grained down to different deployment tracks such as "production", "development", "qa", or "canary", to test out certain changes. From these defined roles, you can then - define which configurations are applied and what systems are put into rotation. Likewise, from here you are - creating new instances, building new hardware, and are installing or deploying software, just like - your monitoring may differ based on the role definitions of your assets or instances. The intersections of these circles here then include specialized topics that may be accomplished by one or the other systems. For example, - the initial software installation and network configuration is likely to be applied by your deployment engine, but then asserted as a continued state via your configuration management system, while - the CM system might collect specific metrics that are then exported to or via the monitoring agents, to give just one more example here. Now all of this is part of the related topics of configuration management, but perhaps you've noticed that we've kind of neglected one major trend in the industry: - What about containers? Do we really use configuration management in the same manner when we use static, small, well-defined containers? And the answer here is yes. And no. Or maybe "kind of". It depends. Because, you see --- containers are different, on the one hand, but on the other, they're really not: they're just the evolution of the same concepts. Just like with CM systems, where we assert the state of our systems, so, in a way, are containers a state assertion. Just like before, we need - and inventory of resources, of instances, of destinations, and even though perhaps we no longer manage individual hosts and use - one more layer of indirection - to manage our containers, - using specialized software and - complex systems upon other complex systems, we _are_ back to asserting state. This is because our containers really ought to be - immutable. That is, they should not see runtime changes, thereby guaranteeing the correct state. When we determine a state deviation, instead of trying to update the configuration on the running system, we instead - integrate our code changes and - deploy an updated container image. Although, it's worth noting that even though docker and kubernetes and all those things are quite nice, much of the heavy lifting even in the brave new container world is still done - by - all the same - tools as before. But yes, at a mature organization the configuration of runtime systems has moved away from individual, mutable systems to - generic, well-defined, immutable components that can be structured and deployed in an automated fashion: i.e., Infrastructure as a Service. --- But as much as we've talked about configuration management in the context of servers, it's also important to keep in mind that we have the same need for state assertion and all the other benefits of CM on - desktops and - mobile clients, where we often times use enterprise tools to manage some of the same aspects, but of course in a different context; as well as on - network equipment, like routers and switches, or - on storage devices - or load balancers or caching systems. That is, everything we learned here can and should be applied there as well, although admittedly the industry is a fair bit behind in this area and centralized configuration management for these devices is not always as easy as it should be. --- Okay, I think we're coming to the end of this topic. I know there was quite a bit we covered here, so let's make sure we recap the most important aspects for you to retain and perhaps research on your own in more detail: First, - let's keep in mind that we want to move away from individual systems towards clear definitions of the services we deploy. That is, - we no longer try to nurse and nurture fragile, individual systems that only a handful of people may remember how exactly they're configured and instead want to move to completely replaceable and exchangeable systems that can be recreated, instantiated on a moment's notice. To do that, we - focus not on making specific changes, but rather on asserting state, defining outcomes rather than how we configure a system. Given the nature of configuration systems, - we need to remain aware of the limitations and build protections for the different ways distributed systems may fail. We build state definitions such that - they can be reached by applying idempotent change sets with reliability assurance and yielding a final convergence on a known state. All of this includes significant - overlap with other components, services, and systems used in system administration, and you probably want to research the topics shown here to take your understanding to the next level as - we as an industry are moving away from specialized services towards a more descriptive way capable of being automated with integrated tests, code review, and continuous integration and deployment like any other large scale product: Infrastructure as a Service. As you can tell, this larger topic has become one of the most important fields in large scale system administration and has opened up room for a lot of interesting research. Make sure to check out the links I've included at the end of the slides for this video for additional resources. In our next videos, we'll try to cover a number of important aspects relevant to system security. Good thing that's an easy topic and can trivially be covered in just a few minutes time. What's that? It can't? Well, we'll still try to do our best. Until then, thanks for watching - cheers!