Hello, and welcome back to CS615 "System Administration". This is our third video for week 01, and after we discussed the job of a System Administrator in the last video segment, I wanted to introduce you to a few core principles that will be important for us throughout the semester. More than just maintain computers and infrastructure components, System Administrators control entire systems. With more experience and seniority, they are ultimately in charge of building these systems, of designing the infrastructure and its components so that the job title frequently morphs into that of a Systems or Network "Architect". --- So when everything is awesome, System Administrators step into the role of planning, designing and overseeing the construction of complex system, suitable both for the immediate requirements of today as well as anticipating the needs of the organization down the road. We become superbuilders. But in order to accomplish this, we need to build our infrastructure on two basic principles: - Scalability and - Security Neither of these can be added to a system after it has been built: trying to apply "security" after the system interfaces have been defined yields restrictions, limitations; trying to make a system with inherent limitations perform under circumstances it was not designed for yields hacks and workarounds and the end result frequently resembles a fragile house of cards more than a solid, reliable structure. But there's a third core principle that is necessary, and it is what actually enables these other two principles: - Simplicity Simplicity is simultaneously obvious and counter-intuitive. Simplicity underlies both scalability and security, since reduced complexity implies better defined interfaces, minimized ambiguity in communications or data handling and in-creased flexibility. As with the other two core aspects, simplicity cannot be added after the fact; it must be inherent in the architecture. Simplicity is the enabler of both scalability and security. Let us look more closely at each of these three core principles. Let's begin with Scalability. Scalability is a buzzword often thrown around these days. Asking "But does it scale?" is an easy way to seem smart in meetings, but let's consider what "scalability" really means. Suppose --- you have a website, and it becomes really popular. Say, it was slashdotted... oh, wait, you guys don't know what Slashdot is. Let's say... it was published on the HackerNews frontpage or a popular Reddit. - Your poor little web servers were not designed for the sudden increase in traffic. - In other words, you're getting a system overload. - So now what? Whenever you have a resource issue like this, there's only so many options. On the one hand, you can --- scale vertically. That is, you get a bigger, beefier server to handle the increased load. Scaling vertically is great -- it's usually quite easy to do, as you don't have to change much. You buy the big ass horse here, and off you go. So you're throwing money at the problem, in a way. Which is a great way of solving many issues - if you have money. Also note, of course, that this only buys you time: at some point, you may have even too much of a load for even this monster to carry. So your other option is... --- ...to scale horizontally. When you scale horizontally, you get more of the same and try to distribute the load. Scaling horizontally may seem a bit better here since it allows you to continue taking on additional load -- you just need to keep on adding horses. I mean, servers. But scaling horizontally isn't quite as easy as scaling vertically: when you add more servers, you then need to distribute the load, so you add, say, a load balancer, which you now have to healthcheck and maintain. That is, you are adding _complexity_. Sorry, no free lunch. But of course you can also combine the two approaches: get more, beefier hardware and distribute across them. And often times that's what people mean when they say "scalability". But there's one aspect that is often overlooked: once you've scaled up -- either horizontally or vertically (or both) -- you are now running with higher costs. That is, for true scalability, you also want to be able to --- scale down. If your traffic does not remain the same, then you're wasting money. Having a fully pimped out data center with racks and racks of redundantly load-balanced servers is rather silly if you only max out one measly CPU. As such, our definition of "scalability" will lean towards the overall _flexible_ nature of a scalable system, its ability to adapt to changing requirements at run time. One term you'll frequently hear in this context -- especially when talking about Cloud Computing -- is "elasticity, which perhaps more aptly describes this feature. So whether by vertical or horizontal means, a _scalable_ -- a flexible, an elastic -- system is one that readily adapts to changing demand. We'll try to keep this in mind throughout the semester, where we'll consider requirements that may spike hundredfold or evaporate again. But for now, let's move on to... --- Security. And yes, this, by the way, _is_ a mostly accurate depiction of what counts as "security" in many cases. All too frequently, the software or information technology industry treats system security as an afterthought, as something that can be added to the final product once it has met all the other functional requirements, after the user interface has been determined and after all code has been written. It is then not surprising that often times people view --- security and usability as being directly and inversely related. This, however, suggests that the only way to reduce risks is to take away or to restrict functionality; but any time you do that, the users will work around your restrictions or stop using your product altogether. Instead, security needs to be built into the system from the design phase on. That is, rather than starting out with a solution that provides the desired functionality and then attempting to figure out how to get it to a secure state, - we should instead begin with a secure, albeit restricted state and then slowly add functionality without compromising safety until the desired capabilities are available. That is, we need to view security as an enabling factor present at the design's conception when building our minimum viable product. --- In this class, we'll often discuss security failures or problems in the products or systems we encounter, and quite frequently we will be able to point to the discrepancy in product design objectives that lead to such a problem. As such, we'll cover peripherally at least throughout almost every week as well as in a summarizing fashion towards the end of the semester many aspects of "security" that relate directly to the computer-human systems we care about: These will include the broad are of cryptography with its ability to aide in the areas of confidentiality, integrity, and authenticity; physical security; service availability; service design; social engineering; and often, generic "trust". All of these should help us approach each area and topic we discuss from both a user's point of view -- asking and understanding what the user actually wants when they use our systems -- as well as the system- or product centric point of view. More often than not will we find that these two intersect with one another more than with the adversarial perspective. Ok, now for both of these two core properties -- scalability and security -- we have noted that neither can be added later, after the fact. --- To go back to an earlier example found in our data centers, you can't take a mess like this and then slap on some scalability. Both scalability and security are things that you have to include in your system design right from the beginning. When that is done, the practical application of these principles yields a reduction of interfaces, end-points, use cases and overall variance. At this time, it's also worth noting the difference between "complicated", as illustrated here on the left, and - complex. A complex system may be well-organized and exhibit a clear, logical structure yet require subtle or intricate connections or components. _Complicated_ systems, on the other hand, are irregular, unpredictable or difficult to follow. Scalable and secure systems are less complex and much less _complicated_. As system administrators, we like --- KISS KISS is... well... remarkable. Not for their ridiculous outfit, or their pyro-technics, or their plateau shoes. In fact, we're not talking about the band at all, but a core SysAdmin principle abbreviated as KISS: - Keep - It - Simple - Stupid That is, we are a looking to reduce complexity. Complexity is the enemy. We will look to resist the temptation to build features we do not need and instead build tools that can be combined. --- In that way, we will look to the Unix philosophy, which includes the critical directive to build simple tools, to do one thing and do it well. That is, we will create small components that, like lego blocks, can be combined. Small little tools that follow the same interfaces and that fit well together. And with such small, such simple building blocks, you can --- build intricate, large, and - yes- _complex_ systems. The Lego Star Wars Millennium Falcon is one of their largest sets -- it consists of over 7500 pieces, and so certainly is far from _simple_. But I think this illustrates well what we're looking for: small, simple building blocks that can be combined to create complex, but well-structured systems. And one part of System Administration, well, and Software Engineering, Programming, and the various other related and overlapping aspects and disciplines, consists of the design and implementation of such complex systems that, because we keep simplicity in mind in the construction and use of the building blocks, will be more fault tolerant, more performant, more flexible. And so simplicity, the third core property of exceptional system design, is really the _enabler_ of both scalability and security. And that's why we like KISS. Now, this concept of Simplicity can even be translated beyond just the design or the implementation of systems, but also to our approach of problem solving. --- One way to try to solve complex and even complicated problems is the application of "Occam's Razor". Occam's Razor has seen many different formulations, but the most common one is that for a given problem, the simplest explanation is usually the correct one. This is something that will be useful for you to recall when you're trying to troubleshoot complex systems, which tend to fail in complex ways, but often due to simple causes. --- A similar golden rule to internalize is the second law of thermodynamics, which stipulates that: The entropy of an enclosed system increases with time. Applied to system administration, this tells us that things will move towards greater disorder. The more users you have, the more traffic you have, the more systems you have... all of these will contribute to more software with more bugs connecting to more other systems and so on and so on. Systems left running for a long time eventually run out of memory or disk space, as they may trigger edge conditions leading to a memory leak; hardware will inevitably and eventually fail. Keeping this in mind allows us to anticipate and prepare for the inevitable. --- Next up in our series of SysAdmin Laws is Hanlon's Razor. Hanlon's Razor is, in a way, a variation of Occam's Razor. Hanlon's Razor states that you should never attribute to malice that which is adequately explained by stupidity. It's easy to jump to conclusions and fear a nation-state actor has infiltrated your network and planted a backdoor, which changed the permissions on /dev/null and which you stumbled upon by accident... or, a user may have accidentally done so. Which explanation is simpler, and thus, according to Occam's Razor, more likely? Hanlon's Razor here deserves being called out, because especially when you focus on security -- as System Administrators must do as part of their job -- it's easy to see malicious actors everywhere. But it's worth remembering that any time you prevent accidental failure, you _also_ help mitigate the risk of intentional abuse of such a failure mode, so Hanlon's Razor can help you win ground here, too. --- Next up: Pareto's Principle. This principle states that in most cases roughly 80% of the consequences derive from 20% of the causes and was originally applied to economics, but it turns out it applies nearly universally. I'm sure you've noticed that when you're writing a program, the general functionality, all the big parts are easy -- you get about 80% done with the program fairly quickly and then spend the majority of the total time on the last 20%, where you're debugging tricky issues and fine-tune the functionality. So 80% of your time is spent on 20% of the functionality, and vice versa. This rule of thumb can be surprisingly useful in estimating software development time, in estimating how customers will use available disk space, what network filters will catch or prevent what traffic and so on and so on. The 80/20 rule can even be applied recursively, and you can again estimate what resources you need to spend on the "vital few" and what on the inevitable long tail. --- Then there's carp. Wait, what? That's not even a carp, that's a sturgeon. Oh, right... Sturgeon's Law. This is an adage named after Theodore Sturgeon, a science fiction writer, who famously quipped that 90% of everything is carp. Uhm, crap. 90% of everything is crap. He said this after he had noted that 90% of science fiction literature is crud, but observing that's only because 90% of everything is crud. That is, most of everything is not outstanding. System Administrators will quickly find out that this is painfully accurate, as you go through software, both open source and provided by commercial vendors. It's not actually as nihilistic as it sounds, but a useful reminder to help you set expectations. --- Similarly, we are all rather familiar with Murphy's Law, right? Anything that can go wrong, will go wrong. Or, more precisely, anything that can happen, will. Like Sturgeon's Law, it's not quite so pessimistic, but good to keep in mind. Especially when it comes to software and in that field even more so when we're talking about software systems at internet scale, we can't excuse not preparing for something by saying Well, what are the odds of that happening? or Who would ever enter _that_ into this field here? If it can happen, it will. Hardware drives _will_ fail, a user _will_ enter an invalid number into a field, a network connection _will_ drop, and your doubly redundant power supplies _will_ both fail at the same time - eventually. Our job, then, is to prepare for these eventualities, to anticipate the possible and build robust systems to function under those circumstances. --- Finally, as we're talking about possible things happening -- and then, logically, of impossible things _not_ happening, we end up in the realm of philosophy to help us troubleshoot our systems: causality. For every effect, there must be a cause. Things don't just happen. Systems don't just break. There's a reason for why your software exploded, for why the load balancer shifted traffic away from a healthy origin VIP, for why your database is currently locked, and for why your outbound traffic filter didn't kick in. Sometimes the causes for these effects are hard to find, but they're there. The various rules and laws I just mentioned will hopefully help you cut out the less likely explanations and guide you in finding the true cause. All it takes is perseverance and dedication to finding the cause of the problem. --- Ok, so those were just some of every SysAdmin's favorite laws. You'll find that we'll reference them throughout the semester, and I bet you will, years from now, in your jobs, run into situations where you will remember them. Of course there are many more similar laws and observations, and we will without a doubt sprinkle some of them in each week. But for today, I think we've had enough. --- So let's take a look at what happens next. - One last topic for this week's introduction is UNIX History. As we will be using Unix systems exclusively in this class, and since system administration is in so many ways tightly coupled with Unix systems and their development and evolution, it's important to understand the history of this operating system. For this, you should watch the video segment I recorded last semester for my other class -- CS631 Advanced Programming in the UNIX Environment -- linked here. That video covers all the critical aspects of Unix history and explains some of the core features. Although targeted towards programmers, Unix System Administrators will be able to benefit as well and translate the lessons. Throughout the semester, you will see us going back to some aspects of this video, so please do not consider this an optional part, but a mandatory video in this series. - After that, I'll explain the various homework assignments - as well as how we set up git for use in this class. But that pretty much concludes week 1 - congratulations! - In the next week, we'll be talking about Storage Models and Disks, so please make sure to read through the reading material posted on the course website and to submit the course questionnaire prior to our second interactive class. The course website also includes a number of ungraded exercises for each class, and I strongly recommend that you go through those as you prepare for the next lecture videos. - In addition, here are a few ways for you to review the topics we covered in this week: - Try to find out a bit about how certain companies manage their infrastructure. You should find that many companies have public blogs where they post a fair bit about the infrastructure or software products they use. This can be quite interesting and give you a good idea of what scalability challenges they may face. - Try to find out what schools exist that grant a degree in System Administration, and what their curriculum looks like. Correlate that to what we'll be covering in this class. - Finally, as we're getting ready for some practical work in the Unix environment, maybe take a look at how you currently use the system. What tools do you use most commonly? How do they fare when analyzed for their complexity and interfaces? Well, I think all that should keep you busy for a while. Remember, it's up to you what you get out of this class, and the more effort you put in to complete these exercises even though they are not graded, the more you will hopefully learn. With that in mind... until the next time, and thanks for watching! Cheers!