Hello, and welcome back to CS615 System Administration! This is week 4, segment 4, and after in our last video we sang the high praises of package managers and illustrated their various useful features, it's now time to shine a light on some of the problems we regularly encounter when dealing with package management. Now to be fair, many of the problems noted here are not so much inherent to package managers than to the larger, conceptual issue of installing software, but it is worth keeping in mind just how fragile the systems we routinely build really are. --- Remember how we tried to classify software by type and group it into categories like "the operating system" and "add-on software"? We quickly realized that it's not straight forward, but ok, let's assume that we're using an OS that uses a package manager for all components, the OS and add-ons. This should let us easily manage all the software we need in a consistent manner. But unfortunately that doesn't always work out: if you're writing a lot of python code, for example, you'll probably know that one of its strengths is that there are many open source modules --- and all sorts of cool packages to let you simply "import antigravity" without having to worry about implementing this yourself. But all of these modules cannot possibly be shipped with the OS, and as the language environment often times moves significantly faster than for example an operating system release cycle could keep up with, you generally won't find your antigravity RPM and now have to figure out how to install such modules. Most programming languages encounter this problem sooner or later, and end up building a native language package management system of sorts complete with remote repositories and all that. So now, instead of looking for an RPM for your python antigravity module, you might use pip, which will go and fetch your module and install it in the right paths and... --- wait, what's that? You don't _have_ pip? Oh, no problem you install it using easy_install, which is a python package manager that you can use to have it automatically fetch modules and install them in the right path and... Well, you see where this is going. Silly python. You should be using nodejs anyway, right? --- Right. But the point here is not to make fun of different languages, but to point out that there is an apparent need to rapidly and easily install packages for a given language, and that these solutions do not integrate with your OS provided package manager. --- And it really is just about every language that has this concept. For example, long -- loooong -- before NodeJS - perl had CPAN and PHP had - pear and pecl (which had a repository compromise in 2019), - Ruby has "gems", as well as a public repository that has frequently seen compromised individual gems, - Go fetches code more or less directly from github - and Rust uses the 'cargo' native language package manager to build packages and upload them to the public community crate registry, crates.io, using the country-code top level domain for the British Indian Ocean Territory, because that's what you use these days when you want to show that you're hip. We'll talk more about domain names and top level domains in a future lecture. --- But ok, so keeping in mind that we have these different language specific package managers, let's talk a bit about dependencies, integrity, and trust. We've seen the OS package managers - and we've seen the various - native language package managers. Seems like we've licked the problem, no? Each one can install software, so - what's the issue? Well, let's have a look... --- Here's a system that has a bunch of python-related rpms installed. [pause] So far, so good. We know we can identify which modules belong to which package, and can express dependencies amongst them, such as for example that "python-urllib3" requires the "python-ipaddress" and "python-six" packages and is of version 1.10.2. [continue] But now this host _also_ has a bunch of python modules installed via 'pip'. [pause] This, too, is convenient, as 'pip' shows us which versions of the modules are installed and also sorts out the dependencies between them. But now notice [continue] that we have urllib3 installed twice -- once via an RPM, and once via pip. And they're different versions, too. So now things are getting interesting. [pause] Imagine that you have a tool written in python that needs urllib3 -- which version will it pull in? And what if there's a vulnerability in urllib3 version 1.9 that was fixed in version 1.15 -- is this system affected? Which python tools using urllib3 on this host need attention? Does your host inventory correctly identify both versions? And how do we even get to have two versions of the same version installed? [continue] Our python version here is 3.6, but we don't even _have_ a python-3 rpm installed! The RPM version is 2.7. So now where do we find the python that is not listed in the RPM database and also not identified via 'pip'? Ooh... here it is, under /opt/python. Which package provides that file? Hmm, no package. Nice mess we got here, huh? And this is not even a very rare setup -- you will find similar configurations on most production systems. And this doesn't even get us to the question of which version of which package is running in every single container you have. So we see that dependencies cannot be expressed, relied on, or tracked once you break out of the package manager. By their very nature, each package manager assumes that it is in charge of all of the packages and doesn't -- cannot -- know about similar software added in another way. --- But ok, so dependencies can become a problem. What about integrity? How do we know that the software we're installing doesn't contain a backdoor? You all have probably seen install directions like - this, and often times people criticize this approach because obviously pulling random code off the internet and running that as root is not a very good idea. But consider the alternative: - The traditional method of installing software looked like this. You'd download a tarball of source, extract it, build it, and then install it. Sure, you _could_ inspect the source code and the configure script and the Makefile to verify it doesn't do anything nefarious, but honestly nobody does that and it simply would not be a scalable approach. So this really isn't that much different. Similarly, the various native language package managers - do something similar in the background: they download some files and execute some scripts. - Likewise for your desktop targeted package management solution du jour. Now one thing that you _can_ do to make things a bit better would be to ensure that you use - https instead of http so that you get at least some authenticity assurance. We'll get back to this discussion in a future lecture, though. But let's think about what we are really concerned about here: With the given examples, we fetch some software from the internet and install it, without having any assurance that the software is what we think it is. How can we improve on that? --- Now your package manager may include support for _signed packages_. That is, the package contains some metadata that includes a digital signature, so that we can verify that the data we downloaded was indeed the one that was uploaded to the repository by an entity we trust. So for that, we need to again deal with asymmetric cryptographic keys, frequently using PGP. So here, this RPM based system includes a handful of PGP keys from different entities from which we may want to install software. This key over here, for example, is the Fedora EPEL repository key, that the Fedora project uses to sign the "Extra Packages for Enterprise Linux". The 'rpm' command can be used to verify the information embedded in the package, which - includes the checksum of the data it includes as well as a signature, created using this key here. - When we install this package, the signature is automatically verified, meaning we get assurance that what we are installing is what the Fedora project uploaded. - That's a great and strong assurance: it means we at least know that the repository itself was not compromised, nor that anybody in the middle was able to manipulate the data in transit. So this topic of trust, then, quickly becomes interesting, by which I mean: complicated. We trust the Fedora project, and thus anybody able to sign packages using that key - which may be an automated release system. But our trust also goes into a few other directions, which are worth taking a quick look at: --- I don't know if you remember the "left-pad" incident from a few years ago. In that case, there was a widely used and surprisingly tiny nodejs module called "left-pad" that could be used to, well, as the name suggests, pad text blocks on the left with the appropriate amount of whitespace. The entire module is shown here -- 12 lines of code. It turns out this was a surprisingly popular module -- many other libraries included this module. But for various reasons the author one day decided to remove the module, to "unpublish" it. Now as the author, that's entirely his prerogative, but of course that means that every single package that -- either directly or indirectly -- depends on this module is now broken! And so it turned out that apparently a third of the internet _did_ depend -- eventually -- on this tiny little nodejs module, and various large internet companies encountered serious problems when their internal products could no longer be built! Now aside from wreaking havoc on the internet, this provided an excellent illustration of the fact that if you depend on another module, you literally _depend_ on it, meaning you break if _it_ breaks or goes away. The npm team put in some restrictions since then surrounding what happens when a module gets unpublished, but by and large this is a problem that's inherent in package management and requires you to understand the impact of your dependencies. --- But with the large number of public repositories used by the different native language package managers -- repositories to which just about anybody can upload anything, mind you -- there are other concerns. Earlier this month, this blog post went around, and it described an interesting attack method called "dependency confusion", in which the author was able to compromise a number of large companies by way of the logic employed by these package managers. The way this worked was really quite simple: --- In step one, you find out what types of modules your target organization might be using. A simple way of doing that is looking at the website sources or sifting through public github repositories, which often times include build information, such as what package a given module depends on. Turns out - there's lots of results if you know what to search for... --- Now many companies have internal repositories that they use for their internal-only, proprietary modules. But they _also_ pull in code from public repositories. So once you know the name of a package that your target uses, but that's not been published to a _public_ repository, you create your own, malicious module and publish it to a _public_ repository. When you do that, you use a large version number so that it may seem that your module would be "newer" than the version found on an internal repository. --- Step 3 is the easiest... and perhaps the hardest. You wait. You wait, hoping that at some point somebody will find and use your package. --- And whattayknow - at some point, somebody is going to run 'npm install' Now 'npm install', as noted here, has the default behavior to always look for the "latest" version of a package; that is, if you find two versions, the one with the larger version number is picked. Which, of course, is exactly why you picked a large version for your malicious payload package. Further more, by default the command is configured to look at the public NPM repository in addition to any internal repositories. So if you know the name of a package, create your copy and give it a large number, at some point something is going to - pull it in. Congratulations, you now own the host in question. Just like before, there _are_ ways to address this -- not allowing pulling code from public repositories, for example -- but it's not always easy. The main lesson remains that you _do_ trust upstream repositories, whether you like it or not, and verifying the integrity of a repository or a package is something that requires more care or attention than most organizations pay it. --- So let's review some of the pitfalls we discussed in this video: First of all, - it's important to remember that you do indeed _depend_ on your dependencies. That is, if they go away, you break; if they introduce an incompatible change, you break; and if you don't control them, then you are relying on others, possibly outside of your organizations. Now to ensure that you don't pull in untrusted code from the internet, or that you don't break if some random person on the internet gets into a beef with the repository maintainers and takes their modules offline, you might decide to mirror an upstream repository internally, - but that really doesn't solve that problem at all. If you are _mirroring_ a repository, you are mirroring all its changes. Now mirroring can help you in some of the circumstances, but it does not address all of them and not completely. As so often, there's no silver bullet. Next, we've seen the use of signed packages and RPMs containing checksums, which is all nice and well, but - ultimately only solve one part of the problem, unless you have established a trust relationship with the provider. I think we mentioned this concept in an earlier video. Finally, and this reflects our first point, - you need to remember that all trust chains, down to the weakest link. The integrity of your software build is gated by that of your dependencies. Now we'll get back into the discussions around trust, integrity, and what to do about the various threats in a more abstract context later in the semester, but --- for now, perhaps try to think a bit about the incidents alluded to in this video and research what similar issues may have occurred before. Put some thought into how you might want to protect your environment against these problems. - Take a look at the various OS- and language specific package managers and find out where they pull their packages from. What transport mechanisms do they use, and how do they assert integrity and authenticity? How do _you_ know that you can trust them? - And think about the problem we illustrated earlier: installing language specific modules in an environment that uses an OS-native package manager. How do you record or resolve dependencies? As I illustrated, this is far from a solved problem, so there are no right answers to these questions, but they are all well worth your time to research. And with that, we're coming to the end of our discussion on package managers for the time being. In our next videos, we'll move on to discuss - multiuser fundamentals -- what it means for a system to support multiple users and how that affects design decisions system administrators have to make -- as well as a first discussion of of certain authentication basics. Until the next time - thanks for watching! Cheers!