Hello, and welcome back to CS615 System Administration! We are now in week 7, past the midpoint of the semester -- congratulations! In our last few videos we've looked closely at the details of networking with a focus on the TCP/IP stack. In the process we've seen how we make connections between different systems, how the different autonomous systems connect to one another, and how we can trick the routers into sending us a packet back to reveal the path our messages are taking across the internet. But in almost all of these examples, we've started out with a host_name_, not an IP address directly. That is, we began all our communications with a lookup of the hostname to translate that into an IP address, which you all know happens via the use of the DNS. By the way, it's "the DNS", not "the DNS system", since that'd be saying "the Domain Name System system", so kinda like an "ATM machine". But anyway, the DNS is a critical part of the internet's infrastructure and one of the fundamental components a System Administrator needs to understand, if only to come to the realization that at the end of a long day of troubleshooting, when nothing makes sense, it's probably a DNS problem. Either that, or somebody monkey'd around with /etc/hosts, which never, not even once, has not come back to bite you, so don't do that. But we're getting ahead of ourselves. To better understand the DNS, let's rewind and go back to the early days of the internet... If you have only two hosts that you're connecting, then all you really need is a network interface card in each, hook them up to a cable, and there you go. You know how to reach the other system, without any need for looking up addresses or anything. But of course as soon as you add just a few other systems, you have a problem: how do you know how to reach each system? Well, we assign them IP addresses, but now you a problem! How do you remember that that system up here in the upper left that you want to talk to has the IP address 198.51.100.195, and the other one, host B, has address 192.0.2.80 and so on. And people are really bad at remembering numbers, especially, as the internet grew this became an unscalable solution. What you see here is a map of the early days of the ARPANET, as the initial sites were connected and the network grew. During those days, how did we keep track of which hosts were located where? How does the host at the Stanford Research Institute know what the address is of the system at MIT it wants to talk to? Well, let's recall how _our_ system looked up this information. From our earlier video, you'll remember that our tools consulted with /etc/nsswitch.conf on how to resolve hostnames to IP addresses, and by default we are starting out by looking in... /etc/hosts, which might contain a set of IP address to hostname mappings. But hold on a second -- doesn't this pose a chicken-egg problem? How do we know what information to put into that file to begin with? Let's look at the manual page once more: On this system, the manual page actually still contains this paragraph here, where it notes that this file may actually originate from the central Network Information Center or NIC! Funny aside: this comment was removed from the NetBSD manual page literally just last week, as rather serendipitously the topic of the hosts database came up just as I was preparing this lecture video. Anyway, so this man page tells us that the /etc/hosts file used to come from the NIC! Meaning... yes, we did indeed have one central /etc/hosts file that contained all the IP addresses of all the hosts on the internet, and we copied that file around and updated it whenever a new host was connected to the internet. One of the earliest documentation of this host database was RFC597, which looked like this: In December 1973, this document provided the "latest network maps". Scrolling down, we'll note that the systems' addresses were listed in octal and decimal, not by IP address, but included information about their location as well as what type of system they were. You'll find a whole bunch of PDPs and old IBM systems here. After some time, as the networks grew, it became clear that the format used here was no longer suitable, and a new RFC, RFC810, was published in 1982 by Elizabeth Feinler -- who we'll talk about a bit more in a second -- and others. This RFC defined the DoD host table, which not only included information about the hosts or systems, but also about the networks and gateways and the operating system of the remote systems. This document included information about where you could get this host table from -- namely the Stanford Research Institute's Network Information Center, SRI-NIC, via anonymous FTP. The format now looked like this, with entries named NET, GATEWAY, or HOST. You can still find some of the historical host tables -- I've included a link at the end of the slides for this video segment. Here's a copy from 1985, to show you what the internet looked like back then. Note all the detailed information about the systems included here in this file. Here we see a whole bunch of network definitions... then some gateway systems indicating which hosts bridge which networks and down here there's our HOST entries, giving you the IP address as well as the operating system as well as what services the host offers. Let's see just how many hosts there were on the internet back in 1985... looks like the internet back then consisted of 1325 hosts. And so back at that time, this was how we resolved names, more or less: All addresses were manually assigned. The management of this information was handled by the Network Information Center at the Stanford Research Institute under the direction of Elizabeth Jocelyn Feinler, whose name we just saw on the earlier RFCs. Feinler, together with Jon Postel, whom we encountered in earlier videos, basically ran the internet back then: if you wanted to connect a new host, you had to call SRI-NIC and ask them to assign you an address, then update the hosts database and publish it, with all other systems then having to fetch the new copy. During that time, Elizabeth Feinler came up with the concept of using "domains" to group names based on their function, such as, for example, using "edu" for educational institutions. But of course, copying around a text file was obviously not a scalable solution, so under the direction of Jon Postel, Paul Mockapetric developed a proposal for a "Domain Name System" in RFCs 882 and 883 back in 1983, and four grad students at UC Berkeley were the first to implement this proposal as a Unix name server, named the "Berkeley Internet Name Domain" or BIND, in 1984. BIND is to this day the most widely used DNS server on the internet, and currently maintained by the Internet Systems Consortium, or ISC. This domain name system is based on the concept of a domain _name space_ in a tree-like hierarchical structure comprising so-called "domain names". This tree is rooted in a node simply known as "dot", and is subdivided into individual _zones_. Each zone may itself consist of one or more domains, which themselves may contain subdomains, and so on. The domains directly under the root are referred to as "top-level domains" or TLDs, which are then further divided into second-level domains, which in turn can be divided into third-level domains and so on. But who controls how a given zone may be divided? That decision power is called having "authority" over a zone, and this authority is not central, but may be delegated within each zone. Now since the entire DNS tree is rooted in "dot", the root zone, this zone must delegate authority over the top-level zones which then delegate authority over each of the individual second-level domains to the rightful operators. Now note that a zone may further delegate authority any way it sees fit, so for example, Stevens, which has authority over the "stevens" zone under the "edu" top-level domain may decide to delegate authority over the "cs" subdomain to a different entity. Next, within this tree, every node must have a label -- a name -- and each node may have additional information associated with it. There is a number of different pieces of information that you can associate with a given node, but it's not completely free form. Instead, we have defined a number of so-called Resource Record types that describe the information. So the most common Resource Records are of course the IP address associations via the "A" and "AAAA" records, but I'm sure you all are also familiar with the CNAME, or "canonical name" resource record, which kind of works like a symbolic link in the file system in that it merely points to another node in the DNS tree. But there are other types. One resource record we'll see again in a future video is the MX record, defining the mail servers responsible for a domain. But let's stick with the DNS, where we have NS records, defining the name servers responsible for a given zone or say, records associated with DNSSEC, an extension to the DNS to add cryptographic authentication. Now when we construct a domain name, we simply walk up the tree, concatenating the labels of the nodes with a dot. So, for example, we get www dot cs dot stevens dot edu Now in order for this name to become a _fully qualified domain name_ or FQDN, the last label -- the dot for the root -- needs to be appended. This trailing dot signals to the resolving libraries that no further walking up of the tree is to be attempted, but of course most people leave out this dot, and a shameful number of systems have been coded up that break or behave in unexpected ways when you actually feed them a trailing dot. Or you may be able to evade a paywall simply by using the FQDN of the domain. This, by the way, is one of the more benign things that can happen when software developers do not understand the DNS... Anyway, so let's take a quick look at the top-level domains we have. Originally, RFC920 added the following: com, for commercial use; yahoo.com is the example we used, as Yahoo is a commercial entity. edu, for educational use; stevens.edu is the example we used for obvious reasons gov, for government use, mil, for military use, and 'org', for any other organization, nowadays often types used for non-profits or otherwise to distinguish a domain from a commercial counter part. Then there's arpa, which was supposed to be "temporarily" for arpanet administrative purposes, but as you all already know from our discussion of the "temporary" IPv4 space, "temporary" rarely is. But as you all know, we have several other TLDs nowadays: we have country-code specific TLDs, such as "de" for Germany, "fr" for France, "ar" for Argentina and so on. This nowadays also includes several internationalized country-code TLDs, such as those shown here for Egypt, Hongkong, and EU using cyrillic letters. Then we have a number of _internationalized_ generic TLDs, so-called "sponsored" TLDs, backed by a narrow community, but not, for example, a single commercial entity. Here we show the "cat" TLD, representing the Catalan linguistic and cultural community, "jobs" for human resource managers, or "xxx" as a voluntary option for pornographic websites, although of course those often generally operate out of .com. After all that, we also got a whole lot of "new" generic TLDs, when ICANN announced that anybody could propose and become a sponsor for a generic top-level domain which lead to a kind of explosion of words becoming TLDs. How many TLDs does that make in total? Let's take a look: we can fetch the top-level root zone file from internic. That file looks like so: We see various records, including the resource records for the root name servers, and as we scroll through this file, we find the nameservers for all the different TLDs, since this is where this information necessarily must be kept. So let's extract all the NS record entries... ...there. Now let's count how many unique TLDs we find... ...and the answer is: as of March 16th, 2021, there are 1504 TLDs. Now... how do we manage all these domains? Remember, we're operating up here on layer 9, and control of the DNS seems to have some pretty obvious implications. For starters, and this really isn't very surprising if you've paid attention to our last few videos, we find that IANA manages the root zone as well as the infrastructure critical zones arpa and int. All other domains -- that is, the management of all the TLDs -- is delegated to so-called "domain name registries". For the gTLDs, there are specific gTLD registries, and each country has their own country-code TLD to manage, so use their own registry. The domain name registries may then outsource the registration of names to domain name registrars, which they accredit to ensure compliance with their rules and requirements, meaning a registry may either be a registrar themselves, or delegate that function, while the registries control the policies of the allocation, such as placing restrictions on the use of domains within their TLD; for example, the "cat" domain we just mentioned a minute ago is not actually for cat pictures on the internet, but rather for the promotion of the Catalan language and community, so you can't randomly register a domain under ".cat" unless your page is in Catalan. Now one thing to note, here, is that the registries do have control over the entire name space within their domain, and they have the power to make your website disappear if they don't like it. For example, the "ly" TLD is a popular choice for several websites, but this domain is actually the country code TLD for Libya, which is not known to be the most progressive country in the world. A few years ago, there was a case where the adult blogger Violet Blue had registered the domain "vb.ly", but the Libyan government deemed the content to not follow Sharia law and thus took it offline. So when you choose your domain name, you probably want to make sure that you don't become subject to the rules of a country or organization that doesn't align with your principles. This is particularly easy to forget when you want to grab a cool sounding name ending in another country's TLD, such as "ly" or "io" (which is the ccTLD for the British Indian Ocean Territory), for example. Which is a good thing to keep in mind: Since the DNS name space is a tree, if you control any one branch, you control all the branches, subtrees, and nodes below that particular branch. Meaning, if you compromise the registry for .com or the registrar for example.com then you control all the other nodes underneath. For this reason, it's quite critical to ensure that your NS records are not compromised -- if I can trick you into pointing these to _my_ nameserver, then I have control over your entire zone. Alright, I think at this point we can take a break. Having seen the history and logical structure of the DNS, we'll next dive back down into the network packets and begin tracing DNS requests to better understand how we are traversing this tree. Until then, thanks for watching, and make sure to get your tcpdump ready for use! Cheers!