CS615 -- Aspects of System Administration

Know A Unix Command: tar(1)

NAME
     tar -- tape archiver

SYNOPSIS
     tar [-]{crtux}[-014578befHhjklmOoPpqSvwXZz] [archive] [blocksize]
         [-C directory] [-s replstr] [-T file] [file ...]

DESCRIPTION
     The tar command creates, adds files to, or extracts files from an archive
     file in ``tar'' format.  A tar archive is often stored on a magnetic
     tape, but can be stored equally well on a floppy, CD-ROM, or in a regular
     disk file.
          

Table of Contents

History and Command-Line Options Parsing

As the name suggests, tar(1) was originally created to archive files on magnetic tape. As per the manual page, the command first appeared in Version 7 AT&T UNIX. Since then, different implementations have been distributed, including the popular, BSD-licensed libarchive and GPL licensed GNU tar versions.

The older UNIX commands had not yet standardized on using command-line options prefixed with a '-' (or, later, the --long-options). tar(1) (and a number of other commands) interpreted the first argument string as a list of single-letter options, and this behavior was retained for backwards compatibility.

Many a sysadmin's muscle memory has been primed on typing, for example, "tar tvf file.tar", when specifying dash-prefixed options would work as well. For consistency, we will use '-flags' throughout this document.

File Format

tar(1) operates on data in a specific archive format, the Uniform Standard Tape ARchive or UStar format, described in the tar(5) manual page. Following its history of being used with tape drives, this format is a series of 512 byte records.

Let's look at a tar(5) file:

$ wget -q http://ftp.gnu.org/gnu/tar/tar-latest.tar.gz
$ gzip -d tar-latest.tar.gz
$ file tar-latest.tar
tar-latest.tar: POSIX tar archive
$ hexdump -c tar-latest.tar | more
0000000   t   a   r   -   1   .   2   8   /  \0  \0  \0  \0  \0  \0  \0
0000010  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0000060  \0  \0  \0  \0   0   0   0   0   7   5   5  \0   0   0   0   1
0000070   7   5   0  \0   0   0   0   1   7   5   0  \0   0   0   0   0
0000080   0   0   0   0   0   0   0  \0   1   2   3   6   5   2   6   2
0000090   3   6   6  \0   0   1   2   2   3   3  \0       5  \0  \0  \0
00000a0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0000100  \0   u   s   t   a   r  \0   0   0   g   r   a   y  \0  \0  \0
0000110  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000120  \0  \0  \0  \0  \0  \0  \0  \0  \0   g   r   a   y  \0  \0  \0
0000130  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000140  \0  \0  \0  \0  \0  \0  \0  \0  \0   0   0   0   0   0   0   0
0000150  \0   0   0   0   0   0   0   0  \0  \0  \0  \0  \0  \0  \0  \0
0000160  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0000200   t   a   r   -   1   .   2   8   /   a   c   i   n   c   l   u
0000210   d   e   .   m   4  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
0000220  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0  \0
*
0000260  \0  \0  \0  \0   0   0   0   0   6   4   4  \0   0   0   0   1
0000270   7   5   0  \0   0   0   0   1   7   5   0  \0   0   0   0   0
--More-- 

Comparing the above output with the file format description from the tar(5) manual page, we can see that this archive contains a directory named tar-1.28, a file named tar-1.28/acinclude.m4, etc. etc.

(Note: the file(1) command identified the correct file type by looking at the "magic" sequence at offset 257: u s t a r \0 0 0.)

Common Invocations

tar(1) has the following primary use cases:

  • viewing contents of an archive, using the -t option
  • extracting contents of an archive, using the -x option
  • creating an archive, using the -c option
  • more rarely: updating an existing archive, using the -r option
Hence, tar(1) requires at least one of these options to be present.

tar(1) also frequently requires a filename to operate on; this may be the name of an actual file, a pathname for a tape device, or, per Unix convention, the string "-" to denote that it should operate on standard in. Different implementations may default to standard in, a default tape device such as /dev/nrst0, or the value of an environment variable if no file is specified via the -f flag.

Finally, you will frequently want to enable verbose output to see tar(1)'s progress by adding the -v flag.

Viewing Contents

One of the most common use cases for tar(1) is to extract software distributed in such an archive file. Before doing so, however, it may be desirable to view the contents of the archive.

An example of the most common invocation here would then be:

$ tar -tvf tar-latest.tar | more
drwxr-xr-x gray/gray         0 2014-07-27 16:45 tar-1.28/
-rw-r--r-- gray/gray      3126 2014-02-14 17:13 tar-1.28/acinclude.m4
-rw-r--r-- gray/gray    206633 2013-09-24 03:19 tar-1.28/ChangeLog.1
-rw-r--r-- gray/gray     86714 2014-07-27 16:34 tar-1.28/config.h.in
-rw-r--r-- gray/gray      3359 2014-02-03 14:46 tar-1.28/Make.rules
drwxr-xr-x gray/gray         0 2014-07-27 16:45 tar-1.28/doc/
-rw-r--r-- gray/gray       439 2014-02-10 12:42 tar-1.28/doc/value.texi
...
Note how the entries listed here (file names, types, sizes, owner, etc.) reflect the data we saw in the raw hexdump(1) output above.

Extracting Contents

Extracting contents using the same example file would then look like this:

$ tar -xf tar-latest.tar
$ ls -l tar-1.28
total 2299
-rw------- 1 jschauma professor  79584 Sep 29  2013 ABOUT-NLS
-rw------- 1 jschauma professor    601 Sep 24  2013 AUTHORS
-rw------- 1 jschauma professor  35147 Sep 24  2013 COPYING
-rw------- 1 jschauma professor 477038 Jul 27  2014 ChangeLog
-rw------- 1 jschauma professor 206633 Sep 24  2013 ChangeLog.1
-rw------- 1 jschauma professor  15752 Mar 24  2014 INSTALL
-rw------- 1 jschauma professor   3359 Feb  3  2014 Make.rules
-rw------- 1 jschauma professor   1243 Jul  7  2014 Makefile.am
-rw------- 1 jschauma professor  65796 Jul 27  2014 Makefile.in
-rw------- 1 jschauma professor  57810 Jul 27  2014 NEWS
-rw------- 1 jschauma professor   9868 Feb 10  2014 README
-rw------- 1 jschauma professor  20118 Feb 14  2014 THANKS
-rw------- 1 jschauma professor   2168 Feb 10  2014 TODO
-rw------- 1 jschauma professor   3126 Feb 14  2014 acinclude.m4
...
Note that we skipped the -v flag when extracting. Note also that tar(1) changed the ownership and permissions on the files it extracted to the current user and umask.

Adding the -p flag allows tar(1) to preserve the permissions when extracting. Since setting/changing file ownership requires superuser privileges, the file owner will still remain the current user. (Different implementations may behave differently or require additional flags to (attempt to) retain the ownership as prescribed in the archive.)

Extracting Partial Contents

Sometimes you may wish to not extract all files from an archive. You can tell tar(1) which files you are looking for by specifying them on the command-line:

$ tar -xvf tar-latest.tar tar-1.28/README tar-1.28/ChangeLog
tar-1.28/README
tar-1.28/ChangeLog
$ 

You can specify wildcards to extract multiple files, although different implementations may require you to use different syntax. For example, using GNU tar(1), you could extract only the C source files like so:

$ tar -xf tar-latest.tar --wildcards '*.[ch]'
$ find tar-1.28 -name '*.[ch]' -print | wc -l
417
$ 
Note that the wildcards are single-quoted to prevent the current shell from interpreting the globs. If you had files ending in .c or .h in the current working directory and didn't quote the wildcards, your command would have failed:
$ touch foo.c bar.h
$ tar -xf tar-latest.tar --wildcards *.[ch]
tar: bar.h: Not found in archive
tar: foo.c: Not found in archive
tar: Exiting with failure status due to previous errors
$ 

Creating an Archive

Creating an archive is trivial by specifying the name of the archive you wish to create and the files or directories to include:

$ tar -cvf archive.tar /bin ../../some/path file1 dir2/
tar: Removing leading `/' from member names
/bin/
/bin/bzmore
/bin/ed
...
tar: Removing leading `../../' from member names
../../some/path
file1
dir2/
dir2/file
dir2/subdir/
...
Note that tar(1) will descend into any directories given and will retain the resulting hierarchy. As a security precaution to prevent you from accidentally destroying files, it will remove pathname prefixes outside of the current working directory, both absolute and relative. That is, "/bin/ed" will become "bin/ed" in your archive; "../../some/path" would become "some/path". You can verify this by inspecting the contents of the archive you just created.

Compression

Nowadays, tar(5) files are frequently compressed using, for example, Lempel-Ziv (compress(1)), LZ77 (gzip(1)), or Burrows-Wheeler (bzip2) coding, and most tar(1) implementations have support for compression built in. Most commonly, you can use the -z flag to enable gzip(1) compression handling, or the -j flag for bzip2(1) compression handling. Some implementations may also let you specify any arbitrary command to invoke to handle compression.

Archives are now usually distributed using a filename ending in .tar.gz (or .tgz) to indicate gzip(1) compression, or .tar.bz2 (or .tbz) to indicate bzip2(1) compression.

In other words, what used to be separate commands:

$ gzip -d tar-latest.tar.gz
$ tar -tvf tar-latest.tar
...
or, more idiomatic:
$ gzip -d -c tar-latest.tar.gz | tar -tvf -
can usually be handled by using:
$ tar -ztvf tar-latest.tar.gz
and similarly for archive creation of extraction.

To illustrate the use of another, external compression program, consider the use of xz(1):

$ tar -cf archive.tar.xz --use-compress-program=xz directory
$ file archive.tar.xz
archive.tar.xz: XZ compressed data
$ tar -tvf archive.tar.xz --use-compress-program=xz
directory/
...
$ xz -c -d archive.tar.xz | tar -tvf -
directory/
...

This One Weird Trick

Like many a good Unix utility, tar(1) can read input from stdin and write to stdout, making it a flexible tool to use in a pipe.

For example, to copy a directory hierarchy from one part of the filesystem to another, one might run:

$ tar -cf - -C /usr share | tar -xf - -C /backup
Here, we are effectively copying the contents of /usr/share to /backup/share. Note the use of the -C flag to change the working directory of tar(1) during archive creation and extraction.

This concept becomes powerful when you realize that you can use this approach to copy files from one host to another. Consider, for example, two hosts hostA and hostB on which you have an account, but which can't talk to each other directly (for example due to firewall restrictions).

From your current system, you could copy a file system hierarchy from one host to the other without any intermediary files by running:

ssh hostA "tar -czf - dir" | ssh hostB "tar -xzf -"

Similar Tools

cpio(1)

Another popular archiving tool is cpio(1), which nowadays finds it most widespread use by way of the rpm(1) package manager as well as part of the initramfs during the Linux boot process.

pax(1)

The original archive format did not include all the information about a file one might wish to retain, and POSIX.1 provided a definition for a new file format as implemented by the pax(1) utility. This tool is backwards compatible and generally read and write tar(1) archives but allows for additional features. However, tar(1) remains the most popular archive utility, even though on some systems it actually is implemented by way of pax-as-tar (i.e., the pax(1), when invoked as tar(1)).

pax(1) can also read cpio(1) format.

tar(1), pax(1), and cpio(1) equivalent invocations
  tar(1) pax(1) cpio(1)
viewing contents tar -tvf archive.tar pax < archive.tar cpio -i -t < archive.cpio
extracting contents tar -xvf archive.tar pax -rv < archive.tar cpio -i -v -d < archive.cpio
creating an archive tar -cvf archive.tar directory file1 file2 pax -wvf archive.tar directory file1 file2 find directory file1 file2 -print | cpio -ov > archive.cpio

See also


[Course Website]