Name

system-manager — manage the system as process #1

Synopsis

system-manager [args...]

init [args...]

Description

system-manager is meant to be invoked as process #1, either as the first user process of an entire system, or as the first process of a "container" running within a Linux PID namespace or a BSD jail. It will not operate correctly if it is not process #1. To manage per-user, non-system-wide, stuff use per-user-manager(8). It should also not be confused with service-manager(1).

Its design is intended to keep process #1 simple, since the operating system regards it as a vital system process. In particular:

  • system-manager doesn't contain (or link to) library code for complex parsing and communications functionality, such as XML parsers and libraries for D-Bus, PAM, and udev. No parsing or RPC marshalling are done by process #1. It is also not involved in any Plug-and-Play device management or Desktop bus systems.

  • Process #1 is the system manager, as distinguished from the service manager which is another process. Process #1 does not contain nor manage service state tables. It does not have open file handles to the service control FIFOs, and its operation is not complicated by mixing the system state with individual service states.

  • Process #1 is has no hand in calculating the details of system state changes. That's done by a separate program running as another process.

The operation of system-manager falls into four parts: process setup, system setup, reaping, and responding to system events.

Process setup

system-manager expects to be started in the normal state for process #1 (of the system or of a container/jail). It does very little to its process state, which is inherited by the service manager and the logger:

  • It sets itself as a session leader, as if by setsid(1). If, as is the case on FreeBSD/TrueOS, the session already has a controlling TTY device, the association from the session to that device is removed.

  • (On operating systems that support this) It calls setlogin(2) to set the session's login name to root.

  • It changes current directory to / as if by chdir(1), on the grounds that on some systems there is an "initrd" mechanism that might have left the current directory somewhere else.

  • It resets the file/directory creation mask to 0000 as if by umask(1), on the same grounds.

  • It sets the hardwired default environment:

    • PATH=/usr/local/bin:/usr/local/sbin:/usr/bin:/usr/sbin:/bin:/sbin

      • LANG=C.UTF-8 (Linux operating systems, per the GNU C library project and consequent initiatives in Gentoo, Fedora, Debian, and others)

      • LANG=C (others)

  • It reads the administrator-configurable default environment. If the directory /etc/locale.d exists, it processes it as if by envdir(1). Otherwise it processes, as if by read-conf(1), the first file that is found (and can be opened for reading) in the list:

    1. /etc/locale.conf

    2. /etc/default/locale

    3. /etc/sysconfig/i18n

    4. /etc/sysconfig/language

    5. /etc/sysconf/i18n

    As the names indicate, this default environment is only expected to comprise locale-controlling variables such as LANG.

System setup

system-manager performs various setup actions so that the full kernel "API" is visible to itself and its descendents:

  • It mounts the "API" filesystems in their accustomed places.

  • It creates the device nodes for various "early" devices that are required to exist before any plug-and-play device management services start up.

  • If control groups are available and it is in one, it enables the CPU, memory, IO, and tasks control group controllers for its own control group and for the service-manager.slice control group immediately below it. It moves itself into a me.slice control group, so that the controllers can be enabled for sub-groups.

  • It instructs the kernel to send the signals for various optional system events such as secure-attention-key and kbrequest.

  • It corrects the system clock.

Reaping

system-manager operates as a "grim reaper", cleaning up after any child processes that exit. The operating system re-parents a few orphaned processes (mainly ones started directly by the kernel) to it. system-manager spawns exactly three processes itself:

  • After creating a local domain socket at /run/service-manager/control, it spawns an instance of service-manager(1). If control groups are available, it is run in its own dedicated service-manager.slice subordinate control group below the system-manager's original own. This is the global service manager for the system, controlled through the socket. It is not expected to ever terminate (before shutdown). If it does, system-manager re-spawns it.

    Most orphaned processes in the system are re-parented to this sub-process, or further subordinate per-user service manager processes, and not to system-manager.

  • As system events occur, it spawns (ephemeral) instances of system-control(1). If control groups are available, they are is run in their own dedicated system-control.slice subordinate control group below the system-manager's original own. These calculate the details of service and target dependencices for system state changes, and pass instructions to the global service manager for bringing services up and down. Only one instance is spawned at a time.

  • It spawns an instance of cyclog(1) with its input connected to the read end of a pipe. If control groups are available, it is run in its own dedicated system-manager-log.slice subordinate control group below the system-manager's original own. This process is expected to only terminate when the pipe is closed. If it terminates otherwise, system-manager simply re-spawns it.

The write end of the aforementioned pipe is connected to the the standard outputs and standard errors of the service manager, the (ephemeral) service controllers, and of system-manager itself. (Their standard input is connected to /dev/null.) system-manager retains open file descriptors to this pipe, so that no unsaved log data are lost should the logger unexpectedly exit.

The logger is intended to be just for the system manager, the service manager, and the service controllers. Actual services should be plumbed to their own logging services. The logger is told to write its logfiles to /run/system-manager/log, which will by default be in a tmpfs filesystem, and to cap their maximum total size at 1MiB.

System event response

The only IPC mechanism provided by system-manager is signals. (Commands to manipulate services are sent to the spawned service manager, not to the system manager.) System-wide events are flagged, by the kernel and by other programs, by sending various signals to process #1. system-manager responds to these signals as follows:

SIGRTMIN + 3, SIGRTMIN + 4, SIGRTMIN + 5, and SIGRTMIN + 7 (and, for compatibility, respectively SIGUSR1, SIGUSR2, SIGINT, and SIGWINCH on BSD)

Spawn (respectively) system-control start halt , system-control start poweroff , system-control start reboot , or system-control start powercycle . This will activate the halt, poweroff, reboot, or powercycle target.

Activating these targets activates the shutdown target. Other targets do not imply shutdown. shutdown is configured to conflict with login services and all normal server and workstation services, and will hence cause them to be stopped. (This is written into the packaged target definitions, not hardwired into system-control(8).)

SIGRTMIN + 2

Spawn system-control start emergency . This will activate the emergency target.

SIGRTMIN + 1

Spawn system-control start rescue . This will activate the rescue target.

SIGRTMIN + 0

Spawn system-control start normal . This will activate the normal target.

SIGPWR

Spawn system-control activate powerfail . This will activate the powerfail target, which is expected to take action to deal with impending power failure.

SIGWINCH (on Linux)

Spawn system-control activate kbrequest . This will activate the kbrequest target.

SIGINT (on Linux)

Spawn system-control activate secure-attention-key . This will activate the secure-attention-key target.

SIGRTMIN + 13, SIGRTMIN + 14, SIGRTMIN + 15, SIGRTMIN + 17

Close the pipe, terminate the service manager, and wait a short while for it. If the system manager is the system-wide process #1, tell the kernel to flush its disc cache and (respectively) halt, power off, reboot, or power cycle the system. Otherwise, if the system manager is running in a container/jail, just exit.

When the reboot, halt, powercycle, and poweroff targets are fully active, they are expected to send the SIGRTMIN + 15, SIGRTMIN + 13, SIGRTMIN + 17, and SIGRTMIN + 14 signals (respectively) to process #1. In the packaged target definitions, they use the --force option to the reboot, halt, poweroff, and powercycle subcommands of system-control(8) to do this. Do not send these signals directly, as this does not shut down services in order.

SIGRTMIN + 10

Spawn system-control activate sysinit . This will activate the sysinit target.

What the kbrequest and secure-attention-key targets do is configured by the system administrator.

  • For traditional Linux and BSD semantics, secure-attention-key should run the reboot(8) command (or some wrapper around it) and kbrequest should run the rescue(8) or emergency(8) command.

  • For semantics more akin to those of Microsoft Windows NT, secure-attention-key should run login(1) on a (secure) console, or the GUI equivalent on a secure desktop; and kbrequest should run vlock(1) or the GUI equivalent, similarly.

system-manager startup is also treated as a system event. In response this "event" system-manager spawns system-control init , passing it the [args...] that were supplied on its own command line. (For process #1 of the entire system, these options are supplied to the initial program by the boot loader via the kernel. In a container/jail, they are supplied by the container/jail configuration.) This calculates what to initialize, deduced from those arguments, and sends appropriate signals back to the system manager process.

API filesystems and early devices

"API" filesystems are filesystems that do not employ persistent backing storage, and that provide means for interrogating and configuring kernel mechanisms. They are thus effectively extensions to the kernel's system call API.

Linux API filesystems and early devices

/proc

A proc filesystem is mounted here with options nodev and nosuid. (noexec is not used because that would disallow the trick of using /proc/N/exe to re-execute a process' executable.)

/sys

The sysfs filesystem is mounted here with options nodev, noexec, and nosuid.

/run

A tmpfs filesystem is mounted here with options nodev, nosuid, strictatime, size=20%, and mode=0755.

/run/shm

A tmpfs filesystem is mounted here with options nodev, noexec, nosuid, strictatime, size=50%, and mode=01777.

/dev

A devtmpfs filesystem is mounted here with options nosuid, strictatime, size=10M, and mode=0755. (noexec is not used because old versions of programs such as /sbin/v86d memory map devices such as /dev/zero with PROT_EXEC access for no good reason. The newer versions of such programs were fixed in the first decade of the 21st century.)

/dev/pts

A devpts filesystem is mounted here with options noexec, nosuid, ptmxmode=0666, gid=tty, newinstance, and mode=0620. tty is currently hardwired to 5, because the library functions for reading the system account database require dynamic link library and network functionality that are inappropriate for process #1.

/dev/ptmx

This is symbolically linked to /dev/pts/ptmx to take advantage of the fact that the devpts filesystem nowadays provides a ptmx device node that is guaranteed correct for its own set of PTY devices. With this, obtaining PTYs will work correctly even in a container.

/dev/fd

This is symbolically linked to /proc/self/fd for compatibility with BSD programs that expect a single /dev/fd tree for the current process.

/dev/core

This is symbolically linked to /proc/kcore.

/dev/stdin, /dev/stdout, and /dev/stderr

These are symbolically linked to /proc/self/fd/0, /proc/self/fd/1, and /proc/self/fd/2, respectively.

/dev/shm

This is symbolically linked to /run/shm for compatibility with C/C++ libraries.

/sys/fs/cgroup

A tmpfs filesystem is mounted here with options size=1M, and mode=0755. This is so that subdirectories for actual (version 1) control group hierarchies can be created here as further mount points. (With version 2 control groups, this would be the root of a single hierarchy.)

/sys/fs/cgroup/systemd

A cgroup filesystem is mounted here with options name=systemd and none. This sets up the root of a version 1 control group hierarchy that other toolset's tools will understand. (With version 2 control groups, this would be at /sys/fs/cgroup and have no name parameter.)

BSD API filesystems and early devices

/proc

A procfs filesystem is mounted here with options nosuid.

/run

A tmpfs filesystem is mounted here with options nosuid and size=20%.

/run/shm

A tmpfs filesystem is mounted here with options nosuid and size=50%.

/dev

A devfs filesystem is mounted here with options nosuid.

/dev/fd

A fdescfs filesystem is mounted here with options nosuid.

/dev/shm

This is symbolically linked to /run/shm for compatibility with C/C++ libraries.

The system clock

When the system starts process #1, the operating system kernel's system clock will have been initialized from a hardware real-time clock. On an all-BSD/Linux system, that hardware real-time clock will be running in UTC, and the system clock will thus be initially set to a proper UTC value. On a more heterogenous system, the hardware real-time clock may be mistakenly running in a local time. This usually leads to some program or other, during the bootstrap process, having to determine the offset between RTC local time and UTC and correct the system clock. A consequence of this is that the system clock jumps by hours partway through the system bootstrap. In particular, system time leaps backwards for machines whose RTC local time is ahead of UTC, which is not something that POSIX programs are written to expect.

Furthermore, the operating system tries to do silly things with FAT volumes. Instead of just taking file and directory timestamps to be UTC, the filesystem driver takes the timestamps to be local time, so needs to know how to convert FAT local time (as on disc) to UTC (as seen at the system call interface with stat(2) and so forth). Because this is done in-kernel, a simplistic and hence broken mechanism is used. A single offset between FAT local time and UTC is applied to all timestamps.

fsck(8) also needs to have the correct system time and the local time offset to hand. Otherwise, it miscalculates timestamps on FAT volumes, and compares the wrong system time against the superblock's "last checked" timestamps on EXT and other volumes. This means that the FAT local time offset must be provided to the kernel before any fsck(8) is run by the system bootstrap.

system-manager performs adjustments to the system clock and supplies the kernel with the one offset from FAT/RTC local time to UTC before the service manager or logger are started up, to ensure that time running backwards happens at a predictable point during the system bootstrap, and that it happens before any filesystem checks can run.

The Linux system clock

When the hardware clock is mistakenly runing in local time, the system clock is initialized to the wrong value, since the kernel is always expecting to read UTC from the hardware clock at that point. The shifting back to UTC is done using a special once-only variant of the settimeofday(2) system call. The FAT time offset is set by the normal variant of settimeofday(2).

So as part of system initialization, system-manager calls the once-only special variant of settimeofday(2) to set the system time back to UTC and calls the normal variant of the same function to provide the RTC-local-time-to-UTC and FAT-local-time-to-UTC offsets.

The BSD system clock

BSDs have a machdep.adjkerntz variable and a machdep.wall_cmos_clock variable (see sysctl(1)) that can be set by the kernel loader from loader.conf(5). These supply the offset between local time, as on FAT volumes and as in the hardware clock, and UTC. Since they can be set before the kernel first sets the system clock (with inittodr(9)) and thus the local time offset can be applied from the get-go when first transferring from the hardware clock to the system clock, there is no requirement to step the system clock later in the bootstrap process. However, they are often not set in loader.conf(5), and the system clock is initialized incorrectly.

So as part of system initialization, system-manager calculates machdep.adjkerntz and machdep.wall_cmos_clock (the latter from the existence of /etc/wall_cmos_clock and the former from the timezone database), updates them, and changes the system clock with settimeofday(2).

Warts

The signal numbers should be uniform across BSD and Linux. They aren't because of the BSD shutdown(8) command, which sends signals directly to process #1, meaning that system-manager has to align with whatever signals it sends.

Because /usr/lib and its ilk aren't necessarily present at mount time, the system-manager program image file is statically linked and also incorporates (copies of) the service-manager(1), system-control(1), and cyclog(1) commands as built-in commands.

Author

Jonathan de Boyne Pollard