My Linux Note

Linux调度

linux内核设计与实现第4章
不同进程有不同的调度算法（这种模块化结构称为调度器类）。每个调度器都有一个优先级，不同调度器之间如何协调？拥有一个可执行进程的最高优先级的调度器会胜出，去选择接下来要执行那个进程（The highest priority scheduler class that has a runnable process wins, selecting who runs next.）

CFS

从Linux2.6.23开始引入
针对普通进程的调度器类（Linux中称为SCHED_NORMAL，POSIX称为SCHED_OTHER）

基于一个简单的理念：每个进程都能获得1/n的处理器时间（n是可运行的进程数量），任何可测量的周期内都是1/n（一个先运行5ms，另一个运行5ms是不满足这个理念的，要做到好像他们在10ms内同时运行，各自使用了一半的处理器能力）

Ref: https://www.cs.columbia.edu/~junfeng/13fa-w4118/lectures/l13-adv-sched.pdf

Approximate fair scheduling. Run each process once per schedule latency period(就是sched_latency_ns，用sysctl -a可以看到). Time slice for process Pi: T * Wi/(Sum of all Wi)

Too many processes? Lower bound on smallest time slice. Schedule latency = lower bound * number of procs

Pick proc with weighted minimum runtime so far Virtual runtime: task->vruntime += executed time / Wi

Red-black tree, Balanced binary search tree, Ordered by vruntime as key, O(lgN) insertion, deletion, update, O(1): find min(min_vruntime caches smallest value)

Update vruntime and min_vruntime: When task is added or removed, On every timer tick, context switch
Converting nice level to weight(): static const int prio_to_weight[40] (sched.h), Nice level changes by 1->10% weight.(Pre-computed to avoid Floating point operations, Runtime overhead)(In the current implementation, each unit of difference in the nice values of two processes results in a factor of 1.25 in the degree to which the scheduler favors the higher priority process., 来自sched和nice的的manual)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
>   // from https://elixir.bootlin.com/linux/v3.4/source/kernel/sched/sched.h#L804
>   /*
>    * Nice levels are multiplicative, with a gentle 10% change for every
>    * nice level changed. I.e. when a CPU-bound task goes from nice 0 to
>    * nice 1, it will get ~10% less CPU time than another CPU-bound task
>    * that remained on nice 0.
>    *
>    * The "10% effect" is relative and cumulative: from _any_ nice level,
>    * if you go up 1 level, it's -10% CPU usage, if you go down 1 level
>    * it's +10% CPU usage. (to achieve that we use a multiplier of 1.25.
>    * If a task goes up by ~10% and another task goes down by ~10% then
>    * the relative distance between them is ~25%.)
>    */
>   static const int prio_to_weight[40] = {
>    /* -20 */     88761,     71755,     56483,     46273,     36291,
>    /* -15 */     29154,     23254,     18705,     14949,     11916,
>    /* -10 */      9548,      7620,      6100,      4904,      3906,
>    /*  -5 */      3121,      2501,      1991,      1586,      1277,
>    /*   0 */      1024,       820,       655,       526,       423,
>    /*   5 */       335,       272,       215,       172,       137,
>    /*  10 */       110,        87,        70,        56,        45,
>    /*  15 */        36,        29,        23,        18,        15,
>   };
>

Hierarchical, modular scheduler

>    class = sched_class_highest;
>    for (;;) {
>        p = class->pick_next_task(rq);
>        if (p)
>            return p;
>       /*
>        * Will never be NULL as the idle class always
>        * returns a non-NULL p:
>        */
>        class = class->next;
>    }
>

Ref: SUSE-doc-Tuning the Task Scheduler

sched_latency_ns
Targeted preemption latency for CPU bound tasks. Increasing this variable increases a CPU bound task’s timeslice. A task’s timeslice is its weighted fair share of the scheduling period:

timeslice = scheduling period * (task’s weight/total weight of tasks in the run queue)

The task’s weight depends on the task’s nice level and the scheduling policy. Minimum task weight for a SCHED_OTHER task is 15, corresponding to nice 19. The maximum task weight is 88761, corresponding to nice -20.

Timeslices become smaller as the load increases. When the number of runnable tasks exceeds sched_latency_ns/sched_min_granularity_ns, the slice becomes number_of_running_tasks * sched_min_granularity_ns. Prior to that, the slice is equal to sched_latency_ns.

sched_min_granularity_ns

Minimal preemption granularity for CPU bound tasks. See sched_latency_ns for details. The default value is 4000000 (ns).

This value also specifies the maximum amount of time during which a sleeping task is considered to be running for entitlement calculations. Increasing this variable increases the amount of time a waking task may consume before being preempted, thus increasing scheduler latency for CPU bound tasks. The default value is 6000000 (ns).

Real-time scheduling

Linux has soft real-time scheduling: No hard real-time guarantees

All real-time processes are higher priority than any conventional processes

Processes with priorities [0, 99] are real-time

Process can be converted to real-time via sched_setscheduler system call

First-in, first-out: SCHED_FIFO: Static priority, Process is only preempted for a higher-priority process, No time quanta; it runs until it blocks or yields voluntarily, RR within same priority level

Round-robin: SCHED_RR: As above but with a time quanta

Normal processes have SCHED_NORMAL scheduling policy

Multiprocessor scheduling

Per-CPU runqueue

Possible for one processor to be idle while others have jobs waiting in their run queues

Periodically, rebalance runqueues

Migration threads move processes from one runque to another

The kernel always locks runqueues in the same order for deadlock prevention

Adjusting priority

Goal: dynamically increase priority of interactive process

How to determine interactive? Sleep ratio, Mostly sleeping: I/O bound, Mostly running: CPU bound

misc

Scheduling policies(Ref: Scheduling policies)
1. Scheduling policies
CFS implements three scheduling policies:
- SCHED_NORMAL (traditionally called SCHED_OTHER): The scheduling
  policy that is used for regular tasks.
- SCHED_BATCH: Does not preempt nearly as often as regular tasks
  would, thereby allowing tasks to run longer and make better use of
  caches but at the cost of interactivity. This is well suited for
  batch jobs.
- SCHED_IDLE: This is even weaker than nice 19, but its not a true
  idle timer scheduler in order to avoid to get into priority
  inversion problems which would deadlock the machine.
SCHED_FIFO/_RR are implemented in sched/rt.c and are as specified by
POSIX.
SCHEDULING CLASSES

The new CFS scheduler has been designed in such a way to introduce “Scheduling
Classes,” an extensible hierarchy of scheduler modules. These modules
encapsulate scheduling policy details and are handled by the scheduler core
without the core code assuming too much about them.

sched/fair.c implements the CFS scheduler described above.

sched/rt.c implements SCHED_FIFO and SCHED_RR semantics, in a simpler way than
the previous vanilla scheduler did. It uses 100 runqueues (for all 100 RT
priority levels, instead of 140 in the previous scheduler) and it needs no
expired array.

Scheduling classes are implemented through the sched_class structure, which
contains hooks to functions that must be called whenever an interesting event
occurs.
Linux有两种不同的优先级范围
- nice值：从-20到19，越大优先级越低。所有unix系统的标准化概念，但是不同unix系统的调度算法不同，所以nice值的运用方式也不同。ps -el中NI那一列就是进程对应的nice值
- 实时优先级：从0到99，越大优先级越高。任何实时进程的优先级都高于普通进程，也就是实时优先级跟nice值是两个不相交的范畴。ps -eo stat,uid,pid,ppid,rtprio,time,comm中，rtprio就是实时优先级，如果是-则说明不是实时进程
nice value（来自sched和nice的的manual）
- It affects the scheduling of SCHED_OTHER and SCHED_BATCH (see below) processes.
- The nice value can be modified using nice(2), setpriority(2), or sched_setattr(2).
- According to POSIX.1, the nice value is a per-process attribute; that is, the threads in a process should share a nice value. However, on Linux, the nice value is a per-thread attribute: different threads in the same process may have different nice values.
- The range of the nice value varies across UNIX systems. On modern Linux, the range is -20 (high priority) to +19 (low priority). On some other systems, the range is -20..20. Very early Linux kernels (Before Linux 2.0) had the range -infinity..15.
- The degree to which the nice value affects the relative scheduling of SCHED_OTHER processes likewise varies across UNIX systems and across Linux kernel versions.
- With the advent of the CFS scheduler in kernel 2.6.23, Linux adopted an algorithm that causes relative differences in nice values to have a much stronger effect.
- In the current implementation, each unit of difference in the nice values of two processes results in a factor of 1.25 in the degree to which the scheduler favors the higher priority process. This causes very low nice values (+19) to truly provide little CPU to a process whenever there is any other higher priority load on the system, and makes high nice values (-20) deliver most of the CPU to applications that require it (e.g., some audio applications).
- On Linux, the RLIMIT_NICE resource limit can be used to define a limit to which an unprivileged process’s nice value can be raised; see setrlimit(2) for details.
- Traditionally, only a privileged process could lower the nice value (i.e., set a higher priority). However, since Linux 2.6.12, an unprivileged process can decrease the nice value of a target process that has a suitable RLIMIT_NICE soft limit; see getrlimit(2) for details.

进程状态

来自linux内核设计与实现
TASK_RUNNING：进程是可执行的，可能正在执行，或在运行队列中等待执行。这是进程在用户空间中执行的唯一可能状态。内核空间中正在执行的进程也是这种状态
TASK_INTERRUPTIBLE：进程正在睡眠（也就是被阻塞），到达某些条件达成，一旦条件达成，内核就会把进程状态设置为RUNNING。处于此状态的进程会因为接收到信号而提前被唤醒并随时准备投入运行
TASK_UNINTERRUPTIBLE：与TASK_INTERRUPTIBLE的差别在于，就算接受到信号也不会被唤醒或准备投入运行。通常在进程必须在等待时不受干扰或等待的事件很快就会发生时出现（比如这个任务正在进行重要的操作，甚至可能持有一个信号量）。因为不对信号做响应，所以较之可中断的状态，用的较少。（ps中看到的D状态的进程就是这种状态）
__TASK_TRACED：被其他进程跟踪的进程，比如通过ptrace对调试程序进行跟踪
__TASK_STOPPED：进程停止执行，没有投入运行也不能投入运行。发生在接受到SIGSTOP，SIGTSTP，SIGTTIN，SIGTTOU信号时，调试期间受到任何信号也会使得进程进入这状态
转换图见书P24(41)

Ref: ps manual

the state of a process:
        D    uninterruptible sleep (usually IO)
        I    Idle kernel thread
        R    running or runnable (on run queue)
        S    interruptible sleep (waiting for an event to complete)
        T    stopped by job control signal
        t    stopped by debugger during the tracing
        W    paging (not valid since the 2.6.xx kernel)
        X    dead (should never be seen)
        Z    defunct ("zombie") process, terminated but not reaped by its parent

僵尸进程、孤儿进程

子进程的父进程终止后，成为孤儿进程，init（pid为1，所有进程的祖先）会接管孤儿进程，所以判断父进程是否终止的方法可以是检查getppid是否返回1（前提是这个子进程原本的父进程不是init）
系统允许父进程在子进程终止后还去执行wait，以确定子进程是如何终止的。这是通过把子进程转为僵尸进程来处理的，僵尸进程释放了该该进程持有的大部分资源，唯一保留的是内核中的进程表中的一条记录（包含pid、终止状态、资源使用数据等）。僵尸进程无法被SIGKILL杀死（SIGKILL、SIGSTOP信号无法被捕获和忽略）
父进程调用wait之后，内核将删除僵尸进程，如果父进程没有执行wait就退出，init将接管子进程并自动调用wait
如果父进程没有退出，而又没有wait子进程，那么就可能导致僵尸进程填满内核进程表，从而阻碍新进程的创建。将他们移除的唯一方法是：杀死父进程（或是等待父进程终止）从而init会wait
父进程wait的方式
- 父进程不带WNOHANG标志调用wait
- 父进程带有WNOHANG标志轮询
- 为SIGCHILD建立handler（默认是忽略）。需要在handler内部带WNOHANG不断调用wait以避免两个问题：调用信号Handler时，会阻塞引发该handler的信号；同一个信号不排队
- 应该在创建任何子进程之前就设置好Handler，因为在设置Handler时，如果已有子进程终止，那么有的系统会立刻产生SIGCHILD信号，有的不会

进程内存布局

共享库的虚拟地址是随着运行确定的，使用pmap拿地址布局可以看到，两次运行，共享库的地址布局不一样
ldd显示的lib地址也是随着ldd的不同调用而改变的

Be aware that in some circumstances (e.g., where the program specifies an ELF interpreter other than ld-linux.so), some versions of ldd may attempt to obtain the dependency information by attempting to directly execute the program, which may lead to the execution of whatever code is defined in the program’s ELF interpreter, and perhaps to execution of the program itself

MISC

识别僵尸进程ps -ef | grep "defunct"

`SO_REUSEPORT`

引用自
The new socket option allows multiple sockets on the same host to bind to the same port, and is intended to improve the performance of multithreaded network server applications running on top of multicore systems.

So long as the first server sets this option before binding its socket, then any number of other servers can also bind to the same port if they also set the option beforehand

The requirement that the first server must specify this option prevents port hijacking—the possibility that a rogue application binds to a port already used by an existing server in order to capture (some of) its incoming connections or datagrams.

To prevent unwanted processes from hijacking a port that has already been bound by a server using SO_REUSEPORT, all of the servers that later bind to that port must have an effective user ID that matches the effective user ID used to perform the first bind on the socket.
The second of the traditional approaches used by multithreaded servers operating on a single port is to have all of the threads (or processes) perform an accept() call on a single listening socket in a simple event loop of the form:
1
2
3
4
5
>    while (1) {
>        new_fd = accept(...);
>        process_connection(new_fd);
>    }
>

The problem with this technique, as Tom pointed out, is that when multiple threads are waiting in the accept() call, wake-ups are not fair, so that, under high load, incoming connections may be distributed across threads in a very unbalanced fashion

the SO_REUSEPORT implementation distributes connections evenly across all of the threads (or processes) that are blocked in accept() on the same port.

SO_REUSEPORT can be used with both TCP and UDP sockets.

Tom noted that the traditional SO_REUSEADDR socket option already allows multiple UDP sockets to be bound to, and accept datagrams on, the same UDP port. However, by contrast with SO_REUSEPORT, SO_REUSEADDR does not prevent port hijacking and does not distribute datagrams evenly across the receiving threads.

`PTHREAD_STACK_MIN`

Ref: Setting the Stack Size

The size attribute defines the size of the stack (in bytes) that the system allocates. The size should not be less than the system-defined minimum stack size. See About Stacks for more information.

size contains the number of bytes for the stack that the new thread uses. If size is zero, a default size is used. In most cases, a zero value works best.
-PTHREAD_STACK_MIN is the amount of stack space that is required to start a thread. This stack space does not take into consideration the threads routine requirements that are needed to execute application code.

IO

慢速syscall被信号中断时，已经read、write部分数据的syscall的处理：POSIX2001版是规定为成功返回，返回已read、write的字节数
POSIX.1要求只有中断信号的SA_RESTART标志有效时，才重启。不同厂商的处理方式不同
所有的IO设备都被模型化为文件，所有的输入输出都可以被当做相应文件的读和写来处理，这种将设备映射成文件的方式，允许Linux内核引用一个简单、低级的应用接口（Unix IO），这使得所有的输入输出都能以一种统一且一致的方式进行
异步IO。要注意区分信号驱动IO，后者以前也叫AIO

The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining mul‐ tiple threads to perform I/O operations is expensive and scales poorly. Work has been in progress for some time on a kernel state-machine-based implementation of asynchronous I/O (see io_submit(2), io_setup(2), io_cancel(2), io_destroy(2), io_getevents(2)), but this implementation hasn’t yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.

网络编程

因特网的socket地址存放在sockaddr_in的struct中
connect、bind、accept函数需要接受各种类型的socket地址结构（不仅仅是因特网socket），所以其要求大家都强转成socketaddr给他
socket函数才能一个socketFD，可以硬编码socket(AF_INET, SOCK_STREAM, 0)，其返回的fd仅仅是部分打开，还不能用于读写
客户端使用connect函数来完成三次握手
socket()：返回一个socketFd，这是一个主动套接字
bind()
- 把一个本地协议地址赋予一个socket，对于ip协议，就是32bit的ipv4地址或128bit的ipv6地址与16bit的tcp或udp端口号的组合
- 调用bind时，可以指定一个端口号和（或）一个ip地址，也可以都不指定
listen()：把socket()创建的主动套接字转为监听套接字
在大多数Linux架构上（除了Alpha和IA-64），所有这些socket系统调用实际上被实现成了通过单个系统调用socketcall()进行多路复用的库函数
socket中的一方调用close之后，当对等应用程序试图从连接的另一端读取数据时将会收到文件结束（当所有缓冲数据都被读取之后）
如果connect()失败并且希望重新进行连接，那么SUSv3规定完成这个任务的可移植的方法是关闭这个socket，创建一个新socket，在该新socket上重新进行连接
如果多个文件描述符引用了同一个socket，那么当所有描述符被关闭之后（使用close()）连接就会终止
AF_INET需经过多个协议层的编解码，消耗系统cpu，并且数据传输需要经过网卡，受到网卡带宽的限制。AF_UNIX数据到达内核缓冲区后，由内核根据指定路径名找到接收方socket对应的内核缓冲区，直接将数据拷贝过去，不经过协议层编解码，节省系统cpu，并且不经过网卡，因此不受网卡带宽的限制。AF_UNIX的传输速率远远大于AF_INET(Ref from)

epoll

epoll支持水平触发和边缘触发
- 水平触发：能在FD上以非阻塞的方式执行I/O操作（poll、select提供的也是这个）
- 边缘触发
  - 自从上一次调用epoll_wait后FD上是否已经有I/O活动（或者是FD被打开，如果之前没有调用的话），语义上类似信号驱动I/O，如果有多个I/O事件发生，epoll会把他们合并成一次单独的通知（信号驱动上可能会产生多个信号）
  - 可以减少syscall的次数
    
    在eventloop类型(包括各类fiber/coroutine)的程序中, 处理逻辑和epoll_wait都在一个线程, ET相比LT没有太大的差别. 反而由于LT醒的更频繁, 可能时效性更好些. 在老式的多线程RPC实现中, 消息的读取分割和epoll_wait在同一个线程中运行, 类似上面的原因, ET和LT的区别不大.
    但在更高并发的RPC实现中, 为了对大消息的反序列化也可以并行, 消息的读取和分割可能运行和epoll_wait不同的线程中, 这时ET是必须的, 否则在读完数据前, epoll_wait会不停地无谓醒来.
    
    引用自
  - 程序基本框架如下
    - 所有待监视的FD都设置为非阻塞
    - 通过epoll_ctl建立epoll interest list
    - 使用如下循环处理IO事件
      - 用epoll_wait取得ready的FD list
      - 针对每一个处于ready的FD，不断调用IO syscall，直到返回EAGAIN或EWOULDBLOCK
    - 解决饥饿的方法
      - 应用程序自己维护一个list，该list存储到目前为止的所有ready FD（如果某个FD返回EAGAIN或EWOULDBLOCK，则移出去）
      - 每次都调用epoll_wait去监视并添加ready FD到list中。如果list中已经有FD，则这次监视的timeout（应该指的是epoll_wait的timeout参数）应该为0或较小的正数。这样子就不用在epoll_wait上花太多时间，可以进行IO
      - 对于list上的FD，可以使用round-robin方式去循环处理，而不是每次从epoll_wait返回后都从list头开始处理
      - 相对之下，使用水平触发时，因为我们只是在ready 的FD上做一些IO而已（不是循环到不能读为止），所以不会有饥饿问题。（英文原文说的是blocking 的FD，译文说的是非阻塞，我觉得也是非阻塞比较合适，因为如果阻塞那么一旦没东西读了，会不会阻塞在那里呢？还是说只要读了一点东西，然后接下来不能读，就立刻可以返回）
核心数据结构称为epoll实例，与epoll_fd关联，是内核数据结构的句柄。这些内核数据结构实现了两个目的
- interest list
- ready list（前者的子集）
epoll_create：创建epoll实例，size参数已废弃。epoll_create1去掉了该参数并且可设置flags，目前只支持EPOLL_CLOEXEC

close-on-exec
which causes the file descriptor to be automatically (and atomically) closed when any of the exec-family functions succeed.
This is useful to keep from leaking your file descriptors to random programs run by e.g. system().
epoll_ctl
- fd参数（第三个参数）可以是pipe、FIFO、socket、POSIX消息队列、inotify实例、终端、设备、另一个epoll实例的fd（可以用于建立层次关系），不能是普通文件或目录的FD（EPERM）
- In a multithreaded program, it is possible for one thread to use epoll_ctl to add file descriptors to the interest list of an epoll instance that is already being monitored by epoll_wait() in another thread. These changes to the interest list will be taken into account immediately, and the epoll_wait() call will return readiness information about the newly added file descriptors.
EPOLLONESHOT：在下一次epoll_wait返回该FD后，该FD在兴趣列表中被标记为非激活状态，可以用EPOLL_CTL_MOD重新激活
max_user_watcher：每个注册到epoll实例上的FD都会占用一小段不能被交换的内核内存空间。这个参数用于定义可以注册到epoll实例上的FD总数。/proc/sys/fs/epoll/max_user_watches

epoll_wait

返回后，需要检查EINTR，如果在执行期间被一个信号打断，然后又通过SIGCONT信号恢复，就可能出现这个错误

EPOLLHUP

Note that when reading from a channel such as a pipe or a stream socket, this event merely indicates that the peer closed its end of the channel. Subsequent reads from the channel will return 0 (end of file) only after all outstanding data in the channel has been consumed.
1
2
3
4
5
6
7
> if(evlist[j].events & EPOLLIN) {
>     // do read
> }else if(evlist[j].events & (EPOLLHUP | EPOLLERR)) {
>     // close fd 
> }
> //  After the epoll_wait(), EPOLLIN and EPOLLHUP may both have been set. But we'll only get here, and thus close the file descriptor, if EPOLLIN was not set. This ensures that all outstanding input (possibly more than MAX_BUF bytes) is consumed (by further loop iterations) before the file descriptor is closed.
>

EPOLLHUP和EPOLLERR会出现在FIFO对端关闭或终端挂起时出现

epoll的语义
- 创建epoll实例时，内核在内存中创建一个新的i-node并打开文件描述（file descriptor：一个打开的文件的上下文信息，内核的数据结构）
- 随后在调用进程中为打开的这个文件描述分配一个新的文件描述符（这就是用户空间的应用程序操作这个file descriptor的句柄）
- 同epoll实例的insterest list相关联的是打开的文件描述，而不是文件描述符。所以可以使用dup复制该描述符，fork也可以，因为子进程复制了父进程的epoll描述符。复制后的描述符指向相同的文件描述
- 如果一个文件描述是epoll insterest list的成员，那么在这个文件描述的所有描述符都关闭后，这个文件描述会自动从epoll 的interest list中移除
- 所以如果某个fd A被复制（复制的那一个叫fd B），然后关闭该fd A，但是epoll_wait还是会在该文件描述就绪后返回fd A
epoll vs select vs poll
- 每次调用select、poll，内核必须检查所有在调用中指定的FD。epoll只需要在FD就绪时将其加入ready list，然后epoll_wait去fetch这些FD就行
- 调用select、poll时，我们传递一个标记了所有待监视的FD的数据结构给内核，返回时，内核将所有标记为ready的FD的数据结构在传给我们。epoll则使用epoll_ctl在内核空间建立数据结构，之后的epoll_wait不需要传递数据结构，返回的也只是ready的FD
- epoll的性能会随着发生I/O事件的数量而线性变化，所以适用于需要监视大量FD但是大部分处于空闲状态
epoll与信号
- epoll_wait的过程中是会因为SIGINT而返回，errno为EINTR的
- 如果使用了信号，应该使用epoll_pwait来避免race condition
- 值得一提的时，epoll的效率在某些情况下比select或者poll要低，这个是曾经做过测试的。具体的业务场景就是：几乎所有连接都处于活跃状态，并且有频繁的业务数据发送和接受，这个时候select或者poll的能力要强于epoll。下面贴一张比较久远的测试数据表格，可以看到这个现象。一个用select实现的echo server，传输的数据量是52B一次的小报文。 (Ref)
- select
  
  |连接数量|cpu(%)|一次回射平均延时(μs)|
  |:-:|:-:|:-:|
  |100|94.5|1394|
  |500|95|9507|
  |1000|95|19852|
- epoll
  
  |连接数量|cpu(%)|一次回射平均延时(μs)|
  |:-:|:-:|:-:|
  |100|93|1930|
  |500|93|11120|
  |1000|95|23857|
- (或许是因为使用了中断，所以不如轮询快？另外，select是数组，epoll应该是链表（？）所以cache miss高？不过听说kqueue可以合并中断，还可以把中断改成轮询)
- 原理其它答案讲得差不多了，我就补一句，从kernel层面将，事件产生有可能不是由硬件中断触发的，在一定情况下kernel的确会轮询，因为响应硬件中断是一个成本比较高的操作。以网卡为例，当数据量很少的时候，每来一个数据包网卡都回产生一个中断，kernel响应这个中断，从网卡缓冲区中读出数据放进协议栈处理，当满足一定条件时，kernel回调用户代码，这里的“回调”一般情况下是指从一个kernel syscall中返回(在此之前用户代码一直处于block状态)。当数据量很大时，每个包都产生一个中断就划不来了，此时kernel可以启动interrupt coalescing机制，让网卡做中断合并，也就是说来足够多的数据包或者等待一个timeout才会产生一个中断，kernel在响应中断时会把所有数据一起读出来处理，这样可以有效的降低中断次数。当数据量更大时，网卡缓冲区里几乎总是有未处理的数据，此时kernel干脆会禁掉网卡的中断，切换到轮询处理的模式，说白了就是跑一个忙循环不停地读网卡缓冲区里的数据，这样综合开销更低。
  作者：徐辰
  链接：https://www.zhihu.com/question/20122137/answer/54153089
  来源：知乎
  著作权归作者所有。商业转载请联系作者获得授权，非商业转载请注明出处。

惊群

很多个进程都block在server socket的accept()，一但有客户端进来，所有进程的accept()都会返回，但是只有一个进程会读到数据，就是惊群。实际上现在的Linux内核实现中不会出现惊群了，只会有一个进程被唤醒（Linux2.6内核）(ref)
为了确保只有一个进程（线程）得到资源，需要对资源操作进行加锁保护，加大了系统的开销。目前一些常见的服务器软件有的是通过锁机制解决的，比如 Nginx
什么是惊群，如何有效避免惊群? - 滴滴云的回答 - 知乎

命令

top

Ref。影响load average值大小的直接因素是系统中活动的进程数。Linux的系统负载指运行队列的平均长度，也就是等待CPU的平均进程数。活动的进程数可以很直观表明系统负载情况。值越大，表明CPU上正在运行和待运行的进程数越多，也就是负载越大。查看loadAvg的方法：top、uptime、w、cat /proc/loadavg，是一分钟、五分钟、十五分钟

top的manual

 2b. TASK and CPU States
    This  portion  consists of a minimum of two lines.  In an SMP environ‐
    ment, additional lines can reflect individual CPU state percentages.

    Line 1 shows total tasks or threads, depending on  the  state  of  the
    Threads-mode toggle.  That total is further classified as:
        running; sleeping; stopped; zombie

    Line  2  shows  CPU  state percentages based on the interval since the
    last refresh.

    As a default, percentages for these  individual  categories  are  dis‐
    played.   Where two labels are shown below, those for more recent ker‐
    nel versions are shown first.
        us, user    : time running un-niced user processes
        sy, system  : time running kernel processes
        ni, nice    : time running niced user processes
        id, idle    : time spent in the kernel idle handler
        wa, IO-wait : time waiting for I/O completion
        hi : time spent servicing hardware interrupts
        si : time spent servicing software interrupts
        st : time stolen from this vm by the hypervisor

    In  the  alternate  cpu  states  display  modes,  beyond   the   first
    tasks/threads  line,  an  abbreviated  summary  is shown consisting of
    these elements（就是按了`t`键之后显示的内容）:
                   a    b     c    d
        %Cpu(s):  75.0/25.0  100[ ...

    Where: a) is the combined us and ni percentage; b) is the sy  percent‐
    age; c) is the total; and d) is one of two visual graphs of those rep‐
    resentations.  See topic 4b. SUMMARY AREA Commands and the `t' command
    for additional information on that special 4-way toggle.

2c. MEMORY Usage
    This  portion  consists  of  two  lines  which  may  express values in
    kibibytes (KiB) through exbibytes (EiB) depending on the scaling  fac‐
    tor enforced with the `E' interactive command.

    As a default, Line 1 reflects physical memory, classified as:
        total, free, used and buff/cache

    Line 2 reflects mostly virtual memory, classified as:
        total, free, used and avail (which is physical memory)

    The  avail number on line 2 is an estimation of physical memory avail‐
    able for starting new applications, without swapping.  Unlike the free
    field,  it  attempts to account for readily reclaimable page cache and
    memory slabs.  It is available on kernels 3.14,  emulated  on  kernels
    2.6.27+, otherwise the same as free.

    In  the  alternate memory display modes, two abbreviated summary lines
    are shown consisting of these elements:
                   a    b          c
        GiB Mem : 18.7/15.738   [ ...
        GiB Swap:  0.0/7.999    [ ...

    Where: a) is the percentage used; b) is the total available; and c) is
    one of two visual graphs of those representations.

    In  the  case  of physical memory, the percentage represents the total
    minus the estimated avail noted above.   The  `Mem'  graph  itself  is
    divided  between used and any remaining memory not otherwise accounted
    for by avail.  See topic 4b. SUMMARY AREA Commands and the `m' command
    for additional information on that special 4-way toggle.

    This table may help in interpreting the scaled values displayed:
        KiB = kibibyte = 1024 bytes
        MiB = mebibyte = 1024 KiB = 1,048,576 bytes
        GiB = gibibyte = 1024 MiB = 1,073,741,824 bytes
        TiB = tebibyte = 1024 GiB = 1,099,511,627,776 bytes
        PiB = pebibyte = 1024 TiB = 1,125,899,906,842,624 bytes
        EiB = exbibyte = 1024 PiB = 1,152,921,504,606,846,976 bytes

free

free  displays the total amount of free and used physical and swap memory
in the system, as well as the buffers and caches used by the kernel.  The
information  is  gathered by parsing /proc/meminfo. The displayed columns
are:

total  Total installed memory (MemTotal and SwapTotal in /proc/meminfo)

used   Used memory (calculated as total - free - buffers - cache)

free   Unused memory (MemFree and SwapFree in /proc/meminfo)

shared Memory used (mostly) by tmpfs (Shmem in /proc/meminfo)

buffers
       Memory used by kernel buffers (Buffers in /proc/meminfo)

cache  Memory used by the page cache and slabs (Cached  and  SReclaimable
       in /proc/meminfo)

buff/cache
       Sum of buffers and cache

available
       Estimation of how much memory is available for starting new appli‐
       cations, without swapping. Unlike the data provided by  the  cache
       or  free fields, this field takes into account page cache and also
       that not all reclaimable memory slabs will  be  reclaimed  due  to
       items  being  in  use (MemAvailable in /proc/meminfo, available on
       kernels 3.14, emulated on kernels 2.6.27+, otherwise the  same  as
       free)

Buffers vs Cached(Ref, Ref)

Buffers: Memory in buffer cache, so relatively temporary storage for raw disk blocks. This shouldn’t get very large.

Cached: Memory in the pagecache (Diskcache and Shared Memory)

  Buffers: Relatively temporary storage for raw disk blocks
          shouldn't get tremendously large (20MB or so)
  Cached: in-memory cache for files read from the disk (the
          pagecache).  Doesn't include SwapCached
SwapCached: Memory that once was swapped out, is swapped back in but
          still also is in the swapfile (if memory is needed it
          doesn't need to be swapped out AGAIN because it is already
          in the swapfile. This saves I/O)

vmstat

FIELD DESCRIPTION FOR VM MODE
   Procs
       r: The number of runnable processes (running or waiting for run time).
       b: The number of processes in uninterruptible sleep.

   Memory
       swpd: the amount of virtual memory used.
       free: the amount of idle memory.
       buff: the amount of memory used as buffers.
       cache: the amount of memory used as cache.
       inact: the amount of inactive memory.  (-a option)
       active: the amount of active memory.  (-a option)

   Swap
       si: Amount of memory swapped in from disk (/s).
       so: Amount of memory swapped to disk (/s).

   IO
       bi: Blocks received from a block device (blocks/s).
       bo: Blocks sent to a block device (blocks/s).

   System
       in: The number of interrupts per second, including the clock.
       cs: The number of context switches per second.

   CPU
       These are percentages of total CPU time.
       us: Time spent running non-kernel code.  (user time, including nice time)
       sy: Time spent running kernel code.  (system time)
       id: Time spent idle.  Prior to Linux 2.5.41, this includes IO-wait time.
       wa: Time spent waiting for IO.  Prior to Linux 2.5.41, included in idle.
       st: Time stolen from a virtual machine.  Prior to Linux 2.6.11, unknown.

buff vs cache
tldp-buffer-cache
- By reading the information from disk only once and then keeping it in memory until no longer needed, one can speed up all but the first read. This is called disk buffering, and the memory used for the purpose is called the buffer cache.
- To make the most efficient use of real memory, Linux automatically uses all free RAM for buffer cache, but also automatically makes the cache smaller when programs need more memory. Under Linux, you do not need to do anything to make use of the cache, it

HugePage

好处
- 增加TLB命中率
- 减小页表大小
- 如果可以把多级页表的级数减少，则可以减小查页表时间
缺点
- 有一些需要页对齐的场景那么为了对齐浪费的空间就比较多
- 页表提供的内存保护是以页为单位的
- 缺页时需要调入的东西增加
- 共享内存时需要共享整数个页那么大的块（我认为如此，因为是通过在页表中填入相同的物理页地址来共享内存的）
- brk上调brk指针时，需要上调到页的边界，所以如果使用HugePage，会导致每次都分配很多虚拟内存（当然只要这些内存没有被读写，就不会分配物理内存）
- Cache的命中率可能会降低，比如如果有的东西需要放在相邻的两个页上，那么4K页时，其实两个东西都可以缓存在L1，但是对于4M的页则不行
  1
  Model name: Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz
  L1d cache: 32K
  L1i cache: 32K
  L2 cache: 256K
  L3 cache: 8192K
  1
- 因为程序所占用的物理内存都需要在页表有记录，那么hugePage意味着小应用会占用更多内存
- mmap
  
  on Linux, the mapping will be created at a nearby page boundary.
  offset must be a multiple of the page size as returned by sysconf(_SC_PAGE_SIZE).
其他影响
- 块（文件系统的一种抽象，文件系统的最小寻址单元，只能基于块来访问文件系统）不能比page大（This is an artificial constraint that could go away in the future），所以hugepage会使得块可以很大

malloc

调用并分配1GB的内存，是否物理内存就会少1G？一方面，CopyOnWrite，所以如果没有写，是不会的。另一方面，空闲链表
malloc的内存对齐
- The malloc() and calloc() functions return a pointer to the allocated memory that is suitably aligned for any kind of variable. (Ref from malloc’s man)
- 在我的x86-64的机器上试了几次，返回的是16B对齐的内存
  1
  2
  3
  4
  5
  6
  7
  int main ()
  {
  printf("%p\n%p\n%p\n", malloc(sizeof(int)), malloc(sizeof(char)), malloc(sizeof(short)));
  }
  // 0x14bd260
  // 0x14bd280
  // 0x14bd2a0
- nginx在core/ngx_palloc.c的ngx_create_pool使用16中调用了ngx_memalign，ngx_memalign的一个实现是直接使用malloc
堆内存边界被称为program break，使用brk和sbrk可以调整。一开始program break位于未初始化的数据段末尾之后。使用brk等抬高program break后，物理内存页尚未分配，内核会在进程首次此试图访问这些虚拟内存地址时自动分配新的物理内存页
虚拟内存以页为单位进行分配，所以调用brk指定的end addr实际上会round up到下一个page boundary

The brk() system call sets the program break to the location specified by end_data_segment. Since virtual memory is allocated in units of pages, end_data_segment is effectively rounded up to the next page boundary.
program break可以设定的精确上限取决于
- 进程中对数据段大小的资源限制RLIMIT_DATA
- 内存映射
- 共享内存段
- 共享库的位置（因为堆向上增长就会遇到共享库的位置）

page cache和buffer cache

页缓存（page cache）针对以页为单位的所有操作，并考虑了特定体系结构上的页长度。内存映射技术、其他类型的文件访问也是基于内核中的这一技术实现的，所以页缓存实际上负责了块设备的大部分缓存工作
buffer cache
- a buffer is the in-memory representation of a single physical disk block. Buffers act as descriptors that map pages in memory to disk blocks;
- the page cache（这个page cache应该是指buffer cache这种page cache）also reduces disk access during block I/O operations by both caching disk blocks and buffering block I/O operations until later
- this caching is often referred to as the buffer cache, although as implemented it is not a separate cache but is part of the page cache
- In earlier kernels, there were two separate disk caches: the page cache and the buffer cache. The former cached pages; the latter cached buffers.The two caches were not unified: A disk block could exist in both caches simultaneously.
- Today, we have one disk cache: the page cache. The kernel still needs to use buffers, however, to represent disk blocks in memory. Conveniently, the buffers describe the mapping of a block onto a page, which is in the page cache.

GCC内联汇编

以下来自CSAPP Assembly Webaside
扩展形式的汇编的目的
- 帮助编译器correctly set up the required source values, execute the assembly instructions, and make use of the computed results
- important program values are not overwritten by the assembly code instructions
形式：以:分割，依次为code-string、output-list(results generated by the assembly code)、input-list(results generated by the assembly code)、overwrite-list(registers that are overwritten by the assembly code)
code-string
- GCC3.1之前，在code-string中用%0, %1, ..., %9来引用output-list和input-list中的操作数（%n的n, based on the order of the operands in the two lists）
- GCC3.1开始，可以使用%[name]来引用
- 寄存器名字用%%reg_name来引用
- If multiple instructions are given, it is critical that return characters be inserted between them. 一种方法是（finish all but the final string with the sequence \n\t）

output-string

a comma-separated list

each list element containing three components in the form [name] tag (expr)(giving the name of the operand, the output constraint(specifies constraints on the use and location of the output operand), and the C expression indicating the destination for the instruction result)

"=r" Update value stored in a register
"+r" Read and update value stored in a register
"=m" Update value stored in memory
"+m" Read and update value stored in memory
"=rm" Update value stored in a register or in memory
"+rm" Read and update value stored in a register or in memory

The expression expr can be any assignable value (known in C as an lvalue). The compiler will generate the necessary code sequence to perform the assignment

input-string
- The input list entries have the same format as output-string
- the tags are of the form tag “r”, “m”, or “rm”, indicating that the operand will be read from a register, memory, or either, respectively
- Each input operand can be any C expression. The compiler will generate the necessary code to evaluate the expression
overwrite-list
- Each input operand can be any C expression. The compiler will generate the necessary code to evaluate the expression

we can simplify the code even more and make use of GCC’s ability to work with different data types. GCC uses the type information for an operand when determining which register to substitute for an operand name in the code string
原先的版本，手动指定寄存器

int tmult_ok2(long x, long y, long* dest)
{
    int result;
    dest = x * y;
    asm("setae %%bl # Set low-order byte\n\t"
        "movzbl %%bl,%[val] # Zero extend to be result"
        : [ val ] "=r"(result) /* Output */
        :                      /* No inputs */
        : "%bl"                /* Overwrites */
    );
    return result;
}

新版本，让GCC指定寄存器

/* Uses extended asm to get reliable code */
int tmult_ok3(long x, long y, long* dest)
{
    unsigned char bresult;
    *dest = x * y;
    asm("setae %[b] # Set result"
        : [b] "=r"(bresult) /* Output */
    );
    return (int)bresult;
}

例子

int umult_ok(unsigned long x, unsigned long y, unsigned long* dest)
{
    unsigned char bresult;
    asm("movq %[x],%%rax # Get x\n\t"
        "mulq %[y] # Unsigned long multiply by y\n\t"
        "movq %%rax,%[p] # Store low-order 8 bytes at dest\n\t"
        "setae %[b] # Set result"
        : [p] "=m"(*dest), [b] "=r"(bresult) /* Outputs */
        : [x] "r"(x), [y] "r"(y)             /* Inputs */
        : "%rax", "%rdx"                     /* Overwrites */
    );

    return (int)bresult;
}

如果我们的汇编语句必须在我们放置它的地方执行，(即，必须不被作为一个优化而移出循环)，则将volatile放在asm之后。所以防止它被移动、删除和任何改变，我们如此声明asm volatile(... : ... : ... : ...); 当我们必须非常小心时，使用__volatile__。如果我们的汇编只是做一些计算而没有任何副作用，最好不要使用volatile关键字。忽略它将帮助GCC优化代码使其更优美。(Ref from译：GCC内联汇编入门)
受影响列表中的memory字样(Ref from)
- The “memory” clobber tells the compiler that the assembly code performs memory reads or writes to items other than those listed in the input and output operands (for example, accessing the memory pointed to by one of the input parameters).
- To ensure memory contains correct values, GCC may need to flush specific register values to memory before executing the asm.
- Further, the compiler does not assume that any values read from memory before an asm remain unchanged after that asm; it reloads them as needed.
- Using the “memory” clobber effectively forms a read/write memory barrier for the compiler.
- Note that this clobber does not prevent the processor from doing speculative reads past the asm statement. To prevent that, you need processor-specific fence instructions.
受影响列表中的cc字样(Ref from)
- The “cc” clobber indicates that the assembler code modifies the flags register
- On some machines, GCC represents the condition codes as a specific hardware register; “cc” serves to name this register.
- On other machines, condition code handling is different, and specifying “cc” has no effect. But it is valid no matter what the target.

mutex

文件锁（使用fcntl加锁、解锁一片文件区域）和信号量的锁定和解锁都需要syscall。mutex使用机器语言的原子操作（在内存中执行、对所有线程可见）来实现，只有发生锁争用时才需要syscall，所以mutex比前两者性能高
linux上的mutex使用futex（快速用户空间互斥量）实现，发生锁争用时调用futex syscall
SUSv3规定，初始化一个已经初始化过的的互斥量将导致未定义行为
依照SUSv3的规定，对某一互斥量的副本（copy）执行各种操作将导致未定义的结果。此类操作只能施之于如下两类互斥量的“真身”
静态分配的mutex可以使用PTHREAD_MUTEX_INITIALIZER初始化
在如下情况下，必须使用函数pthread_mutex_init()，而非静态初始化互斥量
- 动态分配于堆中的互斥量
- 互斥量是在栈中分配的自动变量
- 初始化经由静态分配，且不使用默认属性的互斥量
destroy mutex
- 当不再需要经由自动或动态分配的互斥量时，应使用pthread_mutex_destroy()将其销毁
- 对于使用PTHREAD_MUTEX_INITIALIZER静态初始化的互斥量，无需调用pthread_mutex_destroy()
- 只有当互斥量处于未锁定状态，且后续也无任何线程企图锁定它时，将其销毁才是安全的
- 若互斥量驻留于动态分配的一片内存区域中，应在释放（free）此内存区域前将其销毁。对于自动分配的互斥量，也应在宿主函数返回前将其销毁
- 经由pthread_mutex_destroy()销毁的互斥量，可调用pthread_mutex_init()对其重新初始化
死锁解决方案
- 定义互斥锁的层级关系，按照层级去获取锁
- 对于第一个锁使用pthread_mutex_lock，其他锁使用try_lock。如果有一个try_lock失败，则释放所有已获得的锁，一段时间后再试。这是通过解决“占有并等待”这个条件来解决问题的

close-on-exec

使用O_CLOEXEC标志(打开文件),可以免去程序执行fcntl()的F_GETFD和F_SETFD操作来设置close-on-exec标志的额外工作。在多线程程序中执行fcntl() 的F_GETFD和F_SETFD操作有可能导致竞争状态,而使用O_CLOEXEC标志则能够避免这一点。可能引发竞争的场景是:线程某甲打开一文件描述符,尝试为该描述符标记close-on-exec标志,于此同时,线程某乙执行fork()调用,然后调用exec()执行任意一个程序。(假设在某甲打开文件描述符和调用fcntl()设置close-on-exec标志之间,某乙成功地执行了fork()和exec()操作。)此类竞争可能会在无意间将打开的文件描述符泄露给不安全的程序
某些描述符可能是由库函数打开的。但库函数无法使主程序在执行exec()之前关闭相应的文件描述符。作为基本原则,库函数应总是为其打开的文件设置执行时关闭(close-on-exec)标志,稍后将介绍所使用的技术。如果exec()因某种原因而调用失败,可能还需要使描述符保持打开状态。如果这些描述符已然关闭,将它们重新打开并指向相同文件的难度很大,基本上不太可能

`sched_yield`

sched_yield()的操作是比较简单的。如果存在与调用进程的优先级相同的其他排队的可运行进程，那么调用进程会被放在队列的队尾，队列中队头的进程将会被调度使用CPU。如果在该优先级队列中不存在可运行的进程，那么sched_yield()不会做任何事情，调用进程会继续使用CPU。非实时进程使用sched_yield()的结果是未定义的。

`mlock`

Under Linux, mlock(), mlock2(), and munlock() automatically round addr down to the nearest page boundary. However, the POSIX.1 specification of mlock() and munlock() allows an implementation to require that addr is page aligned, so portable applications should ensure this.(Ref from man page)
可以避免被交换到swap分区，一般用于real-time进程。跟信息安全有关的进程也会这样做，避免内存page被交换到swap分区，从而hack可以从磁盘上读取这些信息，甚至是在进程结束后

cryptographic security software often handles critical bytes like passwords or secret keys as data structures. As a result of paging, these secrets could be transferred onto a persistent swap store medium, where they might be accessible to the enemy long after the security software has erased the secrets in RAM and terminated.
But be aware that the suspend mode on laptops and some desktop computers will save a copy of the system’s RAM to disk, regardless of memory locks.

pthread

joinable vs detached

A thread may either be joinable or detached
If a thread is joinable, then another thread can call pthread_join(3) to wait for the thread to terminate and fetch its exit status
Only when a terminated joinable thread has been joined are the last of its resources released back to the system(看起来就像进程的wait一样).
When a detached thread terminates, its resources are automatically released back to the system: it is not possible to join with the thread in order to obtain its exit status. Making a thread detached is useful for some types of daemon threads whose exit status the application does not need to care about.
By default, a new thread is created in a joinable state, unless attr was set to create the thread in a detached state (using pthread_attr_setdetachstate(3))

以下代码输出Resource deadlock avoided

#include <pthread.h>
#include <string.h>
#include <stdio.h>

int main() {
    int t = pthread_join(pthread_self(), NULL);
    if(t!=0) {
        printf("%s\n", strerror(t));
    }
}

以下代码不退出

#include <pthread.h>
#include <unistd.h>
#include <stdio.h>
#include <string.h>

static void* func(void* arg)
{
    while(1);
}

int main()
{
    pthread_t tid;
    int t = pthread_create(&tid, NULL, func, NULL);
    if(t!=0) {
        printf("%s\n", strerror(t));
    }
    pthread_exit(NULL);
}

以下代码的问题是，主线程退出后，另一个线程对主线程栈上的内容进行操作，结果难以预料

static void *threadFunc(void *arg) {
  struct someStruct *pbuf = (struct someStruct *)arg;
  /* Do some work with structure pointed to by 'pbuf' */
}
int main(int argc, char *argv[]) {
  struct someStruct buf;
  pthread_create(&thr, NULL, threadFunc, (void *)&buf);
  pthread_exit(NULL);
}

条件变量

静态分配的条件变量可以使用PTHREAD_COND_INITALIZER来完成初始化

PIPE and FIFO

可以确保写入不超过PIPE_BUF字节的操作是原子的。如果多个进程写入同一个管道,那么如果它们在一个时刻写入的数据量不超过PIPE_BUF字节,那么就可以确保写入的数据不会发生相互混合的情况
当写入管道的数据块的大小超过了PIPE_BUF字节,那么内核可能会将数据分割成几个较小的片段来传输,在读者从管道中消耗数据时再附加上后续的数据(write()调用会阻塞直到所有数据被写入到管道为止)。
只有在数据被传输到管道的时候PIPE_BUF限制才会起作用
- 当写入的数据达到PIPE_BUF字节时,write()会在必要的时候阻塞直到管道中的可用空间足以原子地完成操作
- 如果写入的数据大于PIPE_BUF字节,那么write()会尽可能地多传输数据以充满整个管道,然后阻塞直到一些读取进程从管道中移除了数据。如果此类阻塞的write()被一个信号处理器中断了,那么这个调用会被解除阻塞并返回成功传输到管道中的字节数,这个字节数会少于请求写入的字节数
PIPE_BUF(Ref from man 7 pipe)
- POSIX.1 says that write(2)s of less than PIPE_BUF bytes must be atomic: the output data is written to the pipe as a contiguous sequence.
- Writes of more than PIPE_BUF bytes may be nonatomic: the kernel may interleave the data with data written by other processes.
- POSIX.1 requires PIPE_BUF to be at least 512 bytes. (On Linux, PIPE_BUF is 4096 bytes.)
- The precise semantics depend on whether the file descriptor is nonblocking (O_NONBLOCK), whether there are multiple writers to the pipe, and on n, the number of bytes to be written:
  - O_NONBLOCK disabled, n <= PIPE_BUF
    All n bytes are written atomically; write(2) may block if there is not room for n bytes to be written immediately
  - O_NONBLOCK enabled, n <= PIPE_BUF
    If there is room to write n bytes to the pipe, then write(2) succeeds immediately, writing all n bytes; otherwise write(2) fails, with errno set to EAGAIN.
  - O_NONBLOCK disabled, n > PIPE_BUF
    The write is nonatomic: the data given to write(2) may be interleaved with write(2)s by other process; the write(2) blocks until n bytes have been written.
  - O_NONBLOCK enabled, n > PIPE_BUF
```
**If the pipe is full**, then write(2) fails, with errno set to EAGAIN.  **Otherwise, from 1 to n bytes may be written** (i.e., a "partial write" may occur; the caller should check the return value from write(2) to see how many bytes were actually written), and these  bytes  may  be  interleaved with writes by other processes.
```

文件系统

调优方法

ext2有

快速符号链接，如果链接目标的路径足够短，则将其存储在inode自身中（不是存储在数据区中）
所以可以减小文件名长度来加速

用户凭证

`passwd(5)`

The encrypted password field may be blank, in which case no password is required to authenticate as the specified login name. However, some applications which read the /etc/passwd file may decide not to permit any access at all if the password field is blank.(我去掉x后，ubuntu图形界面可以登录，无需密码，zsh的sudo不行，输了原来的密码也不行)
If the password field is a lower-case “x”, then the encrypted password is actually stored in the shadow(5) file instead; there must be a corresponding line in the /etc/shadow file, or else the user account is invalid.
If the password field is any other string, then it will be treated as an encrypted password, as specified by crypt(3).

文件操作

`open`

如果在open()调用中指定O_CREAT标志，那么还需要提供mode参数，否则，会将新文件的权限设置为栈中的某个随机值。

同一个bin，第一次启动慢的原因

需要加载共享库

Linux dev

NPTL dev: Ulrich Drepper和Ingo Molnar