L08 Machine-Level Programming IV: Data

Source: https://www.bilibili.com/video/BV1iW411d7hd?p=8

Today

数组 Arrays

一维数组 One-dimensional

多维数组（嵌套）multi-dimensional (nested)

多级数组 multi-level

结构体

分配空间 allocation

访问 access

对齐 alignment

浮点数

数组

数组分配空间

T A[L]：一个长度为 L 的变量类型为 T 的数组 A[].

在内存中需要分配一块 T * sizeof(T) 字节长度的区域。

数组访问

标识符（identifier） A 可以作为指向下标为 0 的数组元素的指针：C 的指针、数组、数组标识符可互换的思想。

int val[5] = {1, 5, 2, 1, 3};

reference    type    value
val[4]       int     3
val          int *   x
val + 1      int *   x + 4
&val[2]      int *   x + 8
val[5]       int     ??
*(val + 1)   int     5
val + i      int *   x + 4 *  i

#define ZLEN 5
typedef int zip_dig[ZLEN]
typedef long long i64
typedef A B C
    means
zig_dip x === int x[ZLEN]
i64 x     === long long x
B x       === A x C
// 我以前从来不知道还有这种用法

例子：

int get_digit(zip_dig z, int digit) {
    return z[digit]
}

  # for IA32
  # %rdi = z
  # %rsi = digit
movl (%rdi,%rsi,4), %eax # z[digit]

又一例子

void zincr(zip_dig z) {
    size_t i;
    for (i = 0; i < ZLEN; i++)
        z[i]++;
}

  # %rdi = z
    movl    $0, %eax          # i = 0
    jmp     .L3               # goto middle
.L4:                          # loop:
    addl    $1, (%rdi,%rax,4) # z[i]++
    addq    $1, %rax          # i++
.L3:                          # middle
    cmpq    $4, %rax          # i:4
    jbe     .L4               # if <=, goto loop
    rep; ret

理解指针和数组：异同比较

关键的差别在于，声明一个数组时，为其分配了空间，但是声明一个指针，则只为指针分配了空间，而并不一定为其指向的地方分配了空间。

Decl           A1, A2                 *A1, *A2
               Comp    Bad    Size    Comp    Bad    Size
int A1[3]      Y       N      12      Y       N      4
int *A2        Y       N      8       Y       Y      4
此处 A2 作为野指针，*A2 作为右值返回野值，作为左值则可能赋值失败，所以是一个可能的坏引用
Decl           An                     *An                    **An
               Comp    Bad    Size    Comp    Bad    Size    Comp    Bad    Size
int A1[3]      Y       N      12      Y       N      4       N       -      -
int *A2[3]     Y       N      24      Y       N      8       Y       Y      4
int (*A3)[3]   Y       N      8       Y       Y      12      Y       Y      4
T (*A)[Z] 为指向一个元素类型为 T，长度为 Z 的数组的指针 T，所以首先解引用 *A 得到的是数组，再解引用 **A 得到数组首元素.
具体来说，[] 是后缀运算符，*是前缀运算符，后缀运算符比前缀的优先级高，所以 A3 是指向数组的指针，A2 是指针构成的数组
教授认为：空方括号 [] 是指针 * 的另一种表示，即 A[] 和 *A 等价

Comp: Compiles (Y/N), Bad: Possible bad pointer reference (Y/N), Size: Value returned by sizeof().

二维等多维数组

对于 T A[R][C] 这样一个 R 行 C 列，元素类型为 T 的二维数组，假设一个 T 类型元素需要 K 字节，则数组大小为 R * C * K 字节，其排布应为行优先序（row-major ordering），即按如下排布：

A[0][0] ... A[0][C-1] A[1][0] ... A[1][C-1] ... A[R-1][0] ... A[R-1][C-1]

我们作这样的理解：A[R][C] 实际上是 (A[R])[C] 即长度为 R 的数组，每个元素为长度为 C 的数组，这类似于 (*p)[Z].

嵌套数组的例子：

#define PCOUNT 4
zip_dig pgh[PCOUNT] = // 相当于 int pgh[4][5]
  {{1, 5, 2, 0, 6}.
   {1, 5, 2, 1, 3},
   {1, 5, 2, 1, 7},
   {1, 5, 2, 2, 1}}
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
|1|5|2|0|6|1|5|2|1|3|1|5|2|1|7|1|5|2|2|1|
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
76                                      156

嵌套数组列访问：行向量：

T A[R][C] 中 A[i] 是 C 个元素的数组，每个 T 类型元素需要 K 个字节，也就是说 A[i][j] 的地址是 A + (i * C + j) * K

显然其中的乘法过多，而且因为是常数，所以编译器往往会做自己的小巧思，我们当然可以利用这些小巧思，比如将数组长度设定为二的幂，这样指令更少，运算会更快，比如下面是 16*16 的矩阵（使用数组表示），我们访问之的例子：

/* Get element a[i][j] */
int fix_ele(fix_matrix a, size_t i, size_t j) {
    return a[i][j];
}

# a in %rdi, i in %rsi, j in %rdx
    salq    $6, %rsi             # 64 * i  
    addq    %rsi, %rdi           # a + 64 * i
    movl    (%rdi,%rdx,4), %eax  # M[a + 64 * i + 4 * j]
    ret

n * n 矩阵访问：

/* Get element a[i][j] */
int val_ele(size_t n, int a[n][n], size_t i, size_t j) {
    return a[i][j];
}

# n in %rdi, a in %rsi, i in %rdx, j in %rcx
    imulq    %rdx, %rdi          # n * i
    leaq     (%rsi,%rdi,4), %rax # a + 4 * n * i
    movl     (%rax,%rcx,4), %eax # a + 4 * n * i + 4 * j
    ret

多级数组

#define ZLEN 5
typedef int zip_dig[ZLEN]
zip_dig cmu = {1, 5, 2, 1, 3};
zip_dig mit = {0, 2, 1, 3, 9};
zip_dig ucb = {9, 4, 7, 2, 0};
#define UCOUNT 3
int *univ[UCOUNT] = {mit, cmu, ucb};

     +-+-+-+-+-+
cmu  |1|5|2|1|3| <---+
     +-+-+-+-+-+     |
     16        36    |   univ
     +-+-+-+-+-+   +-|-- 36 <- 160
mit  |0|2|1|3|9| <-+ +-- 16 <- 168
     +-+-+-+-+-+      +- 56 <- 176
     36        56     |
     +-+-+-+-+-+      |
cmu  |1|5|2|1|3| <----+
     +-+-+-+-+-+
     56        76

多级数组的访问

int get_univ_digit(size_t index, size_t digit) {
    return univ[index][digit];
}

    salq    $2, %rsi            # 4 * digit
    addq    univ(,%rdi,8), %rsi # p = univ[index] + 4 * digit
    movl    (%rsi), %eax        # return *p
    ret

1999 年发行的 C99 以来，你可以将数组作为一个参数传递。

结构体

struct rec {
    int a[4];
    size_t i;
    struct rec *next;
}

r
|
v
+--------+----+----+
|a       |i   |next|
+--------+----+----+
0        16   24   32

结构体被表示为一块内存 Structure represented as block of memory

足够大能够包括所有字段（field） Big enough to hold all of the fields

每一字段的顺序为声明顺序 Fields ordered according to declaration

即使另一种顺序可以使用更少的空间 Even if another ordering couyld yield a more compact representation

编译器决定了整体的大小和各字段的位置 Compiler determines overall size + positions of fields

机器级程序不会理解源代码中的结构体 Machine-level program has no understanding of the structures in the source code

int *get_ap(struct rec *r, size_t idx) {
    return &r->a[idx];
}

# r in %rdi, idx in %rsi
    leaq    (%rdi,%rsi,4), %rax # r + 4 * idx
    ret

void set_val(struct rec *r, int val) {
    while (r) {
        size_t i = r->i;
        r->a[i] = val;
        r = r->next;
    }
}
/*
%rdi    r
%rsi    val
*/

.L11:                           # loop:
    movslq  16(%rdi), %rax      #   i = M[r+16]
    movl    %esi, (%rdi,%rax,4) #   M[r+4*i] = val
    movq    24(%rdi), %rdi      #   r = M[r+24]
    testq   %rdi, %rdi          #   Test r
    jne     .L11                #   if != 0 goto loop

结构体和对齐

struct S1 {
    char c;
    int i[2];
    double v;
} *p;

未对齐的数据：

+-+----+----+--------+
|c|i[0]|i[1]|v       |
+-+----+----+--------+
p      p+5  p+9      p+17

对齐的数据：

+----+----+--------+--------+
|c\3B|i[0]|i[1]\4B |v       |
+----+----+--------+--------+
p    p+4  p+8      p+16     p+32
可以看到其中有 3Bytes 和 4 Bytes 的空白字节，只是为了对齐
原始数据需要 K 字节，那么地址必须是 K 的倍数 Primitive data type requires K bytes. Address must be multiple of K.
具体来说，c 的相对地址是 0，是 1 的倍数；i[0] 和 i[1] 的相对地址是 4 和 8，是 4 的倍数，v 的相对地址是 16，是 8 的倍数.

对齐的引入是为了硬件运行：实际的内存系统一次不取一个字节，现在的大多数机器一次大约取 64 字节——这取决于硬件中的各种宽度（various widths），如果没有一个对齐的地址使得一个数据跨越了两个块之间的边界，将会导致硬件甚至操作系统来采取一些额外的步骤来处理，使得运行效率受到很大影响，对齐原则这一原则由此确立。

对齐原则 Alignment Principles

对齐数据 Aligned Data

原始数据需要 K 字节 Primitive data type requires K bytes

地址必须是 K 的倍数 Address must be multiple of K

这被某些机器要求，被 x86-64 建议 Required on some machines; advised on x86-64 【在这样的某些机器上会导致内存错误（memory fault）】

对齐数据的动机 Motivation for Aligned Data

内存被（对齐的）4 或 8 字节的块访问（系统依赖）Memory accessed by (aligned) chunks of 4 or 8 bytes (system-dependent)

不充分加载和存储数据，会形成 4 字节边界 Inefficient to load or store datum that spans quad word boundaries

当数据被边界切开会使得虚拟内存更难以处理 Virtual memory tricker when datum spans 2 pages

编译器 Compiler

在结构体各字段间插入间隙以对齐 Inserts gaps in structure to ensure correct alignment of fields

在结构体上满足对齐原则 satisfying alignment with structures

结构体内 Within structure:

必须满足每个元素的对齐需求 Must satisfy each element's alignment requirement

结构体整体 Overall structure placement

每个结构体必须对齐 K Each structure has alignment requirement K

此处 K 是结构体内最大元素的对齐要求 K = Largest alignment of any element

初始地址和结构体长度必须是 K 的倍数 Initial address & structure length must be multiples of K

另一例子：

struct S2 {
    double v;
    int i[2];
    char c;
} *p;

+--------+----+----+--------+
|v       |i[0]|i[1]|c\7B    |
+--------+----+----+--------+
p        p+8  p+12 p+16     p+24

struct S2 {
    double v;
    int i[2];
    char c;
} a[10]

+------------------------+------------------------+-----
|a[0]                    |a[1]                    | ...
+------------------------+------------------------+-----
a+0                      a+24                     a+48

struct S3 {
    short i;
    float v;
    short j;
} a[10];

+------------+------------+-----
|a[0]        |a[1]        | ...
+------------+------------+-----
a+0          a+12         a+24

+----+----+----+
|i \2Bv   |j \2B
+----+----+--+-+ <- a+12*idx+12
a+12*idx  a+12*idx+8

short get_j(int idx) {
    return a[idx].j;
}

# %rdi = idx
    leaq (%rdi,%rdi,2), %rax   # 3 * idx
    movzwl a+8(,%rax,4), %eax  # 这里没有写出存储 a 地址的寄存器

需要手动优化：将大的数据类型放在前面：（贪心算法（greedy algorithm）保证了这会最小化占据的空间）

struct S4 {
    char c;
    int i;
    char d;
} *p // 需要 12 字节 空间：c + 3 + i + d + 3
struct S5 {
    int i;
    char c;
    char d;
} *p // 需要 8 字节空间：i + c + d + 2

浮点数（Floating Point）

背景 Background

历史 History

x87 FP

遗产，相当丑陋 Legacy, very ugly

SSE FP

鲨鱼机支持 supported by Shark machine 【鲨鱼机是 CMU 的课程实验机器】

在 vector 指令使用时的特殊情况 Special case use of vector instructions

AVX FP

最新版本 Newest version

和 SSE 相近 Similar to SSE

书中记载的版本 Documented in book

以前，8086 处理器上有一块称为 8087 的芯片，能够完成所有所需的硬件以在单个芯片上实现完整的 IEEE 浮点数在工程上堪称杰作，然而这编程模型非常丑陋、糟糕。由于浮点数的实用性，迫切需要一种新的高效的实现其运算的硬件实现，由此诞生了 SSE，它替 SIMD 执行 SIMD 要执行的部分指令。随后工程师们又提炼 SSE 中的精华，构建了 AVX.

SSE3：XMM Registers 总共有 16 个特殊的寄存器，每个都有 16 个字节。每个寄存器有如下方式处理：

存储 16 个单字节整型。
存储 8 个 16 比特整型。
存储 4 个 32 比特整型。
存储 4 个单精度浮点型。
存储 2 个双精度浮点型。
存储 1 个单精度浮点型，其余空间随意。
存储 1 个双精度浮点型，其余空间随意。

标量和单指令多数据操作 Scalar & SIMD Operations【单指令多数据：SIMD, single instruction multiple data】

addss %xmm0, %xmm1 # add scalar single 两个 单精度/4字节 标量相加（标量指令）
addps %xmm0, %xmm1 # add pack single **八**个 单精度/4字节标量 **分别** 相加（SIMD 指令）
addsd %xmm0, %xmm1 # add scalar double 两个 双精度/8字节 标量相加（标量指令）

SIMD 指令是精华：一个寄存器存储一组数据，一条指令执行一组加法

浮点数内存引用 FP Memory Referencing

整型（和指针）参数用常规寄存器传递 Integers (and pointer) arguments passed in regular registers

浮点型使用 XMM 寄存器传递 FP values passed in XMM registers

不同的 mov 指令来移动 XMM 寄存器中的值以及 XMM 寄存器和内存中的值 Different mov instructions to move between XMM registers, and between memory and XMM registers

double dincr(double *p, double v) {
    double x = *p;
    *p = x + v;
    return x;
}

# p in %rdi, v in %xmm0
    movapd  %xmm0, %xmm1  # copy v
    movsd   (%rdi), %xmm0 # x = *p
    addsd   %xmm0, %xmm1  # t = x + v
    movsd   %xmm1, (%rdi) # *p = t
    ret

【现在好像进化到使用 YMM 寄存器了，反正浮点数机制后面怎么发展的也没讲，这毕竟是 15 年的课程】

其他浮点数代码相关 Other Aspects of FP code

大量指令 Lots of instructions

不同的指令，不同的格式…… Different operations, different formats

浮点数比较 Floating-point comparisons

ucomiss 和 ucomisd 指令 Instructions ucomiss and ucomisd

设置如 CF、ZF、PF 条件码 Set conditions codes CF, ZF and PF

使用常数值 Using constant values

用 xorpd %xmm0, %xmm0 指令来设置 XMM0 寄存器为 0 Set XMM0 register to 0 with instruction xorpd %xmm0, %xmm0

其他从内存取值 Others loaded from memory