RISC-V

Introduction

Terminology

  • Core A component is termed a core if it contains an independent instruction fetch unit. A core might support multiple hardware threads(harts) through multithreading

Execution Environment Interface(EEI)

Initial state of the program Number and type of harts Accessibility and attributes of memory and I/O regions Behavior of all legal instructions executed on each hart (ISA) Handling of any interrupts or exceptions raised during execution

EEI is reponsible for ensuring the eventual forward progress of each of its harts. The following events constitute forward progress:

  • The retirement of an instruction
  • A trap
  • Any other event defined by an extension

Distinction between a hard and a software thread context: The software running inside an execution environment is not responsible for causing progress of each of its harts; that is the responsibility of the outer execution environment.

ISA Overview

RISC-V ISA = a base integer ISA + optional extensions A base is carfully restricted to a minimal set of instructions

XLEN = the width of an integer register in bits

Actually, there are currently four base ISAs.

  • RV32I and RV64I
  • RV32E, which has half the number of integer registers
  • RV128I for future

RISC-V instruction-set encoding space

  • Standard: defined by the Foundation
  • Reserved: currently not defined but are saved for future standard
  • Custom: never be used for standard and are made available for vendor-specific extensions

The term non-conforming describes a non-standard extension that uses either a standard or a reserved encoding (i.e., conflict with standard)

  • I : the base interger ISA
  • M : integer multiplication and division extension
  • A : atomics for inter-processor synchronization
  • F : single-precision floating-point extension
  • D : double-precision floating-point extension
  • C : compressed instruction (16-bit forms)

Memory

A RISC-V hart has a single byte-addressable address space of 2^XLEN bytes for all memory accesses. word = 32 bits = 4 bytes halfword = 16 bits = 2 bytes doubleword = 64 bits = 8 bytes quadword = 128 bits Memory address space is circular, so that the byte at address 2^XLEN-1 is adjacent to the byte at address zero.

EE determines the mapping of hardware resources into a hart’s address space. Each range may (1) be vacant, (2) contain main memory, or (3) contain one or more I/O devices. Reads and writes of I/O devices may have side effects, but accesses to main memory cannot. Although it is possible for EE to call every range an I/O device, it is usually expected that some portion will be specified as main memory.

When RISC-V platform has multiple harts, the address space of any two harts may be entirely the same, or entirely different, or may be partly different but sharing some subset of resources, mapped into the same or different address ranges.

Executing each RISC-V machine instruction entails one or more memory accesses, subdivided into implicit and explicti accesses. For each instruction executed, an implicit memory read (instruction fetch) is done. EE may dictate other implicit memory accesses (such as to implement address translation). Specific load and store instructions perform an explicit read or write.

EE determines what portions of the non-vacant address space are accessible for each kind of memory access(implicit or explicit, read or write). Vacant locations are never accessible.

Except when specified otherwise, implicit reads that do not raise an exception and that have no side effects may occur arbitrarily early and speculatively, even before the machine could possibly prove that the read will be needed. To ensure that certain implicit reads are ordered only after write to the same memory locations, software must execute specific fence or cache-control instruction defined for this purpose(such as FENCE.I). (의문: 왜 implicit reads에만 예외를 뒀을까???)

The memory accesses (implicit or explicit) made by a hart may appear to occur in a different order as perceived by another hart or by any other agent that can access the same memory. This perceived reordering of memory accesses is always constrained, however, by the applicable memory consistency model. The default memory consistency model for RISC-V is the RISC-V Weak Memory Ordering (RVWMO), defined in Chapter 14 and in appendices. Optionally, an implementation may adopt the stronger model of Total Store Ordering, as defined in Chapter 23. The execution environment may also add constraints that further limit the perceived reordering of memory accesses. Since the RVWMO model is the weakest model allowed for any RISC-V implementation, software written for this model is compatible with the actual memory consistency rules of all RISC-V implementations. As with implicit reads, software must execute fence or cache-control instructions to ensure specific ordering of memory accesses beyond the requirements of the assumed memory consistency model and execution environment.

Instruction Encoding

IALIGN = instruction-address alignment constraint the implementation enfores. measure in bits. 16 or 32. 32 in the base ISA, but relaxed to 16 in some ISA extensions(such as C). ILEN = maximum instruction length supported by an implementation. multiple of IALIGN. For implementations supporting only a base instruction set, ILEN is 32 bits.

Rationale: 압축 포맷을 지원하되, 압축 포맷이 필수는 아닌 방향으로 설계함. 표준 IMAFD ISA 구현시 32-bit 명령이므로 I-cache에 30-bit만 집어넣으면 됨(6.25% 절약). RV32I 기준 encoding space를 1/8도 사용하지 않으므로 커스텀 확장을 많이 집어 넣을수 있음. 상위 2비트가 11이 아니게 하여 사용할수도 있음(non-conforming이지만). All-zero-bits 명령은 illegal instruction이 되도록 설계되어 있음.

Instructions are stored in memory as a sequence of 16-bit little-endian parcels, regardless of memory system endianness. (i.e., least-significant 16-bit goes to the lowest address. Byte order inside 16-bit depends on memory system endianness.) Rationale: Length-encoding bits always appear first in halfword address order. Big-endian JIT compilers, for example, must swap the byte order when storing to instruction memory.

Exceptions, Traps, and Interrupts

  • Exception: an unusual condition occurring at runtime associated with an instruction in the current RISC-V hart
  • Interrupt: an external asynchronous event that may cause a RISC-V hart to experience an unexpected transfer of control
  • Trap: the transfer of control to a trap handler caused by either an exception or an interrupt

The general behavior of most RISC-V EEIs is that a trap to some handler occurs when an exception is signaled on an instruction. The manner in which interrupts are generated, routed to, and enable by a hart depends on the EEI. How traps are handled and made visible to software running on the hart depends on the enclosing EE.

RISC-V’s exception and trap is compatible with that in the IEEE-754 floating-point standard. (의문: IEEE-754에 exception과 trap 정의가 있나?)

Four effects from traps:

  • Contained trap: The trap is visible to, and handled by, software running inside the EE. For example, when a hart is interrupted, an interrupt handler will be run.
  • Requested trap: The trap is a synchronous exception that is an explicit call to the EE. An example is a system call. Execution may or may not resume on the hart.
  • Invisible trap: The trap is handled transparently by the EE and EE resumes normally after the trap is handled. Examples include emulating missing instructions, or handling non-resident page faults.
  • Fatal trap: The trap represents a fatal failure and causes the EE to terminate execution. Examples include failing a virtual-memory page-protection check.

RV32I Base Integer Instruction Set

40개 명령으로 구성. 교육목적 이외에는 부분집합만 사용했을때 이득이 없도록 설계, i.e., 진짜 필요한 것만 넣었으니 다 구현하세요.

Registers

XLEN = 32 x0-x31, PC x0는 0으로 고정

No dedicated stack pointer or link register. However, the standard software calling convention is: x1 = return address x5 = link register x2 = stack pointer Hardware might choose to accelerate function calls and returns that use x1 or x5. See JAL and JALR. Compressed format is designed around the assumption that x1 is the return address and x2 is stack pointer.

Instructions

There are 4 formats, R/I/S/U, and 2 variants, B/J. R = register I = immediate S = store U = upper? B = branch J = jump

IALIGN=32

An instruction-address-misaligned exception is generated on a taken branch or unconditional jump if the target address is not four-byte aligned. This exception is reported on the branch or jump instruction, not on the target instruction. No instruction-address-misaligned exception is generated for a conditional branch that is not taken.

레지스터 디코딩이 보통 크리티컬 패스에 있어서 rs1, rs2, rd는 항상 같은 위치에 있도록 디자인. 대신 imm 위치는 포맷마다 다름. S-type에서는 imm이 심지어 둘로 쪼개진다.

imm의 최상위 비트는 항상 bit 31에 있게 하여 sign-extension이 instruction decoding과 병렬로 일어날 수 있도록 설계. Zero-extension은 쓸모 있는 경우가 안보여서 아예 안넣었다고 함.

S-type의 변형으로 B-type이 있음. imm이 2의 배수라고 가정하고 imm[0] 자리에 imm[11]을 넣음. 최상위 비트 위치는 그대로.(imm[11] -> imm[12]) 마찬가지로 U-type의 변형인 J-type이 있음.

Integer Computational Instructions

Integer Register-Immediate Insts. (mostly I-type format) ADDI: rd = rs1 + imm SLTI: rd = rs1 < imm (signed) SLTIU: rd = rs1 < imm (unsigned) ANDI: rd = rs1 & imm ORI: rd = rs1 | imm XORI: rd = rs1 ^ imm SLLI: rd = rs1 « imm SRLI: rd = rs1 » imm (logical) SRAI: rd = rs1 » imm (arithmetic) LUI: rd = {imm[31:12], 12’b0} AUIPC: rd = {imm[31:12], 12’b0} + PC

Integer Register-Register Ops. (mostly R-type format) ADD: rd = rs1 + rs2 SLT: rd = rs1 < rs2 (signed) SLTU: rd = rs1 < rs2 (unsigned) AND: rd = rs1 & rs2 OR: rd = rs1 | rs2 XOR: rd = rs1 ^ rs2 SLL: rd = rs1 « rs2 SRL: rd = rs1 » rs2 (logical) SUB: rd = rs1 - rs2 SRA: rd = rs1 » rs2 (arithmetic)

MV rd, rs1 = ADDI rd, rs1, 0 SEQZ rd, rs = SLTIU rd, rs1, 1 NOT rd, rs = XORI rd, rs1, -1 SNEZ rd, rs = SLTU rd, x0, rs2 NOP = ADDI x0, x0, 0

의문:

  • imm가 32-bit가 안되는데 큰수는 어떻게 더하지? -> There are 20-bit imm instruction(LUI, AUIPC) and 12-bit imm instructions(JALR, load/store). A combination of two can acheive arbitrary 32-bit calculation.
  • SRAI는 왜 imm[11:5]를 0100000으로 주지? 나머지는 all-zero인데? -> OP-IMM에서 funct3가 3비트라 8개 가능한데, OP가 9개라서 SRAI is distinguished by imm bit
  • Why no SUBI?

Control Transfer Instructions

Control transfer instructions in RV32I do not have architecturally visible delay slots.

Unconditional Jumps

JAL: jumps to (address of jump + imm), 1MiB range, rd=pc+4 JAL = jump and link J = JAL with x0 (pseudo) JALR: jumps to (rs1 + imm), rd=pc+4

Return-address prediction

Return-address prediction stacks are a common feature of high-performance instruction-fetch units, but require accurate detection of instructions used for procedure calls and returns to be effective. For RISC-V, hints as to the instructions’ usage are encoded implicitly via the register numbers used.

Rule

+-------+-------+--------+----------------+-----------+
|  rd   |  rs1  | rs1=rd |   RAS action   |  Comment  |
+-------+-------+--------+----------------+-----------+
| !link | !link |      - | none           |           |
| !link | link  |      - | pop            | Return    |
| link  | !link |      - | push           | Call      |
| link  | link  |      0 | pop, then push | Coroutine |
| link  | link  |      1 | push           | Call      |
+-------+-------+--------+----------------+-----------+
* link is true when the register is either x1 or x5

The fourth case supports ping-pong between two routines(i.e., coroutine). The fifth case is considered as call(not return) to support LUI-JALR and AUIPC-JALR patterns such as lui ra, imm20; jalr ra, imm12(ra).

Conditional Branches

4KiB range BEQ: jumps to (pc + imm) if rs1==rs2 BNE: jumps to (pc + imm) if rs1!=rs2 BLT: jumps to (pc + imm) if rs1<rs2 (signed) BLTU: jumps to (pc + imm) if rs1<rs2 (unsigned) BGE: jumps to (pc + imm) if rs1>=rs2 (signed) BGEU: jumps to (pc + imm) if rs1>=rs2 (unsigned)

Signed array bounds may be checked with a single BLTU instruction, since any negative index will compare greater than any nonnegative bound. Software should be optimized such that the sequential code path is the most common path. Software should also assume that backward branches will be predicted taken and forward branches as not taken. Unlike some other architectures, the RISC-V jump(JAL with rd=x0) instruction should always be used for unconditional branches instead of a conditional branch instrtuction with always-true condition.

The conditional branches were designed to include arithmetic comparison operations between two registers, rather than:

  • using condition codes (x86, ARM, SPARC, PowerPC)
  • comparing one register against zero (Alpha, MIPS) and allowing two registers only for equality (MIPS) because:
  • combined compare-and-branch instruction fits into a regular pipeline, and reduces static code size and dynamic instruction fetch traffic
  • comparisons agains zero requires a circuit almost as expensive as arithmetic comparison anyway
  • branches are observed earlier, and so can be predicted earlier
  • in the case where multiple branches can be taken based on the same condition codes, a design with condition codes has an advantage, but this case seems to be rare

Static branch hints are not included because:

  • can reduce the pressure on dynamic predictors but,
  • require more instruction encoding space
  • require profiling, and can result in poor performance if production runs do not match profiling runs

Conditional moves and predicated instructions are not included because:

  • cmov is difficult to use with code that might cause exceptions
  • predication adds additional state to a system, addition instructions to set and clear flags, and additional encoding overhead
  • add complexity to OoO microarchitecture
  • both are designed to replace unpredictable short forward branches, but various microarchitectural techniques exist to dynamically convert unpredictable short forward branches into internally predicated code to avoid the cost of flushing pipelines on a branch mispredict and have been implemented in commercial processors. (IBM POWER7?)

Load and Store Instructions

RV32I is a load-store architecture, where only load and store instruction access memory.

In RISC-V, endianness is byte-address invariant. (i.e., if a byte is stored to memory at some address in some endianness, then a byte-sized load from that address in any endianness returns the stored value.)

In a little-endian configuration, multi-byte stores write the least-significnat register byte at the lowest memory byte address. In a big-endian configuraiton, multi-byte stores write the most-significant register byte at the lowest memory byte address.

LW: rd = mem[rs1+imm] (32-bit) LH: rd = mem[rs1+imm] (16-bit, sign-extend) LHU: rd = mem[rs1+imm] (16-bit, zero-extend) LB: rd = mem[rs1+imm] (8-bit, sign-extend) LBU: rd = mem[rs1+imm] (8-bit, zero-extend) SW: mem[rs1+imm] = rs2 (32-bit) SH: mem[rs1+imm] = rs2 (low 16-bit) SB: mem[rs1+imm] = rs2 (low 8-bit)

Memory Ordering Instructions

FENCE instruction

Any combination of device input(I), device output(O), memory reads(R), and memory write(W) may be ordered w.r.t. any combination of the same. Informally, no other RISC-V hart or external device can observe any operation in the successor set following a FENCE before any operation in the predecessor set preceding the FENCE.

FENCE.TSO = fm=1000, pred=RW, succ=RW (backward-compatible since old RISC-V core see this as FENCE RW, RW) FENCE.TSO orders all load operations in its predecessor set before all memory ops in its successor set, and all store operations in its predecessor set before all store operations in its successor set. (i.e., only relax write-to-read) (TSO = total store order)

Environment Call and Breakpoints

ECALL: make a service request to EE EBREAK: return control to a debugging environment

Similar to SVC and BKPT in ARM.

HINT instructions

HINTs are usually used to communicate performance hints to the microarchitecture. HINTs are encoded as integer computational instructions with rd=x0. Like NOP, HINTs do not change any architecturally visible state, except for advancing the pc and any applicable performance counters.

이렇게 정의된 이유는, HINT를 skip하는 기능조차 구현되지 않은 간단한 코어에서는 그냥 HINT 명령을 실제로 실행해도 상관없도록 하기 위해서임.

Summary

  • Integer Computational Instructions: ADDI SLTI SLTIU ANDI ORI XORI SLLI SRLI SRAI LUI AUIPC ADD SLT SLTU AND OR XOR SLL SRL SUB SRA (21)
  • Control Transfer Instructions: JAL JALR BEQ BNE BLT BLTU BGE BGEU (8)
  • Load and Store Instructions: LW LH LHU LB LBU SW SH SB (8)
  • Memory Ordering Instructions: FENCE(1)
  • Environment Call and Breakpoints: ECALL EBREAK (2) 21 + 8 + 8 + 1 + 2 = 40 진짜 딱 40개. 학부 플젝 내기 딱 좋을듯.

“Zifencei” Instruction-Fetch Fence

“Zifencei” extension includes the FENCE.I instruction that provides explicit synchronization between write to instruction memory and instruction fetches on the same hart. Currently, this instruction is the only standard mechanism to ensure the stores visible to a hart will also be visible to its instruction fetches.

왜 필요한가? implicit read의 경우 스펙상 explicit write랑 ordering이 되지 않는다. 예를 들어, 모든 인스트럭션을 캐시에 미리 올려놓을 수 있고, 나중에 explicit write를 통해 인스트럭션을 고치더라도 (e.g., JIT compiler) 반영이 되지 않는다. 여기에 ordering을 주기 위한 인스트럭션이 FENCE.I이다.

왜 기존에는 base I에 있었는데 별도의 extension이 되었는가? 첫째로, 특정 시스템에서는 FENCE.I가 비싼 동작일 수 있다. 예를 들어, I-cache와 D-cache가 incoherent하면 FENCE.I에서 모두 flush해야 한다. 두번째로, Unix-like 시스템의 user-level에서 FENCE.I가 별로 powerful 하지 않을 수 있다. OS에서 컨텍스트 스위칭이 일어나면 hart 입장에서는 instruction을 새로 쓴 것이고 FENCE.I를 실행해줘야만 한다. 그래서 Linux ABI에서는 user-level에서 FENCE.I를 사용하는 대신 시스템 콜을 하게 하고, OS에서 최소한의 FENCE.I로 instruction-fetch coherence를 유지하게 했다고 한다.

FENCE.I는 local hart에만 동작한다. 모든 hart와 동기화하려면, 먼저 FENCE로 write한 데이터를 동기화하고, 각 hart에서 FENCE.I를 실행해야 한다. Linux 사례에서 보듯이, 실행중인 코드가 다른 hart로 migration 될 수 있는 경우 FENCE.I에 의존하면 안되고 EEI가 매커니즘을 제공해줘야 한다.

RV32E Base Integer Instruction Set

RV32E is a reduced version of RV32I designed for embedded systems. The only change is to reduce the number of inter registers to 16. Registers are x0-x15 instead of x0-x31.

작은 RV32I 코어 디자인에서 레지스터 16개가 면적을 25% 정도 먹는다고 함. 그래서 RV32E가 RV32I 대비 25% 정도 면적 절약.

RV64I Base Integer Instruction Set

XLEN=64

Most integer computational instructions operate on XLEN-bit values. Additional instruction variants are provided to manipulate 32-bit values in RV64I, indicated by a ‘W’ suffix to the opcode. These W instructions ignore the upper 32 bits of their inputs and always produce 32-bit signed values, i.e., bits XLEN-1 through 31 are equal.

  • New Integer Computational Instructions for 32-bit: ADDIW SLLIW SRLIW SRAIW ADDW SLLW SRLW SUBW SRAW (9)
  • New Load and Store Instructions: LWU LD SD (3)
  • Modification to RV32I: SLLI SRLI SRAI (one more bit to the shift amount)

40 + 9 + 3 = 52 instructions in RV64I.

RV128I Base Integer Instruction Set

XLEN=128

A new set of D instructions are added to support 64-bit operations.

  • New instructions: ADDID SLLID SRLID SRAID ADDD SLLD SRLD SUBD SRAD LDU LQ SQ (12)

52 + 12 = 64 instructions in RV128I.

M, Standard Extension for Integer Multiplication and Division

  • MUL: rd = lower XLEN bits of rs1 * rs2
  • MULH: rd = upper XLEN bits of signed rs1 * signed rs2
  • MULHU: rd = upper XLEN bits of unsigned rs1 * unsigned rs2
  • MULHSU: rd = upper XLEN bits of signed rs1 * unsigned rs2
  • MULW: multiplies lower 32 bits of source registers, placing sign-extension of the lower 32 bits of result (for RV64)

When full product are required, then the recommended code sequence is MULH[[S]U] rdh, rs1, rs2; MUL rdl, rs1, rs2. Microarchitectures can then fuse these into a single multiply operation.

  • DIV[U]: rd = rs1 / rs2 (rouding towards zero)
  • REM[U]: corresponding remainder. the sign of result equals the sign of the dividend.
  • DIV[U]W/REM[U]W: for RV64

When both the quotient and remainder are required, the recommended code sequence is DIV[U] rdq, rs1, rs2; REM[U] rdr, rs1, rs2.

A, Standard Extension for Atomic Instructions

The two forms of atomic instruction provided are load-reserved/store-conditional instructions and atomic fetch-and-op memory instructions. These instructions allow RISC-V to support the RCsc memory consistency model.

Each atomic instruction has two bits, aq and rl. The bits order accesses to one of the two address domains, memory or I/O, depending on which address domain the atomic instruction is accessing. No ordering constraint is implied to accesses to the other domain, and a FENCE instruction should be used to order across both domains.

  • If both bits are clear, no additional ordering constraints are imposed on the atomic memory operation.
  • If only the aq bit is set, the atomic memory operation is treated as an acquire access, i.e., no following memory operations on this RISC-V hart can be observed to take place before the acquire memory operation.
  • If only the rl bit is set, the atomic memory operation is treated as a release access, i.e., the release memory operation cannot be observed to take place before any earlier memory operations on this RISC-V hart.
  • If both the aq and rl bits are set, the atomic memory operation is sequentially consistent and cannot be observed to happen before any earlier memory operations or after any later memory operations in the same RISC-V hart and to the same address domain.

Load-Reserved/Store-Conditional Instructions

Called Load-Reserved/Store-Conditional(LR/SC) or Load-Linked/Store-Conditional(LL/SC)

Load-link returns the current value of a memory location, while a subsequent store-conditional to the same memory location will store a new value only if no updates have occurred to that location since the load-link.

  • LR.W: loads a word from the address in rs1, places the sign-extended value in rd, and registers a reservation set - a set of bytes that subsumes the bytes in the addressed word.
  • SC.W: conditionally writes a word in rs2 to the address in rs1: the SC.W succeeds only if the reservation is still valid and the reservation set contains the bytes being written. If the SC.W succeeds, the instruction writes the word in rs2 to memory, and it writes zero to rd. If fails, the instruction does not write to memory, and it writes a nonzero value to rd. Regardless of success or failure, executing an SC.W instruction invalidates any reservation held by this hart.
  • LR.D, SC.D: for RV64

Both compare-and-swap (CAS) and LR/SC can be used to build lock-free data structures, but RISC-V opted for LR/SC because:

  • CAS suffers from the ABA problem while LR/SC monitors all accesses to the address rather than only checking for changes in the data value
  • CAS requires a new integer instruction format to support three source operands (address, compare value, swap value) as well as a different memory system message format
  • To avoid the ABA problem, some systems provide a double-wide CAS (DW-CAS) to allow a counter to be tested and incremented along with a data word. This requires reading five registers and writing two in one instruction, and also a new larger memory system message type
  • LR/SC usually provides a more efficient implementation as it only requires one load as opposed to two with CAS (one load before the CAS instruction to obtain a value for speculative computation, then a second load as part of the CAS instruction to check if value is unchanged before updating)

The main disadvantage of LR/SC over CAS is livelock. (which can be avoided with an architected guarantee of eventual forward progress under certain circumstances) Another concern is whether the influence of the current x86 architecture, with its DW-CAS, will complicate porting of synchronization libraries.

How LR/SC works?

  • An implementation can register an arbitrarily large reservation set on each LR, provided the reservation set includes all bytes of the addressed data word or doubleword.
  • An SC can only pair with the most recent LR in program order.
  • An SC must fail if there is another SC (to any address) between the LR and the SC in program order.
  • The SC must fail if the address is not within the reservation set of the most recent LR in program order.
  • The SC must fail if a store to the reservation set from another hart can be observed to occur between the LR and SC.
  • The SC must fail if a write from some other device to the bytes accessed by the LR can be observed to occur between the LR and SC. If such a device writes the reservation set but does not write the bytes accessed by the LR, the SC may or may not fail.
  • 해석: SC는 program order 상 가장 가까운 LR이랑 페어링된다. LR-LR-SC 패턴이 있으면 첫 LR은 무시되고 뒤의 LR-SC가 페어링 될것임. LR-SC-SC 패턴이 있으면 앞의 LR-SC가 페어링 되고 뒤의 SC는 fail. 즉, reservation set은 1개까지만 유지된다. SC가 페어링된 LR이 등록한 reservation set 밖을 접근하면 fail. LR과 SC 사이에 다른 hart가 reservation set에 write한 것이 관찰되면 SC는 fail. LR과 SC 사이에 다른 device가 LR이 접근한 bytes에 write한 것이 관찰되면 SC는 fail. Device의 경우 hart와 달리 byte 단위로 fail 여부를 판단하고 reservation set에 접근했을때 처리는 자유. 이 모든 케이스에 해당하지 않으면 successful 할 수 있음.

Synchronization

  • An SC instruction can never be observed by another RISC-V hart before the LR instruction that established the reservation.
  • The LR/SC sequence can be given acquire semantics by setting the aq bit on the LR instruction.
  • The LR/SC sequence can be given release semantics by setting the rl bit on the SC instruction.
  • Setting the aq bit on the LR instruction, and setting both the aq and the rl bit on the SC instruction makes the LR/SC sequence sequentially consistent. (TODO: why no rl bit on the LR?)
  # CAS function using LR/SC
  # a0 holds address of memory location
  # a1 holds expected value
  # a2 holds desired value
  # a0 holds return value, 0 if successful, !0 otherwise
cas:
  lr.w t0, (a0) # Load original value.
  bne t0, a1, fail # Doesn’t match, so fail.
  sc.w t0, a2, (a0) # Try to update.
  bnez t0, cas # Retry if store-conditional failed.
  li a0, 0 # Set return to success.
  jr ra # Return.
fail:
  li a0, 1 # Set return to failure.
  jr ra # Return.

TODO: why does CAS gaurantees eventual progress, while LR/SC can livelock? reservation set을 invalidate 시키는 것이 may or may not으로 자유라서 그런걸까?

Atomic Memory Operations

AMO instructions atomically load a data value from the address in rs1, place the value into register rd, apply a binary operator to the loaded value and the original value in rs2, then store the result back to the address in rs1.

  • AMOSWAP.W/D AMOADD.W/D AMOAND.W/D AMOOR.W/D AMOXOR.W/D AMOMAX[U].W/D AMOMIN[U].W/D

RISC-V provides fetch-and-op style atomic primitives as they scale to highly parallel systems better than LR/SC or CAS. A simple microarchitecture can implement AMOs using the LR/SC primitives. More complex implementations might also implement AMOs at memory controllers.

Although the FENCE R, RW instruction suffices to implement the acquire operation and FENCE RW, W instruction suffices to implement release, both imply additional unnecessary ordering as compared to AMOs with the corresponding aq or rl bit set.

  • 해석: acquire는 acquire 아래의 operation이 acquire 위로 못올라가게 하는 역할. FENCE R, RW는 다른 쓰레드로부터의 영향을 모두 받은 후(R) fence 아래의 operation(RW)가 실행되게 보장하므로 acquire 역할. release는 release 위의 operation이 release 아래로 못 내려가게 하는 역할. FENCE RW, W는 fence 위의 일을 다 한 후(RW) 다른 쓰레드에 영향을 미치게(W) 보장하므로 release 역할.

A sequentially consistent load can be implemented as an LR with both aq and rl set. A sequentially consistent store can be implemented as an AMOSWAP that writes the old value to x0 and has both aq and rl set.

  # Mutual exclusion using AMO (test-and-test-and-set spinlock)
  li t0, 1 # Initialize swap value.
again:
  lw t1, (a0) # Check if lock is held. (test)
  bnez t1, again # Retry if held.
  amoswap.w.aq t1, t0, (a0) # Attempt to acquire lock. (test-and-set)
  bnez t1, again # Retry if held.
  # ...
  # Critical section.
  # ...
  amoswap.w.rl x0, x0, (a0) # Release lock by storing 0.

Zicsr, Control and Status Register (CSR) Instructions

RISC-V defines a separate address space of 4096 CSRs associated with each hart.

  • CSRRW: Atomic Read/Write CSR
  • CSRRS: Atomic Read and Set Bits in CSR
  • CSRRC: Atomic Read and Clear Bits in CSR
  • CSRRWI, CSRRSI, CSRRCI: Immediate forms

Counters

RISC-V ISAs provide a set of up to 32x64-bit performance counters and timers. The first three of these (CYCLE, TIME, and INSTRET) have dedicated functions.

F, Standard Extension for Single-Precision Floating-Point

F adds 32 floating-point registers, f0-f31, each 32 bits wide, and a floating-point control and status register fcsr. (FLEN = 32)

LOAD-FP: rd = mem[rs1 + imm] STORE-FP: mem[rs1 + imm] = rs2 FADD: rd = rs1 + rs2 FSUB: rd = rs1 - rs2 FMUL: rd = rs1 * rs2 FDIV: rd = rs1 / rs2 FSQRT: rd = sqrt(rs1) FMIN-MAX: rd = min or max(rs1, rs2) FMADD: rd = rs1 * rs2 + rs3 FMSUB: rd = rs1 * rs2 - rs3 FNMSUB: rd = -(rs1 * rs2) + rs3 FNMADD: rd = -(rs1 * rs2) - rs3 FCVT.fmt.fmt: Converts between floating-point and integer FSGNJ, FSGNJN, FSGNJX: sign-injection, base for pseudoinstruction FMV, FNEG, FABS FCMP: EQ/LT/LE FCLASS: rd = a 10-bit mask that indicates the class of the floating-point number

D, Standard Extension for Double-Precision Floating-Point

D extension widens the 32 floating-point registers, f0-f31, to 64 bits.

When multiple floating-point precisions are supported, then valid values of narrower n-bit types are represented in the lower n bits of an FLEN-bit NaN value, in a process termed NaN-boxing.

Q, Standard Extension for Quad-Precision Floating-Point

SAME

RVWMO Memory Consistency Model

A memory consistency model is a set of rules specifying the values that can be returned by loads of memory. RISC-V uses a memory model called RVWMO(RISC-V Weak Memory Ordering).

Under RVWMO, code running on a single hart appears to execute in order from the perspective of other memory instructions in the same hart, but memory instructions from another hart may observe the memory instructions from the first hart being executed in a different order.

Written on May 3, 2021