阅读《Intel® 64 and IA-32 Architectures Software Developer’s Manual》相关文档,了解 lock
指令相关内容。
lock
指令的介绍、描述,散落在文档的各个章节,下面按照卷、章节归纳整理。
《Intel® 64 and IA-32 Architectures Software Developer’s Manual》分卷说明:
Volume 1:Basic Architecture(第一卷):基础架构
Volume 2 (2A, 2B, 2C & 2D):Instruction Set(第二卷):指令集参考
Volume 3 (3A, 3B, 3C & 3D):System Programming Guide(第三卷):系统编程指南
更多细节请详阅文档,本文摘录会删减部分无关内容。
注意:受能力水平限制,以下翻译、理解内容不完全准确,欢迎讨论。
1. 结论先行
LOCK前缀指令,早期的处理器会触发总线锁;后期更新的处理器,优先使用缓存锁(CPU高速缓存锁),无法满足的情况下再使用总线锁。
多个Core的缓存中共同缓存了某内存地址的数据;如果某个Core修改数据(读-修改-写)时使用了LOCK前缀指令,那么LOCK指令结合MESI协议会将其他Core的缓存中对应数据状态置为失效,同时将修改完的数据回写到主内存中,并且保证指令执行的原子性。
2. 术语约定
LOCK (prefix) = LOCK前缀指令
LOCK# signal = LOCK#信号
处理器 = core = 多核CPU中的一个核心
缓存锁 = 高速缓存锁 = CPU高速缓存锁
asserted
结合资料,asserted要理解为发送信号。
- 早期通过锁总线保证指令的原子性,asserted理解为’向总线发送信号’。
- 后来通过锁缓存结合缓存一致性保证指令的原子性,asserted理解为’向缓存控制器发信号’。
3. 文档描述
3.1. Volume 1:Basic Architecture
3.1.1. Chapter 5 Instruction Set Summary
位置: <5.20 System Instructions>
LOCK(前缀) - 执行对内存的原子访问(可应用于提供存储器源/目标访问的多个通用指令)
LOCK (prefix) - Perform atomic access to memory (can be applied to a number of general purpose instructions that provide memory source/destination access).
3.1.2. Chapter 7 Programming With General-Purpose Instructions
位置:<7.3 Summary of GP Instructions> - <7.3.1 Data Transfer Instructions> - <7.3.1.2 Exchange Instructions>
CMPXCHG(比较与交换) 和 CMPXCHG8B(比较与交换 8bytes) 指令在多处理器系统中用于同步操作。
CMPXCHG
指令需要三个操作数:寄存器中的一个源操作数、EAX寄存器中的另一个源操作数和目标操作数。如果目标操作数和EAX寄存器中包含的值相等,则目标操作数将替换为另一个源操作数的值(不在EAX寄存器中的值)。否则,目标操作数的原始值被加载到EAX寄存器中。EFLAGS寄存器中的状态标志反映了从EAX寄存器中的值减去目标操作数所获得的结果。
CMPXCHG
指令通常用于测试和修改信号量;它检查信号量是否空闲。
如果信号量是空闲的,则将其标记为已分配;如果信号量不是空闲的,将获取到当前所有者的ID;这一切都是在一个不间断的操作中完成的;
在单处理器系统中,CMPXCHG
指令在执行多条指令测试和修改信号量之前,不需要切换到保护级别0(禁用中断)。
对于多处理器系统,可以将 CMPXCHG
与 lock
前缀结合起来,以原子方式执行比较和交换操作。
The CMPXCHG (compare and exchange) and CMPXCHG8B (compare and exchange 8 bytes) instructions are used to synchronize operations in systems that use multiple processors.
The CMPXCHG instruction requires three operands: a source operand in a register, another source operand in the EAX register, and a destination operand. If the values contained in the destination operand and the EAX register are equal, the destination operand is replaced with the value of the other source operand (the value not in the EAX register). Otherwise, the original value of the destination operand is loaded in the EAX register. The status flags in the EFLAGS register reflect the result that would have been obtained by subtracting the destination operand from the value in the EAX register.
The CMPXCHG instruction is commonly used for testing and modifying semaphores. It checks to see if a semaphore is free.
If the semaphore is free, it is marked allocated; otherwise it gets the ID of the current owner. This is all done in one uninterruptible operation.
In a single-processor system, the CMPXCHG instruction eliminates the need to switch to protection level 0 (to disable interrupts) before executing multiple instructions to test and modify a semaphore.
For multiple processor systems, CMPXCHG can be combined with the LOCK prefix to perform the compare and exchange operation atomically.
3.2. Volume 2 : Instruction Set Reference
3.2.1. Chapter 2 Instruction Format
位置:<2.1.1 Instruction Prefixes>
LOCK前缀指令强制执行一个操作,确保在多处理器环境中独占共享内存。
The LOCK prefix (F0H) forces an operation that ensures exclusive use of shared memory in a multiprocessor environment. See “LOCK—Assert LOCK# Signal Prefix” in Chapter 3, “Instruction Set Reference, A-L,” for a description of this prefix.
3.2.2. Chapter 3 Instruction Set Reference
位置:❤️.2 INSTRUCTIONS> - <LOCK—Assert LOCK# Signal Prefix>
执行指令时,LOCK前缀会向处理器发送一个LOCK#信号,即:将指令转换为原子指令;在多处理器环境中,发送的LOCK#信号确保处理器独占任何共享内存。
LOCK前缀通常与BTS指令一起用于对共享内存环境中的内存位置执行读-修改-写操作;LOCK前缀的完整性不受内存字段对齐的影响;内存锁定会观察到任意错位的字段。
该指令的操作在非64位模式和64位模式下是相同的。
从P6系列处理器开始,当LOCK作为指令的前缀时,并且要访问的内存区被缓存在处理器内部时,不会发送LOCK#信号;相反,只有处理器的缓存被锁定时,才会发送LOCK#信号。
Causes the processor’s LOCK# signal to be asserted during execution of the accompanying instruction (turns the instruction into an atomic instruction). In a multiprocessor environment, the LOCK# signal ensures that the processor has exclusive use of any shared memory while the signal is asserted.
The LOCK prefix is typically used with the BTS instruction to perform a read-modify-write operation on a memory location in shared memory environment.The integrity of the LOCK prefix is not affected by the alignment of the memory field. Memory locking is observed for arbitrarily misaligned fields.
This instruction’s operation is the same in non-64-bit modes and 64-bit mode.Beginning with the P6 family processors, when the LOCK prefix is prefixed to an instruction and the memory area being accessed is cached internally in the processor, the LOCK# signal is generally not asserted. Instead, only the processor’s cache is locked. Here, the processor’s cache coherency mechanism ensures that the operation is carried out atomically with regards to memory.
3.3. Volume 3 : System Programming Guide
3.3.1. Chapter 2 System Architecture Overview
位置:<2.8 System Instruction Summary> - <2.8.5 Controlling the Processor>
在修改内存操作时,使用LOCK前缀去调用加锁的读-修改-写操作,这种机制用于多处理器系统中处理器之间进行可靠的通讯,具体描述如下:
- 在Pentium和早期的IA-32处理器中,LOCK前缀会使处理器执行当前指令时产生一个LOCK#信号,这会显示的触发总线锁。
- 在Pentium4、Inter Xeon和P6系列处理器中,加锁操作是由高速缓存锁或总线锁来处理;
- 如果内存访问(个人理解:内存中的某数据)有且只在一个Core的高速缓存中,那么就会调用高速缓存锁;而系统总线和系统内存中的实际区域内不会被锁定。其他Pentium4、Intel Xeon或者P6系列处理器在总线上回写所有已修改的数据并使它们的高速缓存失效,以保证系统内存的一致性。
- 如果内存访问没有高速缓存 并且/或 它跨越了高速缓存行的边界,那么这个处理器就会产生LOCK#信号,并在锁定操作期间不会响应总线控制请求。
The LOCK prefix invokes a locked (atomic) read-modify-write operation when modifying a memory operand. This mechanism is used to allow reliable communications between processors in multiprocessor systems, as described below:
- In the Pentium processor and earlier IA-32 processors, the LOCK prefix causes the processor to assert the LOCK# signal during the instruction. This always causes an explicit bus lock to occur.
- In the Pentium 4, Intel Xeon, and P6 family processors, the locking operation is handled with either a cache lock or bus lock. If a memory access is cacheable and affects only a single cache line, a cache lock is invoked and the system bus and the actual memory location in system memory are not locked during the operation. Here, other Pentium 4, Intel Xeon, or P6 family processors on the bus write-back any modified data and invalidate their caches as necessary to maintain system memory coherency. If the memory access is not cacheable and/or it crosses a cache line boundary, the processor’s LOCK# signal is asserted and the processor does not respond to requests for bus control during the locked operation.
3.3.2. Chapter 8 Multiple-Processor Management
3.3.2.1. 8.1 LOCKED ATOMIC OPERATIONS
32位IA-32处理器支持对系统内存中的某个区域进行加锁的原子操作。这些操作常用来管理共享的数据结构(如信号量、段描述符、系统段或页表),其中两个或多个处理器可能同时会修改这些数据结构中的同一数据域或标志。处理器使用三个相互依赖的机制来实现加锁的原子操作:
- 保证原子操作
- 总线加锁,使用LOCK#信号和LOCK指令前缀
- 高速缓存相关性协议,确保对高速缓存中的数据结构执行原子操作(高速缓存锁)。这种机制存在于Pentium4、Intel Xeon和P6系列处理器中
The 32-bit IA-32 processors support locked atomic operations on locations in system memory. These operations are typically used to manage shared data structures (such as semaphores, segment descriptors, system segments, or page tables) in which two or more processors may try simultaneously to modify the same field or flag. The processor uses three interdependent mechanisms for carrying out locked atomic operations:
• Guaranteed atomic operations.
• Bus locking, using the LOCK# signal and the LOCK instruction prefix.
• Cache coherency protocols that ensure that atomic operations can be carried out on cached data structures (cache lock); this mechanism is present in the Pentium 4, Intel Xeon, and P6 family processors.
3.3.2.1.1. 8.1.2 Bus Locking
Intel64和IA-32处理器提供了一个LOCK#信号,会在某些关键内存操作期间被自动激活,去锁定系统总线。
当这个输出信号发出的时候,来自其他处理器或总线代理的控制请求将被阻塞。软件能够通过预先在指令前添加LOCK前缀来指定需要LOCK语义的其它场合。
在Intel386、Intel486、Pentium处理器中,明确地对指令加锁会导致LOCK#信号的产生。由硬件设计人员来保证系统硬件中LOCK#信号的可用性,以控制处理器间的内存访问。
对于Pentinum4、Intel Xeon以及P6系列处理器,如果被访问的内存区域是在处理器内部进行高速缓存的,那么通常不发出LOCK#信号;相反,加锁只应用于处理器的高速缓存。
Intel 64 and IA-32 processors provide a LOCK# signal that is asserted automatically during certain critical memory operations to lock the system bus or equivalent link.
While this output signal is asserted, requests from other processors or bus agents for control of the bus are blocked. Software can specify other occasions when the LOCK semantics are to be followed by prepending the LOCK prefix to an instruction.
In the case of the Intel386, Intel486, and Pentium processors, explicitly locked instructions will result in the assertion of the LOCK# signal. It is the responsibility of the hardware designer to make the LOCK# signal available in system hardware to control memory accesses among processors.
For the P6 and more recent processor families, if the memory area being accessed is cached internally in the processor, the LOCK# signal is generally not asserted; instead, locking is only applied to the processor’s caches (see Section 8.1.4, “Effects of a LOCK Operation on Internal Processor Caches”).
3.3.2.1.1.1. 8.1.2.2 Software Controlled Bus Locking
为显式地强制执行LOCK语义,软件可以在下列指令修改内存区域时同时使用LOCK前缀。
- bit测试和修改指令(BTS、BTR、BTC)
- 交换指令(XADD、CMPXCHG、CMPXCHG8B)
- XCHG指令会自动添加LOCK前缀
- 单操作数的算数和逻辑指令:INC、DEC、NOT、NEG
- 双操作数的算数和逻辑指令:ADD、ADC、SUB、SBB、AND、OR、XOR
一个加锁的指令会保证对目标操作数所在的内存区域加锁,但是系统可能会将锁定区域解释得稍大一些。
软件应该使用相同的地址和操作数长度来访问信号量(用作处理器之间发送信号的共享内存)。例如,如果一个处理器使用一个字来访问信号量,其它处理器就不应该使用一个字节来访问这个信号量。
总线锁的完整性不受内存区域对齐的影响。加锁语义会一直持续,以满足更新整个操作数所需的总线周期个数。但是,建议加锁访问应该对齐在它们的自然边界上,以提升系统性能:
- 任何8位访问的边界(加锁或不加锁)
- 锁定的字访问的16位边界
- 锁定的双字访问的32位边界
- 锁定的四字访问的64位边界
对所有其它的内存操作和所有可见的外部事件来说,加锁的操作都是原子的。所有取指令和页表操作能够越过加锁的指令。加锁的指令可用于同步一个处理器写数据而另一个处理器读数据的操作。
To explicitly force the LOCK semantics, software can use the LOCK prefix with the following instructions when they are used to modify a memory location.
- The bit test and modify instructions (BTS, BTR, and BTC).
- The exchange instructions (XADD, CMPXCHG, and CMPXCHG8B).
- The LOCK prefix is automatically assumed for XCHG instruction.
- The following single-operand arithmetic and logical instructions: INC, DEC, NOT, and NEG.
- The following two-operand arithmetic and logical instructions: ADD, ADC, SUB, SBB, AND, OR, and XOR.
A locked instruction is guaranteed to lock only the area of memory defined by the destination operand, but may be interpreted by the system as a lock for a larger memory area.
Software should access semaphores (shared memory used for signalling between multiple processors) using identical addresses and operand lengths. For example, if one processor accesses a semaphore using a word access, other processors should not access the semaphore using a byte access.
The integrity of a bus lock is not affected by the alignment of the memory field. The LOCK semantics are followed for as many bus cycles as necessary to update the entire operand. However, it is recommend that locked accesses be aligned on their natural boundaries for better system performance:
- Any boundary for an 8-bit access (locked or otherwise).
- 16-bit boundary for locked word accesses.
- 32-bit boundary for locked doubleword accesses.
- 64-bit boundary for locked quadword accesses.
Locked operations are atomic with respect to all other memory operations and all externally visible events. Only instruction fetch and page table accesses can pass locked instructions. Locked instructions can be used to synchronize data written by one processor and read by another processor.
3.3.2.1.2. 8.1.4 Effects of a LOCK Operation on Internal Processor Caches
对于 Intel486 和 Pentium 处理器,在锁定操作期间,即使被锁定的内存区域在高速缓存中,也会发送LOCK#信号去锁定总线。
P6及更新的处理器系列,如果在锁定操作过程中锁定内存的区域是在处理器缓存中的,若此时发生修改操作,不会总是发送LOCK#信号去锁定总线。相反会结合缓存一致性协议,保证操作以原子方式执行。
此操作被称为cache locking
(缓存锁定);缓存一致性协议会自动防止两个或多个缓存了相同内存区域的处理器同时修改该区域的数据。
For the Intel486 and Pentium processors, the LOCK# signal is always asserted on the bus during a LOCK operation,even if the area of memory being locked is cached in the processor.
For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically.
This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.
4. Reference
- 《Intel® 64 and IA-32 Architectures Software Developer’s Manual》
- 就是要你懂Java中volatile关键字实现原理#lock指令做了什么