wiki:AtomicOperations

Atomic Operations

1. Goals

The TSAR architecture implements two read-then-write atomic operations:

  • The LL/SC (Linked Load / Store Conditional) operation is implemented as two specific VCI transactions. As the LL/SC instructions are implemented in the MIPS32 instruction set, these instructions can be used by both the kernel code and by the application code to read a data at address X, test and modify this data, and write the modified data at the same address X, with the guaranty that no other access to this address was done between the read and the write access.

  • The CAS (Compare and Swap) operation is implemented as a specific VCI transaction. As there is no CAS instruction in the MIPS32 instruction set, this operation is only used by the L1 cache controller for some low-level synchronisation mechanisms, such as updating a page table entry in kernel memory.

For both types of operation, the addresses is supposed to be aligned on 32 bits word boundaries, and the data are supposed to be 32 bits words.

2. LL/SC operation

As we want to support commodity operating systems and existing software applications, any memory address can be the target of an atomic access. As the atomic access can be used to implement spin-locks, the address must be cacheable in order to benefit from the general coherence protocol, and avoid unnecessary transactions on the interconnection network.

2.1 General Principle

From a conceptual point of view, the atomicity is handled on the memory controller side, that is actually the L2 cache controller in the TSAR architecture. Each L2 cache controller contains a list of all pending LL/SC atomic operations in an associative reservation table, that contains 32 entries. Each entry registers the X dress, plus a 32 bits K key identifying a given LL/SC operation. It does not register the client identifier, but the K key avoids to mix two successive LL/SC operations with the same address X.

  • When a processor P executes the LL(X) instruction for an address X, this réservation request is sent to the L2 cache by the L1 cache. The L2 cache allocates a 32 bits authentication key for this reservation. It registers both the X address and the K key in the associative reservation table, and returns both the value stored at address X and the K value to the L1. Both the X address and the K key are also registered in the L1 cache. If another processor P' request a reservation for the same address X, it receives the registered K value from the L2 cache.
  • When a processor P executes the SC(X,D) instruction to an address X, this conditional write is sent to the L2 cache by the L1 cache, and the command contains both the reservation key K and the data D to be written. The L2 cache makes an associative search in the reservation table. If a valid reservation with the same address X and the same key K is found, the atomic operation is a success : The reservation is canceled in the reservation table, the D value is written at address X, and a success value is returned to the L1 cache. If there is no match in the reservation table, the atomic operation is a failure: the D value is not written at address X, the reservation table is not modified, and a failure value is returned to the L1 cache.

Clearly, in case of concurrent LL/SC access to the same address X by two or more L1 caches, the winner is defined by the first SC(X,D) instruction received by the L2 cache.

2.2 Key allocation policy

Tha key allocator is a simple 32 bits counter, that is incremented each time a new K value is allocated to satisfy a LL requiring a new reservation. As there is a finite number of values, the mechanism requires that a K value allocated for a given reservation have a finite time of life.

  • In the réservation table implemented in the L2 cache, the maximum time of life for an entry containing the K value is defined by a maximum number of (231) allocated keys: Each time a new K' value is allocated, any entry in the reservation table containing the K value, such as (|K-K'| == 231) is invalidated.
  • In the L1 cache, the bounded time of life of the registered reservation is implemented as a cycle counter: The counter starts when a new reservation is initiated in the L1 cache, and the reservation is invalidated when the counter reaches 231 cycles.

2.3 Replacement policy

As the capacity of the reservation table is limited to 32 entries, this table can be full when a reservation request LL(X) is received by the L2 cache controller. In this case, an existing reservation entry must be invalidated in the associative table. In order to improve the robustness of the mechanism against malicious attacks, the victim selection algorithm is unbalanced: All slots have not the same probability to be evicted from the reservation table. This probability is of the form 1/2i and ranges from 1/4096 to 1/2. Some slots are selected with a high probability (1/2), and some other slots are selected with a very low probability (1/4096).

2.4 Detailed specification for the L1 cache

Each L1 cache controller contains 4 registers to store one single reservation:

  • physical address : 40 bits
  • reservation key : 32 bits
  • cycle counter : 31 bits
  • valid reservation : 1 bit

We summarize below the actions done by the L1 cache controller receiving a LL(X), SC(X,D) or SW(X,D) request from the processor:

  • LL(X) : The L1 cache registers the X address and its local reservation register, activates the cycle counter, and send a single flit VCI LL command containing the X address to the L2 cache. When the response is received from the L2 cache containg the value and the key K, the key is saved in the local reservation register and the value is send to the processor.
  • SC(X,D) : The L1 cache checks the X address agains the registered address. In case of miss, it returns a failure code to the processor, without any VCI transaction on the network. In case of hit, it invalidates the local reservation and sent a two flits VCI SC command containing the X address, the registered K value, and the D value.
  • SW(X,D) : The L1 cache checks the X address against the registered address. In case of hit the reservation is invalidated. In case of miss, the reservation is not modified. In both cases the write request is sent to the L2 cache.

2.5 Detailed specification for the L2 cache

Each entry in the reservation table contains 3 fields to store one reservation:

  • physical address : 40 bits
  • reservation key : 32 bits
  • valid reservation : 1 bit

We summarize below the actions done by the L2 cache controller receiving a LL(X), SC(X,D,K) or SW(X,D) VCI command from a L1 cache controller:

  • LL(X) : The L2 cache makes an associative search on the X address in the reservation table. In case of hit (X = Xr), the L2 cache returns both the D value stored at address X, and the K value stored in the reservation table to the L1 cache. In case of miss, the L2 cache allocates a new K value from the key allocator, registers a new entry in the reservation table(this can require a victim eviction), and returns the D and K values to the L1 cache.
  • SC(X,D,K) : The L2 cache makes an associative search on both the the X address and the K key in the reservation table. In case of hit, the reservation is invalided in the reservation table, the D value is written at address X, and a success value is returned to the L1 cache. In case of miss, the D value is not written at address X, the reservation table is not modified, and a failure value is returned to the L1 cache.
  • SW(X,D) : The L2 cache makes an associative search on the X address in the reservation table. As the write command can be a burst, with a Xmin and Xmax addresses, the HIT condition for each entry containing an address Xr is actually Xmin <= Xr <= Xmax. In case of hit, the reservation Xr is invalidated. In case of miss, the reservation table is not modified. In both cases the D value is written at address X.

2.6 Failure / Success encoding

The actual encoding of the (success/failure) response for a SC(X,D) instruction depends on the processor core: For the MIPS2 and ARM processors, a success is encoded as a non-zero value. For the PPC processor, a success is encoded as a zero value. In the TSAR architecture, the L2 cache controller returns the value 0 for a success, and the value 1 for a failure to a SC(X,D,K) VCI command. If the architecture uses a MIPS or ARM processor, the value returned by the L2 cache must be transcoded by the L1 cache controller before to be transmitted to the processor core.

2.7 Software example on MIPS32 processor

As described below, the LL/SC mechanism can be used to implement a spin-lock, using any memory address :

  • The lock acquisition is done by an atomic LL/SC operation.
  • The lock release is done by a simple WRITE instruction.

Remember that a SC failure in encoded by a zero value for the MIPS processor.

			_itmask		        # enter critical section
# lock acquisition
loop		        ll r1,   0(r4)		# r1 <= M[r4]
			bnez r1, loop	        # retry if lock already taken (r1 != 0)
			ori r1,  r0, 1	        # r1 <= 1
			sc  r1,  0(r4)	        # if atomic (M[r4] <= 1 / r1 <= 1) else (r1 <= 0)
			beqz r1, loop	        # retry if not atomic (r1 == 0)
			...
# lock release
			ori r1, r0, 0           # r1 <= 0
			sw r1, 0(r4)	        # M[r4] <= 0

			_itunmask		# exit critical section

3. CAS operation

3.1 General Principle

The semantic of the CAS(X,D_old,D_new) VCI transaction (three arguments in the VCI command) is the following:

  • If the value stored at address X is equal to the D_old value, the CAS returns a success code in the VCI response, and the value D_new is written at address X.
  • If the value stored at address X is not equal to the D_old value, the CAS returns a failure code in the VCI response, and no write is done at address X.

3.2 Implementation and Usage

This CAS operation is implemented in the L2 cache controller. It is actually used by the MMU (Memory Management Unit) implemented in the L1 cache controller, to update the DIRTY bit in a page table entry. It is also used by the vci_mwmr_dma component, to atomically access the lock protecting a shared hardware/software communication channel). It cannot be directly used by the software.

Last modified 2 weeks ago Last modified on Dec 1, 2017, 6:39:22 PM