# arm

## NUMA Aware PER-CPU Framework

Rohit Mathew 18<sup>th</sup> July 2024

© 2024 Arm

### Agenda

#### -- Problem

- Overview
- PER-CPU Objects

#### Proposal – NUMA Aware PER-CPU Framework

- How do we do it?
- Platform's responsibility
- Definer Interface
- Accessor Interface
- Optimization 1 tpidr\_el3 magic
- Optimization 2 avoid cache thrashing
- Stack migration
- Interface variants

### Problem

#### Overview

- + Homogeneous multichip platforms have physically segregated SRAM in each chiplet
- + TF-A runtime image size for multichip can exceeds a single SRAM size
  - Can we reduce the runtime Image size on the primary SRAM?
- + Additionally, CPUs from non-primary chiplet deals with a NUMA latency due to cross chip access
  - Can we move parts of the Image to the SRAM local to CPU in context?



### Problem

**PER-CPU Objects** 

#### The NUMA problem

- + TF-A has a *lot of* global objects that are per CPU.
  - RMM context, NS PSCI context, SPMD context etc
  - part of BSS which is not loaded explicitly, but forms part of the runtime.
- CPU objects are re-used through-out the lifetime of the system
  - Cross chip Read/write/snoops would add in NUMA latency

#### The Storage problem





How do we do it?

- + Every PER-CPU object should *ideally* be defined using the framework's **Definer Interface**
- Accessor Interface would help with accessing these objects.
- + For single chiplet systems, there is no change in how things works\*\*
- For multi-chiplet systems, PER-CPU framework would deal with allocating globals spread across SRAMs
- + A new section called ".per\_cpu" would be introduced just for the multichip systems to tie PER-CPU globals in a single chip

\*\*Certain optimizations can bring changes for single chip as well.

How does it look?

#### **Single Chip**

|            | SRAM0 End   |  |
|------------|-------------|--|
| BL31 XLAT  |             |  |
| BL31 BSS   |             |  |
| BL31 Stack |             |  |
| BL31 Data  |             |  |
| BL31 RO    |             |  |
| BL31 code  |             |  |
|            |             |  |
|            | SRAM0 Start |  |

#### **Multi-Chip**





**Platform's Responsibility** 

### Single chip

-- Nothing to be done.

### Multi-chip

- -- Set build option PER\_CPU\_MULTICHIP := 1
- Setup page tables for remote regions at desired locations.
- + Implement
  - uintptr\_t plat\_per\_cpu\_section\_base(int cpu);
  - This should return the address of the «.per\_cpu » section corresponding to a CPU

#### **Definer Interface**

#### Single chip

- No changes internally
- -- Here is an example use-case -

DEFINE\_PER\_CPU(rmmd\_rmm\_context\_t, rmm\_context);

### **Multi-chip**

#define DEFINE\_PER\_CPU(TYPE, NAME) \
 TYPE NAME[CHIPLET\_CORE\_COUNT] \
 \_\_section(PER\_CPU\_MULTICHIP\_SECTION)

- The object is tied to a different section (.per\_cpu)
- The number of cores have been reduced to cores per chiplet
- This region would be duplicated across each chiplets SRAM.

FOR\_CPU\_PTR accessor Interface

#### Single chip

- #define FOR\_CPU\_PTR(NAME, CPU) &NAME[CPU]
- No changes internally
- -- Here is an example use-case -

rmmd\_rmm\_context\_t \*rmm\_ctx = FOR\_CPU\_PTR(rmm\_context, linear\_id);

 If used in an env where multi-CPUs can concurrently access, make sure to use proper locking primitives!

#### **Multi-chip**

#define PER\_CPU\_OFFSET(x) (x - PER\_CPU\_START)
#define FOR\_CPU\_PTR(NAME, CPU) \_\_extension\_\_\_\_\_\
 ((\_\_typeof\_\_(&NAME[0])) \\_\_\_\_\_\
 (plat\_per\_cpu\_section\_base(CPU) + \\_\_\_\
PER\_CPU\_OFFSET((uintptr\_t)&NAME[CPU%CHIPLET\_CORE\_COUNT])))

- + plat\_per\_cpu\_section\_base is implemented
   by the platform to return the section base for the
   CPU in context
- + plat\_per\_cpu\_section\_base is one way of doing it; tpidr\_el3 would be another way.

#### THIS\_CPU\_PTR accessor Interface

#### Single chip

```
#define THIS_CPU_PTR(NAME)
```

```
&NAME[plat_my_core_pos()]
```

-- No changes internally

#### **Multi-chip**

| #define | PER_CPU_OFFSET(x) (x - PER_CPU_START)                              |     |
|---------|--------------------------------------------------------------------|-----|
| #define | THIS_CPU_PTR(NAME)extension                                        | \   |
|         | ((typeof(&NAME[0]))                                                | \   |
|         | <pre>(plat_per_cpu_section_base(plat_my_core_pos()) +</pre>        | \   |
|         | PER_CPU_OFFSET (                                                   | \   |
| (uintr  | <pre>ptr_t) &amp;NAME[plat_my_core_pos()%CHIPLET_CORE_COUNT]</pre> | ))) |

-- Here is an example use-case
rmmd\_rmm\_context\_t \*ctx = THIS\_CPU\_PTR(rmm\_context);



#### **Op1** - tpidr\_el3 magic

bl 6e698 <plat\_my\_core\_pos>
mov w19, w0
bl 75d14 <plat\_per\_cpu\_section\_base>
and w19, w19, #0x3

- Can be optimized to something as simple as

mrs x0 ,tpidr\_el3
add x0, x0, #0x9f8

 This should be multi-folds faster, even faster than an access from a cached pointer in case of a cache miss.

- + we rely on a system register to get the offset for a particular CPU
- + The unoptimized variant relies multiple memory accesses to calculate the right offset

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

Interconnect

arm

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

Interconnect

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.

| CPL     | J 1      | (       | CPU 2   | (       | CPU 3   |
|---------|----------|---------|---------|---------|---------|
| Cac     | he       |         | Cache   |         | Cache   |
| Address | Data     | Address | Data    | Address | Data    |
| 0x1000  | D2 D3 D4 | 0x1000  | Invalid | 0x1000  | Invalid |

Interconnect

| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



Interconnect

| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

D1 D2 D3 D4

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



Interconnect

| М       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



Interconnect

| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



| M       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



| Μ       | lemory      |
|---------|-------------|
| Address | Data        |
| 0x1000  | D1 D2 D3 D4 |

Interconnect

### **Op2** – Avoid Cache Thrashing

 Contiguous arrays *can* cause data for different CPUs to be residing on the same cache-line

TYPE NAME [CPU\_MAX]

 This introduces false sharing or cache-thrashing where the ownership of the cache line keeps switching between different CPUs.



| Memory  |             |  |  |  |  |  |  |  |
|---------|-------------|--|--|--|--|--|--|--|
| Address | Data        |  |  |  |  |  |  |  |
| 0x1000  | D1 D2 D3 D4 |  |  |  |  |  |  |  |

Interconnect

23

#### **Op2** – Avoid Cache Thrashing



### arm

### **Op2** – Avoid Cache Thrashing

- -- The extra cache-line alignment coupled with breaking down the array would avoid the same cache-line to exist in multiple CPUs.
- -- *Could* take up a bit more storage as alignment is costly.
  - + Change in Definer and Accessor implementation. Interface should be same
- -- single-chip could be kept untouched; however, this would be a better design if performance is of priority. (Remember \*\*)
  - + multi-CPU problem and not a multi-chip one!

Migration of Stack

- Stack as of today is using its own section ".tzfw\_normal\_stacks" and is a big consumer like the other context globals.
- + Plan to move the stack to the PER-CPU framework as we progress with the migration
- At a high level this would mean:
  - Removing the stack section from BL31 linker script
  - Stack would now be defined by the framework
  - SP is switched via the accessor interface

Interface variants

- For both FOR\_CPU\_PTR and THIS\_CPU\_PTR, it would be beneficial to have a non-pointer/object accessor interface (FOR\_CPU/THIS\_CPU).

- Eg:FOR\_CPU(spm\_core\_context, core\_id).state = SPMC\_STATE\_OFF;
- Definer interface should also support aligned definitions, for definitions requiring tighter alignments.
  - Eg: \_\_aligned (64) some\_struct\_t some\_struct[PLATFORM\_CORE\_COUNT];
  - Could be defined as DEFINE\_PER\_CPU\_ALIGNED (some\_struct\_t, some\_struct, 64)
  - .per\_cpu section has to be aligned to the max of (SORT\_BY\_ALIGNEMENT(.per\_cpu))

-- Support for arrays

• Eg:uint64\_t shadow\_registers[16][PLATFORM\_CORE\_COUNT];

**Interface variants** 

- + Support for initialized PER-CPU variables could be a use-case we should support
  - Eg: /plat/st/common/stm32mp\_gic.c:static unsigned int target\_mask\_array[PLATFORM\_CORE\_COUNT] = {1, 2};
- + Possibly useful to add support in BL32
  - Eg:./bl32/tsp/tsp\_timer.c:static timer\_context\_t pcpu\_timer\_context[PLATFORM\_CORE\_COUNT];
- This would be a long-term activity where less crucial objects can be migrated down the line.

| +          | + |  |  |  |  |                                 |
|------------|---|--|--|--|--|---------------------------------|
| ari        | + |  |  |  |  | Thank You<br>+ Danke<br>Gracias |
|            |   |  |  |  |  | ↓ Grazie<br>谢谢                  |
|            |   |  |  |  |  | ありがとう<br>Asante                 |
|            |   |  |  |  |  | Merci<br>감사합니다                  |
|            |   |  |  |  |  | धन्यवाद<br>+ Kiitos<br>شکرًا    |
|            |   |  |  |  |  | + ধন্যবাদ                       |
| © 2024 Arm |   |  |  |  |  | <b>౧</b> ౹ఀౢ<br>ధన్యవాదములు     |

arm

The Arm trademarks featured in this presentation are registered trademarks or trademarks of Arm Limited (or its subsidiaries) in the US and/or elsewhere. All rights reserved. All other marks featured may be trademarks of their respective owners.

www.arm.com/company/policies/trademarks

| © 2024 Arm |  |  |  |  |  |  |
|------------|--|--|--|--|--|--|
|            |  |  |  |  |  |  |
|            |  |  |  |  |  |  |
|            |  |  |  |  |  |  |
|            |  |  |  |  |  |  |
|            |  |  |  |  |  |  |