#### Safely Exploiting Multithreaded Processors to Tolerate Memory Latency in Real-Time Systems

#### Ali El-Haj-Mahmoud and Eric Rotenberg

Center for Embedded Systems Research (CESR) Electrical & Computer Engineering North Carolina State University





#### **Embedded Processor Trends**







- More demanding applications and user expectations
- Higher frequency
  - ARM-11 (0.13µ): 500 MHz
  - ARM-11 (0.10µ): 1 GHz
- Processor-memory speed gap



# Multithreading

- Switch-on-event coarse-grain multithreading
  - Multiple register contexts for fast switching
  - Switch to alternate task when current task accesses memory
  - Overlap memory accesses with computation
- But what about multithreading in hard-real-time?
  - Require *analyzability*
  - Cannot rely on dynamic schemes

## **Exploiting Multithreading in Hard-Real-Time Systems**

- Safety
  - Guarantee all tasks meet deadlines (worst-case)
  - Statically bound overlap under all scenarios
- Tractability
  - Confirm/disconfirm schedulability mathematically using closed-form tests
  - Consider each task individually



- Utilization-based schedulability test
- No need to **construct** the schedule *a priori*

### **Performance vs. Tractability**

- Performance: Memory overlap
- Tractability: Closed-form schedulability test



- Only basic parameters of tasks are known
  - WCET = C + M (from conventional WCET analysis)
  - Period = deadline



- Memory overlap: No
- Closed-form test: Yes

$$U = \frac{WCET_A}{period_A} + \frac{WCET_B}{period_B} > 1$$



- Memory overlap: No
- Closed-form test: Yes

$$U = \frac{WCET_A}{period_A} + \frac{WCET_B}{period_B} > 1$$



- Memory overlap: No
- Closed-form test: Yes

$$U = \frac{WCET_A}{period_A} + \frac{WCET_B}{period_B} > 1$$



- Memory overlap: No
- Closed-form test: Yes

$$U = \frac{WCET_A}{period_A} + \frac{WCET_B}{period_B} > 1$$



- Memory overlap: Possible, unfair for low priority
- Closed-form test: No
  - Must examine memory positioning!



- Memory overlap: Possible, unfair for low priority
- Closed-form test: No
  - Must examine memory positioning!



- Memory overlap: Possible, unfair for low priority
- Closed-form test: No
  - Must examine memory positioning!



- Memory overlap: Possible, unfair for low priority
- Closed-form test: No
  - Must examine memory positioning!



- Memory overlap: Possible, unfair for low priority
- Closed-form test: No

- Must examine memory positioning!



- Memory overlap: Possible, unfair for low priority
- Closed-form test: No

- Must examine memory positioning!



- Memory overlap: Yes (fair and bounded)
- Closed-form test: Yes



- Memory overlap: Yes (fair and bounded)
- Closed-form test: Yes



- Memory overlap: Yes (fair and bounded)
- Closed-form test: Yes



- Memory overlap: Yes (fair and bounded)
- Closed-form test: Yes



- Memory overlap: Yes (fair and bounded)
- Closed-form test: Yes

#### **Tractability through Deterministic Switching**

- Fully decouple independent tasks by forcing periodic switches
  - Every task gets a chance to initiate/overlap memory accesses
  - No scheduling dependences
  - No specificity regarding memory positioning

#### Weighted-Round-Robin (WRR)



#### **Analytical Framework for WRR**

```
d: duty cycle 0 < d \le 1
P: period = deadline
WCET: Worst-Case Execution Time
   WCET = C + M
   C: aggregate computation time
   M: aggregate memory time
WCET': dilated Worst-Case Execution Time
   WCET' = (C/d) + M
```

#### **Schedulability Test**

1. Dilated task meets deadline on its virtual processor



#### **Schedulability Test**

1. Dilated task meets deadline on its virtual processor

$$WCET' \le P \longrightarrow \frac{C}{d} + M \le P$$



#### **Schedulability Test**

1. Dilated task meets deadline on its virtual processor

$$WCET' \leq P \longrightarrow \frac{C}{d} + M \leq P$$
$$d \geq \frac{C}{P - M}$$



WCET' 
$$\leq P \longrightarrow \frac{C}{d} + M \leq P$$
  
 $d \geq \frac{C}{P - M}$   
 $d = \frac{C}{P - M} = \frac{(C/P)}{1 - (M/P)}$ 



WCET' 
$$\leq P \longrightarrow \frac{C}{d} + M \leq P$$
  
 $d \geq \frac{C}{P - M}$   
 $d = \frac{C}{P - M} = \frac{(C/P)}{1 - (M/P)}$ 

2. Sum of all duty cycles less than or equal to 1



WCET' 
$$\leq P \longrightarrow \frac{C}{d} + M \leq P$$
  
 $d \geq \frac{C}{P - M}$   
 $d = \frac{C}{P - M} = \frac{(C/P)}{1 - (M/P)}$ 

2. Sum of all duty cycles less than or equal to 1

$$\sum_{i=1}^n d_i \le 1$$

El-Haj-Mahmoud © 2004





#### **Generalized Analytical Framework**

Multiple tasks per VP:



## **Modeling Bus and Memory System**

- Addressed in detail in paper
- Analysis accounts for:
  - Worst-case task serialization on memory bus
  - DRAM bank conflicts
  - Multiple VPs sharing single DRAM bank

#### NC STATE UNIVERSITY



# Results



#### NC STATE UNIVERSITY



## **Results**



#### NC STATE UNIVERSITY **JCESR Results** EDF (no memory) EDF ■ WRR 1.2 **norst-case utilization** 0.0 0.4 0.2 mem component 0 1GHz 2GHz 1GHz 2GHz 1GHz 2GHz LOW HIGH MED **TASK-SETS** 8 tasks (2 tasks/VP for WRR)

#### **JCESR** NC STATE UNIVERSITY **Results** EDF (no memory) EDF ■ WRR 1.2 **norst-case utilization** 8.0 0.4 0.2 mem component 28% 50% 0 1GHz 2GHz 1GHz 2GHz 1GHz 2GHz LOW HIGH MED **TASK-SETS** 8 tasks (2 tasks/VP for WRR)

## Novel

|                                           | Memory<br>overlap | Formalism<br>(safety & tractability) |
|-------------------------------------------|-------------------|--------------------------------------|
| Classic real-time                         | No                | Yes                                  |
| Classic multithreading                    | Yes               | No                                   |
| Our real-time multithreading<br>framework | Yes               | Yes                                  |

## Useful

- Fully capitalize on high-frequency embedded microprocessors
- Exceed schedulability limit of conventional real-time theory for uniprocessors by analytically bounding WCET overlap

#### NC STATE UNIVERSITY

# Deployable

- Software-only solution
  - Use Ubicom IP3023 8-thread embedded microprocessor
  - Analytical framework + scheduling policy





El-Haj-Mahmoud © 2004

#### Summary

- Safely expose multithreading to hard-real-time schedulability analysis
- Bound computation / memory overlap
  - Offline closed-form schedulability test
  - Safe
  - Tractable
- Scale "Memory Wall" in embedded systems
  - Expose full benefits of high-frequency embedded processors



OCESR

# **Questions?**