## VLIW-based FPGA Computation Fabric with Streaming Memory Hierarchy for Medical Imaging Applications

Joost J. Hoozemans – Computer Engineering, TU Delft Tuesday, 21 March 2017













#### ρ-VEX Overview









- Dynamically reconfigurable VLIW processor
- Adapt processor to workload
- Assign resources (datapaths) to threads
- Developed at TU Delft
- Prototyped on FPGA (softcore)





#### EU Project ALMARVI consortium



# **TUDelft** Healthcare





#### **PHILIPS** Healthcare



- Real-time feedback (continuous radation)
- Reduce radiation doses
- Keep/improve image quality
- Limit latency
- Maintainability ~15 years













- Network latency between machines
- Hardware availability too short
- Multiple machines -> difficult to debug





#### Acceleration

- Possibility: FPGA
- Large amounts of resources (parallellism!)
- Long availability





#### **FPGA**

FPGA design:

- VHDL (hand-written)
- High-Level Synthesis (HLS) generated from C code
- Synthesis time-consuming (hours)



#### **FPGA**

FPGA design:

- VHDL (hand-written)
- High-Level Synthesis (HLS) generated from C code
- Synthesis time-consuming (hours)

#### • Alternative: Softcore

- Processor running on FPGA
- Synthesize once
- Compiling, running, debugging as normal







FPGA design:

- VHDL (hand-written)
- High-Level Synthesis (HLS) generated from C code
- Synthesis time-consuming (hours)

#### • Alternative: Softcore

- Processor running on FPGA
- Synthesize once
- Compiling, running, debugging as normal



#### Softcore processor: p-VEX



- Written in VHDL
- Highly generic (*also* design-time)





## Softcore processor: p-VEX



- Workload characteristics:
- Image processing
- Lots of parallellism
  - ILP
  - DLP
- Streaming
- Aims: small area footprint, high clock frequency
  - 2-issue VLIW
  - Disable forwarding (decreases area, increases clock frequency)
  - Loop unrolling limits performance penalty



#### Memory Hierarchy

- Image processing pipeline
- Stages stream data from one filter to the next





#### Memory Hierarchy







#### Implementation



**T**UDelft

#### Implementation



| Block Properties ? _ 🗆 🗠 ×        |                         |                 |  |
|-----------------------------------|-------------------------|-----------------|--|
|                                   |                         |                 |  |
| <pre># rvex_stream_ip_top_0</pre> |                         |                 |  |
| 0                                 | CLASS                   | bd cell 🔄       |  |
|                                   | 💡 CONFIG                |                 |  |
|                                   | AXI ADDRW G             | 28 🥒            |  |
|                                   | CORE ID OFFS            | 0 🖉             |  |
| 1000                              | Component Name          | system rve 🖉 👘  |  |
| +                                 | DEBUG BUS MUX BIT       | 14 🖉            |  |
|                                   | DMEM DEPTH LOG2         | 12 🖉 🔤          |  |
| 3                                 | ENABLE FORWARDING       | 0 🖉 🗐           |  |
|                                   | ENABLE TRAPS            | 0 🖉             |  |
| ↓ <mark>A</mark>                  | IMEM DEPTH LOG2         | 12 🖉            |  |
|                                   | NUM CORES PER STREAM    | 4 🖉             |  |
|                                   | NUM STREAMS             | 16 🖉            |  |
|                                   | STREAM OUTPUT BASE ADDR | 0 🖉             |  |
|                                   | STREAM WIDTH            | 33 🖉 🗖          |  |
|                                   | S ACLK F KHZ            | 25000 🧷         |  |
|                                   | LOCATION                | 7 2750 31( 🖉    |  |
|                                   | NAME                    | rvex strear 🖉 🖵 |  |
| General <b>Properties</b>         |                         |                 |  |



## Synthesized Design











- Applications Image processing pipeline (Rolf, Joost), Doom (Koray, Jeroen), Demos (Muneeb, Joost, Jeroen), Benchmarks SPEC, MiBench, Malardalen, Powerstone (Anthony, Joost)
- **Operating System support** Linux (Mainly Joost, some low-level code written/fixed/updated by Anthony & Jeroen), FreeRTOS (Jeroen, Muneeb)
- Runtime libraries Newlib (Joost, Anthony), uCLibc (Tom, Joost), Floating Point & Division, math (Joost)
- **Compilers** HP VEX, GCC (IBM, Anthony, Joost), Cosy (Hugo), LLVM (Maurice, Hugo), Open64 (Joost)
- **Binutils** Assembler, linker, etc. (Anthony), VEXparse (Anthony, Jeroen)
- Architectural Simulator (Joost)
- **Debug** hardware, tools and interface (Jeroen)
- Hardware design VHDL (Jeroen)
- ASIC manufacturing effort core (Lennart), interface (Shizao) supported by
   Jeroen





#### ρ-VEX: the Dynamically Reconfigurable VLIW Processor

#### What is the p-VEX?

#### The $\rho\text{-VEX}$ is a reconfigurable and extensible VLIW processor.

• *VLIW:* this stands for "very-long instruction word". It implies that the processor can issue multiple instructions in parallel, and that the selection of which instructions are to run in parallel is done at compile-time. It is called such because each instruction word has to describe multiple independent instructions.

| The p-VEX can work as         - One 8-way datapath,         - Two independent 4-way datapaths,         - Four independent 2-way datapaths, or         - One 4-way + two 2-way datapaths |
|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| V Data cache  Datanath                                                                                                                                                                  |

#### http://rvex.ewi.tudelft.nl





#### Applied Reconfigurable Computing ARC 2017

April 3 - 7 arc2017.tudelft.nl Location Keynotes & Paper Sessions: EWI, Collegezaal Boole







## 



Challenge the future 25

# **TUDelft**





Challenge the future 26