Baseline analysis

Analysis

Vortex has three cycle-accurate simulators: SimX, Vlsim, and Rtlsim. SimX implements a cycle-level simulator for Vortex and is ideal for architecture design-space exploration. Vlsim uses Verilator to simulate the full RTL design and implements the AFU interface and memory simulation in software. Rtlsim simulates the processor RTL without the command processor (AFU) to emulate the SoC environment where the host and accelerator share the same memory interface.

I chose the verilator-based Vlsim to analyze the ideal hardware configurations for BFS because it is more accurate than SimX and faster than Rtlsim. We used these results to configure Intel’s Arria 10 FPGA with Vortex and analyze the application-architecture relationship. The BFS analyzed here is a push-based OpenCL implementation from the Rodinia benchmark with an input graph of 64k vertices.

Results for BFS

If the no of threads per warp are set to 2 (minimum):

Optimal hardware configuration (based on cycles): Max cores, Max warps, Min threads

Comments

The minimum threads per warp configuration is benefical because of thread-divergence in BFS. A variable warp size architecture points out this advantage but picks 4 instead of 2 threads per warp citing area advantages. Therefore, some area analysis is required to settle on a resource-aware configuration.