Home My Page Projects Barra
Summary Activity Tracker Lists News SCM Files Mediawiki

InriaForge

Main Page

From Barra Wiki
Jump to: navigation, search
File:Bara.jpg
Barra Simulation Framework

Contents

Barra - NVIDIA GPU Architecture Simulator

What is Barra

Barra simulates CUDA programs at the assembly language level (Tesla ISA). Its ultimate goal is to provide a 100% bit-accurate simulation, offering bug-for-bug compatibility with NVIDIA G80-based GPUs. It works directly with CUDA executables; neither source modification nor recompilation is required.

Barra is primarily intended as a tool for research on computer architecture, although it can also be used to debug, profile and optimize CUDA programs at the lowest level.

Project page

Getting Barra

Source Tarballs

Development source repository

Quick start

See the Barra Tutorial.

What's new?

See the Changelog.

Installation

Supported Platforms

Barra was tested on GNU/Linux i386 and x86_64. It should be easily portable to Win32 with Cygwin and to Mac OS X, but this has not been tested so far.

Binary packages require a CPU with SSE2 support (Intel Pentium 4, AMD Athlon 64 or newer), and a compatible Linux distribution.

Tested platforms:

  • Fedora 18 x86_64
  • Ubuntu 8.04 x86_64
  • Ubuntu 8.10 i386
  • Debian Lenny x86_64

CUDA 2.0, 2.1, 2.2 and 2.3 are supported, as well as CUDA 3.0 beta (configured for non-ELF cubins).

Requirements

Binary package:

  • libc6 (>= 2.7)
  • libstdc++6 (>= 4.2)
  • libxml2 (2.6.30 recommended)
  • ncurses (>= 5.6 recommended)
  • zlib (1.2.3 recommended)
  • libboost-thread1.35

Sources:

  • Development versions of the above packages
  • automake (1.10 recommended), autoconf
  • libtool (2.2.4 recommended)

Installing

From sources:

Refer to Compiling Barra.

If either of these fail, please read Barra Troubleshooting.

Working With Barra

Barra is made of a simulator and a driver. The Barra driver is a dynamic library which exports the same symbols and API as the CUDA Driver library (libcuda.so/cuda.dll).

How Does It Work?

To simulate a CUDA program which uses the Driver API, we temporarily replace the NVIDIA-provided CUDA Driver library with our Barra library by setting the LD_LIBRARY_PATH or PATH variable. This way, cuXxx calls are redirected to the Barra driver, which can then configure and run the simulator as if it were an actual GPU. When cuLaunch or cuLaunchGrid is called, control is transferred to the simulator for execution.

Programs that use the CUDA Runtime API are still linked with the official CUDA Runtime library provided by NVIDIA (libcudart.so/cudart.dll). This Runtime library is only a wrapper over the Driver library, which translates cudaXxx calls into cuYyy calls. We can then trick the Runtime library into using our driver instead of NVIDIA's driver, just as we do with programs using the Driver API directly.


How To Use It?

Let's assume we want to simulate the matrixMul sample of the NVIDIA CUDA SDK under GNU/Linux and Barra was installed in /usr/local/barra-0.4-linux_x86_64.

We temporarily override the default library search path to the Barra lib directory, in addition to specifying where libcudart.so resides (the latter may not be required depending on how the CUDA toolkit was installed):

LD_LIBRARY_PATH="/usr/local/barra-0.4-linux_x86_64/lib/:/usr/local/cuda/lib"

Then, we can launch our executable:

cd NVIDIA_CUDA_SDK/bin/linux/debug
./matrixMul

Barra then outputs lots of debug information (CUDA function calls, cubin data, disassembly, memory allocation, thread scheduling...) during program execution.

Multithreaded simulation

By default, only one Streaming Multiprocessor is simulated by one host thread. Multiple SMs can be simulated by independent host threads to accelerate simulation on multi-core and multi-processor machines. To enable this feature, set the environment variable CORE_COUNT to the numbers of threads (and SMs) to use (typically the number of logical processors of the host computer):

export CORE_COUNT=4

Hacking, Debugging

As a computer architecture research tool, Barra is designed to be modified to suit the user's needs (e.g. gathering statistics on instructions, generating traces...)

Statistics gathering can be enabled by setting the environment variable EXPORT_STATS:

export EXPORT_STATS=1

For each kernel run, a file named <kernelname>.csv will be created in the current directory. (Note that for C++ applications, the kernel name will be the mangled name, such as __globfunc__Z9matrixMulPfS_S_ii.)

This file can be open in any spreadsheet software, and provides the following data for each kernel instruction:

  • Address: instruction address
  • Name: instruction mnemonic
  • Executed: number of times it was executed
  • Exec. scalar: number of SIMD channels it was executed on
  • Integer: if it is an integer instruction
  • FP32: if it is a single-precision floating-point instruction
  • Flow: if it is a control-flow instruction
  • Memory: if it accesses global or local memory
  • Shared: if it accesses shared memory
  • Constant: if it accesses constant memory
  • Input regs: number of input operands from the register file
  • Output regs: number of output operands to the register file

Some support is also present to generate debug traces. It is intended to be used for manual debugging and the trace format may change without notice. Several environment variables control the level of verbosity of traces:

  • TRACE_INSN outputs to stderr each instruction executed along with the warp number and program counter. All subsequent trace types are designed to be used with TRACE_INSN.
  • TRACE_MASK outputs the current predication mask of the warp after (not during) each instruction.
  • TRACE_REG outputs the destination register of each instruction executed, in hex.
  • TRACE_REG_FLOAT. Same as TRACE_REG, but as floating-point. Requires TRACE_REG.
  • TRACE_LOADSTORE. *Very* verbose. Traces every load and store from/to any memory type.
  • TRACE_BRANCH controls the output of the SIMD branching algorithms.
  • TRACE_SYNC outputs which warps are waiting at synchronization barriers.

Enabling tracing is done by setting the variable to 1 before running Barra. For example:

export TRACE_INSN=1

Tracing is disabled by setting each variable to 0:

export TRACE_INSN=0

Features

Supported Features

  • Simulator
    • Integer arithmetic on 32-bit and 16-bit registers, floating-point arithmetic, bitwise instructions.
    • Memory scatter/gather instructions from/to global and local memory.
    • Shared and constant memory.
    • Control flow instructions.
    • Reciprocal, reciprocal square root and transcendental instructions (not bit-accurate).
    • Synchronization barrier instruction.
    • Integer texture sampling over linear memory.
  • Driver
    • Most of the CUDA Driver API.
    • CUDA runtime API, through NVIDIA-provided libcudart.so/cudart.dll.
    • Support for cubin files and Fat Executables (through CUDA Runtime).
    • Host <-> Device linear memory copy.
    • Multithreaded simulation.

Regression testing matrix

UN-implemented Features (aka TODO List)

  • Simulator
    • Atomic instructions.
    • Warp vote instructions.
    • Double precision instructions.
    • Complete texture sampling.
    • Bit-accurate transcendentals.
    • Run-time checks. As Barra is primarily designed to run valid CUDA benchmarks, few safety checks are performed on instruction validity, memory addresses... Running an invalid or buggy CUDA program is likely to result in a segmentation fault.
  • Driver
    • Asynchronous execution.
    • Streams.
    • Complete texture support.
    • Arrays.
    • Multiple contexts.
    • Multiple devices.

How Fast (Or How Slow) Is It?

Barra performs 4 times faster on average (from 8 times slower to 10 times faster) than source-level emulation in debug mode (nvcc --deviceemu) on a dual core CPU, which is itself several orders of magnitude slower than execution on a high-end GPU. Barra is competitive with the emulator of the Ocelot project, and an order of magnitude faster than the CUDA Debugger.



Test platform: Core 2 Duo E8400, GeForce 9800 GX2, CUDA 2.2, gcc 4.3, Ocelot 0.4.46

What We Plan To Do Next

  • More statistics about the CUDA code: coalesced memory accesses, bank conflicts in shared and constant memory...
  • Transaction-Level Modeling of the G80 memory architecture to provide a realistic timing model.
  • (Close to) cycle-accurate modeling of Streaming Multiprocessors.
  • Modeling of power consumption.
  • Simulation speed optimization.

About Us

Authors

Credits

Barra is supported by:

Feedback

Contact: sylvain.collange at inria.fr

Bug reports and comments are welcome.

Thanks

  • Wladimir J. van der Laan, for his amazing work on recovering the G80 instruction set and for developing the decuda and cudasm tools, without which this work would not have been possible.
  • Hendra Sumilo for his feedback on installing Barra on CentOS 5.2.
  • Fabrice Ferrand for his work on a Win32 port.
  • Guillaume Yziquel for providing a Debian package of Barra.

Publications related to Barra