Fast GPU math using bfMap
=========================

A key under-the-hood part of ``bifrost`` is the ``map`` function, that
provides a simple way to do fast arithmetic operations on the GPU. To
get acquainted, we will ignore all of the pipeline infrastructure that
bifrost provides, and just call the map function directly.

Let's suppose you have two ``ndarrays`` and wish to add them together on
the GPU. Here is how you would do that with ``map``:

.. code:: python

    import bifrost as bf

    # Create two arrays on the GPU, A and B, and an empty output C
    a = bf.ndarray([1,2,3,4,5], space='cuda')
    b = bf.ndarray([1,0,1,0,1], space='cuda')
    c = bf.ndarray(np.zeros(5), space='cuda')

    # Add them together
    bf.map("c = a + b", data={'c': c, 'a': a, 'b': b})
    print c
    # ndarray([ 2.,  2.,  4.,  4.,  6.])

The map function figures out what to do based on the string you give it.
These would also work:

.. code:: python

    bf.map("c = a - b", data={'c': c, 'a': a, 'b': b})
    bf.map("c = a * b", data={'c': c, 'a': a, 'b': b})
    bf.map("c = a / b", data={'c': c, 'a': a, 'b': b})
    bf.map("c += a + b", data={'c': c, 'a': a, 'b': b})
    bf.map("c /= a + b", data={'c': c, 'a': a, 'b': b})
    bf.map("c *= a + b", data={'c': c, 'a': a, 'b': b})

Getting deeper requires a look at the docstring:

.. code:: python

    def map(func_string, *args, **kwargs):
        """Apply a function to a set of ndarrays.

        Args:
          func_string (str): The function to apply to the arrays, as a string (see
                       below for examples).
          data (dict): Map of string names to ndarrays or scalars.
          axis_names (list): List of string names by which each axis is referenced
                       in func_string.
          shape:       The shape of the computation. If None, the broadcast shape
                       of all data arrays is used.
          func_name (str): Name of the function, for debugging purposes.
          extra_code (str): Additional code to be included at global scope.
          block_shape: The 2D shape of the thread block (y,x) with which the kernel
                       is launched.
                       This is a performance tuning parameter.
                       If NULL, a heuristic is used to select the block shape.
                       Changes to this parameter do _not_ require re-compilation of
                       the kernel.
          block_axes:  List of axis indices (or names) specifying the 2 computation
                       axes to which the thread block (y,x) is mapped.
                       This is a performance tuning parameter.
                       If NULL, a heuristic is used to select the block axes.
                       Values may be negative for reverse indexing.
                       Changes to this parameter _do_ require re-compilation of the
                       kernel.

        Note:
            Only GPU computation is currently supported.

        Examples::

          # Add two arrays together
          bf.map("c = a + b", {'c': c, 'a': a, 'b': b})

          # Compute outer product of two arrays
          bf.map("c(i,j) = a(i) * b(j)",
                 {'c': c, 'a': a, 'b': b},
                 axis_names=('i','j'))

          # Split the components of a complex array
          bf.map("a = c.real; b = c.imag", {'c': c, 'a': a, 'b': b})

          # Raise an array to a scalar power
          bf.map("c = pow(a, p)", {'c': c, 'a': a, 'p': 2.0})

          # Slice an array with a scalar index
          bf.map("c(i) = a(i,k)", {'c': c, 'a': a, 'k': 7}, ['i'], shape=c.shape)
        """

Let's look a bit closer at that outer product example. Here, by
convention of summation notation, the indexes 'i', 'j' on the two arrays
A and B, create an outer product. A full example:

.. code:: python

    import bifrost as bf

    # Create two arrays on the GPU, A and B, and an empty output C
    a = bf.ndarray([1,2,3,4,5], space='cuda')
    b = bf.ndarray([1,0,1,0,1], space='cuda')
    c = bf.ndarray(np.zeros((5, 5)), space='cuda')

    # Compute outer product
    bf.map("c(i,j) = a(i) * b(j)",
           axis_names=['i', 'j'],
           data={'c': c, 'a': a, 'b': b})
    print c

    # ndarray([[ 1.,  0.,  1.,  0.,  1.],
    #          [ 2.,  0.,  2.,  0.,  2.],
    #          [ 3.,  0.,  3.,  0.,  3.],
    #          [ 4.,  0.,  4.,  0.,  4.],
    #          [ 5.,  0.,  5.,  0.,  5.]])

The first example of ``c = a + b`` could also be written more explicitly as:

.. code:: python

    bf.map("c(i) = a(i) + b(i)", axis_names=['i'], data={'c': c, 'a': a, 'b': b})

Note, however, that implicit indexing should be preferred where possible, as
explicit indexing may exhibit worse performance.