Skip to content

Latest commit

 

History

History
 
 

section-2d

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Section 2d - Programming for multiple cores


This section will focus on the process of taking code written for a single core and transforming it into a design with multiple cores relatively quickly. For this we will start with the code in aie2.py which contains a simple design running on a single compute tile, and progressively turn it into the code in aie2_multi.py which contains the same design that distributes the work to three compute tiles.

The first step in the design is the tile declaration. In the simple design we use one Shim tile to bring data from external memory into the AIE array inside of a Mem tile that will then send the data to a compute tile, wait for the output and send it back to external memory through the Shim tile. Below is how those tiles are declared in the simple design:

ShimTile = tile(0, 0)
MemTile = tile(0, 1)
ComputeTile = tile(0, 2)

For our scale out design we will keep using a single Shim tile and a single Mem tile, but we will increase the number of compute tiles to three. We can do so cleanly and efficiently in the following way:

n_cores = 3

ShimTile = tile(0, 0)
MemTile = tile(0, 1)
ComputeTiles = [tile(0, 2 + i) for i in range(n_cores)]

Each compute tile can now be accessed by indexing into the ComputeTiles array.

Once the tiles have been declared, the next step is to set up the data movement using Object FIFOs. The simple design has a total of four double-buffered Object FIFOs and two object_fifo_links. The Object FIFOs move objects of datatype <48xi32>. of_in brings data from the Shim tile to the Mem tile and is linked to of_in0 which brings data from the Mem tile to the compute tile. For the output side, of_out0 brings data from the compute tile to the Mem tile where it is linked to of_out to bring the data out through the Shim tile. The corresponding code is shown below:

data_size = 48
buffer_depth = 2
data_ty = np.ndarray[(48,), np.dtype[np.int32]]


# Input data movement

of_in = object_fifo("in", ShimTile, MemTile, buffer_depth, data_ty)
of_in0 = object_fifo("in0", MemTile, ComputeTile, buffer_depth, data_ty)
object_fifo_link(of_in, of_in0)


# Output data movement

of_out = object_fifo("out", MemTile, ShimTile, buffer_depth, data_ty)
of_out0 = object_fifo("out0", ComputeTile, MemTile, buffer_depth, data_ty)
object_fifo_link(of_out0, of_out)

We can apply the same method as in the tile declaration to generate the data movement from the Mem tile to the three compute tiles and back (see distribute and join patterns). The object_fifo_link operations change from the 1-to-1 case to distributing the original <48xi32> data tensors to the three compute tiles as smaller <16xi32> tensors on the input side, and to joining the output from each compute tile to the Mem tile on the output side. Lists of Object FIFOs are used to keep track of the input and output Object FIFOs. With these changes the code becomes:

n_cores = 3
data_size = 48
tile_size = data_size // 3

buffer_depth = 2
data_ty = np.ndarray[(data_size,), np.dtype[np.int32]]
tile_ty = np.ndarray[(tile_size,), np.dtype[np.int32]]

# Input data movement
inX_fifos = []

of_in = object_fifo("in", ShimTile, MemTile, buffer_depth, data_ty)
for i in range(n_cores):
    inX_fifos.append(object_fifo(
        f"in{i}", MemTile, ComputeTiles[i], buffer_depth, tile_ty
    ))

# Calculate the offsets into the input/output data for the join/distribute
if n_cores > 1:
    of_offsets = [16 * i for i in range(n_cores)]
else:
    of_offsets = []
object_fifo_link(of_in, inX_fifos, [], of_offsets)


# Output data movement
outX_fifos = []

of_out = object_fifo("out", ShimTile, MemTile, buffer_depth, data_ty)
for i in range(n_cores):
    outX_fifos.append(object_fifo(
        f"out{i}", ComputeTiles[i], MemTile, buffer_depth, tile_ty
    ))
object_fifo_link(outX_fifos, of_out, of_offsets, [])

The core of this simple design acquires one object of each Object FIFO, adds 1 to each entry of the incoming data, copies it to the object of the outgoing Object FIFO, then releases both objects:

@core(ComputeTile)
def core_body():
    # Effective while(1)
    for _ in range_(0xFFFFFFFF):
        elem_in = of_in0.acquire(ObjectFifoPort.Consume, 1)
        elem_out = of_out0.acquire(ObjectFifoPort.Produce, 1)
        for i in range_(data_size):
            elem_out[i] = elem_in[i] + 1
        of_in0.release(ObjectFifoPort.Consume, 1)
        of_out0.release(ObjectFifoPort.Produce, 1)

Once again we apply the same logic and use a for-loop over our three cores to write the code which will be executed on the three compute tiles. Each tile will index the inX_fifos and outX_fifos maps to retrieve the Object FIFOs it will acquire and release from. This process results in the following code:

for i in range(n_cores):
    # Compute tile i
    @core(ComputeTiles[i])
    def core_body():
        for _ in range_(0xFFFFFFFF):
            elem_in = inX_fifos[i].acquire(ObjectFifoPort.Consume, 1)
            elem_out = outX_fifos[i].acquire(ObjectFifoPort.Produce, 1)
            for i in range_(tile_size):
                elem_out[i] = elem_in[i] + 1
            inX_fifos[i].release(ObjectFifoPort.Consume, 1)
            outX_fifos[i].release(ObjectFifoPort.Produce, 1)

[Prev - Section 2c] [Up] [Next - Section 2e]