Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

KIT308 Multicore Architecture and Programming

Assignment 2 — SIMD

Aims of the Assignment

The purpose of this assignment is to give you experience at writing a program using SIMD programming techniques. This assignment will give you an opportunity to demonstrate your understanding of:

the where there is one there is many approach (WTIOTIM);

structures of arrays;

the use of SIMD for calculation; and

the translation of conditional statements into SIMD.

Due Date

11:55pm Friday 22nd of September (Week 10 of semester)

Late assignments will only be accepted in exceptional circumstances and provided that the proper procedures have been followed (see the School Office or this link for details) assignments which are submitted late without good reason will be subject to mark penalties if they are accepted at all (see the School office or this link for details on this as well).

Forms to request extensions of time to submit assignments are available from the Discipline of ICT office. Requests must be accompanied by suitable documentation and should be submitted before the assignment due date.

Assignment Submission

Your assignment is to be submitted electronically via MyLO and should contain:

A .zip (or .rar) containing a Visual Studio Solution containing a project for each attempted stage of the assignment (in the format provided in the downloadable materials provided below).

A document containing:

A table of timing information comparing the provided multi-threaded base code, the provided base code with SIMD sphere intersections, and the Stages 1–4 implementation on each scene file (all running with the maximum threads natively supported by your CPU).

An analysis of the above timing data.

You do not need to (and shouldn't) submit executables, temporary object files, or images. In particular, you must delete the ".vs" diretory before submission as it just Visual Studio temporary files and 100s of MBs. Do not however delete the "Scenes" folder or the "Outputs" folder (but do delete the images within this one).

Marking

This assignment will be marked out of 100 (NOTE: there are 120 marks available, but it's only possible to receive a maximum of 100). NOTE: if your code for a particular stage does not correctly produce the expected output images, you will only be able to receive a maximum of half marks for that stage — see below for more details. A perfect solution of Stages 1–3 with timings and analysis would obtain 88%.

The following is the breakdown of marks:

Task/Topic        Marks

1. Conversion of Distance Calculation to Where There is One there is Many (WTIOTIM) 10%

Correct implementation of WTIOTIM approach to isTriangleIntersected function to calculate distance to nearest object (in Intersection.cpp), and conversion of its usage in the objectIntersection function (also in Intersection.cpp) 5%

Correct implementation of WTIOTIM approach to create a short-circuiting isTriangleIntersected function to return (as soon as possible) whether any triangle intersects (in Intersection.cpp), and conversion of its usage in the isInShadow function (in Lighting.cpp) 5%

2. Conversion of Scene Objects to SoA 15%

Correct declarations of data structures for SoA SIMD form of Triangle container data (don't delete the AoS versions), and SoA SIMD form of data uses minimal (sensible) amount of memory 5%

Correct code to convert AoS container to dynamically declared SoA structures (conversion should happen after the scene has been loaded) 5%

Correct conversion of isTriangleIntersected functions to use SoA copy of Triangle data 5%

3. AVX SIMD Conversion of Triangle Intersection Calculation 55%

A. SIMD Conversion of short-circuiting isTriangleIntersected function 40%

Correct conversion of scalar inputs to SIMD 5%

Correct and efficient intersection point t0 calculation 5%

Correct and efficient (e.g. no use of if-statements) handling of cases where no intersection occurs due to ray being parallel(ish) with triangle (i.e. first if-statement) 5%

Correct and efficient (e.g. no use of if-statements) handling of cases where no intersection occurs due to point being outside of triangle (i.e. second and third if-statements) 5%

Correct and efficient (e.g. no use of if-statements) handling of cases where intersection does occur (i.e. fourth if statement) 5%

Correct and efficient (e.g. no use of if-statements, for-loops, or scalar code) calculation of intersection data 5%

Correct declaration of loop constants before loop 5%

No remaining scalar code (except WTIOTIM for-loop) and correct final output of SIMD code (including cases where number of triangles isn't divisible by 8) 5%

B. SIMD Conversion of non-short-circuiting isTriangleIntersected function 15%

Correct (i.e. corresponding to scalar minimum) and efficient (e.g. no use of if-statements) SIMD calculation of closest intersection (i.e. t)

(NOTE: full marks can be gained for this component if SIMD is correctly used in the main loop, but a scalar loop is required to determine the final minimum from a SIMD vector of minimums) 5%

Correct (i.e. corresponding to scalar index) and efficient (e.g. no use of if-statements) SIMD calculation of corresponding triangle index (i.e. index)

(NOTE: full marks can be gained for this component if SIMD is correctly used in the main loop, but a scalar loop is required to determine the final index from a SIMD vector of indexes) 5%

Correct (i.e. corresponding to scalar values) and efficient (e.g. no use of for-statements) SIMD ("horizontal") calculation to determine closest intersection and index (i.e. t and index) 5%

4. AVX SIMD Conversion and General Optimisation of Lighting Calculation 30%

(NOTE: this section will require SoA declarations and conversion from AoS for Light container data)

Correct SIMD conversion of applySpecular function (i.e. so it handles 8 lights at once) 5%

Correct and efficient (i.e. common terms identified) inlining of applyCheckboard, applyCircles, and applyWood functions within applyDiffuse function 5%

Correct and efficient (e.g. no use of switch-statements) SIMD conversion of applyDiffuse function (i.e. so it handles 8 lights at once) 5%

Correct SIMD conversion of applyLighting function (NOTE: calling isInShadow should use a scalar loop, as it's not possible to use SIMD here) 5%

Correct determination and declaration of all loop constants before main loop in applyLighting function (NOTE: this requires either inlining of one or more subfunctions or splitting these function(s) into multiple parts) 10%

Documentation 10%

Outputs showing timing information for the base assignment code, base code with SIMD sphere intersections, and Stages 1–4 (with the maximum threads natively supported by your CPU) on all applicable scene files 5%

Analysis of data (comparisons across the base code, base code with SIMD sphere intersections, and Stages 1–4) 5%

Penalties

Failure to comply with submission instructions (eg. no cover sheet, incorrect submission of files, abnormal solution/project structure, etc.) -10%

Poor programming style (eg. insufficient / poor comments, poor variable names, poor indenting, obfuscated code without documentation, compiler warnings, etc.) up to -20%

Lateness (-20% for up to 24 hours, -50% for up to 7 days, -100% after 7 days) up to -100%

Failure to use the techniques outlined in this unit's lectures and tutorials, resulting in a wildly different solution, using un-taught libraries (eg. an external library for SIMD, etc.), un-taught techniques, or a vastly different rendering implementation up to -100%

Marking and Correct Images

As SIMD code is very very difficult to debug — as grows exponentially harder as the amount of it grows — there will be limited opportunity in the marking process to determine where exactly mistakes have been made, and even less chance of being able to provide fixes to those mistakes.

As a result, if your code for a stage does not produce the expected image, then you will only be eligible for half marks for that stage of the assignment.

In order to work within this constraint, you should attempt translations into SIMD in very small steps (i.e. a single line at a time, or even a partial line at a time). The lectures and live-coding sessions will demonstrate this approach to SIMD translation. You will likely receive more marks for a partially complete SIMD translation than a fully complete translation that doesn't work 100%.

Stage 4

As this stage may require the use of less accurate approximations to some mathematical functions, the concept of "correct" image will be somewhat relaxed from having to be an exact match as specified by ImageMagick — an acceptable visual match will likely be enough may be considered as correct enough if the ImageMagick comparison values are very small, and the cause of the difference can't be easily corrected by the marker. For reference, the solution to Stage 4 of this assignment (at the time of writing this specification document) only differs by: a single channel being –1 different, on only one pixel, on only one image of the 15 in the testing set.

Programming Style

This assignment is not focussed on programming style (although it is concerned with efficient code), but you should endeavour to follow good programming practices. You should, for example:

comment your code;

use sensible variables names;

use correct and consistent indenting; and

internally document (with comments) any notable design decisions.

[NOTE: any examples in the provided assignment materials that don't live up to the above criteria, should be considered to be deliberate examples of what not to do and are provided to aid your learning ;P]

The Assignment Task

You are to modify the (square-based) multithreaded implementation of a "simple" raytracer from the first assignment to take advantage of SIMD instructions. As time is tight for this assignment you only need to SIMD-ify the triangle/ray intersection code and the lighting code. This requires changes across multiple files.

To aid you in this conversion, the sphere/ray intersection code has already (mostly) been translated into SIMD code (by following the techniques required to do stages 1–3 of the assignment). Note that this provided example still has a scalar loop for determining the final values in the non-short-circuiting version of isSphereIntersected and that part of Stage 3B is to create a similar function for triangle intersections that does NOT have this limitation (i.e. having a similar for loop would not get you all of the Stage 3B marks).

From the provided (square-based) multithreaded raytracer implementation, for Stages 1–3 of the assignment you will create multiple subsequent versions that modify the triangle implementation as follows:

1. Rewritten functions for the triangle/ray intersection tests in a where there is one there is many approach.

2. Translation of array-of-structures (AoS) with structures-of-arrays (SoA) for the triangle container.

3. Optimisation of the triangle/ray intersection test to take advantage of SIMD.

For Stage 4, you will follow a similar pattern, but for lights and the lighting calculations.

Implementation — SIMD Distance (Stages 1–3)

The following section describes in detail the steps required to complete Stages 1–3 of the assignment. If you complete these steps, only then should you attempt Stage 4 (as you will likely find it more difficult).

1. Where there is One there is Many

This stage involves replacing the isTriangleIntersected function with two new versions that use the where there is one there is many paradigm.

In order to complete this step you will need to:

Write two functions that take the entire triangles container (or at least a pointer/reference) as an argument and will perform whatever action took place at the calling site. There are two versions as the isInShadow function (in Lighting.cpp) uses a short circuited approach, exiting as soon as any triangle collision is discovered, and the objectIntersection function (in Intersection.cpp) finds the closest triangle intersected with (which means it must examine them all).

A version of isTriangleIntersected that returns a boolean depending on whether or not a triangle intersects with the ray (i.e. stopping as soon as one is found).

A version of isTriangleIntersected that returns a boolean (as above), updates its t parameter to be the closest triangle that intersects with the ray, and updates its index parameter to specify the index of the closest triangle.

2. Structures of Arrays

This stage involves modifying the Scene struct (Scene.h) to store a duplication of the triangle data via structure of arrays rather than the currently existing array of structures (currently in the triangleContainer variable).

In order to complete this step you will need to:

Create a SoA copy of the data that is stored in triangleContainer in the Scene struct.

NOTE: this means the program has two copies of the triangle data, the original one in AoS form and a second one in SoA form. As this assignment predominately only requires changes to isTriangleIntersected (and the sites of the calls to this function) the rest of the code will still use the original AoS version of the data.

Fill this SoA copy of the triangle data after the Scene struct has been loaded (dynamically allocating memory for the SoA representation).

Rewrite the isTriangleIntersected function to make use of the SoA.

Hints:

The best place to do the conversion is immediately after loading the scene (i.e. in Main.cpp).

Triangle (and Sphere) is defined in SceneObjects.h. Both the AoS and SoA versions of the sphere containers are defined in the Scene struct in Scene.h.

You should not need to attempt to modify Scene.cpp to complete this step.

At times this code update may require conversions to and/or from normal structs (such as Vector and Point) to the equivalently stored data in the SoA.

It is up to you whether your SoA datastructure uses SIMD datatypes or not.

At the end of this stage the program should still produce the same results as the base code. Be thorough in your testing (i.e. test all the appropriate scenes) to ensure that everything works correctly before progressing.

3A. SIMD Conversion of isTriangleIntersected function

This stage involves converting the short-circuiting version of the isTriangleIntersected function from Stage 2 into a SIMD implementation.

In order to complete this step you will need to:

Convert many of the input parameters (or parts of them) into SIMD-ready values (e.g. the ray starting location and direction need to be converted into a SIMD format).

Step through the SoA in chunks proportional to the number of value stored in each SIMD variable (i.e. 8).

Convert the calculation to use SIMD. Care must be taken to correctly deal with the various conditional expressions in the loop. There are four if-statements that must be converted to SIMD code via the use of appropriate masking statements.

Carefully deal with the situation where the triangle count isn't equally divisible by the number in each SIMD calculation (e.g. 9 triangle with 4 in each SIMD value). NOTE: this will likely be handled by your AoS to SoA conversion from Stage 2, but if it isn'tyou'll need to write special masking code to handle this situation.

3B. SIMD Conversion of isTriangleIntersected function

This stage involves converting the non-short-circuiting version of the isTriangleIntersected function from Stage 2 into a SIMD implementation.

Most of this conversion is identical to step 3, and the code can just be copied from that function. It does however require the addition of:

New variables to keep track of the closest intersection found, and which triangle index it was found at.

Consolidation of the values calculated using SIMD into a single scalar return value. This should be done after the loop finishes.

Hints:

You do not want to return the minimum index value (i.e. the smallest of all the indices in a SIMD vector), you want to return the index corresponding to the minimum distance found (i.e. the index in the same lane as the minimum distance).

General Hints / Tips

The techniques required to complete each stage rely heavily on work done in the SIMD tutorials and the step-by-step approach shown in lectures — refer to them often.

When implementing the SIMD stage, it's best to SIMD-ify small portions of code at a time (e.g. even a single line is often a lot) and then perform the rest of the calculation through the existing scalar code, or even compare the result of the SIMD version to the existing scalar code.

Again for SIMD, it's helpful to render scenes at a greatly reduced size for testing, with an equally small block size, one sample per pixel, and with one thread (e.g. try using ‑path .\ ‑input Scenes/allmaterials.txt ‑size 8 8 ‑blockSize 8 ‑samples 1 ‑threads 1). It may even be helpful to fix the number of rays cast (MAX_RAYS_CAST) to be 1, so that reflection doesn't occur. This doesn't produce a very nice image, but images can be easily compared visually for problems (the example is only 64 pixels) and printf statements can be used to verify the correctness of calculations without outputting a huge amount of data (the use of debuggers is somewhat perilous in a multithreaded environment, but that shouldn't be an issue here).

When converting conditional statements to masks it can be helpful to do this with scalars first (although remember that C/C++ implementations use 1 rather than 0xFFFFFFFF for true, so the calculations may be different).

Write functions to output all the elements in a SIMD value for easier printf debugging (I am not a fan of how these types are shown in the debugger — unless they have changed recently).

Implementation — SIMD Lighting (Stage 4)

The following section describes in detail the steps required to complete Stage 4 of the assignment. You should only attempt this step if you have fully completed (and tested) Stages 1–3 (as much of this stage is likely much harder to complete).

This stage involves rewriting the applyLighting function and a number of functions it calls. In order to complete this step you will need to:

Create SoA declarations and conversions from AoS for Light container data (in a similar way that this was done for spheres and triangles).

Convert the applySpecular function to SIMD (i.e. perform this for 8 lights and produce 8 colours).

Create a normal non-SIMD applyDiffuse function that has applyCheckboard, applyCircles, and applyWood functions manually inlined before attempting the conversion to SIMD.

Convert this new applyDiffuse function to SIMD (i.e. perform this for 8 lights and produce 8 colours).

Convert the applyLighting function to SIMD (i.e. handle multiple lights simultaneously).

NOTE: the call to isInShadow is not suitable for SIMD conversion, so a simple for loop that repeatedly calls the function and stores each result in turn in a SIMD vector is acceptable.

Determine what parts of within the main loop in applyLighting are the same regardless of where the light is, and move these calculations outside the loop (i.e. so they are only calculated once). This step involves thinking about the loop itself and the functions that it calls.

Missing Mathematical Functions

There are no definitions of a few needed mathematical functions in the AVX/AVX2 instruction sets:

The powf function, as used in the applySpecular function in Lighting.cpp.

The sinf and cosf functions, as used in the applyWood function in Texture.cpp.

Definitions for these functions can either be sourced from:

The SVML api — if your processor / compiler supports them (e.g. _mm256_sin_ps).

By writing your own function that just calls the scalar function (e.g. sinf) 8 times for each slot in the vector.

Hints / Tips

All the hints/tips from the stages 1–3 apply here (i.e. complete this in small steps, use 8x8 images as tests, use a lot of printfs, etc.).

You have to change complex conditional control structures (e.g. continues, switch statements, etc.) into SIMD code. It's easier to do this using scalar code first, and then check your work, before attempting SIMD conversion.

You may find it helpful to define some helper classes similar to Vector8 to aid with the completion of this step, e.g. Colour8 (and maybe even Ray8). Also in a similar fashion to Vector8, defining useful operators can help with translation and code readability.

These stages are much more work than Stages 1–3 and they are worth less marks. Decide if you think it's worth it.