Hello, dear friend, you can consult us at any time if you have any questions, add WeChat: daixieit

COMP0015 2021-22 LSA Coursework


This document explains the arrangements for the coursework. You will create an application that draws a variation of a Sankey diagram .

You will draw a simplified variant, of the one shown left, in which  a single source value is split into different destination component. The width of each arrow is proportional to the amount of the flow .

Figure 1 Sankey diagram from dataviz.com


Your work will be checked for plagiarism using a world-class plagiarism detection tool,MOSS.      MOSS is in the Department of Computer Science and across UCL more widely. According to UCL policy, plagiarism is defined as the presentation of another person’s thoughts or words or artefacts or software as though they were your own .

Plagiarism includes copying work from other students, submitting work completed by students in    previous years of the course, and copying from journal articles, books and internet sources without

correct referencing. Plagiarism seriously undermines the integrity of the College and its graduates    and if a deliberate case of plagiarism is suspected in this course it will be treated as cheating under  the University of London Proceedings in Respect of Examination Irregularities .  Further details of the policy and proceedings can be found on the College website at:https://www.ucl.ac.uk/academic- manual/chapters/chapter-6-student-casework-framework/section-9-student-academic-misconduct- procedure.It is most important that if you feel that you are not able to deal with the study                   requirements in this course or if you are unsure about referencing conventions, then please ask        your lecturer for help. Do not feel tempted to risk your personal reputation and progress through       your degree program by plagiarizing or cheating.

It is also most important to remember that each assessment task is an opportunity for you to learn   and to develop skills that will be of great value in professional and other areas of your life. While you may feel under pressure to complete each assessment task you should not waste important             learning opportunities by dishonestly fulfilling the assessment requirements, including copying          material directly from the internet.

If you are in any way unsure about the rules and interpretations relating to plagiarism, please        contact your personal tutor or the module leader for clarification. Plagiarism will not be tolerated in this module.


You are expected to show that you can code competently using the programming concepts covered so far in the course including (but not limited to): use of files, strings, dictionaries, variables,             conditions, loops, and functions.

Marking criteria will include:

•    Correctness your code must perform as specified.

•    You must apply the Python concepts appropriately.

•    Programming style see section Style Guide’ for more detail.

Your assignment will be marked using the rubric at the end of this document. This is the standard rubric used in the Department of Computer Science. Categories 5 and 6 will be used for coding assignments.


Your task is to build a program that reads data values from a file and draws a Sankey diagram that represents those values.

At a high-level, the aim of this coursework is for you to demonstrate that you can use text files and


You are given some starter code (sankey.py) and some text files (data sets). Using the starter     code, you are required to complete the functions described in this document. You will also need to write your own functions.

To complete the coursework:

1.   Download and save the starter code before starting the assignment

1.  Add your student number (not your name) in comments at the top of your program (sankey.py).

2.  Add code to your programme as required to complete the functions described in section 1 through to section 6 . Do not change the code that is given, you should only add to this.

3.  You must ensure that your program works properly on your own computer before you submit the code.

4.   Upload your program at the submission link on Moodle. You were advised of the link when you were sent the instructions for this coursework. Do not upload a folder containing your  files because this can cause compatibility issues for the marking team.

Running the sankey program

There are two ways to run the program from the terminal depending on whether you want to provide the data file name on the command line or whether you want the user to be prompted for the file      name. The code in main() contains code to handle this, you are not required to edit the code in main().

1. Enter a file name on the command line

To run the program in the terminal, specifying the data file name, you will need to type:

python3 sankey.py netball_2018.txt

The meaning of the terms on this line is:




The python interpreter. On macos this will be python3 and on Windows, this will be py or python.

The name of the python program.

Name of the input file .

2. Run the programme in your IDE

To prompt the user for a file name, simply run the program in your editor (IDE) as you would normally.

Using the data sets

Each data set is stored in a text file where the first line in the file is the title for the graph and the second line in the file is the label for the source value on the top of the diagram.

These two lines are followed by a variable number of lines, each of which consists of the name of a destination category that is to appear on the right side of the diagram and the amount of flow to that category.

Please note we have provided several datasets. Some of these datasets contain a minimum      number of fields (BlueHatGreenHat.txt, California_Electricity.txt, Enmax_Bill.txt, netball_2018.txt), as we have just described.

The optional datasets (netball_2018_opt1.txt, and netball_2018_opt2.txt) contain a variable number of fields specifying colours per data line, and are meant to be used to test the functionality of Additional Challenge 2.

Section 1: Reading datasets

Begin by examining the code in main():

# Try to read the file contents


title, left_axis_label, data_list = read_file(input_file)

file_read = True

except FileNotFoundError:

print(f"File {input_file} not found or is not readable.") input_file = ""

The function read_file() takes the file given as a parameter, opens the file for reading, and returns the title, the top label and a list in which each element is a line of data in the file .

Your first task is to complete function read_file(). The function read_file() should throw an exception if the file does not exist or is not readable. The program will terminate if the file cannot be opened and read.           Please note we have provided a dummy code for this function, returning some data in the correct     format (but not read from a file). This would allow you to work on other sections of the coursework    even if you do not complete this one. However, if you do read the dataset files, you should remove  this code.

For your reference, for the file netball_2018.txt the function read_file() should return the following:

Highest Goal Scorers 2018’‘Country’

['Australia, 529\n', 'Jamaica, 466\n', 'England, 450\n', 'New Zealand, 391\n', 'South Africa, 363']

Section 2: Displaying the graph window

Take a look at the following code in main():

# Section 2: Create a window and canvas

win = set_up_graph(title)

Complete the function set_up_graph(). It accepts the title of our graph and must use it to create and display a window with the correct title.

Please note that the size of the window is defined by the global variables WIDTH and HEIGHT, but you can use 2 arguments instead .

The output of this step for the data set netball_2018.txt is shown on the next page:

Section 3: Process the data

Take a look at the following code in main():

# Section 3: Process the data


data_dic = process_data(data_list)

except ValueError as error:

print("Content of file is invalid: ")



You must implement the function process_data(). This function processes the list of data entries read

from the file and returns a dictionary containing a series of key-value pairs.

If we take the file netball_2018.txt, data_listcontains the following elements:

['Australia, 529\n', 'Jamaica, 466\n', 'England, 450\n', 'New Zealand, 391\n', 'South Africa, 363']

The function process_data() will return a dictionary containing the items

the right. You must validate the contents of each line to ensure that:

•    Neither the key nor the value are empty

•    The value can be converted to a float successfully

If the content of a line is invalid, you must print an appropriate error

message and raise a ValueError exception. Here is an example of an error message, you should provide the line number:

Error in line 5: Value provided is not a number (as363)

Section 4: Plotting the source and destinations

Take a look at the following code in main():

# Section 4: Draw the graph

draw_sankey(win, left_axis_label, data_dic)

Now you will begin to write a function that draws the Sankey diagram. Your function takes a            reference to the    window, the label to show and the dictionary containing the key/value pairs to be plotted.

The source will be plotted to the top of the graph and the destinations will be plotted to the bottom of the graph, with the number of pixels allocated to the source and each destination being directly         proportional to their  magnitude.

In addition, a small gap (e.g., 10 pixels, see global variable GAP) should be left between each       destination to make it easy to determine the relative sizes of the destinations, and the bar for the   source should be centred within the horizontal space needed for the destinations. The width of the bar drawn for each destination can be calculated in the following manner:

•    Determine the total flow to all of the destinations. This should be 2199 for the netball data set .

•    Assume that the diagram will be drawn in the window with a margin of 100 pixels at the top      bottom, left and right margins. The usable width in pixels (total width, minus margins and gaps) will be particularly important to scale everything correctly, and we will refer to this here as the   diagram width .

•    Compute the number of available pixels (diagram width) as:

diagram width = window width – 2*100 -(number of destinations - 1) * GAP

Given a window width of 1000 pixels and the netball data set (5 categories), this calculation will be: 1000 – 200 - (5 - 1) * 10 which is 760 pixels.

•    As you can see, the units in the data-flow are not necessarily correlated to the pixels that you

have available to display them. For instance, in the netball example we need to show arrows of up to 2199 units of flow in up to 760 pixels. It will be useful to compute a conversion ratio, telling me how many pixels I must use to represent a given value (in units of flow). You can compute this (number of pixels per unit of flow) as the number of available pixels divided by the total flow .

•    Use this ratio to compute the width of each destination arrow (i.e., amount of flow to that destination multiplied by the number of pixels per unit of flow). The width of the source bar must be computed as the sum of the flows to all destinations, multiplied by the number of pixels per unit of flow. The result of performing this task for the netball data set is shown below , together with some annotations (red arrows), highlighting key distances such as margins and gaps .

•    Notice we use different shapes for the source and destinations. Note also that the labels have been drawn for each destination, centred horizontally over each shape, but the rules are slightly different. Also note that the top half of the text on the destinations is currently not visible (it is written on white over a white background). Do not worry about this, we will soon be adding some coloured background .

Section 5: Connecting the sources to the destinations

A polygon can be used to connect the source to each destination. In order to draw the polygon the (x,y) position within the source and the (x,y) position of the destination must both be known. The y positions are easy to determine from the size of the window, the margins and the sizes you gave to each shape.

However, you will probably find that you want one variable to keep track of the x position at the         source and a second variable to keep track of the x position at the destination , where the arrows      start. As each polygon is drawn the x position at the source should be increased by the width of the  destination while the x position at the destination should be increased by the width of the destination plus the gap size (to account for the gap between the destinations). You likely already introduced     such a variable for the destinations when drawing the bars in the previous part of the assignment,    and you can continue to use that variable in this part of the assignment .

The output generated when the polygons are added is shown below (note the top part of the label is now visible).

Section 6: Adding flat colours

Without colour, the Sankey diagram above is not very attractive. In this section, you will add colours so that a different colour is assigned to the source and all the destinations on the graph. Add the     following constant to the top of your program:

COLOURS = [(230, 25, 75), (60, 180, 75), (255, 225, 25), (0, 130, 200),

(245, 130, 48),     (145, 30, 180), (70, 240, 240),    (240, 50, 230),

(210, 245, 60), (250, 190, 212), (0, 128, 128),    (220, 190, 255),

(170, 110, 40), (255, 250, 200), (128, 0, 0), (170, 255, 195),

(128, 128, 0), (255, 215, 180), (0, 0, 128), (128, 128, 128)]

The list COLOURS contains colours in RGB format. A colour expressed in RGB format is a tuple of 3 values: a red value, a green value and a blue value. Together, the red, green and blue value        comprise the colour, you can experiment with RGB values using thiscolour picker.Note: all RGB    values must be within the range 0-255, this fact will become important later on when you have to do some calculations with these values.

Now update your program so that the rectangles at each destination is drawn in different colours as shown in the diagram below. You can implement any policy that you like to select the right colour,    but you need to make sure this is one from the COLOURS list. Also, the colour of the text at the end of each arrow should be the inverse of the colour of the arrow (e.g., if the arrow head is red              (255,0,0); the text of its label should be (0,255,255)).

Check the documentation for ezgraphics for information on how to usecanvas.setFill()with R,G,B values.

Section 7: Adding a colour gradient

Your next task is to create the effect of the colour transitioning from top to bottom. This will require  you to fill the polygon with a series of lines. A line is 1 pixel thick and each line should be drawn in a slightly different colour.

How you do this is up to you but you may wish to create the code to do this in two stages:

1.   Draw a series of horizontal lines to fill each polygon.

2.   Colour the lines so that the colour graduates from the source colour (black) to the destination colour.

Drawing the filling lines

The aim here is to draw a series of lines to fill the polygon shape , following the idea shown in the   diagram bellow (we show it in horizontal, for convenience, but your arrows run from top to bottom). In this diagram, only one every 4 lines has been coloured/filled, to make it easier to see what is     happening. In your case you want to draw every line, adjusting the colour of each line slowly, as to create the final gradient

Keep the polygon outline that you drew earlier, it looks neater. When you draw all the lines, you will just see a solid block. Your job is to work out the algorithm to draw this shape.

Use a for loop to draw the lines from the source to the destination. Think about what you know, given that a line is specified by 4 values x1, y1, x2 and y2:

•    All the lines are the same length, x2 = x1 + the width (in pixels) of the destination rectangle.

•    y1 and y2 are equal and increase by 1 pixel for each line.

•    x1 increases by a small increment, delta, with each line. Again, it’s up to you how you      calculate this delta. You can use a variable that describes how close you are to the          destination, with 0 representing being at the source and 1 representing being at the          destination. The computed value can be used to calculate x1 for each line. x1 must be an integer (as it refers to a pixel coordinate) .