CS 33 (Fall '17)

Parallelism comparison

Chenyu Xu.

For the following forms of machine parallelism, give an example of an application that will likely work better with this form of parallelism.

instruction level parallelism
multiplexing
process parallelism

Solution

When we have one CPU and core, and our program does not rely on sequences of values a lot, instruction level parallelism enable micro operations to perform at the same time. This parallelism is enabled besides other machine parallelisms, and is achieved based on techniques of coding.
When we have only one CPU and cor, and we need controls over the application. For example, we want to give preferred service or priority to specific client and part of the program. Also, when the whole application need to share data between flows, multiplexing enables every logical flow has access to the entire address space of the process.
When we have several CPUs and cores, and have several clients working on the application. Besides enabling parallelism for each client, every client has its seperate address space, so none of them can overwrite other's memory, and keep their memory safe.

Bitwise Less Than or Equal To

Contributed by Caleb Chau.

Given two numbers x and y, determine if x is less than or equal to y. Return 1 if x <= y and 0 if not.

Example

isLessOrEqual(5, 6) returns 1, isLessOrEqual(6, 5) returns 0, isLessOrEqual(5, 5) returns 1

Solution

int isLessOrEqual(int x, int y) {
    /*
     * If x <= y, then either x is negative and y is positive/zero, or
     * they have the same sign and y - x is positive/zero (sign bit is
     * 0). xneg_ypos is 1 only when x is negative and y is
     * positive. nneg_diff is 1 only when x and y have the same sign and
     * their difference is positive/zero.
     */

    int difference = y + (~x + 1);
    int xneg_ypos = (x & (~y)) >> 31 & 0x01;
    int nonneg_difference = (~(x ^ y) & ~difference) >> 31 & 0x01;

    return xneg_ypos | nonneg_difference;
}

Unions and Floating Point Representation

Contributed by Joey Ganley.

What will the following code output? Hint: The binary represention of "1065353216" is "0 01111111 00000000000000000000000"

A) 1065353216
B) FLT_MAX
C) 1
D) NaN

#include <iostream>
using namespace std;

union MagicUnion {
int a;
float b;
};

int main() {
MagicUnion test;
test.a = 1065353216;
cout << test.b << endl;
}

Solution

The answer is C. The union takes the binary data and outputs the data in whichever type is being requested. Since "test.b" is being requested, the union will output the binary data in float form. Since floating point value "0 01111111 00000000000000000000000" is equal to 1, the program will output 1.

Picky Positives

Contributed by Anirudh Veeraragavan.

Write a function isPositive that returns 1 if x > 0 and 0 otherwise.

Restrictions:

You may only use the operators ! ~ & ^ | + << >>
You may only use the above operators 8 times.
You may not use any control flow constructs. I.E. no if, do, while, for, switch, etc.
You may not cast and may not use data types other than int.

Example

isPositive(-1) = 0

Solution

x is positive if and only if

The sign bit is not 0
x is not 0

int isPositive(int x) {
  int sign = (x >> 31) & 1;
  int isZero = !x;
  return !sign & !isZero;
}

Assembled Transpose

Contributed by Arvind Ramachandran.

The following code transposes the elements of an square matrix.

#define N ___

void transpose(int A[N][N]) {
  int i, j;
  for (i=0; i < N; i++) {
    for (j=0; j < i; j++) {
      int t = A[i][j];
      A[i][j] = A[j][i];
      A[j][i] = t;
    }
  }
}

When compiled with optimization level -01, GCC generates the following code for the inner loop:

        movl    (%rdx), %ecx
        movl    (%rax), %esi
        movl    %esi, (%rdx)
        movl    %ecx, (%rax)
        addq    $4, %rdx
        addq    $400, %rax
        cmpq    %rdi, %rax
        jne     .L4

Which register holds a pointer to array element A[i][j]?
Which register holds a pointer to array element A[j][i]?
What is the value of N?
(A) 4
(B) 100
(C) 400

Solution

%rdx
%rax
(B) 100

Thread-safe counter

Contributed by Benjamin Yang.

Create a counter that is incremented by two threads, 5 times each. Print out the thread number and current value of count after each increment. Print out the final value after both threads are finished as well. Use POSIX threads and semaphores to implement your solution. Use global variables as appropriate.

Example

Output:
Thread 1 incremented count to 1
Thread 2 incremented count to 2
Thread 1 incremented count to 3
Thread 2 incremented count to 4
Thread 1 incremented count to 5
Thread 2 incremented count to 6
Thread 1 incremented count to 7
Thread 2 incremented count to 8
Thread 1 incremented count to 9
Thread 2 incremented count to 10
The value of count is 10

Solution

#include <pthread.h>
#include <stdio.h>
#include <semaphore.h>

sem_t s;
int count;

void *countup(void *p) {
    int *id = (int*)p;
    for (int i = 0; i < 5; i++) {
        sem_wait(&s);
        count++;
        printf(""Thread %d incremented count to %d\n"", *id, count);
        sem_post(&s);
    }
}

int main() {
    pthread_t tid1, tid2;
    count = 0;
    sem_init(&s, 0, 1);
    int id1 = 1;
    int id2 = 2;
    pthread_create(&tid1, NULL, countup, &id1);
    pthread_create(&tid2, NULL, countup, &id2);

    pthread_join(tid1, NULL);
    pthread_join(tid2, NULL);
    printf(""The value of count is %d\n"", count);
}

How to subtract?

Contributed by Vaibhav Aggarwal.

How can you subtract two integers using only bitwise operators?

int subtract(int a, int b);

Solution

If we create a table of the different bitwise scenarios:

a | b | diff | borrow
0 | 0 |   0  |    0
0 | 1 |   1  |    1
1 | 0 |   1  |    0
1 | 1 |   0  |    0

We can see that we only borrow if (~a) & b, and the difference bit is a ^ b. Therefore:

int subtract(int a, int b) {
    while (b) {
        int borrow = (~a) & b;
        a = a ^ b;
        b = borrow << 1;
    }
    return a;
}

Toggling Bits

Contributed by Nyan Tun.

Given an integer A and an integer B, determine the number of bits it would take to toggle to convert A to B.

Example

in: a, b = 2, 3  // in binary: 10, 11
out: 1
in: a, b = 10, 20 // in binary: 01010, 10100
out: 4

Solution

You iterate from the least significant bit leftwards. a^b returns a 1 if a and b are different and the c & 1 masks c to retain only the bit value.

#include <iostream>

using namespace std;

int toggle_bit(int a, int b) {
    int count = 0;
    for (int c = a^b; c!= 0; c = c >> 1) {
        count += c & 1;   
    }
    return count;
}

int main() {
    cout << toggle_bit(2, 3) << endl;
    return 0;
}

Solid State Drives

Contributed by Shawye Ho.

1a. Di cuss the difference between a Solid State Drive (SSD) and a regular Hard Drive. What advantages and disadvantages does an SSD have over a regular hard disk drive due to the differences in their design? (Remember: SSDs use flash memory)

1b. Some SSD manufacturers display a warning to leave at least 25% of the storage space empty for optimal write performance and SSD longevity. Why do you think this is? Other manufacturers do not display such a warning for their SSDs. What differences between the SSDs of different manufacturers might lead to some manufacturers recommending that 25% of the storage space on their SSD be left empty, while other SSD manufacturers do not have such a recommendation?

1c. Samsung is currently working on a new type of SSD called a Key SSD. This SSD can bypass the flash controller on the SSD, and stores "keys" that map to certain blocks of memory on the SSD. Thus, to find one of these "blocks" of memory, the SSD can use the keys, rather than working through the flash controller. What advantage(s) (or disadvantage(s)) might the Key SSD have over a regular SSD? What types of objects or files might the Key SSD be best suited to store?

Solution

1a. An SSD uses flash memory, which means it has large blocks of integrated-circut assemblies as memory that is accessed using a flash controller, which allows the hardware (usually a computer) to communicate with the drive. This controller does many of the necessary operations to maintain the flash memory, including bad block mapping, encryption, caching, error detection and correction, garbage collection, and wear leveling, among other things. This type of memory, with integrated-circut assemblies rather than actual spinning disks and read/write heads, has no moving parts. As such, SSDs tend to be more resistant to physical shock and damage. They are also more silent when they run. Futhermore, because of the nature of their memory access capabilities, they tend to read faster (have lower access time and lower latency) than hard disk drives. However, due to the cost of creating the SSD, SSDs cost much more (per GB memory) than hard drive disks.

1b. Flash memory (like in SSDs) is most effective when it is written in large blocks, because it is typically accessed in blocks. Thus, the flash controller will always attempt to write to a free block, to optimize the SSD's read capability. Thus, if there are no free blocks, and something new must be written, the flash controller must first read a block (or multiple blocks) and write it somewhere else so that the current item being written can be written in a free block. Thus, if there is low space in the SSD, writing will be slower. Furthermore, because this occurance also requires MORE writing (since it needs to rewrite an old block before writing a new block), and will thus wear out the flash memory sooner. Thus, some manufacturers might recommend keeping at least 25% of the SSD free to increase write performance and the longevity of the SSD. However, some manufacturers do not have this recommendation; this is because these SSDs might have different flash controllers that optimize differently (always wipe and write at the same time, perhaps), or may include extra unusable space on their SSDs, to act as a buffer for these scenarios, and ensure that write performance won't suffer, and so that the SSD will have increased longevity.

1c. The Key SSD will be able to bypass the flash controller, which is slower, and access memory directly through a key, which is faster. Thus, the Key SSD should be faster than regular SSDs, because a Key system should be faster than going through the (relatively) slow flash controller. However, because each Key is associated with a particular region, this system could be unsecure. While traditional flash controller encrypt the data being stored on the drive (thus making it impossible to access the data on the drive unless someone has the flash controller associated with the drive) the Key directly gives the region of the information, making encryption more difficult. As a result, the Key SSD could be less secure than traditional SSDs. Because the Key SSD uses keys that map to certain large blocks of memory, it is most effective with regard to objects or files that take up large blocks of memory. These include, among others, videos, images, and data blocks (large amounts of data, organized in some particular way, such as a large excel spreadsheet, etc.).

Locality of loops

Contributed by Koji Kusumi.

Examine the function loop1 below:

#include <stdio.h>
void loop1(void) {
    int arr[1000][1000];
    for ( int j = 0; j < 1000; j++ ) {
        for ( int i = 0; i < 1000; i++ )
            arr[i][j] = i*j;
    }
    int sum = 0;
    for ( int j = 0; j < 1000; j++ ) {
        for (int i = 0; i < 1000; i++ )
            sum += arr[i][j];
    }
    printf(""the sum is %d \n"", sum);
}

int main(void) {
    loop1();
    //loop2();
}

With your understanding of locality and caching, write improved function, loop2. This function should compute the same sum and still use loops in a meaninful way. Explain what changes you made and how they improve the speed of the function.

Solution

A two-dimensional array is stored as single line of values. If a two-dimensional array is indexed by [row][col], values in the same row are stored adjacent to one other. Notice that loop1 sums values in the same col. For each index, in the array an entire new block of memory must be loaded. Blocks of memory pulled into a cache are pieces of memory stored adjacently. Since values in the same col are not adjacent, we cannot use more than one value in a single block.

To speed up the program, we can sum values by row. Now, when we load a block, we can use all of the values because the values are adjacent and we area summing adjacent indices. In code, this manifests itself as swapping the order of our loops

#include <stdio.h>

void loop2(void) {
    int arr[1000][1000];
    for ( int i = 0; i < 1000; i++ ) {
        for ( int j = 0; j < 1000; j++)
            arr[i][j] = i*j;
    }
    int sum = 0;
    for ( int i = 0; i < 1000; i++ ) {
        for ( int j = 0; j < 1000; j++ )
            sum += arr[i][j];
    }
    printf(""the sum is %d \n"", sum);
}

void loop1(void) {
    int arr[1000][1000];
    for ( int j = 0; j < 1000; j++ ) {
        for ( int i = 0; i < 1000; i++ )
            arr[i][j] = i*j;
    }
    int sum = 0;
    for ( int j = 0; j < 1000; j++ ) {
        for (int i = 0; i < 1000; i++ )
            sum += arr[i][j];
    }
    printf(""the sum is %d \n"", sum);
}

int main(void) {
    loop1();
    loop2();
}

Performance and Optimization

Contributed by Kevin Tan.

Consider the following C code:

void strIncrement(int *n, int x, const char *word) {
  int i;
  for (i = 0; i < strlen(word); i++) {
    *n += i * x;
  }
}

Suppose word is a very long C-string. How well does speculative execution perform when evaluating the condition of the for loop?
Without modifying the body of the for loop, propose one way to improve the performance of this code.
Our function is rewritten to conform to the syntax required for thread routines:

// Global struct definition
struct arguments {
  int *n,
  int x;
  const char *word;
};

void *strIncrement(void *args) {
  // Cast args to a pointer to a ""struct arguments""
  struct arguments *arg = (struct arguments *)args;
  
  // Get parameters
  int *n = arg->n;
  int x = arg->x;
  const char *word = arg->word;
  
  // Main thread routine
  int i;
  for (i = 0; i < strlen(word); i++) {
    *n += i * x;
  }
  
  return NULL;
}

Is strIncrement always thread-safe? Explain. If no, also propose a way to make the function always thread-safe.

Solution

If word is long, we are evaluating the for loop condition many times. Speculative execution works well: the CPU will guess that the condition evaluates to true since that is the most typical program behavior. The condition only ever evaluates to false when we exit the loop!
We can hoist out the call to strlen so that it isn't called every time the program evaluates the condition for the for loop (function calls are treated as black boxes that cannot be optimized - with the exception of calls to inline functions). It would look something like:
```
void strIncrement(int *n, int x, const char *word) {
  int i;
  unsigned long length = strlen(word);
  for (i = 0; i < length; i++) {
    *n += i * x;
  }
}
```
strIncrement is not always thread-safe. It never accesses any global or static variables, but it is not explicitly reentrant - it is implicitly reentrant. If the parameter n (or arg->n) points to the same address for two or more different threads, we have race conditions. One thread might update the value of n after another thread has already read its old value.

One way we can fix this is with semaphores, or more specifically a mutex. We could create a mutex, then surround the body of the for loop with P and V operations.

Another way we could fix this is to change the function so that it calculates and returns the total value that it would've added to n. Then, the main thread can reap all child threads, sum up all of their return values, and manually update n. This solution would notably require a lot more code rewriting and overhead than the above solution.

Caching and Locality

Contributed by Kyle Romero.

Which of the following statements is NOT true regarding the general use of caches?
A. Caches utilize spatial locality to offer fast access to elements in arrays
B. Accessing data in a cache is faster than accessing data in RAM
C. If we write data to the cache, this value is always changed immediately in the backing store
D. Caches optimize data accesses while iterating through loops
E. All of the above statements are true regarding general use of caches

Solution

C. If we write data to the cache, this value is always changed immediately in the backing store This answer is incorrect because this depends on the write policy of our caches. This would be true if our cache implemented a write-through policy. However, in a write-back policy this would not always be the case.

Struct Memory Layout

Contributed by An Nguyen.

typedef struct {
  char a;
  int b;
  char c;
  double d;
} X;

What is the size of X? How can you reduce memory consumption of X? How small can X be?

Solution

a: 1 byte
padding: 3bytes
b: 4 bytes
c: 1 byte
padding: 7 bytes
d: 8 bytes
=> Size of X is 24 bytes

To reduce memory consumption of X, put data in the struct in decreasing order of size.

typedef struct {
  double d;
  int b;
  char a;
  char c;
} X;

Now, size of X is 16 bytes.

Rotate Bits

Contributed by Andrew Ding.

Solution

#include <iostream>

int rotateRight(int x, int n) {
    int shift = (!n + ~0) & (32 + ~n + 1);  // makes sure shifting n is valid
    int mask1 = (x >> n) & ~(~0 << shift);  
    int mask2 = x << shift; 
    return mask1 | mask2;
}

Thread-Safe/Reentrant Functions

Contributed by Marko Sevo.

Are the following thread-safe functions in the Rio I/O package also reentrant? ssize_t rio_readn(int fd, void *usrbuf, size_t n); ssize_t rio_writen(int fd, void *usrbuf, size_t n);

Solution

Just because the functions in the Rio I/O package are thread-safe does not mean that they are reentrant. These functions are definitely not explicitly reentrant because they both take in pointers as arguments, but they are implicitly reentrant. This means that if they are used correctly and no shared data is passed as arguments (global data, static data, etc), then they are reentrant. However, since reentrancy is a property of the caller and callee, if shared data is passed into the functions, then they are not reentrant.

Stack and Heap Memory Allocations

Contributed by Max (Yingbo) Wang.

Consider the new memory allocations on each line in the following C function. Is the memory allocated in stack or heap? After the function returns, is there a memory leak? If so, which line of code is causing the leak?

int cs33Rocks()
{
  char *ptr;  // 1
  char buffer[128];  // 2
  ptr = new char[128];  // 3
  return 0;
}

A. stack, stack, heap; #2 is causing a memory leak
B. stack, stack, heap; #3 is causing a memory leak
C. stack, heap, heap; #3 is causing a memory leak
D. stack, heap, heap; no memory leak

Solution

Correct answer: B

Local variables, pointers and function return addresses are stored in stack memory. It's organized by the OS and all allocated spaces in the scope will be recycled when the function returns. Manually allocated memory spaces by new or malloc keywords go onto heap memory instead. It's responsible for memory leak. The developer is responsible for deallocating the space by free when the object is no longer needed.

Is it reentrant?

Contributed by Michael Mallett

Four functions are shown in the code below. Find the one which is not explicitly reentrant.

#include stdin

int t;

// function a
void hello() {
  printf(""Hello World"");
}

// function b
int f(int i) {
  hello();
  return i + 2;
}

// function c
void swap(int *x, int *y) {
  t = *x;
  *x = *y;
  *y = t;
}

// function d
unsigned int fib_rec(unsigned int n) {
  if (n == 0) {
    return 0;
  } 
  if (n == 1) {
    return 1;
  }
  return fib_rec(n - 1) + fib_rec(n - 2);
}

Solution

Function c is not reentrant because it uses a global.
A is obviously thread-safe.
B also is reentrant, despite calling another function.
D is reentrant because it is recursive.

Add Two Numbers

Contributed by Tina Luo.

Write a function add(int x, int y) that returns sum of two integers. The function cannot use any of the arithmetic operators (+, ++, –, -, .. etc).

Solution

#include<stdio.h>

int add(int x, int y) {  
  while (y != 0) {
      int carry = x & y;  
      x = x ^ y; 
      y = carry << 1; 
  }
  return x;  
}

Bit Manipulation to find single number

Yizhu Zhang.

Given an array of integers, every element appears twice except for one. Find that single one.

Example

Given [2, 3, 5, 3, 5]. Returns 2.

Solution

int singleNumber(int A[], int n) {
    int result = 0;
    for (int i = 0; i < n; i++)
        result ^=A[i];
    return result;
}

Thinking about locality

Contributed by Daniel Li.

Explain the two types of locality used in caching.

Solution

Temporal locality: data that we access is likely to be accessed again in the future Spatial locality: if you access a piece of data, it's likely that it we'll access data nearby soon.

Floating Point and Bitwise Operations

Contributed by Derek Xu

Prof. Eggert has developed a new programming language that acts exactly like C except that it uses a new data structure called bloats instead of doubles. bloat is a number inspired by the floating point and have the following structure:

+-----------+----------+----------+
| sign byte | mantissa | exponent |
+-----------+----------+----------+
|<--1 bit-->|<-32bits->|<-31bits->|

Assume that casting a bloat to a long will preserve binary representation. Assume that bloat's like float's can be in normalized, denormalized, or special forms. Assume any operations on bloat act the same if performed on floats (including NaN behavior).

a) can bloating point numbers represent all int's? Why or why not?
b) can bloating point numbers represent all double's? Why or why not?
c) you are tasked with writing a C function that typecasts longs into bloats. You do not need to take into consideration rounding when finding the closest bloat representation. You may only use bitwise operations, addition, subtraction, while loops, if statements, comparison operators, and unions. Try to make your solution as efficient as possible, rather than using brute force (whatever that may be). HINT: 2^30 = 1073741824

Complete the following function:
bloat long2bloat (long x){
        
}

Solution:

a) yes: the bloat's mantissa is 32bits. int is 32bit. Thus bloats can represent all ints.
b) no: the double's mantissa is 52bits. The bloat's mantissa is 32bits. The double can represent numbers more fine grain than bloat, but its numbers have less range.
c)

typedef union {
        long l
        bloat b;
} conversion;

bloat long2bloat (long x) {
        conversion c;
        c.l = 0;
        //convert negative to positive 2s complement
        if (x < 0) {
                c.l = 0x8000000000000000;
                //hard code LONG_MIN case (can use unsigned long too)
                if (x != LONG_MIN)
                        x = (x^0xffffffffffffffff) + 1;
                else {
                        c.l = 0x8000000043e00000;
                        return c.b;
                }
        }
        //calculate exponent and make mantissa
        long shifts = 0;
        while ((x & 0x7fffffff00000000) != 0) {
                x = x >> 1;
                shifts--;
        }
        while ((x & 0x0000000100000000) == 0) {
                x = x << 1;
                shifts++;
        }
        //calculate bias: 2^30 - 1 = 1073741823 
        long exponent = 1073741823 + 32 - shifts;
        //combine everything to form new number
        c.l = c.l | exponent | ((x & 0x00000000ffffffff) << 31);
        //use union to print out same binary representation
        return c.b;
}

CS 33 (Fall '17)

Parallelism comparison

Solution

Bitwise Less Than or Equal To

Example

Solution

Unions and Floating Point Representation

Solution

Picky Positives

Example

Solution

Assembled Transpose

Solution

Thread-safe counter

Example

Solution

How to subtract?

Solution

Toggling Bits

Example

Solution

Solid State Drives

Solution

Locality of loops

Solution

Performance and Optimization

Solution

Caching and Locality

Solution

Struct Memory Layout

Solution

Rotate Bits

Solution

Thread-Safe/Reentrant Functions

Solution

Stack and Heap Memory Allocations

Solution

Is it reentrant?

Solution

Add Two Numbers

Solution

Bit Manipulation to find single number

Example

Solution

Thinking about locality

Solution

Floating Point and Bitwise Operations

Solution:

Clone this wiki locally