prep

The time has come to dust off those old study materials, and get re-acquainted with some long forgotten concepts. The only issue is I don’t have old study materials – I’m self taught, so it’s Google-fu I’m using this time around. An added benefit this round is I’ll document what I’m learning here, which will do two things: (1) help the non-existent readers of this fine blog and (2) generate content! Which is something I really badly need to do (it’s been 21 days since I wrote dmidecode). Let’s do this.

Find the known unknowns

I don’t miss the days of being junior. Just getting the attention of recruiters and hiring managers anywhere hiring junior level was incredibly difficult, and a lot of it had to do with a ton of unknown unknowns. That being said, you only come to know what you don’t know through experience (and some brutal code reviews). At this point in my career, and given the fact I’ve worked predominantly with PHP (don’t do this – learn something actually marketable, and “professional”), I’m starkly aware of the fact my knowledge of data structures and algorithms is about as strong as my pinky toe.

It’s time to change that. I’ve worked in great places, and I’ve worked in really, really shitty places, and frankly I’m over the hot-cold nature of this industry. As much as I detest the idea of technical interviews and the concept that “a real professional needs to know these theories like they’re their gospel”, the powers that be require strong knowledge in these areas, so to cite Coding Interview for Dummies:

In reality, being aware of existing data structures and selecting the appropriate ones to tackle the problem at hand is more important than knowing the intricate implementation details.

Basically, my objective is pretty simple right now. Learn some shit, write some shit, and apply some shit. Seems pretty simple, so this post will serve as a roadmap of sorts going forward. I’ll tackle some definitions and get some high level explanations compiled, and will likely expound on this post a bit in the coming weeks as things come into fruition. Please note that some of the code was copied (citations are all at the bottom of this page). Some I wrote myself for the sake of this article. Consider this to be a compilation of a bunch of resources, condense into bite-sized chunks for your easy consumption.

Data structures

What are data structures.. Well, we use them to organize data, which helps enable us to use the data more effectively. Oftentimes, junior developers (especially self taught ones) learn just enough about arrays to brute force their way into a really shitty Wordpress install, and then have people screaming at them because their site is slower than a 56k modem connection trying to download some pixelated, oversized 90s album art.

By utilizing data structures, we are able to optimize our resource consumption, while maximizing the throughput of data over a given time period. ~~Basically, we use them to play Einstein, in a really remotely, not really related kind of way.~~ Data structures do have a lot of to do with our use of computational space and time. Here’s the major ones.

Linked List

Linear collection of data elements (nodes), each pointing to the next node by means of a pointer. Linked lists consist of a group of nodes and represent a sequence; there are three types:

Single-linked list: each node points to the next node, and the final points to null Double-linked list: each node has two pointers [P and N], where P points to the previous node and N points to the next node. The last node’s N points to null. Circular-linked list: each node points to the next, and the last node points back to the first node.

# Python

class DoublyNode:
    def __init__(self, val):
        self.val = data
        self.next = None
        self.prev = None

    def traverse(self):
        node = self # start from the head node
        while node != None:
            print node.val # access the node value
            node = node.next # move on to the next node

// PHP

$list = new SplDoublyLinkedList();
$list->push('a');
$list->push('3');
$list->push('v');
$list->push('1');
$list->push('p');
$list->rewind();

while ($list->valid()){
    echo $list->current() . PHP_EOL;
    $list->next();
}

When and why:

Linked lists save memory by only allocating the memory required for the values to be stored. In addition, nodes can live anywhere in memory as opposed to an array, where indexes live sequentially. However, this comes at the cost of look up time which is linear due to how linked lists are constructed. Each node in the chain must be checked, one at a time, for a matching value.

Time Complexity:
  Access: O(n)
  Search: O(n)
  Insert: O(1)
  Remove: O(1)

Stack

A collection of elements with two principal operations: push (add/append) and pop (remove). Stacks are LIFO (last in, first out) data structures.

# Python

stack = ['First', 'Second', 'Third']
stack.append('Fourth')
stack.append('Fifth')
print(stack)
print(stack.pop())
print(stack)
print(stack.pop())
print(stack)

# Output:
['First', 'Second', 'Third', 'Fourth', 'Fifth']
Fifth
['First', 'Second', 'Third', 'Fourth']
Fourth
['First', 'Second', 'Third']

// PHP
$stack = ['First', 'Second', 'Third'];
array_push('Fourth');
array_push('Fifth');

var_dump($stack);
array_pop($stack);
var_dump($stack);
array_pop($stack);

// Output:
array(5) {
  [0] =>
  string(5) "First"
  [1] =>
  string(6) "Second"
  [2] =>
  string(5) "Third"
  [3] =>
  string(6) "Fourth"
  [4] =>
  string(5) "Fifth"
}
array(4) {
  [0] =>
  string(5) "First"
  [1] =>
  string(6) "Second"
  [2] =>
  string(5) "Third"
  [3] =>
  string(6) "Fourth"
}
array(3) {
  [0] =>
  string(5) "First"
  [1] =>
  string(6) "Second"
  [2] =>
  string(5) "Third"
}

Note: As of PHP 7, PHP now offers various data structures as an alternative to arrays. See requirements and installation for more information.

When and why:

Stacks are useful for back tracing access to previous elements or operations and matching recursive elements or operations. Consider the Undo functionality in your code editor – stacks work very simliarly to that (and that’s where the usefulness comes in).

Time Complexity:
  Access: O(n)
  Search: O(n)
  Insert: O(1)
  Remove: O(1)

Queue

A collection of elements supporting two operations, enqueue and dequeue. Queues are FIFO (first in, first out) data structures. You can think of a queue as a pipeline, or single lane road, where the first item to enter is the first to leave, ad infinitum.

# Python

from collections import deque
queue = deque(['Fourth', 'Third', 'Second', 'First'])
print(queue)
queue.append('Fifth')
print(queue)
queue.append('Sixth')
print(queue)
print(queue.popleft())
print(queue.popleft())
print(queue)

# Output
deque(['Fourth', 'Third', 'Second', 'First'])
deque(['Fourth', 'Third', 'Second', 'First', 'Fifth'])
deque(['Fourth', 'Third', 'Second', 'First', 'Fifth', 'Sixth'])
Fourth
Third
deque(['Second', 'First', 'Fifth', 'Sixth'])

When and why:

Queues are useful when you want to process one thing at a time. For example, we have a cron job that runs nightly which is responsible for enabling and disabling user accounts in various platforms. A queue is used in this instance as we don’t want to bomb the mail server with 200 requests to send emails out and we also don’t want to block resources during larger batches. By using a queue, we sequentially run through the affected accounts, and everything runs smoothly.

Time Complexity:
  Access: O(n)
  Search: O(n)
  Insert: O(1)
  Remove: O(1)

Tree

A Tree is an undirected, connected, acyclic graph. See All you need to know about tree data structures for a much more thorough explanation than I will provide at this time.

There is some terminology that needs to be understood when discussing trees:

Root is the very first, or root, node in a tree
Edge is a link between two nodes
Child is a node that has a parent node
Parent is a node that has nodes underneath it
Leaf is a node that has no child nodes
Height is the longest path in the tree
Depth is the distance from a node to the root node

Binary Tree

A Binary Tree is a tree data structure in which each node has at most two children, which are referred to as the left child and right child. This is not to be confused with a binary search tree, which is slightly different in the regard that any value in each node must be greater than or equal to the value stored in the left child node, while also being less than or equal to any value stored in the right child node. Confusing, yes, but we’ll get to that momentarily.

# Python

class BinaryTree:
    def __init__(self, value):
        self.value = value
        self.left_child = None
        self.right_child = None

    def insert_left(self, value):
        if self.left_child == None:
            self.left_child = BinaryTree(value)
        else:
            new_node = BinaryTree(value)
            new_node.left_child = self.left_child
            self.left_child = new_node

    def insert_right(self, value):
        if self.right_child == None:
            self.right_child = BinaryTree(value)
        else:
            new_node = BinaryTree(value)
            new_node.right_child = self.right_child
            self.right_child = new_node

top = BinaryTree('root')
top.insert_left('child_a')
top.insert_right('child_b')
child_a = top.left_child
child_b = top.right_child
child_a.insert_left('child_c')
child_a.insert_right('child_d')
child_b.insert_left('child_e')
child_a.insert_right('child_f')

# Structure
    root
    /\
  a    b
 / \  / \
c  d  e  f

An important detail about trees is the flexiblity for traversing a tree. We have two methods available, Depth-First Search (DFS) or Breadth-First Search (BFS). Let’s cover various methods of traversal now, then I’ll discuss when and why to use trees.

Depth-First Search follows a path all the way to the final node before backtracking and navigating another path. The algorithm would go something like:

Start at root
Go to child_a; print it.
Go to child_c; print it. Since this node doesn’t have any children, backtrack to child_a.
Go to child_d; print it. Since this node doesn’t have children, backtrack to root as all children of child_a have been enumerated.
Go to child_b; print it.
Go to child_e; print it. Since this node doesn’t have any children, backtrack to child_b.
Go to child_f; print it. Since this node doesn’t have children, and is the last node in the tree, finalize execution.

Now that we’re familiar with the general algorithm, we can discuss a few types of DFS.

Pre-order: Exactly what we did in the algorithm above. Prints the value of a node. If, and only if it has a left child, goes to the left child, and prints it. Repeats for the right child and subsequent nodes. Output looks like root-a-c-d-b-e-f.

def pre_order(self):
    print(self.value)

    if self.left_child:
        self.left_child.pre_order()
    if self.right_child:
        self.right_child.pre_order()

In-order: this runs left first, middle second, and right last. Using the tree above as an example, the resulting output of this style would be c-a-d-root-e-b-f.

def in_order(self):
    if self.left_child:
        self.left_child.in_order()

    print(self.value)

    if self.right_child:
        self.right_child.in_order()

Post-order: this runs bottom up, consuming each level of nodes as it works its way up the tree. Output would look like c-d-a-e-f-b-root

def post_order(self):
    if self.left_child:
        self.left_child.post_order()

    if self.right_child:
        self.right_child.post_order()

    print(self.value)

Breadth-First Search traverses by level and depth. A rudimentary illustration of what this means, using the same tree we used in our previous example, would look something like this:

    root        <-- Level 0
    /\
  a    b        <-- Level 1
 / \  / \
c  d  e  f      <-- Level 2

def breadthfirst(self):
    queue = Queue()
    queue.put(self)

    while not queue.empty():
        current_node = queue.get()
        print(current_node.value)

        if current_node.left_child:
            queue.put(current_node.left_child)

        if current_node.right_child:
            queue.put(current_node.right_child)

# Output
root-a-b-c-d-e-f

Binary Search Tree

As I mentioned earlier, we’d get to this. The most important distinction you need to remember is that the value of a node is greater than the value of its left child, but less than the value of its right child. Like this, right_child > root > left_child, or like:

And implemented:

def BinarySearchTree:
    def __init__(self, value):
        self.value = value
        self.left_child = None
        self.right_child = None

    def insert_node(self, value):
        if value <= self.value and self.left_child:
            self.left_child.insert_node(value)
        elif value <= self.value:
            self.left_child = BinarySearchTree(value)
        elif value > self.value and self.right_child:
            self.right_child.insert_node(value)
        else:
            self.right_child = BinarySearchTree(value)

    def find_node(self, value):
        if value < self.value and self.left_child:
            return self.left_child.find_node(value)
        if value > self.value and self.right_child:
            return self.right_child.find_node(value)

        return value == self.value

When and why:

The main advantage for using a binary search tree is it extends functionality of a normal array. While arrays are constant time for reads, adding an index at P position will result in any indices after P to be re-indexed, which is a linear time operation (O(n)). This operation isn’t terribly slow, but BST can perform better. They perform better because instead of relying on indexes for reference to values, they rely on their implicit structure. As a result, insertions and deletions are done at logarithmic time (O(log n)).

One of the main use cases for this data structure is indexing database entries. For example, if you have 1 million entries, and you need to find a specific field, you can locate this user in at most 21 iterations. Compare this to iterating over an array, which would require at minimum 20,000 iterations. Quite a big difference.

Time Complexity:
  Access: O(log(n))
  Search: O(log(n))
  Insert: O(log(n))
  Remove: O(log(n))

Trie

A trie, sometimes called a radix or prefix tree, is a kind of search tree that is used to store a dynamic set or associative array where the keys are usually Strings. No node in the tree stores the key associated with that node. A node’s position in the tree defines the key with which it is associated. All the descendants of a node have a common prefix of the String associated with that node, and the root is associated with the empty String.

Python implementation

PHP implementation

When and why:

Typically tries are used when doing lookups, such as searching for a specific word or string of words within a body of text. Other use cases include when you have a fixed dictionary that you want to perform lookups aginst quickly. When compared to a hashtable, it may require less storage but may also take longer to look up.

Fenwick Tree

A Fenwick tree, sometimes called a binary indexed tree, is implemented as an implicit data structure using an array. Given an index in the array representing a vertex, the index of a vertex’s parent or child is calculated through bitwise operations on the binary representation of its index. Each element of the array contains the pre-calculated sum of a range of values, and by combining that sum with additional ranges encountered during an upward traversal to the root, the prefix sum is calculated.

This stack exchange thread has some excellent discussion about Fenwick trees, and I highly recommend reviewing it for more details than I can provide in regards to this data structure.

# Python, see link [10]

class Fenwick():
    def update(self, idx, val):
        while idx <= self.MaxVal:
            self.tree[idx] += val
            idx += idx & (-idx)

    def query(self, idx):
        total = 0
        while idx > 0:
            total += self.tree[idx]
            idx -= idx & (-idx) # subtract the least significant, non-zero bit from idx
        return total

    def __init__(self, size = []):
        self.MaxVal = len(size)
        self.tree = [0 for i in xrange(self.MaxVal)]
        for idx in xrange(0, self.MaxVal):
            self.update(idx, size[idx -1])

// PHP

class Fenwick
{
    public $max;
    public $tree;

    public function __construct(array $data = [])
    {
        $this->max = sizeof($data);
        echo "Max: " . $this->max . PHP_EOL;
        for ($i = 0; $i < $this->max; $i++) {
            $this->tree[$i] = $data[$i];
        }

        echo "Tree: " . print_r($this->tree, true) . PHP_EOL;
        for ($j = 0; $j < $this->max; ++$j) {
            $this->update($j, $data[$j]);
        }
    }

    /**
     * Don't rely on this method -- I'm not sure it's working
     * The while loop was causing an infinite loop, so I removed it.
     *
     * I believe this was caused by the bitwise operation.
     */
    public function update(int $idx, int $val): void
    {
        // while ($idx <= $this->max) {
            $this->tree[$idx] += $val;
            $idx += ($idx & (-$idx));
        // }
    }

    public function query(int $idx): int
    {
        $total = 0;

        while ($idx > 0) {
            $total += $this->tree[$idx];
            $idx -= ($idx & (-$idx));
        }
        return $total;
    }
}

$t = new Fenwick(['0', '15', '325', '6', '11', '17']);
echo 'Result: ' . $t->query(3) . PHP_EOL;
exit;

// Output:
Max: 6
Tree: Array
(
    [0] => 0
    [1] => 15
    [2] => 325
    [3] => 6
    [4] => 11
    [5] => 17
)

Result: 662

When and why:

Fenwick trees, or binary indexed trees, are most useful when calculating the sum of a range. They’re also commonly used in coding challenges. I personally have not found a practical use for this data structure in my day to day tasks.

Time Complexity:
  Range Sum: O(log(n))
  Update: O(log(n))

Segment Tree

A Segment tree, is a tree data structure for storing intervals, or segments. It allows querying which of the stored segments contain a given point. A node may store one or more data members of an interval, which may be queried later.

# Python, see link [12]

class SegmentTree:
    def __init__(self):
        self.tree = None

    def buildTree(self,List):
        n = len(List)
        self.tree = [0 for i in range(2*n)]

        # Initialize leaf nodes
        for i in range(n):
            self.tree[i+n] = List[i]

        # Calculate parents and build the tree
        for i in range(n-1,0,-1):
            self.tree[i] = self.tree[i<<1] + self.tree[i<<1 | 1]

    def updateInRange(self, value, left, right):
        n = len(self.tree)//2

        for i in range(left+n, right+n+1):
            # Update the leaf node
            self.tree[i] += value

            j = i
            # Move upwards and update parents
            while j>1:
                self.tree[j>>1] = self.tree[j] + self.tree[j^1]
                j >>= 1

    def sumInRange(self, left, right):
        n = len(self.tree)//2
        result = 0
        left += n
        right += n

        # Add the current node to the sum if its parent is not included in the same
        # (resulting in faster computation of result)
        while left<right:
            # If left is odd then it means that it is the right child of it's parent
            # and the interval includes only left and not it's parent. So, include
            # this node to sum and move to the parent of it's next node.
            # Similar condition is applied to right index.
            if left&1:
                result += self.tree[left]
                left += 1
            if right&1:
                right -= 1
                result += self.tree[right]

            left >>= 1
            right >>= 1

        return result


def main():
    segmentTree = SegmentTree()

    #Sample test case
    sampleList = [1,3,5,7,9,11]
    segmentTree.buildTree(sampleList)

    print("Array representation of tree :",segmentTree.tree)
    print("Sum of elements in the interval [1,3] :",segmentTree.sumInRange(1,4))

if __name__ == "__main__":
    main()

When and why:

Used as a more versatile, heavier handed version of FT/BIT. I need to do further research on this data structure as well to fully comprehend what its valid use cases are.

Time Complexity:
  Range Query: O(log(n))
  Update: O(log(n))

Heap

A Heap is a specialized tree-based data structure that satisfies the heap property. If $a is a parent node of $b, then the key (the value) of $a is ordered with respect to the key of node $b with the same ordering applying across the entire heap. There can be two types of heap:

Max heap: the keys of parent nodes are always greater than or equal to those of the children and the highest key is in the root node

// PHP

$heap = new SplMaxHeap();
$heap->insert(65);
$heap->insert(13);
$heap->insert(94);
$heap->insert(0);

foreach($heap as $number) {
    echo $number . PHP_EOL;
}

// Output
94
65
13
0

Min heap: the keys of parent nodes are less than or equal to those of the children and the lowest key is in the root node

# Python

import heapq

heap = [12, 45, 9, 1, 53, 8, 27, 66, 20]

heapq.heapify(heap)
print(heap)

# Output
[1, 8, 9, 12, 20, 27, 45, 53, 66]

// PHP

$heap = new SplMinHeap();
$heap->insert([22,333]);
$heap->insert([2,33]);
$heap->insert([222,3]);

var_dump($heap->extract());
var_dump($heap->extract());
var_dump($heap->extract());

// Output
array ( 0 => 2, 1 => 33, )
array ( 0 => 22, 1 => 333, )
array ( 0 => 222, 1 => 3, )

When and why:

Useful whenever you need quick access to the largest or smallest item in a set. The reason for this is the item you need will always be the first in the array, or at the root of the tree. Insertions are fast, and this makes heaps a good way of dealing incoming events or data and always having access to the earliest or biggest indexes. Also valuable for queues, schedulers, or anything else where the earliest item is desired.

Time Complexity:
  Access Max / Min: O(1)
  Insert: O(log(n))
  Remove Max / Min: O(log(n))

Hashing

Hashing is used to map data of an arbitrary size to data of a fixed size. If two hash keys map to the same value, a collision occurs. Sometimes, hashes can be used in what are call hash tables, which are nothing more than a structure that can map keys to values, for example:

// PHP
return [
    'key' => 'value',
    'foo' => 'bar'
];

# Python

# Shorthand
hashmap = {'key': 'value', 'foo': 'bar'}

# Using dict()
hashmap = dict({'key': 'value', 'foo': 'bar'})

return hashmap

To facilitate collision resolution we have a couple of tools available to us:

Separate Chaining: in separate, independent buckets, each bucket contains a list of entries for each index. The time for hash map operations is the time to find the bucket (constant time), plus the time to iterate through the list Open Addressing: when a new entry is inserted into a bucket, the buckets are examined (starting with the hashed-to-slot and proceeding in some sequence) until an unoccupied slot is found. This is called open addressing because sometimes the hash does not point to the location of the value.

Note: Hashmaps resembles JSON syntax for a reason. They’re technically the same, and can also be referred to as associative arrays, depending on the language.

When and why:

Hashing is useful in a lot of instances. One use case would be an unordered list of cities that you want to index by state so you can use JavaScript to dynamically populate some drop downs in a form. You could use the user’s state selection as the key lookup in the hash, and this would ensure you have a one to many relationship between your state and city names. Really, hashes are incredibly useful far beyond such a small use case. As the note says, hashmaps are identical to JSON, and JSON has definitely taken off as the dominant structure of API communications in the past decade. That didn’t happen just by chance.

Graph

A Graph is a data structure consisting of a finite (and possibly mutable) set of nodes, or points, together with a set of unordered pairs of these vertices (for an undirected graph) or a set of of ordered pairs (for a directed graph). These pairs are known as edges, and for a directed graph can be referred to as arrows. See this wikipedia for more information, because I’m still super fuzzy on these as I’ve never had a reason to implement this before.

// PHP

// This is a Dijkstra implementation, see link [14]
class Graph
{
  protected $graph;

  public function __construct($graph) {
    $this->graph = $graph;
  }

  public function shortestPath($source, $target) {
    // array of best estimates of shortest path to each
    // vertex
    $d = array();
    // array of predecessors for each vertex
    $pi = array();
    // queue of all unoptimized vertices
    $Q = new SplPriorityQueue();

    foreach ($this->graph as $v => $adj) {
      $d[$v] = INF; // set initial distance to "infinity"
      $pi[$v] = null; // no known predecessors yet
      foreach ($adj as $w => $cost) {
        // use the edge cost as the priority
        $Q->insert($w, $cost);
      }
    }

    // initial distance at source is 0
    $d[$source] = 0;

    while (!$Q->isEmpty()) {
      // extract min cost
      $u = $Q->extract();
      if (!empty($this->graph[$u])) {
        // "relax" each adjacent vertex
        foreach ($this->graph[$u] as $v => $cost) {
          // alternate route length to adjacent neighbor
          $alt = $d[$u] + $cost;
          // if alternate route is shorter
          if ($alt < $d[$v]) {
            $d[$v] = $alt; // update minimum length to vertex
            $pi[$v] = $u;  // add neighbor to predecessors
                           //  for vertex
          }
        }
      }
    }

    // we can now find the shortest path using reverse
    // iteration
    $S = new SplStack(); // shortest path with a stack
    $u = $target;
    $dist = 0;
    // traverse from target to source
    while (isset($pi[$u]) && $pi[$u]) {
      $S->push($u);
      $dist += $this->graph[$u][$pi[$u]]; // add distance to predecessor
      $u = $pi[$u];
    }

    // stack will be empty if there is no route back
    if ($S->isEmpty()) {
      echo "No route from $source to $targetn";
    }
    else {
      // add the source node and print the path in reverse
      // (LIFO) order
      $S->push($source);
      echo "$dist:";
      $sep = '';
      foreach ($S as $v) {
        echo $sep, $v;
        $sep = '->';
      }
      echo "n";
    }
  }
}

When and why:

A lot of the big companies are offering graph APIs now, and there is a lot of excitement around graphs. This is another area I’m weak in and need to spend some time studying and figuring out what the application of this data structure is. Suppose that just means more content needs to be written.

Algorithms

So, I was gonna write about some algorithms at this point. As I started to dig into them (and it being quite late in the day, post work and pre-beer), I’ve decided that the algorithms portion will probably be a multipart series. There are simply too many (a few small ones are even implemented in this post) to be able to cram them all into this one-stop-learn-everything-you-need-so-one-of-the-Big-Four-will-hire-you post. However, what I will leave you with are some additional questions and resources that will surely keep you busy while I take another month to write something new. If you made it here, thank you. If not, well, you probably were never here to begin with.

Questions to ask

How big is the size of the input?
How big is the range of values?
What kind of values are there? Are there negative numbers? Floating points? Will there be empty inputs?
Are there duplicates within the input?
What are some extreme cases of the input?
How is the input stored? If you are given a dictionary of words, is it a list of strings or a trie?

Links:

1. Awesome Algorithms - Online Courses 2. Coding Interviews for Dummies 3. Time Complexity 4. 10 Most Popular Coding Challenge Websites (2016) 5. Short Circuit Evaluation 6. Tech Interview Handbook 7. Base CS 8. Dynamic Programming 9. Linked Lists 10. Fenwick Trees 11. Binary Indexed Tree 12. Segment Tree (Wikipedia) 13. Data Structures 4 14. All you need to know about tree data structures 15. How to implement a binary tree 16. Why Use Binary Search Tree 17. Binary Indexed Trees 18. A simple approach to segment trees