Data Structures & Algorithms

Table of Contents

Lecture 1: Algorithms and Analysis

Psuedocode

  • Selection Sort
  • Input: A collection \(A = \{a_1, \ldots, a_n\}\)
  • Output: \(A\), Sorted
  • For (\(i = 1\ldots n-1\))
    • min \(\gets\) \(a_i\)
    • For (\(j = i+1 \ldots n\))
      • if (min \(> a_j\))
        • min \(\gets a_j\)
      • swap min, \(a_i\)
  • output \(A\)

Analysis

  • Identify the input
  • Identify the input size
    • collection size
    • tree/graph number of nodes/edges
  • Identify the elementary operation(s)
    • most common or most expensive
    • comparisons, assignments, and incrementations ignored
  • Analyze elementary op executed with regard to the input size
  • Provide an asymptotic characterization from step 4

Example

  • Collection A
  • Size or cardinality of \(A\), \(n\)
  • Comparison on line 4
  • \(\sum_{i = 1}^{n-1} (n - i) = \sum_{i = 1}^{n - 1} n - \sum_{i = 1}^{n - 1} i = n(n - 1) - \frac{n(n - 1)}{2} = \frac{n(n-1)}{2}\)
  • \(O(n^2)\), \(\Theta(n^2)\) (Prefer \(\Theta\))

Brute Force

  • Given: a set of \(n\) points, \(P = \{(x_1, y_1), (x_2, y_2), \ldots, (x_n, y_n)\}\)
  • Find: the two points \(p\) and \(q\) that are closest to each other with regard to \(L_2\) norm (\(\sqrt{(x_1-x_2)^2 + (y_1 - y_2)^2}\))

Looping Method

  • \(minDist \gets \infty\)
  • For (\(i = 1 \ldots n-1\))
    • For(\(j = i+1 \ldots n\))
      • If(distance(\(P_i\), \(P_j\)))
        • \(p \gets P_i\)
        • \(q \gets P_j\)
        • \(minDist \gets distance(P_i, P_j)\)
  • output p, q

Simpler

  • \(minDist \gets \infty\)
  • For Each Pair of Points \(p = (x, y)\), \(q = (x^\prime, y^\prime)\)
    • d ← distance(p, q)
    • if (d < minDist)
      • \(minDist \gets distance(p,q)\)
  • Finish
Analysis
  • \(\binom{n}{2} = \frac{n!}{(n - 2)!2!} = \frac{n(n-1)}{2} = \Theta(n^2)\)

Common Big Os

  • Constant
  • Logarithmic
  • Linear
  • Iterated Linear
  • Quadratic
  • Polynomial
  • \(2^n\)
  • Factorial

Lecture 2: Brute Force Approach

  • Examining all possibilities
  • optimisation to find the best solution
  • decision problems: find a solution, or determine none exists
  • generally infeasable or inefficient
  • may involve:
    • generating combinatorial objects
    • backtracking
  • Ex: What percentage of 52 card shuffles has been generated by humans
  • Ex: Compute \(a^n \mod m\) for a very large \(n\): Dumbly

    int a, n, m;
    int result = 1;
    
    for(int i = 1 ; i <= n; i++) {
        result = (result * a) % m;
    }
    
  • Pairs: 2 for loops
  • triples: 3 for loops
  • for each subset of k elements: write k for loops where k is a variable
  • Given a set of n elements \(A = \{a_1, a_2, \ldots, a_n\}\) and an integer \(k\), \(1 \leq k \leq n\), output all subsets of \(A\) with cardinality \(k\).
    • \(\binom{n}{k} \in O(n^k)\)
  • It suffices to consider combinations of integers \(1, 2, \ldots, n\).
  • Algorithm:
    • Start with \(a_1, a_2, \ldots, a_k\)
    • Assume we have \(a_1, a_2, \ldots, a_k\), we want to generate the next combination
    • Locate the last element \(a_i\) in the current such that \(a_i \neq n - k + i\)
    • Replace \(a_i\) with \(a_i + 1\), \(a_j\) with \(a_i + j - i\) for \(j = i+1, i+2, \ldots k\)
  • Ex: \(k = 3\) on \(\{1, 2, 3, 4, 5\}\)
    • \(\{1, 4, 5\}\)
    • \(\{2, 3, 4\}\)
    • \(\{2, 3, 5\}\)
    • \(\{2, 4, 5\}\)
  • Goal: generate all permutations of a set of size \(n\), \(A = \{a_1, a_2, \ldots, a_n\}\)
    • Steinhouse-Jonson-Trotter Algorithm
  • Generate all \(n!\) permutations in lexicographic order
    • Start with \(1, 2, 3, \ldots, n\)
    • Given \(a_1, a_2, \ldots, a_n\)
    • Find the rightmost pair $ai, \(a_{i + 1}\) such that \(a_i < a_{i + 1}\)
    • Find the smallest element to the right of this pair that is larger than \(a_i\), \(a^{\prime}\)
    • Swap \(a_i\) and \(a^{\prime}\), sort all elements to the right of \(a^{\prime{}}\)
  • ex: \(n = 6\)
    • \(\{1, 6, 3, 5, 4, 2\}\)
    • \(\{1, 6, 4, 2, 3, 5\}\)
    • \(\{1, 6, 4, 2, 5, 3\}\)
    • \(\{1, 6, 4, 3, 2, 5\}\)
  • Ordered combinations (\(n\) distinct elements), and a length \(k\), generate all permutations of of length \(k\), \(P(n, k)\)
    • Generate all subsets of size k, for each, generate all permutations of it
  • Permutations with repetition: given \(n\) distinct elements, generate all ordered sequences of length \(k\), where each element can appear multiple times. \(n^k\)
    • Count from 0 to \(n^k - 1\), converting to base \(n\), map to the set.

Satisfiability (SAT)

  • Given a predicate \(P(x_1, x_2, \ldots x_n)\) on \(n\) boolean variables \(x_1, x_2, \ldots x_n\)
  • Determine if there exists an assignment of \(x_1, x_2, \ldots, x_n\) that \(P\) evaluates to true
  • (x1 or x2 or x3) and (x1 or not x2) and (x2 and not x3) and (x3 and not x1) and (not x1 or not x2 or not x3)
    • by brute force: No. All \(2^n\) possibilities were checked
  • Possible solutions:
    • First:
      • Input: A predicate \(P(x_1,\ldots, x_n)\)
      • Output: True if satisfiable, false otherwise
      • For each Truth assignment T
        • if \(\mathrm{apply} P, T\)
          • output True
      • Output False
      • \(O(2^n)\)
    • Second: Backtracking, pruning, think trees

Lecture 3: Knapsack, Hamiltonian Path and Backtracking

  • Backtracking requires recursion
  • SAT traversal, w/ backtracking: SAT(P, \(\overrightarrow{t}\))
    • Input: A Predicate \((x_1, \ldots, x_n\)), a partial truth assignment \(\overrightarrow{t} = t_1, \ldots, t_k\)
    • Output: True if P is satisfiable, false otherwise
    • If \(k = n\)
      • Return \(P(\overrightarrow{t})\)
    • Else
      • \(t_{k + 1} \gets 0\)
      • If SAT(P, \(\overrightarrow{t}\))
        • return True
      • Else
        • \(t_{k + 1} \gets 1\)
        • return SAT(P, \(\overrightarrow{t}\))

Hamiltonian Path/Cycle

  • given an undirected graph \(G = (V, E)\)
  • output true if G has a Hamiltonian path, false otherwise
  • A Hamiltonian Path is a path in \(G\) that visits every vertex exactly once.
  • A Hamiltonian Cycle is a path in \(G\) that visits every vertex exactly once, save for the first and last, which must be the same.
  • Given a a graph \(G = (V, E)\), and a vertex \(v \in V\), the neighborhood of \(v\) is \(N(v) = \{x | (v, x) \in E\}\)
  • Naive approach: Go by permutations
  • Smart approach: Go by neighborhood
    • Given a Graph and a partial path:
    • Return true if length of path \(p = n - 1\), false otherwise
  • Algorithm (HamWalk(G, \(p = v_1, \ldots, v_k\))):
    • Input \(G = (V, E)\) and partial path \(p\)
    • Return true if \(G\) contains a hamiltonian path, false otherwise
    • If \(k = n\), return True
    • for each \(v \in N(v_k)\)
      • if (\(v \not\in p\))
        • HamWalk(G, p + v)
    • return false

0-1 Knapsack Problem

  • Given a collection of items \(A = \{a_1, \ldots, a_n\}\) a Weight Function \(wt: A \to \mathbb{R}^+\) \(w(a_i)\), a value function \(val: A \to \mathbb{R}^+\) \(v(a_i)\) and a capacity \(W\)
  • Output \(S \subseteq A\) such that \(\sum_{s \in S} w(s) \leq W\) but \(\sum_{s \in S} v(s)\) is maximized
  • Ex:

    item value weight
    1 15 1
    2 10 5
    3 9 3
    4 5 4
    • W = 8
    • Solutions
      • 1, 2, 3
      • 1, 3, 4
  • For Each Subset in A
  • if total weight of subset is less than max, and total value is greater than the best value
    • update solution
    • update best value
  • Optimally:
    • Use Backtracking, cutting off at a possibility if it’s greater than the weight, not checking any children
  • Algorithm
    • Input: Knapsack \(A = \{a_1, \ldots, a_n\}\), weight function \(w(a_i)\), value function \(v(a_i)\), max weight \(W\) and a Partial Solution \(S \subseteq A\) constituting elements not indexed > j
    • Output: A solution \(S^{\prime} \subseteq A\) at least as good as \(S\).
    • If \(j = n\):
      • Return \(S\)
    • \(S_{best} \gets S\)
    • For \(k = j+1 \ldots n\)
      • \(S^{\prime} \gets S \cup \{a_k\}\)
      • If \(w(S^{\prime}) } \leq W\):
      • \(T \gets\) Knapsack(\(A\), \(w\), \(v\), \(W\), \(S^{\prime}\), \(k\))
      • If \(v(T) > val(S_{best})\)
        • \(S_{best} \gets T\)

Lecture 4: Divide and Conquer: repeated squaring, karatsuba multiplicaiton

Traveling Salesman Problem

  • Given: A collection of cities \(C = \{c_1, \ldots, c_n\}\)
  • Output an output of cities \(\pi(C) = (c_1^{\prime}, \ldots, c_n^{\prime})\)
    • such that the total distance travelled \((\sum_{i = 1}^{n - 1} (\delta(c_i^{\prime}, c_{i+1}^{\prime}))) + \delta(c_1^{\prime}, c_n^{\prime})\) is minimized
    • This is on a permutation of cities where there is a complete graph with weighted edges, all solutions (permutation of the set) are therefore hamiltonian
  • an optimization problem: back tracking not necessarily applicable, but not the most appropriate, requires being smart
  • Question: Ham Path or TSP: Which is harder or more complex? Which requires more resources?
    • They’re equivalent as they’re both NP-complete

Convex Hull

  • Given a field of points, find a polygon that encloses all points and is also convex and has a minimal count of sides
  • Given a set of \(n\) points, compute a convex hull that contains them
    • brute force is \(0(n^3)\)
    • Smart way is based on presorting, \(O(n)\)

Assignment Problem

  • Given a collecetion of tasks \(T\) and persons \(P\) and an \(n \times n\) cost matrix \(C\) such that \(C_{i, j}\) is the cost of assigning person \(p_i\) to task \(t_j\).
  • Output an assignment (a bijection) \(f: P \to T\) such that the \(\sum_{p_i \in P} c(i, f(p_i))\) is minimized.
  • Naive approach is \(O(n!)\)
  • Better Solutions
    • Hungarian Algorithm – Polynomial Time
    • Solving a Linear Program – Much faster, simpler

Integer Mod

  • Given integer \(a\), \(n\), \(m\)
  • Compute \(a^n \mod{m}\)
  • Naively: Perform \(n\) multiplications (input size \(\log{n}\), elem op multiplication, \(n\) times, \(O(2^N)\))
  • Ex: \(a^77 = a^64 \cdot a^8 \cdot a^4 \cdot a\)
    • Repeated Squaring, Binary Exponentiation
  • Algo: \(a, m, n = (b_k, \ldots, b_0)\)
    • \(term \gets a\)
    • \(prod \gets 1\)
    • if \(b_0 = 1\)
      • \(prod \gets a\)
    • for \(i = 1 \ldots k\)
      • \(term \gets term^2 \mod{m}\)
      • if \(b_i = 1\)
        • \(prod \gets (prod \cdot term) \mod{m}\)
    • output \(prod\)
    • \(O(\log{n})*\), but really linear, as \(O(k)\)

Karatsuba Multiplication

  • ad + bc = (a + b)(c + d) - ac - bd
  • Given 2 \(n\) bit numbers, \(a, b\)
    • split into 4 numbers
    • \(a = a_1 \cdot 10^{\frac{n}{2)} + a_0\)
    • \(b = b_1 \cdot 10^{\frac{n}{2)} + b_0\)
    • \(ab = a_1 b_1 10^n + (a_1 b_0 + a_0 b_1)10^{\frac{n}{2}} + a_0 b_0 = a_1 b_1 10^n + \[(a_1 + a_0)(b_1 + b_0) - a_1 b_1 - a_0 b_0\] 10^{\frac{n}{2}} + a_0 b_0\)
  • Do this recursively
  • Is \(O(n^{\log_2 3})\)

Lecture 5: Karatsuba and Strassen Matrix Multiplication

  • karatsuba alternatives
    • toom-cook \(O(n^{1.463})\)
    • fast-fourier transform (Schonege-Strassen) \(O(n\log n \log \log n)\)
    • Furer $O(nlog n 2log* n)
  • Log * 64 = 4

Matrix Multiplication

  • \(A \cdot B = C\)
    • \(c_{ij} = \sum_{k = 1}^{n} a_{ik}\cdot b_{kj}\)
    • Traditional Method is \(O(n^3)\) for multiplications, \(O(n^3)\) for additions
    • Input: Matrix, size \(n^2 = N\), then \(O(N^1.5)\)

Strassen’s Matrix Multiplication Algorithm

  • For \(2 \times 2\):
    • M1 + M4 - M5 + M7, M3 + M5, M2 + M4, M1 + M3 + M2 + M6
    • M1 = A00 + A11 * B00 + B11
    • M2 = A10 + A1 * B00
    • M3 = A00 * B01 - B11
    • M4 = A11 * B10 - B00
    • M5 = A00 + A01 * B11
    • M6 = A10 - A00 * B00 + B01
    • M7 = A01 - A11 * B11 + B11
    • 7 mults, 18 additions for strassens
    • 8 mults, 4 adds for traditional
  • For an \(n \times n\) matrix, split into Quadrants, \(A_00, A_01, A_10, A_11\) which are each of \(\frac{n}{2} \times \frac{n}{2}\) and the same for \(B\)
    • Use the previous identities recursively
  • \(M(n)\) is the number of scalar multiplications by strassen on 2 matrices of \(n \times n\)
    • \(= 7M(\frac{n}{2})\)
    • \(M(2) = 7\)
  • By Master Theorem:
    • \(a = 7, b = 2, d = 0\)
    • \(7 > 2^0\)
    • \(\Theta(n^{\log_2 7})\)
  • Alternatives:
    • Winagard: \(O(n^{2.375477})\)
    • Williams: \(O(n^{2.3727})\) (2012)
  • Algorithm Details
    • design and implement splitting or your functions should handle arbitrary indices
    • matrix add and subtract
    • matrix combine
    • \(M_1, \ldots, M_7\) functions
    • pad-with-zeros function, for when \(n\) is not a power of 2 and unpad

Lecture 6: Closest Pair, Convex Hull and Presorting & Gaussian Elimination and Applications

Closest Pairs, Revisited

  • Divide and Conquer
    • Divide the points into 2 sets along the \(x\) axis
    • Conquer by recursively find the two closest points in each partition
    • Combine by considering combinations that cross the split
    • \(D(n)\) the number of distance calculations for divide and conquer
      • \(2D\left(\frac{n}{2}\right) + \frac{k}{2}n\)
      • By Master Theorem \(a = 2\), \(b = 2\), \(d = 1\)
      • \(a \equiv b^d\), thus \(\mathcal{O}(n\log n)\)
  • Use sorting to help reduce work

Gaussian Elimination

  • \(3x + 2y = 35\)
  • \(x + 3y = 27\)
  • \(a_{11} x_2 + a_{12} x_2 + \cdots a_{1n} x_n = b_1\)
  • \(a_{n1} x_1 + a_{n2} x_2 + \cdots a_{nn} x_n = b_n\)
  • Provides an \(n \times n\) matrix, and with \(\overrightarrow{b}\) added on the rhs, you get an \(n \times (n + 1)\) augmented matrix
  • Goal: transform augmented matrix into an upper diagonal matrix
  • Tools: We can perform any operation (any op that does not change the solution)
    • row multiply by any non-zero constant
    • reorder rows
    • add any constant multiple of one row to another
  • This can be done with Gauss-Jordan elimination
  • Gaussian Elimination Algorithm
    • Input \(n \times n\) matrix \(A\), \(1 \times n\) solution vector \(\overrightarrow{b}\)
    • \(A^{\prime} \gets \left[ A | \overrightarrow{b} \right]\)
    • For \(i = 1 \ldots n - 1\)
      • \(\mathrm{pivotrow} \gets i\)
      • for \(j = i + 1 \ldots n\)
        • if \(|A^{\prime}_{ij}| > |A^{\prime}_{\mathrm{pivotrow}j}|\)
          • \(\mathrm{pivotrow} \gets j\)
      • for \(k = i \ldots n + 1\)
        • swap \(A^{\prime}_{ik}\), \(A^{\prime}_{\mathrm{pivotrow}j}\)
      • for \(j = i + 1 \ldots n\)
        • \(t \gets \frac{A^{\prime}_{ji}}{A^{\prime}_{i,i}}\)
        • for \(k = i \ldots n + 1\)
          • \(A^{\prime}_{jk} \gets A^{\prime}_{jk} - tA^{\prime}_{ik}\)
    • return \(A^{\prime}\)
  • Most common op, multiplication in the for i, for j, for k$
    • \(\sum_{i = 1}^{n - 1} \sum_{j = i + 1}^{n} \sum_{k = i}^{n + 1} 1 = \sum_{i = 1}^{n - 1} \left(n_2 - 2ni + i^2\right) = \frac{(2n + 3)(n-2)n}{6}\)
    • Therefore \(\mathcal{O}(n^3)\)
  • Back Solving
    • Input \(n \times (n + 1)\) augmented upper triangualr matrix \(A\)
    • Output a solution vector \(\overrightarrow{x}\)
    • For \(i = n \ldots 1\)
      • \(t \gets A_{i n+1}\)
      • For \(j = i + 1 \ldots n\)
        • \(t \gets t - x_j A_{ij}\)
      • \(x_i \gets \frac{t}{A_{ii}}\)
    • return \(\overrightarrow{x}\)
  • Gotchas:
    • divide by bigger numbers or use a language with good numerics support
    • There do exist some that have infinite solutions, a indeterminite system
    • There are some with no solution, a degenerate system
  • After Gaussian Elimination:
    • if last row is all zeros, infinite solutions
    • if last row is all zeros but the augment, no solutions

Lecture 7: Gaussian Elimination, LU Decomposition, Linear Programming and Coding Theory

LU Decomposition

  • Uses gaussian elimination for two matrices
    • L is lower triangular, 1’s on the diagonal
    • U is upper Triangular
    • Works with unknowns on \(\overrightarrow{b}\)

Matrix Inverses

  • A Matrix \(A^{-1}\) is a matrix such that \(A \cdot A^{-1} = I\)
  • A Matrix is invertible or singular iff its determinant is non-zero.
    • \([A | I]\) and row ops to \([I | A^{-1}]\) (Gauss-Jordan)
  • A Determinant is \(\mathrm{det}(A) = \sum_{\sigma \in S_n} (\mathrm{sign}(\sigma)\prod_{i = 1}^{n} A_{i\sigma_n})\)
    • The Determinant of an Upper Triangular Matrix is equal to the product of its diagonal, \(\mathrm{det}(A) = \prod_{i = 1}^{n} a_{ii}\)
    • The determinant is invariant with regard to adding/subtracting a constant multiple of a row to another
    • The sign flips whenever rows are exchanged.
    • any constant multiplication of a row modifies the determinant as follows
      • if multiplied by c, then \(\mathrm{det}(A) = \frac{\mathrm{det}(A)}{C}\)

Linear Programming

  • Not the modern sense of programming
  • A methematical model for linear optimization
  • there’s an objective function: cost(minimize)/benefit(maximize)
  • constraints: linear equations that cannot be violated
  • \(n\) variables, \(x_1, \ldots, x_n\)
  • Objective: output a solution vector \(\overrightarrow{x}\) that is feasable and optimizes the objective function
  • Standard Form:
    • maximize \(\overrightarrow{c}^{T} \cdot \overrightarrow{x}\)
    • Subject To \(A\overrightarrow{x} \leq \overrightarrow{b}\), \(x_i \geq 0\)
  • Ex:
    • Maximize \(4x + 3y\)
    • \(x \geq 0\), \(y \geq 0\), \(x \leq 9\), \(2y + x \leq 12\), \(2y - x \leq 8\)
  • For LP in standard form, the optimal solution always lies at a critical point (in this situation, all vertices bounding the feasible region)
    • this is the simplex method or “gradient descent”

Use of Linear Programs

  • Knapsack to ILP (Integer Linear Program)
    • Maximize the sum of all values of items takenp
    • Subject to the total weight constraint, \(W\), and an Integer Constraint: each item \(a_i\) can be taken, or not (0 or 1) \[ x_i = \left\{ \begin{matrix}0 & \mathrm{not\ taken}\\1 & \mathrm{taken}\end{matrix} \right. \]
    • Maximize \(\sum_{i = 1}^{n} x_iv_i\), subject to \(\sum_{i = 1}^{n}x_iw_i \leq W\) and \(x_i \in \{0, 1\} \forall x_i\)
    • Can do a relaxation, converting an ILP to an LP, such that \(0 \leq x_i \leq 1\), but this generates incorrect solutions
  • In General, ILP are NP-Complete
  • Some Problems, however, can be relaxed and preserve all optimal solutions (totally unimodular)
  • In General, Simplex Method may be inefficient
    • though good in practice
    • there are, however, proven polynomila time algorithms
      • Khachyan (1979)
      • Interior Point Method (1984)

Coding Theory

  • Claude Shannon
  • Compressing Data for efficient storage and/or transmission
    • lossless: tarballs (encodes all information)
    • lossy: jpegs, mpegs, mp3s, oggs (looses information)
  • Adding redundancies to make it more reliable
    • sending \(n + \log n\), where \(n\) is the number of bits, and \(\log n\) is the redunancy
    • Reed-Solomon codes

Huffman Coding

  • Lossless compression scheme for text and text-like data
  • Let \(\Sigma\) be a fixed alphabet of characters appearing in a file, \(|\Sigma| = n\), (n distinct characters)
  • A block encoding scheme encodes every character with a binary codeword of equal length, ex: ASCII (256 chars, in codewords of 8 bits)
    • in general \(n\) characters requires \(\log n\) length codewords in a block encoding scheme
  • But some characters may have a higher frequency, and others a lower frequency, thus variable length codewords
    • high-freq gets short codewords
    • low-freq gets long codewords
    • \(\{0, 01, 101, 010\}\)
    • vlcw require a prefix-free code (no codeword appears as the prefix of another)
    • \(\{10, 010, 110, 0110\}\)
  • Goal: Encode a file using prefix-free variable-length encoding such that common characters get short codewords, less common get longer codewords

Lecture 8: Huffman Coding and AVL Trees

Huffmann Coding Continued

  • Create \(n\) binary tree nodes with the characters.
  • Combine 2 trees at a time with the smallest frequencies
  • Do this until 1 huffmann tree exists
  • Characters only appear as leave, thus the path from the root to any leaf is unique
  • high frequency leaves are shallow, high frequency leaves are deeper
  • Minimize average codeword length
  • Make sure to generate from root!
  • compression ratio is original minus avg codeword over original
  • in general english text files get 25-40% compression ratios
  • pack is the most common
  • Algorithm:
    • Input: An alphabet of \(n\) symbols, \(\Sigma\), and a frequenci mapping \(\mathrm{freq}: \Sigma \to [0, 1]\)
    • Output: A Huffmann Tree
    • \(H \gets \mathrm{minHeap}\)
    • For each \(x \in \Sigma\)
      • \(T_x \gets \mathrm{single\ node\ tree}\)
      • \(wt(T_x) \gets \mathrm{freq}(x)\)
      • \(\mathrm{insert}(H, T_x)\)
    • While \(\mathrm{size}(H) > 1\)
      • \(T_r \gets \mathrm{new tree node}\)
      • \(T_a \gets \mathrm{get}(H)\)
      • \(T_b \gets \mathrm{get}(H)\)
      • \(\mathrm{left}(T_r) \gets T_a\)
      • \(\mathrm{right}(T_r) \gets T_b\)
      • \(\mathrm{wt}(T_r) \gets \mathrm{wt}(T_a) + \mathrm{wt}(T_b)\)
    • Return \(\mathrm{get}(H)\)
  • Analysis
    • While Loop executes \(n - 1\) times
    • Input alphabet and frequency function, input size \(n = |\Sigma|\)
    • heap insertion is performed \(2n - 1\) times, \(\in\mathcal{O}(n)\), at worst
    • truly \(\mathcal{O}(\log n)\) when heap comparision operations considered
    • Or \(\widetilde{\mathcal{O}}(n)\)

BSTs

  • Every node has a unique key, \(k\)
  • Every node in a node \(u\)’s left subtree has a key value less than \(u\)’s key
  • Every node in a node \(u\)’s right subtree has a key value greater than \(u\)’s key
  • In general, there is an \(\mathcal{O}(\log n)\) lower bound on a BST with \(n\) elements, an upper bound of \(\mathcal{O}(n)\)
  • Goal: Have a BST that is balanced: \(d \in \mathcal{O}(\log n)\)

AVL Trees

  • Accomplishes the goal
  • The height of a node is the length of the unique path from the root to the node
  • The balance factor of the node \(u\) is as follows \(\mathrm{bf}(u) = \mathrm{h}(u_l) - \mathrm{h}(u_r)\)
  • The height of an empty tree is always \(-1\), the heigh of a single node tree is \(0\)
  • balance factors of \(-1\), \(0\), \(1\) indicate a balanced tree
  • positive balance factors indicate left skew
  • negative balance factors indicate right skew
  • left rotation pulls the tree to the left
  • right rotation pulls the tree to the right
  • Can have left and right rotations within subtrees, such as RL or LR rotations
  • Suppose that \(r\) has on it’s left \(T_1\), on right \(c\) which on the left, has \(T_2\) and on the right \(T_3\), and \(r\) is -2, \(c\) is -1$
    • perform a generalized left rotation, with \(c\) as new root, and $T2 on right of \(r\), and a new node on \(T_3\)
  • Suppose that \(r\) has \(c\) on left, \(T_3\) on right. \(c\) has \(T_1\) and \(T_2\) on left and right respectively
    • Perform generalized right rotation, \(c\) has \(T_1\) on left, \(r\) on right. \(r\) has \(T_2\) and \(T_3\) on left and right respectively
  • (r (c T_1 (g T_2 T_3)) T_4)
    • (r (g (c T_1 T_2) T_3) T_4) -> (g (c T_1 T_2) (r T_3 T_4))
  • Swinging over lets you cut one off and put it on the other side of the tree

Lecture 9: AVL Trees, 2-3 Trees and Balancing

AVL Trees

  • Insert 8, 4, 7, 3, 2, 5, 1, 10, 6
    • (8)
    • (8 4)
    • (8 (4 nil 7))
    • (8 (4 3 7)) -> (8 (7 4 3)) -> (7 (4 3) 8)
    • (7 (4 (3 2)) 8) -> (7 (3 2 4) 8)
    • (7 (3 2 (4 nil 5)) 8) -> (7 (4 (3 2) 5) 8) -> (4 (3 2) (7 5 8))
    • (4 (3 (2 1)) (7 5 8)) -> (4 (2 1 3) (7 5 8))
    • (4 (2 1 3) (7 5 (8 nil 10)))
    • (4 (2 1 3) (7 (5 nil 6) (8 nil 10)))
  • Observations:
    1. Search is \(\mathcal{O}(d)\)
      • \(d \in \mathcal{O}(\log n)\)
    2. Insertion is \(\mathcal{O}(d)\)
      • $d ∈ \(\mathcal{O}(\log n)\)
    3. A single rotation is \(\mathcal{O}(1)\)
    4. Deletion is \(\mathcal{O}(d)\)
      • \(\mathcal{O}(d)\) to find replacement
      • May need to rotate all the way up to the root. (\(\mathcal{O}(\log n)\))
    5. In worst case, \(\Theta{\log n}\), but guaranteed to be \(\mathcal{O}(\log n)\) in all cases

B-Trees

  • Each node has at most \(m\) children
  • Every node except the root has at least \(\left\lceil\frac{m}{2}\right\rceil\)
  • The root has at least 2 children unless it is a leaf
  • Non-leaf nodes with \(k\) children have \(k - 1\) keys
  • All leaves are at the same level and non-empty
  • commonly used in databases
  • Ex:
    • Each node has at most 3 children
    • Each node has at least 2 children

2-3 trees

  • This is a specialization of the B-tree where \(m = 3\)
  • 2 kinds of nodes:
    • 2-nodes: A single key \(k\) and 2 children, \(\ell, r\)
    • 3-nodes: 2 keys \(k_1, k_2\) such that \(k_1 < k_2\) and 3 children \(\ell, c, r\)
  • Ex: (10 (5 (7 4) 7) ((15 20) 12 17 (21 25)))
  • Search is simple, but varies based on node size
  • Insertion is always done at a leaf
    • sometimes requires changing node size and ordering
    • sometimes requires reordering of node keys
  • Ex: 8 4 7 3 2 5 1 10 6
    • (8)
    • ((4 8))
    • ((4 7 8)) -> (7 4 8)
    • (7 (3 4) 8)
    • (7 (2 3 4) 8) -> ((3 7) (2 4) 8) -> ((3 7) 2 4 8)
    • ((3 7) 2 (4 5) 8)
    • ((3 7) (1 2) (4 5) 8)
    • ((3 7) (1 2) (4 5) (8 10))
    • ((3 5 7) (1 2) (4 6) (8 10)) -> (5 (3 (1 2) 4) (7 6 (8 10))
  • Analysis:
    • Min key count depth \(d\)
      • all be 2-nodes (1 key per node)
      • all nodes are present at all levels (always 2 children)
      • \(2^{d + 1} - 1\) nodes
      • \(2^{d + 1} - 1\) keys
    • Max key count for depth \(d\)
      • every node is 3-node (2 keys per node, 3 children per node)
      • \(2(3^{d + 1} - 1)\) nodes/keys
    • \(d \leq \mathcal{O}(\log_2 (n + 1) - 1)\)
    • \(\log_3(\frac{n}{2} + 1) - 1 \leq d\)
    • Therefore \(d \in \Theta(\log n)\)
  • Don’t even think about deletion. It’s a pain.

Other Balanced BSTs

  • Splay Trees
  • Tries
  • Red-Black Trees (default implementation in java for tree-set)
  • A Trees
  • Scapegoat Trees
  • Treap

Heaps

  • A (min) heap is a binary tree data structure with the following properties
    • all nodes are present at all levels except possibly the last level
    • at the last level, it is full to the left
    • the key of any node is less than bth its children
    • The minimum is therefore always at the root
    • The max is always at the bottom towards the left. It will only be at the very bottom if it’s full.
  • 2 efficient operations
    • remove min
    • insert

Date: 2017-04-18 Tue 13:40

Author: Sam Flint

Created: 2017-07-20 Thu 16:58

Validate