Monday, May 28, 2012

SICP 2.59 - 2.60: Sets as unordered lists

From SICP section 2.3.3  Example: Representing Sets

Section 2.3.3 shows us several ways to represent sets in Scheme, starting with unordered sets. To start, we define what a 'set' is by specifying the operations that are to be used on sets. These are:

  • element-of-set? - a predicate function that determines whether a given element is a member of a set.
  • adjoin-set - takes an object and a set as arguments and returns the set that contains the elements of the original set and the adjoined object.
  • intersection-set - computes the intersection of two sets, which is the set containing only elements that appear in both arguments.
  • union-set - computes the union of two sets, which is the set containing each element that appears in either argument.

We represent a set as an unordered list by making sure no element appears more than once. The text provides the following implementations for representing sets as unordered lists.

(define (element-of-set? x set)
  (cond ((null? set) false)
        ((equal? x (car set)) true)
        (else (element-of-set? x (cdr set)))))

(define (adjoin-set x set)
  (if (element-of-set? x set)
      set
      (cons x set)))

(define (intersection-set set1 set2)
  (cond ((or (null? set1) (null? set2)) '())
        ((element-of-set? (car set1) set2)        
         (cons (car set1)
               (intersection-set (cdr set1) set2)))
        (else (intersection-set (cdr set1) set2))))

We can define a few lists to test and see how these procedures work together.

> (define odds '(1 3 5 7))
> (define evens '(2 4 6 8))
> (define primes '(2 3 5 7))
> (element-of-set? 5 odds)
#t
> (element-of-set? 5 evens)
#f
> odds
(1 3 5 7)
> (adjoin-set 9 odds)
(9 1 3 5 7)
> (intersection-set odds evens)
()
> (intersection-set odds primes)
(3 5 7)
> (intersection-set evens primes)
(2)

Exercise 2.59 asks us to implement the union-set operation for the unordered-list representation of sets.

(define (union-set set1 set2)
  (cond ((null? set1) set2)
        ((null? set2) set1)
        ((element-of-set? (car set1) set2) (union-set (cdr set1) set2))
        (else (cons (car set1) (union-set (cdr set1) set2)))))

This implementation starts by checking to see if either set is null. If so, the other set is returned immediately. The procedure then checks to see if the first element of set1 is an element of set2. If so, that element is discarded from the first set, and the union of the remaining elements of set1 and set2 are returned. If the first element of set1 is not an element of set2, it is included in the list returned by the procedure.

> (union-set odds evens)
(1 3 5 7 2 4 6 8)
> (union-set odds primes)
(1 2 3 5 7)
> (union-set evens primes)
(4 6 8 2 3 5 7)


Exercise 2.60 asks us what would need to change in the implementations above if we redefine 'set' to allow duplicate elements. For example, the set {1, 2, 3} could be represented as the list (2 3 2 1 3 2 2). We need to compare the efficiency of each of the new procedures with their original (non-duplicate) implementations. The exercise also asks if there are applications where the new representation would be preferred over the original.

The implementation of element-of-set? doesn't need to change. It returns #t when it finds an element that matches the input element, otherwise it returns #f. The implementation of adjoin-set is simplified by the new definition. Since we no longer need to check to see if the input element already exists in the set, we can just cons the element to the existing set.

(define (adjoin-set x set)
   (cons x set))

Like adjoin-set, union-set is also simplified by allowing duplicate elements.

(define (union-set set1 set2) 
   (append set1 set2))

We have an interesting choice when it comes to intersection-set. The intersection of two sets under the original (non-duplicate) definition doesn't seem like it requires any change in implementation since duplicates are now allowed, not necessarily required. However, look what happens when you execute intersection-set with duplicate elements in the first set vs. the second set.

> (define primes '(2 3 5 7))
> (define twos '(2 2 2))
> (intersection-set primes twos)
(2)
> (intersection-set twos primes)
(2 2 2)

The result is different depending on which set has duplicate elements. This is because the original implementation checks each element of the first set independently to see if they're present in the second set. When the first set contains duplicates, so will the result. Since the instructions in the exercise are ambiguous (and being the typical lazy programmer that I am), I'm going to say that this implementation does not need to change.

Since element-of-set? and intersection-set haven't changed, neither will their performance. Since adjoin-set and union-set no longer need to check for duplicate elements, the performance of both of these procedures is improved. The number of steps required by adjoin-set has gone from $O(n)$ to $O(1)$. The number of steps required by union-set has gone from $O(n^2)$ to $O(n)$.

The penalty that we pay for these performance improvements is that sets now require more memory to accommodate duplicate elements. This representation would be preferred over the original in cases where we don't care about that added memory overhead, and where most of our operations are either adjoin-set or union-set.


 Related:

For links to all of the SICP lecture notes and exercises that I've done so far, see The SICP Challenge.

Sunday, May 6, 2012

Which Open Source License?

I was doing a little research on open source software licenses with a student when I came across this flow chart.


I like the balance it strikes between being funny and still being at least somewhat useful and informative.  Credit for the original flowchart goes to the creators Dan Bentley and Brian Fitzpatrick.  (Thanks to @jldeev for pointing me to the chart.)

For a short break down of the key differences between some of the more popular software licenses, see Jeff Atwood's post Pick a License, Any License on the Coding Horror blog.  The main point to take away from the article is that if you don't explicitly declare a license on software you publish, your code is copyrighted by default.  To use the code people have to contact you and ask permission.  So if you post a lot of code examples on your blog, take a few minutes to pick a license then add a notice to the footer of your site.