EDOOFUS

my commit log as a blog

Basic Set Theory

Recently, I was explaining to someone the basics of set theory and how the various basic operations translate to the real world. I used the example of the project I’m currently working on, which is a web front end to my ebook library. This is a very quick introduction aimed at people with a programming background but who don’t have a strong math background; the goal is to help you to learn to use them without having to delve deep into the math behind them.

Basic Properties of Sets

The first thing we have to do is to explain what is meant by a set -

definition: set
A set is any collection of items where each item is unique and the order of items in the collection is not important.

The uniqueness property is very important to sets: there are no duplicates in a set.

So what does a set look like? In my database, I have a list of all the books I have electronic copies of. Each book comes in at least one of three formats: PDF, epub, or mobi. We’ll call the superset (the universal set of all the items under consideration) the list of all the books in the library. We’ll call this set ‘L’ (for Library). Part of the set might look like:

1
2
3
4
5
L = { 'Natural Language Processing with Python', 'Learning OpenCV', 
      'Code Complete', 'Mastering Algorithms with C', 
      'The Joy of Clojure', 'Mining the Social Web', 
      'Algorithms In A Nutshell', 'Introduction to Information Retrieval', 
      ... }

We use '{}' to denote the members of a set. The order of books in the library doesn’t matter here, and it doesn’t make sense to have more than one entry for a book in the library.

Building a set in Python is very easy:

1
2
3
4
5
6
library = set(['Natural Language Processing with Python', 'Learning OpenCV',
               'Code Complete', 'Mastering Algorithms with C',
               'The Joy of Clojure', 'Mining the Social Web',
               'Algorithms In A Nutshell',
               'Introduction to Information Retrieval',
               'Network Security With OpenSSL', 'RADIUS'])

Clojure has set notation built in using the #{ } syntax, and any collection can be turned into a set with (set coll):

1
2
3
4
5
6
7
8
9
10
(def library #{ "Natural Language Processing with Python",
                "Learning OpenCV",
                "Code Complete",
                "Mastering Algorithms with C",
                "The Joy of Clojure",
                "Mining the Social Web",
                "Algorithms In A Nutshell",
                "Introduction to Information Retrieval",
                "Network Security With OpenSSL",
                "RADIUS"})

So now we need to build some subsets.

definition: subset
A subset is some part of a set.

definition: proper subset
A proper subset is some part of a set, but is not the whole set.

For example, we’ll create a subset of books P that are on or in Python. We’ll also create a subset of books E that are in the English language. For my library, because not all of my books are in or about Python, the number of members of P is smaller than the number of elements in L. However, all of my books are in English, so the number of elements in E is the same as the number of elements in L. Therefore P is a proper subset, while E is not.

The Basic Set Operations

Now let’s consider two proper subsets of the library to explain some of the basic set operations: M is the subset of ebooks that I have in mobi format, and we’ll redefine E to be the list of ebooks in epub format. For the sake of the rest of this article, let’s note the following:

1
2
3
M = { 'Natural Language Processing with Python', 'Code Complete',
      'Introduction to Information Retrieval' }
E = { 'The Joy of Clojure', 'Mining the Social Web', 'Code Complete' }

In practical terms, this means in my library I have copies of:

  • “Natural Language Processing with Python,” “Introduction to Information Retrieval,” and “Code Complete” in mobi format
  • “The Joy of Clojure,” “Mining the Social Web,” and “Code Complete” in epub format.

In Python:

1
2
3
mobi = set(['Natural Language Processing with Python', 'Code Complete',
           'Introduction to Information Retrieval'])
epub = set(['The Joy of Clojure', 'Mining the Social Web', 'Code Complete'])

In Clojure:

1
2
3
(def mobi #{"Natural Language Processing with Python", "Code Complete",
           "Introduction to Information Retrieval"})
(def epub #{"The Joy of Clojure", "Mining the Social Web", "Code Complete"})

Union

A union is the set of members that appear in either set - if it’s in at least one of the sets, it will appear in a union of the two sets. So we could define a subset of L that contains all the books I have in a mobile format, which for our purposes means copies exist in epub or mobi format. In Python, you can use the set.union method, and in Clojure you can use the functions in the clojure.set namespace.

In Python:

1
2
3
mobile = set.union(mobi, epub)
for book in mobile:
    print book

which yields the output:

1
2
3
4
5
Natural Language Processing in Python
Code Complete
Introduction to Information Retrieval
The Joy of Clojure
Mining the Social Web

Remember that one of the properties of sets is that order is irrelevant, so you might get the books in a different order (this applies to Clojure as well).

The same thing, in Clojure:

1
2
(def mobile (clojure.set/union mobi epub))
(doseq [book mobile] (println (str book)))

You would see a similar output to the Python example.

Again, the practical result of this is a set of all the books I have in my library in a mobile format.

Intersection

The intersection of two sets is a list of all the members that only appear in both sets. In the library example, taking the intersection of the mobi and epub sets gives me a set of my books that I have in both epub and mobi format. The intersection function gives me this result.

The Python example:

1
2
3
both_formats = set.intersection(mobi, epub)
for book in both_formats:
    print book

And in Clojure:

1
2
(def both-formats (clojure.set/intersection mobi epub))
(doseq [book both-formats] (println (str book)))

For either example, the output should be just one book, given the sample sets:

1
Code Complete

I could use this result to know which books I can use on any mobile device.

Difference

The difference of one set from another is a list of all the members in the first set that are not in the second set. This operation is a bit different from the first two; the first two operations are commutative, but the result of a difference is dependent on the order of the sets. I’ll illustrate this with some code examples:

In Python:

1
2
3
4
5
6
7
8
9
only_mobi = set.difference(mobi, epub)
print 'books only in mobi format:'
for book in only_mobi:
    print '\t' + book

only_epub = set.difference(epub, mobi)
print 'books only in epub format:'
for book in only_epub:
    print '\t' + book

In Clojure:

1
2
3
4
5
6
7
8
9
(println "books only in mobi format:")
(def only-mobi (clojure.set/difference mobi epub))
(doseq [book only-mobi]
    (println "\t" book))

(println "books only in epub format:")
(def only-epub (clojure.set/difference epub mobi))
(doseq [book only-epub]
    (println "\t" book))

As the output messages show, this gives us the set of books that are only in mobi and the set of books that are only in epub. The output should look something like:

1
2
3
4
5
6
books only in mobi format:
  Introduction to Information Retrieval
  Natural Language Processing with Python
books only in epub format:
  The Joy of Clojure
  Mining the Social Web

Complements

When discussing complements, we do so when considering a subset and it’s superset. The complement of a subset is the difference of subset from the superset; i.e., the set of all members in the superset that are not in the subset. For example, if I wanted to check my library for all ebooks I have that are not in mobi format, I would use the superset library and take the difference of mobi from library:

In Python:

1
2
3
4
not_mobi = set.difference(library, mobi)
print 'books not in mobi format, using the library superset:'
for book in not_mobi:
    print '\t' + book

and in Clojure:

1
2
3
4
(println 'books not in mobi format, using the library superset:')
(def not-mobi (clojure.set/difference library mobi))
(doseq [book not-mobi]
    (println "\t" book))

This gives us the output:

1
2
3
4
5
6
7
8
books not in mobi format, using the library superset:
   Mining the Social Web
   Algorithms In A Nutshell
   Mastering Algorithms with C
   RADIUS
   The Joy of Clojure
   Network Security With OpenSSL
   Learning OpenCV

Conclusion

This has been a very basic look at set theory and what it means in practise. There is a lot more to set theory (see the references) but this should help get you started. There are a lot of applications for set theory, such as in data mining and natural language processing; it is a powerful tool that is worth spending some time to get to know.

Stay tuned for the next post, which will be on how to use sets in your code. We’ll develop the library idea a bit more.

UPDATE: The next post is up!

References

Reviewers

I’d like to thank the following people for reviewing this:

Code Samples

The complete python source code, which you can save to a file and run directly:

(set_theory.py) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# author: kyle isom <coder@kyleisom.net>
# date: 2012-01-23
# license: ISC / public domain (brokenlcd.net/license.txt)
#
"""
Python illustrations for blog article "Basic Set Theory"
    (see http://kisom.github.com/blog/2012/01/23/basic-set-theory/)

Note that this is slightly tweaked from the examples in the article:
    1. PEP8 dictates that all globals be in all caps; as all the variables
    in this illustration are globals, they have been modified to be all caps.
    2. There is a little extra output to explain what is going on; namely,
    tabs are added before printing books and there is an output line showing
    which example the book set is associated with.
"""

# variables are in all caps because they are globals, and PEP8 dictates
# that globals be in caps.

# the superset
LIBRARY = set(['Natural Language Processing with Python', 'Learning OpenCV',
               'Code Complete', 'Mastering Algorithms with C',
               'The Joy of Clojure', 'Mining the Social Web',
               'Algorithms In A Nutshell',
               'Introduction to Information Retrieval',
               'Network Security With OpenSSL', 'RADIUS'])

# the subsets
MOBI = set(['Natural Language Processing with Python', 'Code Complete',
           'Introduction to Information Retrieval'])
EPUB = set(['The Joy of Clojure', 'Mining the Social Web',
            'Code Complete'])


print '[+] list all the books in a mobile format (union example):'
MOBILE = set.union(MOBI, EPUB)
for book in MOBILE:
    print '\t' + book

print '[+] list all the books in both mobile formats (intersection example)'
BOTH_FORMATS = set.intersection(MOBI, EPUB)
for book in BOTH_FORMATS:
    print '\t' + book

ONLY_MOBI = set.difference(MOBI, EPUB)
print '[+] books only in mobi format:'
for book in ONLY_MOBI:
    print '\t' + book

ONLY_EPUB = set.difference(EPUB, MOBI)
print '[+] books only in epub format:'
for book in ONLY_EPUB:
    print '\t' + book

You’ll want to run this with python set_theory.py (or whatever you choose to name the file, obviously).

The complete Clojure source code, which you can likewise save and run:

(set-theory.clj) download
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
;; set-theory.clj
;; author: kyle isom <coder@kyleisom.net>
;; date: 2012-01-23
;; license: ISC / public domain (brokenlcd.net/license.txt)
;;
;; code examples for blog post "Basic Set Theory"
;;     http://kisom.github.com/blog/2012/01/23/basic-set-theory/
;;

(require 'clojure.set)

; the superset
(def library #{ "Natural Language Processing with Python",
                "Learning OpenCV",
                "Code Complete",
                "Mastering Algorithms with C",
                "The Joy of Clojure",
                "Mining the Social Web",
                "Algorithms In A Nutshell",
                "Introduction to Information Retrieval"
                "Network Security With OpenSSL",
                "RADIUS" })

; the subsets
(def mobi #{"Natural Language Processing with Python", "Code Complete",
           "Introduction to Information Retrieval"})
(def epub #{"The Joy of Clojure", "Mining the Social Web", "Code Complete"})

; union illustration
(println "union illustration (books in either mobile format)")
(def mobile (clojure.set/union mobi epub))
(doseq [book mobile]
    (println "\t" book))

; intersection illustration
(println "intersection illustration (books in both mobile formats)")
(def both-formats (clojure.set/intersection mobi epub))
(doseq [book both-formats]
    (println "\t" book))


(println "books only in mobi format:")
(def only-mobi (clojure.set/difference mobi epub))
(doseq [book only-mobi]
    (println "\t" book))

(println "books only in epub format:")
(def only-epub (clojure.set/difference epub mobi))
(doseq [book only-epub]
    (println "\t" book))

; complement illustration
(println "books not in mobi format, using the library superset:")
(def not-mobi (clojure.set/difference library mobi))
(doseq [book not-mobi]
    (println "\t" book))

You’ll want to run this with clj set-theory.py - I’ve deliberately chosen not to make this a lein project in order to make it easier to share, but I did upload a lein project. You should be able to just run lein deps, test, run.