In the last post, we took a look at the
basics of set theory. Now, I’d like to take a look at how to actually make use
of it in your code.
One of the issues with practically using the code in the last post is that the
initial subsets were defined arbitrarily and not derived from the superset. In
this post, all the examples are derived from the superset. We’ll use a couple
techniques for doing this illustrate some of the various ways to do it.
In Python, we’ll use an object-oriented approach, creating a few classes and
working on Book objects. In Clojure, we’ll use records. Though we’ll approach
language a little differently, I hope they still bring clarity to the subject.
Foundation: A Collection of Books
The first thing we need to do in a useful system is determine what we mean by
book. The last post represented each book as a string denoting the title; while
that worked for a brief introduction, in practise it gives us very limited
options for building subsets. What we need to do is identify more information,
called attributes or fields, that give us the information we need to build our
subsets.
Python
In Python, we’ll approach this using a class. I’ve saved them in library.py
in the Python example code
# used to validate the list of formats passed to a bookSUPPORTED_FORMATS=['epub','mobi','pdf']classBook:""" Represents a book, with title, author, and summary text fields. A book should be given a list of formats supported as a dictionary in the form {fmt: True}, and optionally a list of tags. """title=Noneauthor=Nonesummary=Noneformats=Nonetags=Nonedef__init__(self,title,author,summary,formats):"""Initalise a new book. The format shoud be a dictiontary in the form { 'epub': True } where each key is a format that we have the book in."""self.title=titleself.author=authorself.summary=summaryassert(notFalsein[fmtinSUPPORTED_FORMATSforfmtinformats])self.formats=formatsdef__str__(self):""" Return string representation of a book. """out="%s\n\tby %s\n\t%s\n\tformats: %s"out=out%(self.title,', '.join(self.author),self.summary,', '.join(self.formats))returnout
We’ll also want a BookCollection class to store a set of books and provide
some utility methods for dealing with the collection:
classBookCollection:"""Representation of a collection of books. Internally, they are stored as a set. It's main utility is in its helper methods that make accessing the books easier."""def__init__(self,books,book_filter=None):"""Instantiate a collection of books. It expects a collection of books, e.g. a list or set, and optionally takes a filter to only put some of the books into the collection."""ifbook_filter:self.books=set([bookforbookinbooksifbook_filter(book)])else:self.books=set(books)def__len__(self):returnlen(self.books)defshow_titles(self,description=None):"""Print a list of titles in the collection. If the description argument is supplied, it is printed first and all the books are printed with a preceding tab."""ifdescription:printdescriptionfmt='\t%s'else:fmt='%s'forbookinself.books:printfmt%(book.title,)defget_titles(self):"""Return a list of titles in the collection."""return[book.titleforbookinself.books]
These two classes are very short (and we’ll extend them later to make them
more useful) but provide a solid foundation to begin building on. You’ll want
to load the books in the class.
To load an example book, you would do use code similar to this:
12345678910
books=set([Book("Natural Language Processing with Python",['Steven Bird','Ewan Klein','Edward Loper'],'A highly accessible introduction to natural language processing.',['mobi',]),Book('Learning OpenCV',['Gary Bradski','Adrian Kaehler'],'Puts you in the middle of the rapidly expanding field of '+'computer vision.',['pdf',])])
Manually entering all these details is tedious. Fortunately for you, I put up
with the tedium to create a sample dataset in sample_library.py. You use the
function get_library() from the file to use it.
Clojure
In Clojure, we’ll use a record to define a book:
12345
;; define a book record(defrecordBook#^{:doc"Representation of a book. title is a string, authors a vector, summary is text, and formats is a vector."}[titleauthorssummaryformats])
We’re not using objects, so we don’t need a record to store a collection.
(If we wanted to validate formats, we could do it using a Ref and a
:validator argument - that’s left as an exercise for the reader). I have,
however, defined a few helper functions.
(defn in?"Check whether val is in coll."[collval](if (map? coll)(val coll)(not= -1(.indexOfcollval))));; format validation(def valid-format?"Check a record or object with a :formats key to ensure it fits the listof valid formats."#(or(in?(:formats%):epub)(in?(:formats%):mobi)(in?(:formats%):pdf)))(defn list-titles"Print a list of titles of a book."[books&description](let [titles(map :titlebooks)](if description(do(println description)(doseq [titletitles](println "\t"title)))(doseq [titletitles](println title)))))(defn get-titles"Get a list of titles of a book collection."[books](map :titlebooks))(defn book-str"Return a book as a string."[book](format"%s\n(by %s\n\t%s\n\tformats: %s\n"(str (:titlebook))(join ", "(:authorsbook))(str (:summarybook))(join ", "(map #'name(:formatbook)))))
Adding books is a simple affair:
123456789
(set [(Book."Natural Language Processing with Python"["Steven Bird""Ewan Klein""Edward Loper"]"A highly accessible introduction to natural language processing."[:mobi])(Book."Learning OpenCV"["Gary Bradski""Adrian Kaehler"](str "Puts you in the middle of the rapidly expanding field of ""computer vision")[:pdf])])
I’ve loaded a sample dataset into the sample_library.clj source file, available
from the Clojure example code.
Building Subsets
Now that we have a way to represent a book (with more useful information than
simply the title), we can start to build some subsets. Let’s start by
looking at set notation (aka how to write a set both mathematically and
in code), and then continue on to recreate the two subsets in the previous
article, epub and mobi.
which means the set of numbers that are members of (∈ means ‘element of’)
the set of positive integers and are less than 10. You might generalise this
as such:
given the universal set S, which defines all the elements under
consideration, and some predicate P which is a function that returns either
true if the element satisfies the predicate (and thus should be included
in the set):
{ x | x ∈ S, P(x) }
# a Python list comprehension isn't aware that once N is above 10, it should# terminate, so we cheat and create a list of integers from 1 to 100.# define NN=range(100)# build the setA=[xforxinNifx<10]
And in Clojure, we could use something similar:
123456
;;defineN(defN(iterateinc0))(defN#^{:doc "Representation of the set of positive integers."} N);;buildtheset(filter#(< % 10) N)
Building the Subsets
As mentioned earlier, I have already built sample datasets for both Python
and Clojure, so be sure to use those and save yourself from having to build
your own just yet!
Python
In Python, we can use the built-in filter function to build a list. It will
serve as our predicate function.
12345678910111213
importlibraryimportsample_library# my_library is our supersetMY_LIBRARY=sample_library.get_library()# build our filtersIS_EPUB=lambdabook:'epub'inbook.formatsIS_MOBI=lambdabook:'mobi'inbook.formats# build the subsetsEPUB=library.BookCollection(MY_LIBRARY.books,IS_EPUB)MOBI=library.BookCollection(MY_LIBRARY.books,IS_MOBI)
This gives me the output:
12345678910111213
In[7]:formats.EPUB.show_titles("books in epub format:")booksinepubformat:CodeCompleteTheJoyofClojureMiningtheSocialWebIn[8]:formats.MOBI.show_titles("books in mobi format:")booksinmobiformat:IntroductiontoInformationRetrievalCodeCompleteNaturalLanguageProcessingwithPythonIn[9]:
If you recall the definition of BookCollection, the filter method is
called as filter(predicate, collection). In the case of the mobi
subset, it filters out anything that fails the test
'mobi' in book.formats. We might write this as
{ book | book ∈ my_library, is_mobi(book) }
in set notation. I’ve predefined some filters in the file formats.py
which is again in the example code.
Clojure
Likewise, Clojure has a built-in filter function, in the form
(filter pred coll). We’ll use two
anonymous functions
to do our filtering:
12345678910111213
(use'using_set_theory.library)(use'using_set_theory.sample_library)(def epub?#(in?(:formats%):epub))(def mobi?#(in?(:formats%):mobi))(def my-library(get-library))(def epub(filter epub?my-library))(def mobi(filter mobi?my-library))(list-titlesepub"list of books in epub format:")(list-titlesmobi"list of books in mobi format:")
In the repl, this gives me:
12345678910111213
using_set_theory.core=>(list-titlesepub"list of books in epub format:")(list ofbooksinepubformat:)TheJoyofClojureMiningtheSocialWebCodeCompletenilusing_set_theory.core=>(list-titlesmobi"list of books in mobi format:")(list ofbooksinmobiformat:)IntroductiontoInformationRetrievalNaturalLanguageProcessingwithPythonCodeCompletenilusing_set_theory.core=>
I’ve put these filters in the filters.clj source file, along with definitions
for epub-books and mobi-books:
123456789101112131415161718192021222324
(nsusing_set_theory.filters(:use[using_set_theory.library][using_set_theory.sample_library]))(def epub?#^{:doc"Filter a collection of books by those supporting the epub format."}(fn [book](in?(:formatsbook):epub)))(def mobi?#^{:doc"Filter a collection of books by those supporting the mobi format."}(fn [book](in?(:formatsbook):mobi)))(defn- get-epub"Takes a collection of books and returns the list of books in epub format."[books](filter epub?books))(defn- get-mobi"Takes a collection of books and returns the list of book in mobi format."[books](filter mobi?books))(def epub-books(set (get-epub(get-library))))(def mobi-books(set (get-mobi(get-library))))
Parallels with SQL
This introduction of filters might remind you of SQL, and for good reason.
Edgar Codd designed SQL with
set theory in mind. You can think of tables as sets (provided, of course,
proper data preparation is done to ensure there are no duplicates in the
database), and operations like SELECT return subsets. For example, if we
were storing the books in a library, we would write something like
1
SELECT*FROMbooksWHEREhas_epub=TRUE;
Moving On
Now that we have a programmatic way to build subsets, we can automate the entire
set of sequences in the last post:
Python
12345678910
importformatsimportlibraryeither_format=set.union(formats.EPUB.books,formats.MOBI.books)either_format=library.BookCollection(either_format)either_format.show_titles("books in either format:")both_formats=set.intersection(formats.EPUB.books,formats.MOBI.books)both_formats=library.BookCollection(both_formats)both_formats.show_titles("books in both formats:")
which gives me the results:
123456789101112
In[31]:either_format.show_titles("books in either format:")Out[31]:booksineitherformat:CodeCompleteMiningtheSocialWebNaturalLanguageProcessingwithPythonIntroductiontoInformationRetrievalTheJoyofClojureIn[32]:both_formats.show_titles("books in both formats:")Out[32]:booksinbothformats:CodeComplete
Clojure
1234567891011
(require'clojure.set)(use'using_set_theory.filters)(use'using_set_theory.library)(def either-format(clojure.set/unionepub-booksmobi-books))(def both-formats(clojure.set/intersectionepub-booksmobi-books))(show-titleseither-format"books in either format:")(show-titlesboth-formats"books in both formats:")
In the Clojure REPL, I get the following output:
12345678910111213
using_set_theory.core=>(show-titleseither-format"books in either format:")(booksineitherformat:)IntroductiontoInformationRetrievalTheJoyofClojureNaturalLanguageProcessingwithPythonCodeCompleteMiningtheSocialWebnilusing_set_theory.core=>(show-titlesboth-formats"books in both formats:")(booksinbothformats:)CodeCompletenilusing_set_theory.core=>
Sets v. Lists
Remember that one of the key attributes of a set is that each member is distinct.
Let’s compare a set with a list; we’ll do this with an intersection.
Python
12345678910111213141516
importlibraryimportformatsfromsample_libraryimportget_libraryepub_list=[bookforbookinget_library().booksif'epub'inbook.formats]mobi_list=[bookforbookinget_library().booksif'mobi'inbook.formats]both_formats=[]both_formats.extend(list(epub_list))both_formats.extend(list(mobi_list))print'books in both formats:'forbookinboth_formats:print'\t%s'%book.title
The result:
1234567
books in both formats:
Mining the Social Web
Code Complete
The Joy of Clojure
Code Complete
Natural Language Processing with Python
Introduction to Information Retrieval
Clojure
In Clojure, we’ll use the vector type, which is like a list but the first
element isn’t evaluated:
clojure.core=> (doseq [title (union epub-list mobi-list)] (println title))
The Joy of Clojure
Code Complete
Mining the Social Web
Introduction to Information Retrieval
Natural Language Processing with Python
nil
clojure.core=>
So what?
You’ll notice “Code Complete” shows up twice in the list. The advantage of sets
here is that only unique items are returned. A union is actually the list of
elements in both sets, minus the list of items that are in both
sets.
In[30]:union(epub_list,mobi_list)Out[30]:['Mining the Social Web','The Joy of Clojure','Code Complete','Introduction to Information Retrieval','Natural Language Processing with Python']In[31]:
This has been just a quick introduction to the topic, but hopefully you
can see the relevance to areas like data mining. Coincidentally, datasets
tend to conform to the mathematical idea of sets, and typically with some
data massaging (i.e. to filter out duplicates), those that don’t can
be made more like mathemtical sets. Once appropriately represented in the
computer, they can be acted upon with the basic set operations.
I’ve created an additional example: a web service providing a rest API to
the book collection. As with the code in this post, there is an example in
Python
and in
Clojure. The
README in either example explains what dependencies are required. You can also
view the Bitbucket repo for the
Python example, or the GitHub repo
for the Clojure example.
Acknowledgements
Stephen Olsen reviewed many iterations of this
article and helped me to properly articulate the important points (like illustrating
that unions require the subtraction of the intersection). I originally wrote
the bulk of this article on the 25th, but it took me until the 28th to finish
writing the API example code, until the 31st to add in the additional union
explanation, and until the 1st to polish it up.