Through the Looking Glass

Matching DNS hosts

Posted on January 22, 2020

I originally wrote this on my dev blog, and moved it over here.

One of the things that I really don’t know nearly enough about is DNS, so I started a slow-burner project to write a local resolver that can filter domains. It’s inspired by adsuck and pi-hole and really just designed to force to me to learn a bunch of stuff. It’s part of a larger effort to slowly write replacements for a lot of system software just to learn them while simplifying my computing environments.

So, right now I’ve got the following working:

In this post, I wanted to write up my trie implementation and talk about some improvements I’d like to make to it.

The problem

The problem is basically this: in order to filter DNS domains, I have to have some way of specifying a bunch of domains to blacklist 1 that works in a performant way. A first (and pretty solid approach) would be to use a hash table. However, I’d like to be able to support wildcards too, which I don’t know how you’d do with a hash table. A wildcard could be thought of as a note that any prefix following the domain is valid, which means we need to be able to do prefix searches. Turns out there’s a data structure called a prefix tree, or a trie.

What’s a trie?

The name ‘trie’ comes from “retrieval”: they’re useful for retrieving strings from a set, and doing so in a way that lets us look up prefixes for strings.

For example, we might store ‘hi’ and ‘hello’ in a trie as

t := Trie()
t.Add("hi")
t.Add("hello")

                .     // root
        |
        h     // level 1
       / \
      i   e   // level 2
      |   |
      #   l   // level 3
          |
          l   // level 4
          |
          o   // level 5
          |
                  #   // level 6

Both ‘hi’ and ‘hello’ share a starting byte: 0x68, so we have one node at level 1. This node splits off into our two words. Traversing this node, we keep following children until we find a ‘#’ (or in a future version, a * or another trie). Why is it important to keep going until we see a ‘#’? Consider this naive approach that also lacks sanity checks:

func (t *Trie) Find(term []byte) bool {
        if len(term) == 0 {
                fmt.Println("found term")
                return true
        }

        if t.children[term[0]] {
            fmt.Println("searching ", string(term))
                return t.children[term[0]].Find(term[1:])
        }

    fmt.Println("not found")
        return false
}

Let’s try it on the trie above:

// We didn't 'hell' to the set, so it shouldn't be found.
t.Find([]byte("hell"))
searching  ell
searching  ll
searching  l
searching
found term

With that said, here’s my basic trie code:

package trie

type node struct {
    leaf     bool
    children map[byte]*node
}

func (n *node) add(k []byte) {
    if len(k) == 0 {
        n.leaf = true
        return
    }

    c := k[0]
    if n.children == nil {
        n.children = map[byte]*node{}
    }

    if n.children[c] == nil {
        n.children[c] = &node{}
    }

    n.children[c].add(k[1:])
}

func (n *node) find(b []byte) bool {
    if len(b) == 0 {
        return n.leaf
    }

    if len(n.children) == 0 {
        return false
    }

    cn := n.children[b[0]]
    if cn == nil {
        return false
    }

    return cn.find(b[1:])
}

type Trie struct {
    root *node
}

func (t *Trie) Add(v []byte) {
    if t.root == nil {
        t.root = &node{}
    }

    t.root.add(v)
}

func (t *Trie) AddString(v string) {
    t.Add([]byte(v))
}

func (t *Trie) Find(v []byte) bool {
    if t.root == nil {
        return false
    }

    return t.root.find(v)
}

func (t *Trie) FindString(v string) bool {
    return t.root.find([]byte(v))
}

So, for example in matching domain names:

// example from a blocklist I found on the internet
// TODO: put this in a config file
var domainList = []string{
    "facebook.com",
    "facebook.net",
    "fbcdn.com",
    "fbsbx.com",
    "fbcdn.com",
    "instagram.com",
    "google.com",
}

func NewBlacklist() *trie.Trie {
  blacklist := trie.Trie()
  for _, domain := range domainList {
      trie.Add(domain)
  }
  return trie
}

func AllowQuery(q dns.Question) bool {
        // dns.Question has a name []string section.
    return !trie.Find(strings.Join(q.Name, "."))
}

Next steps

In DNS, names are treated as labels and their own hierarchy; a DNS question represents a name as a series of strings like

{"org", "kernel", "edge", "mirrors"}

I was thinking of representing tries this way, but haven’t done so yet. I’m not sure it’s valuable to do it that way, except that it models the problem more closely and having wildcard labels is useful. One approach to this is to have the leaf field be a trie pointer instead of a bool. If this is nil, it’s not a leaf. Then I need to add a wildcard field to the Trie structure… which takes some thinking to think about how this changes the behaviour of the other methods. For example, can you specify “static.*.somecdn.net”? What happens when you call add on a wildcard?

The nice thing is that this isn’t open source, so I don’t have to worry about the needs of internet randos. Right now, I don’t need that behaviour; I do need wildcards.

An alternative approach is adding wildcard nodes with a ‘wildcard bool’ field, like the leaf field that’s there now. I don’t have any signaling that the wildcard field wouldn’t support add, but since this is for my own stuff, I’d feel okay making it panic there. Why? Because add is called at one point in the program: at startup, when the block list is being loaded, which is a good time to crash.

As a for-fun thing and not yet motivated by real-world performance measurements (because this doesn’t even work yet), I’d like to make a trie that doesn’t necessarily have to allocate new nodes for every deep node; I need to dig up some papers on them.

I’m going to have to think about it.

Reading

In the meantime, here’s some links to papers:

Some discussion

My friend Jonathan brought up a good point: unless you have a lot (maybe tens of thousands) of entries, the hash table is going to be more performant. I used the block list combinations from Ad Guard on my phone as a reference point; I’ve gone over the 50k rule limit there already; another list I found online has some 770,422 entries. I thought the hash table wouldn’t be so performant based on some of the stuff I saw while working on advent of code problems.

I wrote a quick comparing my implementation to a hash table-backed implementation. It uses the standard system dictionary 2; on this machine, it has 235,886 entries. The first column is the number of loops that were run, the second column is how long it took per operation, and the last two columns are the size and number of allocations per operation.


  BenchmarkTrieLoad-4 | 3	 | 460086285 ns/op | 99638976 B/op | 1993252 allocs/op
  BenchmarkTrieFind-4 | 566652	 |      2664 ns/op |        0 B/op |       0 allocs/op
  BenchmarkHashLoad-4 | 38	 |  27544054 ns/op |   563808 B/op |     246 allocs/op
  BenchmarkHashFind-4 | 2372006 |       501 ns/op |        0 B/op |       0 allocs/op
PASS

So it seems even with hundreds of thousands of entries, the hash table is more performant in terms of space and time. The thing I can’t yet reconcile is the question of how to implement wildcard matches. One thing I could do is use a hybrid approach, with a trie for wildcard rules and a hash table for plain entries. I’ll have to think about it. I also want to consider compressed tries, e.g. storing h -> {i, “ello”}, which would end up as h -> {i, “ell” -> {#, “o”}}. In this case, I’m fortunate in that my tries never delete entries. That lets me simplify things.


  1. I will say I thought about whitelisting domains on the premise that most of the internet is useless, but decided against it. For now, at least.

  2. e.g. /usr/share/dict/words