Sharding Data Collections with MongoDB

Objective

The objective of this exercise is to illustrate the concept of sharding (horizontal partitioning), a database technique for storing large data collections across multiple servers called “shards” (cf. image below). Sharding increases query performance since each server handles (small) parts of a data set.

Requirements

Creating and Populating a Database in MongoDB

In MongoDB a database is composed of a set of collections, each collection containing JSON documents. The following instructions guide you in creating a MongoDB database collection. The devil is in the details!

Start a MongoDB Instance:

mkdir -p ~/db/shard1 # Folder containing the DB files 
mongod --shardsvr --dbpath ~/db/shard1 --port 27021

Connect to the CouchDB:

mongo --host localhost:27021

Create the database mydb and the collection cities:

use mydb
db.createCollection("cities")

Verify the existence of mydb and cities:

show dbs
show collections

Populate the database (requires a new terminal !!):

# Assumes that cities.txt is in HOME
mongoimport --host localhost:27021 --db mydb --collection cities --file ~/cities.txt

Query the database:

db.cities.find().pretty()
db.cities.count()

Configuring a Sharded Cluster

MongoDB supports sharding through a sharded cluster composed of the following components (cf. image below):

Shards: store the data.
Query routers: direct operations from clients to the appropriate shard(s) and return results to clients.
Config servers: store cluster’s metadata. The query router uses this metadata to target operations to specific shards.

The following instructions illustrate how to create a simple sharded cluster with 1 config server, 1 query router and 1 shard.

Config server:

mkdir ~/db/configdb
mongod --configsvr --dbpath ~/db/configdb --port 27020

Query router (mongos instance) connected to the config server:

mongos --configdb localhost:27020 --port 27019

Connect to the query router :

mongo --host localhost:27019

Add MongoDB as a shard (the instance containing the mydb database):

use admin
db.runCommand( { addShard: "localhost:27021", name: "shard1" } )

Verify the cluster:

sh.status()

Sharding a Database

In MongoDB sharding is enabled on a per-basis collection. When enabled, mongo partitions and distributes data into the cluster based on a shard key (cf. sharding introduction). MongoDB uses two kinds of partitioning strategies (cf. images below):

Range based: data is partitioned into ranges having [min, max] values determined by the shard key.

Hash based: data is partitioned using a hash function.

Range based sharding example

Connect to the query router and create the collection cities1 in database mydb:

use mydb
db.createCollection("cities1")
show collections

Enable sharding on the collection mydb.cities1. Use state as shard key:

sh.enableSharding("mydb") 
sh.shardCollection("mydb.cities1", { "state": 1} )

Verify the cluster state:

sh.status()

Populate collection cities1 using (my)db.cities:

db.cities.find().forEach( 
    function(d) {
        db.cities1.insert(d); 
    }
)

Verify the cluster state:

sh.status()

Hash-based sharding example

Connect to the query router and create the collection cities2 in database mydb:

use mydb
db.createCollection("cities2")
show collections

Enable sharding on the collection mydb.cities2. Use state as shard key:

sh.enableSharding("mydb") 
sh.shardCollection("mydb.cities2", { "state": 1} )

Verify the cluster state:

sh.status()

Populate collection cities2 using (my)db.cities:

db.cities.find().forEach( 
    function(d) {
        db.cities2.insert(d); 
    }
)

Verify the cluster state:

sh.status()

Guiding the partitioning process using Tags

MongoDB also supports tagging a range of shard key values. These are some of the benefits of using tags:

Isolate a subset of the data on a specific shard.
Ensure that relevant data resides on shards that are geographically close to the user.

The next example illustrates tag-based partitioning on a cluster composed of 3-shards.

Start 2 more MongoDB instances (future shards):

# Shard 2: requires a new terminal!
mkdir -p ~/db/shard2 # Folders containing DB files 
mongod --shardsvr --dbpath ~/db/shard2 --port 27022 

# Shard 3: requires a new terminal!
mkdir -p ~/db/shard3 # Folders containing DB files 
mongod --shardsvr --dbpath ~/db/shard3 --port 27023

Add shards to the cluster:

use admin
db.runCommand( { addShard: "localhost:27022", name: "shard2" } )
db.runCommand( { addShard: "localhost:27023", name: "shard3" } ) 

sh.status()

Associate tags to shards:

sh.addShardTag("shard1", "CA") 
sh.addShardTag("shard2", "NY") 
sh.addShardTag("shard3", "Others")

Create, shard and populate a new collection named cities3:

db.createCollection("cities3") 
sh.shardCollection("mydb.cities3", { "state": 1} ) 

use mydb 
db.cities.find().forEach( 
    function(d) { 
       db.cities3.insert(d); 
    } 
)

Associate shard key ranges to tagged shards:

sh.addTagRange("mydb.cities3", { state: MinKey }, { state: "CA" }, "Others") 
sh.addTagRange("mydb.cities3", { state: "CA" }, { state: "CA_" }, "CA") 
sh.addTagRange("mydb.cities3", { state: "CA_" }, { state: "NY" }, "Others") 
sh.addTagRange("mydb.cities3", { state: "NY" }, { state: "NY_" }, "NY") 
sh.addTagRange("mydb.cities3", { state: "NY_" }, { state: MaxKey }, "Others")

Verify the cluster state:

sh.status()

Resources

MongoDB manual

Big Data Visualization

Tools and techniques for exploring data