Sharding Data Collections with MongoDB

Objective

The objective of this exercise is to illustrate the concept of sharding (horizontal partitioning), a database technique for storing large data collections across multiple servers called “shards” (cf. image below). Sharding increases query performance since each server handles (small) parts of a data set.

Requirements

Creating and Populating a Database in MongoDB

In MongoDB a database is composed of a set of collections, each collection containing JSON documents. The following instructions guide you in creating a MongoDB database collection.  The devil is in the details!

  • Start a MongoDB Instance:
mkdir -p ~/db/shard1 # Folder containing the DB files 
mongod --shardsvr --dbpath ~/db/shard1 --port 27021
  • Connect to the CouchDB:
mongo --host localhost:27021
  • Create the database mydb and the collection cities:
use mydb
db.createCollection("cities")
  •  Verify the existence of mydb and cities:
show dbs
show collections
  • Populate the database (requires a new terminal !!):
# Assumes that cities.txt is in HOME
mongoimport --host localhost:27021 --db mydb --collection cities --file ~/cities.txt
  • Query the database:
db.cities.find().pretty()
db.cities.count()

Configuring a Sharded Cluster

MongoDB supports sharding through a sharded cluster composed of the following components (cf. image below):

  • Shards: store the data.
  • Query routers: direct operations from clients to the appropriate shard(s) and return results to clients.
  • Config servers: store cluster’s metadata. The query router uses this metadata to target operations to specific shards.

The following instructions illustrate how to create a simple sharded cluster with 1 config server, 1 query router and 1 shard.

  • Config server:
mkdir ~/db/configdb
mongod --configsvr --dbpath ~/db/configdb --port 27020
  • Query router (mongos instance) connected to the config server:
mongos --configdb localhost:27020 --port 27019
  • Connect to the query router :
mongo --host localhost:27019
  • Add MongoDB as a shard (the instance containing the mydb database):
use admin
db.runCommand( { addShard: "localhost:27021", name: "shard1" } )
  • Verify the cluster:
sh.status()

Sharding a Database

In MongoDB sharding is enabled on a per-basis collection. When enabled, mongo partitions and distributes data into the cluster based on a shard key (cf. sharding introduction). MongoDB uses two kinds of partitioning strategies (cf. images below):

  • Range based: data is partitioned into ranges having [min, max] values determined by the shard key.

  • Hash based: data is partitioned using a hash function.

Range based sharding example

  • Connect to the query router and create the collection cities1 in database mydb:
use mydb
db.createCollection("cities1")
show collections
  • Enable sharding on the collection mydb.cities1. Use state as shard key:
sh.enableSharding("mydb") 
sh.shardCollection("mydb.cities1", { "state": 1} )
  • Verify the cluster state:
sh.status()
  • Populate collection cities1 using (my)db.cities:
db.cities.find().forEach( 
    function(d) {
        db.cities1.insert(d); 
    }
)
  • Verify the cluster state:
sh.status()

Hash-based sharding example

  • Connect to the query router and create the collection cities2 in database mydb:
use mydb
db.createCollection("cities2")
show collections
  • Enable sharding on the collection mydb.cities2. Use state as shard key:
sh.enableSharding("mydb") 
sh.shardCollection("mydb.cities2", { "state": 1} )
  • Verify the cluster state:
sh.status()
  • Populate collection cities2 using (my)db.cities:
db.cities.find().forEach( 
    function(d) {
        db.cities2.insert(d); 
    }
)
  • Verify the cluster state:
sh.status()

Guiding the partitioning process using Tags

MongoDB also supports tagging a range of shard key values. These are some of the benefits of using tags:

  • Isolate a subset of the data on a specific shard.
  • Ensure that relevant data resides on shards that are geographically close to the user.

The next example illustrates tag-based partitioning on a cluster composed of 3-shards.

  • Start 2 more MongoDB instances (future shards):
# Shard 2: requires a new terminal!
mkdir -p ~/db/shard2 # Folders containing DB files 
mongod --shardsvr --dbpath ~/db/shard2 --port 27022 

# Shard 3: requires a new terminal!
mkdir -p ~/db/shard3 # Folders containing DB files 
mongod --shardsvr --dbpath ~/db/shard3 --port 27023
  • Add shards to the cluster:
use admin
db.runCommand( { addShard: "localhost:27022", name: "shard2" } )
db.runCommand( { addShard: "localhost:27023", name: "shard3" } ) 

sh.status()
  • Associate tags to shards:
sh.addShardTag("shard1", "CA") 
sh.addShardTag("shard2", "NY") 
sh.addShardTag("shard3", "Others")
  • Create, shard and populate a new collection named cities3:
db.createCollection("cities3") 
sh.shardCollection("mydb.cities3", { "state": 1} ) 

use mydb 
db.cities.find().forEach( 
    function(d) { 
       db.cities3.insert(d); 
    } 
)
  • Associate shard key ranges to tagged shards:
sh.addTagRange("mydb.cities3", { state: MinKey }, { state: "CA" }, "Others") 
sh.addTagRange("mydb.cities3", { state: "CA" }, { state: "CA_" }, "CA") 
sh.addTagRange("mydb.cities3", { state: "CA_" }, { state: "NY" }, "Others") 
sh.addTagRange("mydb.cities3", { state: "NY" }, { state: "NY_" }, "NY") 
sh.addTagRange("mydb.cities3", { state: "NY_" }, { state: MaxKey }, "Others")
  • Verify the cluster state:
sh.status()

Resources