Expressing MapReduce Queries with CouchDB

Objective

The objective of this exercise is to learn to query data collections using the map-reduce programming model. For this purpose you will use CouchDB, a NoSQL document oriented database where data is stored/retrieved as JSON documents.

Requirements

CouchDB in a nutshell

CouchDB is a NoSQL database that completely embraces the web:

  • Data is stored as JSON documents.
  • Documents are created and accessed via HTTP (i.e., using a browser).
  • Queries are expressed as Javascript map-reduce functions.

The following instructions illustrate how to create and populate a database in CouchDB using data coming from the Deezer‘ music catalogue.

Create and populate a database

  • Create the deezer database:
# Assuming CouchDB default address (http://localhost:5984)
curl -X PUT http://localhost:5984/deezer
  • Download Muse albums and similar artists:
curl -X GET http://api.deezer.com/artist/705/albums > MuseAlbums.json
curl -X GET http://api.deezer.com/artist/705/related > MuseRelatedArtists.json

# Verify the existence of the files
ls *.json
  • Populate the deezer database with the retrieved data:
curl -X PUT http://localhost:5984/deezer/muse_albums --upload-file "MuseAlbums.json"
curl -X PUT http://localhost:5984/deezer/muse_related_artists --upload-file "MuseRelatedArtists.json"
  • Verify the content of the database:
curl -v http://localhost:5984/deezer/muse_albums
curl -v http://localhost:5984/deezer/muse_related_artists
  • Access and observe the database deezer on Fouton (CouchDB web user interface):
http://127.0.0.1:5984/_utils/index.html

Querying the database

Queries are defined in Futon as temporal views composed of a map and (optionally) a reduce function. For instance:

  • Retrieve the name and the web page of the groups that are similar to the rock band Muse.
// Map 
function(doc) {

    var artists = doc.data;

    if(doc._id == "muse_related_artists") {
        for(var i in artists) {
            emit(artists[i].name, artists[i].link);
        }
    } 
}
  • Compute the total number of the albums produced by the rock band Muse —use the reduce check button (see figure). If it does not appear refresh the page.
// Map
function(doc) {
    
var artists = doc.data;
    
if(doc._id == "muse_related_artists") {
        for(var i in artists) {
            emit('muse_albums', 1);
        }
    } 
}

// Reduce
function(keys, values) {
    return sum(values);
}

TODO

For this practical work you will use Allocine Data Collection, which contains JSON documents with information about the films presented in 2011 in Grenoble (cf. allocine.fr). Each of these documents contain the films presented in a cinema of Grenoble at that time (i.e. there is a file per cinema and a total number of 9).

The following commands help you creating and populating the allocine database:

# Create database allocine
curl -X PUT http://localhost:5984/allocine

# Populate database allocine
curl -T "allocineGrenoble1.txt" http://localhost:5984/allocine/allocineGrenoble1
curl -T "allocineGrenoble2.txt" http://localhost:5984/allocine/allocineGrenoble2
curl -T "allocineGrenoble3.txt" http://localhost:5984/allocine/allocineGrenoble3
curl -T "allocineGrenoble4.txt" http://localhost:5984/allocine/allocineGrenoble4
curl -T "allocineGrenoble5.txt" http://localhost:5984/allocine/allocineGrenoble5
curl -T "allocineGrenoble6.txt" http://localhost:5984/allocine/allocineGrenoble6
curl -T "allocineGrenoble7.txt" http://localhost:5984/allocine/allocineGrenoble7
curl -T "allocineGrenoble8.txt" http://localhost:5984/allocine/allocineGrenoble8
curl -T "allocineGrenoble9.txt" http://localhost:5984/allocine/allocineGrenoble9

Using the allocine database try to answer the following questions. Do not forget to save your queries as: _design: answers, view name: qX.

  1. Define a view in MapReduce that contains, for each theatre, the films presented in it. Hint: You do not need a reduce here.
  2. Modify your previous answer and filter the theaters outside Grenoble (e.g., do not include the theatres in Saint Martin d’Hères).
  3. Give the number of films that each theatre is presenting. Hint: You need a reduce here.
  4. Give the list of films with a press rating higher than 4 stars. Attention: filter duplicates.
  5. Give the list of films presented 2 years ago (10.12.2011), and for each film, the theatre where it was presented and its schedule.
  6. BONUS! Give the list of films, and for every film, the list of theatres that present it (this question is a challenge but we encourage you to try to solve it).

Resources