MongoDB (www.mongodb.org) is a scalable, high-performance, document-oriented NoSQL database. The rmongodb package provides an interface from the statistical software R (www.r-project.org) to MongoDB and back using the mongodb-C library.
This vignette will provide a first introduction to the rmongodb package and offer a lot of code to get stared. If you need anyhelp getting started with MongoDB please check the resources provided by MongoDB: http://docs.mongodb.org/manual/tutorial/getting-started/
There is a stable CRAN version of rmongodb available:
install.packages("rmongodb")
You can also install the latest development version from the GitHub repository:
library(devtools)
install_github(repo = "mongosoup/rmongodb")
The installation should be very simple and straightforward. No local MongoDB installation is required. Only if you install from source, the RUnit tests will need a local MongoDB installation.
After you've installed the the rmongodb package you can load it just like any other package:
library(rmongodb)
First of all we have to create a connection to a MongoDB installation. If no paramters are provided we connect to the MongoDB installation running on localhost. Parameters for external installations and user authentication are implemented and documented.
help("mongo.create")
mongo <- mongo.create()
mongo
## [1] 0
## attr(,"mongo")
## <pointer: 0x58df750>
## attr(,"class")
## [1] "mongo"
## attr(,"host")
## [1] "127.0.0.1"
## attr(,"name")
## [1] ""
## attr(,"username")
## [1] ""
## attr(,"password")
## [1] ""
## attr(,"db")
## [1] "admin"
## attr(,"timeout")
## [1] 0
mongo.is.connected(mongo)
## [1] TRUE
It's always a good idea to check if there is a working connection to your MongoDB to avoid errors.
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
Get all databases of your MongoDB connection:
if(mongo.is.connected(mongo) == TRUE) {
mongo.get.databases(mongo)
}
## [1] "grid" "rmongodb" "scorr" "test"
Get all collections in a specific database of your MongoDB (in this case, the “rmongodb” database)
if(mongo.is.connected(mongo) == TRUE) {
db <- "rmongodb"
mongo.get.database.collections(mongo, db)
}
## [1] "rmongodb.zips"
coll <- "rmongodb.zips"
We will use the 'zips' collection in the following examples. The 'zips' collection holds the MongoDB example data set called “Zip Code Data Set” (http://docs.mongodb.org/manual/tutorial/aggregation-zip-code-data-set/). This data set is available as JSON and contains zip code data from the US.
Using the command mongo.count, we can check how many documents are in the collection or in the result of a specific query. More information for all functions can be found in the help files.
if(mongo.is.connected(mongo) == TRUE) {
help("mongo.count")
mongo.count(mongo, coll)
}
## [1] 29470
In order to run queries it is important to know some details about the available data. First of all you can run the command mongo.find.one
to get one document from your collection.
if(mongo.is.connected(mongo) == TRUE) {
mongo.find.one(mongo, coll)
}
## _id : 7 545a7c2e08a841baf228f6cd
## city : 2 ACMAR
## loc : 4
## 0 : 1 -86.515570
## 1 : 1 33.584132
##
## pop : 1 6055.000000
## state : 2 AL
## orig_id : 2 35004
The command mongo.distinct
is going to provide a list of all values for a specific key.
if(mongo.is.connected(mongo) == TRUE) {
res <- mongo.distinct(mongo, coll, "city")
head(res, 2)
}
## [1] "ACMAR" "ADAMSVILLE"
Now we can run the first queries on our MongoDB. In this case we ask for one document providing zip code data for the city “COLORADO CITY”. Please be aware that the output of mongo.find.one
is a BSON object, which can not be used directly for further analysis in R. Using the command mongo.bson.to.list
, an R list object will be created from the BSON object.
if(mongo.is.connected(mongo) == TRUE) {
cityone <- mongo.find.one(mongo, coll, '{"city":"COLORADO CITY"}')
print( cityone )
mongo.bson.to.list(cityone)
}
## _id : 7 545a7c2e08a841baf228fa9b
## city : 2 COLORADO CITY
## loc : 4
## 0 : 1 -112.952427
## 1 : 1 36.976266
##
## pop : 1 3065.000000
## state : 2 AZ
## orig_id : 2 86021
## $`_id`
## { $oid : "545a7c2e08a841baf228fa9b" }
##
## $city
## [1] "COLORADO CITY"
##
## $loc
## [1] -112.95 36.98
##
## $pop
## [1] 3065
##
## $state
## [1] "AZ"
##
## $orig_id
## [1] "86021"
The most convinient way to construct bson objects is to to use mongo.bson.from.list
function.
It is very natural, because R lists are very similar to JSON objects - each level of JSON object
can be represented by named list and each array can be represented by unnamed list:
query <- mongo.bson.from.list(list('city' = 'COLORADO CITY'))
query
## city : 2 COLORADO CITY
query <- mongo.bson.from.list(list('city' = 'COLORADO CITY', 'loc' = list(-112.952427, 36.976266)))
query
## city : 2 COLORADO CITY
## loc : 4
## 0 : 1 -112.952427
## 1 : 1 36.976266
Internally mongo.bson.from.list
calls
mongo.bson.buffer.create
, mongo.bson.buffer.append
, mongo.bson.from.buffer
functions. But in most cases you really don't need to know anything about these internals:
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "city", "COLORADO CITY")
## [1] TRUE
query <- mongo.bson.from.buffer(buf)
query
## city : 2 COLORADO CITY
Alternatively same BSON object can be created using one line of code and JSON:
mongo.bson.from.JSON('{"city":"COLORADO CITY", "loc":[-112.952427, 36.976266]}')
## city : 2 COLORADO CITY
## loc : 4
## 0 : 1 -112.952427
## 1 : 1 36.976266
mongo.bson.from.list
automatically converts R primitive data types (integer, numeric, logical, character) into MongoDB data types. You have to make some extra job for Date
types. To build bson with ISODate
data you should pass it as POSIXct
object:
date_string <- "2014-10-11 12:01:06"
# Pay attention to timezone argument
query <- mongo.bson.from.list(list(date = as.POSIXct(date_string, tz='MSK')))
# Note, that internall MongoDB strores dates in unixtime format:
query
## date : 9 1413028866000
You should construct bsons manually using mongo.bson.buffer.create
, mongo.bson.buffer.append
, mongo.bson.from.buffer
if it contains MongoDB specific data types such as BSON_OID
, BSON_REGEX
, etc.
For real analyses it is important to get more than one document of data from MongoDB. As an example, we first use the command mongo.distict
to get an overview about the population distribution. Then we check for all cities with less than two inhabitants (errors in the data set?).
if(mongo.is.connected(mongo) == TRUE) {
pop <- mongo.distinct(mongo, coll, "pop")
hist(pop)
boxplot(pop)
nr <- mongo.count(mongo, coll, list('pop' = list('$lte' = 2)))
print( nr )
pops <- mongo.find.all(mongo, coll, list('pop' = list('$lte' = 2)))
head(pops, 2)
}
## [1] 87
## [[1]]
## [[1]]$`_id`
## [1] "545a7c2e08a841baf228f85a"
##
## [[1]]$city
## [1] "ALLEN"
##
## [[1]]$loc
## [1] -87.67 31.62
##
## [[1]]$pop
## [1] 0
##
## [[1]]$state
## [1] "AL"
##
## [[1]]$orig_id
## [1] "36419"
##
##
## [[2]]
## [[2]]$`_id`
## [1] "545a7c2e08a841baf228f91c"
##
## [[2]]$city
## [1] "CHEVAK"
##
## [[2]]$loc
## [1] -164.78 61.58
##
## [[2]]$pop
## [1] 0
##
## [[2]]$state
## [1] "AK"
##
## [[2]]$orig_id
## [1] "99563"
The analysis gets more interesting when creating a more complex query with two arguments.
# or do it R way, as recommended above:
if(mongo.is.connected(mongo) == TRUE) {
pops1 <- mongo.find.all(mongo, coll, query = list('pop' = list('$lte' = 2), 'pop' = list('$gte' = 1)))
head(pops1, 2)
}
## [[1]]
## [[1]]$`_id`
## [1] "545a7c2e08a841baf228f926"
##
## [[1]]$city
## [1] "CROOKED CREEK"
##
## [[1]]$loc
## [1] -158.00 61.82
##
## [[1]]$pop
## [1] 1
##
## [[1]]$state
## [1] "AK"
##
## [[1]]$orig_id
## [1] "99575"
##
##
## [[2]]
## [[2]]$`_id`
## [1] "545a7c2e08a841baf228fac2"
##
## [[2]]$city
## [1] "HUALAPAI"
##
## [[2]]$loc
## [1] -113.30 35.54
##
## [[2]]$pop
## [1] 2
##
## [[2]]$state
## [1] "AZ"
##
## [[2]]$orig_id
## [1] "86412"
Using the package jsonlite you can check and visualize your JSON syntax first. Afterwards we query MongoDB with this JSON query.
library(jsonlite)
json <- '{"pop":{"$lte":2}, "pop":{"$gte":1}}'
cat(prettify(json))
## {
## "pop": {
## "$lte": 2
## },
## "pop": {
## "$gte": 1
## }
## }
validate(json)
## [1] TRUE
if(mongo.is.connected(mongo) == TRUE) {
pops1 <- mongo.find.all(mongo, coll, query = list('pop' = list('$lte' = 2), 'pop' = list('$gte' = 1)))
pops2 <- mongo.find.all(mongo, coll, json)
identical(pops1, pops2)
}
## [1] TRUE
Another interesting point is inserting data into MongoDB.
# insert data
a <- mongo.bson.from.JSON( '{"ident":"a", "name":"Markus", "age":33}' )
b <- mongo.bson.from.JSON( '{"ident":"b", "name":"MongoSoup", "age":1}' )
c <- mongo.bson.from.JSON( '{"ident":"c", "name":"UseR", "age":18}' )
if(mongo.is.connected(mongo) == TRUE) {
icoll <- paste(db, "test", sep=".")
mongo.insert.batch(mongo, icoll, list(a,b,c) )
dbs <- mongo.get.database.collections(mongo, db)
print(dbs)
mongo.find.all(mongo, icoll)
}
## [1] "rmongodb.zips" "rmongodb.test"
## [[1]]
## [[1]]$`_id`
## [1] "545a7c2f08a841baf22969eb"
##
## [[1]]$ident
## [1] "a"
##
## [[1]]$name
## [1] "Markus"
##
## [[1]]$age
## [1] 33
##
##
## [[2]]
## [[2]]$`_id`
## [1] "545a7c2f08a841baf22969ec"
##
## [[2]]$ident
## [1] "b"
##
## [[2]]$name
## [1] "MongoSoup"
##
## [[2]]$age
## [1] 1
##
##
## [[3]]
## [[3]]$`_id`
## [1] "545a7c2f08a841baf22969ed"
##
## [[3]]$ident
## [1] "c"
##
## [[3]]$name
## [1] "UseR"
##
## [[3]]$age
## [1] 18
You can also update your data in MongoDB from R and add indices for more efficient queries.
if(mongo.is.connected(mongo) == TRUE) {
mongo.update(mongo, icoll, list('ident' = 'b'), list('$inc' = list('age' = 3)))
res <- mongo.find.all(mongo, icoll)
print(res)
# Creating an index for the field 'ident'
mongo.index.create(mongo, icoll, list('ident' = 1))
# check mongoshell!
}
## [[1]]
## [[1]]$`_id`
## [1] "545a7c2f08a841baf22969eb"
##
## [[1]]$ident
## [1] "a"
##
## [[1]]$name
## [1] "Markus"
##
## [[1]]$age
## [1] 33
##
##
## [[2]]
## [[2]]$`_id`
## [1] "545a7c2f08a841baf22969ec"
##
## [[2]]$ident
## [1] "b"
##
## [[2]]$name
## [1] "MongoSoup"
##
## [[2]]$age
## [1] 4
##
##
## [[3]]
## [[3]]$`_id`
## [1] "545a7c2f08a841baf22969ed"
##
## [[3]]$ident
## [1] "c"
##
## [[3]]$name
## [1] "UseR"
##
## [[3]]$age
## [1] 18
## NULL
Of course there are also commands to drop databases and collections in MongoDB. After you finished all your analyses it's a good idea to destroy the connection to your MongoDB.
if(mongo.is.connected(mongo) == TRUE) {
mongo.drop(mongo, icoll)
#mongo.drop.database(mongo, db)
res <- mongo.get.database.collections(mongo, db)
print(res)
# close connection
mongo.destroy(mongo)
}
## [1] "rmongodb.zips"
## NULL
Please do not hesitate to contact us if there are any issues using rmongodb. Issues or pull requests can be submitted via github: https://github.com/mongosoup/rmongodb