Tag Archives: mongodb

MongoDB Journalling

A quick one if you are wondering about MongoDB and Journalling (especially on Debian/Ubuntu from the MongoDB repo) .

By default mongodb will generate 3gig of journal files in /var/lib/mongodb/journal this is far from ideal if you are running on a Rackspace CloudServer or smaller VPS.

There are two easy solutions:

  • If you are running as part of a ReplicaSet then you can choose to go without journalling altogether with the
    nojournal=true

    option in mongodb.conf.

  • Alternatively you can request that MongoDB uses smaller preallocated files for journalling use the
    smallfiles=true

    option in mongodb.conf.

To also reduce some usage in your datafiles there is the also the

noprealloc = true

option, however I don’t see this as particularly useful considering the preallocation starts pretty small and only grows as your data does.

Mongo Benchmarking #1

These aren’t perfect benchmarks – far from it in fact – but I just wanted to get a rough idea of the relative tradeoffs between fsync and safe over normal unsafe writes…

Time taken for 10,000 inserts – (no indexes, journalling on MongoDB 2.0.1 on Debian (Rackspace CloudServer 256Meg))

default      - time taken: 0.2401921749115 seconds
fsync        - time taken: 358.55523014069 seconds
safe=true    - time taken: 1.1818060874939 seconds

Edit: and for journalling disabled and smallfiles=true in mongo.conf

default      - time taken: 0.15036606788635 seconds 
fsync        - time taken: 34.175970077515 seconds 
safe=true    - time taken: 1.0593159198761 seconds

The results aren’t perfect, but do show how big the difference is…

Source:

<?php
$mongo = new Mongo();
$db = $mongo->selectDB('bench');
$collection = new MongoCollection($db,'bench');

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ));
}
$end = microtime(TRUE);
echo 'default      - time taken: '.($end-$start)." seconds \n";

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ),array('fsync' => true));
}
$end = microtime(TRUE);
echo 'fsync        - time taken: '.($end-$start)." seconds \n";

$start = microtime(TRUE);
for ($i=0;$i<10000;$i++) {
  $collection->insert(array('data' => sha1(rand()) ),array('safe' => true));
}
$end = microtime(TRUE);
echo 'safe=true    - time taken: '.($end-$start)." seconds \n";
?>

I’m not sure that the existing number of records will make a massive amount of difference besides through the pre-allocation of files which we have little control of anyway – but it doesn’t look like there is an increase between runs even when there are a lot of entries… (perhaps we’d see more with indexes enabled)

Each run will add an extra 20,000 entries into the collection with little perceptable slowdown.

root@test:/var/www/test# php bench1.php
default      - time taken: 0.53534507751465 seconds
safe=true    - time taken: 1.2793118953705 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.203537940979 seconds
safe=true    - time taken: 1.2887620925903 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.22933197021484 seconds
safe=true    - time taken: 1.6565799713135 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.19606184959412 seconds
safe=true    - time taken: 1.5315411090851 seconds
root@test:/var/www/test# php bench1.php
default      - time taken: 0.2510199546814 seconds
safe=true    - time taken: 1.2419080734253 seconds

It is hard testing on a cloud server as you are at the mercy of other users impacting the available bandwidth and processor utilisation, but you can at least see trends. I hope this has been enlightening and I hope to expand on this in future…

Edit: for the one person that asked me about storage efficiency… here goes…

 > db.bench.stats()
{
    "ns" : "bench.bench",
    "count" : 140001,
    "size" : 10640108,
    "avgObjSize" : 76.00022856979594,
    "storageSize" : 21250048,
    "numExtents" : 7,
    "nindexes" : 1,
    "lastExtentSize" : 10067968,
    "paddingFactor" : 1,
    "flags" : 1,
    "totalIndexSize" : 4562208,
    "indexSizes" : {
        "_id_" : 4562208
    },
    "ok" : 1
}

So based on this… we can work out that size/storageSize = approx 50% efficiency… so MongoDB on this dataset is using about the same again for the data storage.

If we add in indexes size/(storageSize+totalIndexSize) then the result is only about 41% efficient. I think this is a reasonable tradeoff for the raw speed it gives personally…