CHAPTER 1 ■ INTRODUCTION TO MONGODB
13
Profiling Queries
A built-in profiling tool lets you see how MongoDB works out which documents to return. This is useful because,
in many cases, a query can be easily improved simply by adding an index. If you have a complicated query, and
you’re not really sure why it’s running so slowly, then the query profiler can provide you with extremely valuable
information. Again, you’ll learn more about the MongoDB Profiler in Chapter 10.
Updating Information In-Place
When a database updates a row (or in the case of MongoDB, a document), it has a couple of choices about how to do
it. Many databases choose the multi-version concurrency control (MVCC) approach, which allows multiple users to
see different versions of the data. This approach is useful because it ensures that the data won’t be changed partway
through by another program during a given transaction.
The downside to this approach is that the database needs to track multiple copies of the data. For example,
CouchDB provides very strong versioning, but this comes at the cost of writing the data out in its entirety. While this
ensures that the data is stored in a robust fashion, it also increases complexity and reduces performance.
MongoDB, on the other hand, updates information in-place. This means that (in contrast to CouchDB) MongoDB
can update the data wherever it happens to be. This typically means that no extra space needs to be allocated, and the
indexes can be left untouched.
Another benefit of this method is that MongoDB performs lazy writes. Writing to and from memory is very fast, but
writing to disk is thousands of times slower. This means that you want to limit reading and writing from the disk as much
as possible. This isn’t possible in CouchDB, because that program ensures that each document is quickly written to disk.
While this approach guarantees that the data is written safely to disk, it also impacts performance significantly.
MongoDB only writes to disk when it has to, which is usually once every second or so. This means that if a value
is being updated many times a second—a not uncommon scenario if you’re using a value as a page counter or for live
statistics—then the value will only be written once, rather than the thousands of times that CouchDB would require.
This approach makes MongoDB much faster, but, again, it comes with a tradeoff. CouchDB may be slower, but it
does guarantee that data is stored safely on the disk. MongoDB makes no such guarantee, and this is why a traditional
RDBMS is probably a better solution for managing critical data such as billing or accounts receivable.
Storing Binary Data
GridFS is MongoDB’s solution to storing binary data in the database. BSON supports saving up to 4MB of binary data
in a document, and this may well be enough for your needs. For example, if you want to store a profile picture or a sound
clip, then 4MB might be more space than you need. On the other hand, if you want to store movie clips, high-quality
audio clips, or even files that are several hundred megabytes in size, then MongoDB has you covered here, too.
GridFS works by storing the information about the file (called metadata) in the files collection. The data itself is
broken down into pieces called chunks that are stored in the chunks collection. This approach makes storing data both
easy and scalable; it also makes range operations (such as retrieving specific parts of a file) much easier to use.
Generally speaking, you would use GridFS through your programming language’s MongoDB driver, so it’s
unlikely you’d ever have to get your hands dirty at such a low level. As with everything else in MongoDB, GridFS is
designed for both speed and scalability. This means you can be confident that MongoDB will be up to the task if you
want to work with large data files.
Replicating Data
When we talked about the guiding principles behind MongoDB, we mentioned that RDBMS databases offer certain
guarantees for data storage that are not available in MongoDB. These guarantees weren’t implemented for a handful
of reasons. First, these features would slow the database down. Second, they would greatly increase the complexity of
the program. Third, it was felt that the most common failure on a server would be hardware, which would render the
data unusable anyway, even if the data were safely saved to disk.