Streaming MongoDB Databases
Table of Contents
As many companies begin the transition from on-prem to cloud, indubitably developers and administrators will run into headaches ensuring a quick transfer of data to the cloud. One such headache I personally experienced recently is with uploading multiple large MongoDB databases into MongoDB Atlas.
There are currently three widely recommended methods for coping a MongoDB database up into Atlas. The first is by utilizing Mongo's Live Migration tool, which connects directly with your onprem database, and syncs the documents and any ongoing updates to Atlas, in preparation for a cutover.
The second is to use mongodump and mongorestore to first create a database copy, and subsequently uploading that copy to Atlas.
The third method is to utilize mongomirror, a binary that operates similar to Atlas Migration, to copy and maintain state of an onprem database with an Atlas database.
My Scenario
In my scenario, we had to upload databases residing in a docker MongoDB container that stored slightly more than 1TB of data, to Atlas. The container was running in a firewalled network, so utilizing Atlas Migration was not feasible. Mongomirror would also not work, since the application required that the source database be a replicaset, and our container was a standalone. Finally, mongodump and mongorestore took in excess of three days to complete in our tests.
In order to speed up the transfer, we wanted to be able to copy directly from one database to the other, like with mongomirror, without the intermediate data copy, as with mongodump/mongorestore. So we created a new tool called mongodb-stream-rs, a tool written in rust, which uploads MongoDB database collections in parallel to a remote database.
During testing, we were able to transfer the entire 1TB database in under 24 hours. Also, since there were no updates being written to documents in the source database, we also included a --restart flag in mongodb-stream-rs, so that the application can pick up where a previous upload finished. This way, incremental runs will upload any new documents from the source to the destination database.
Our Solution
You can view our code here. This tool is written in rust and leverages the tokio runtime in order to send multiple collection to the destination database at once. By default mongodb-stream-rs will upload four collections in parallel. By default, uploads are transmitted in batches of 2000 docs, but this option can be changed with the --bulk flag. You can override this default with the --nobulk flag in order to have this tool upload one doc at a time.
If only a database name is passed to the app, then this tool will upload all collections within the db. However, you can specify a single collection to upload with --collection.
USAGE:
mongodb-stream-rs [FLAGS] [OPTIONS] --db <MONGODB_DB> --destination_uri <STREAM_DEST> --source_uri <STREAM_SOURCE>
FLAGS:
-c, --continue Restart streaming at the newest document
-h, --help Prints help information
-n, --nobulk Do not upload docs in batches
--validate Validate docs in destination
-V, --version Prints version information
OPTIONS:
-b, --bulk <STREAM_BULK> Bulk stream documents [env: STREAM_BULK=]
-c, --collection <MONGODB_COLLECTION> MongoDB Collection [env: MONGODB_COLLECTION=]
-d, --db <MONGODB_DB> MongoDB Database [env: MONGODB_DB=]
--destination_uri <STREAM_DEST> Destination MongoDB URI [env: STREAM_DEST=]
--source_uri <STREAM_SOURCE> Source MongoDB URI [env: STREAM_SOURCE=]
-t, --threads <STREAM_THREADS> Concurrent collections to transfer [env: STREAM_THREADS=]