MongoDB Alerts & Rules

ScaleGrid's alerts and rules allow you to stay on top of your cluster activity. Every database type has a different set of alert rules that can be configured as per your convenience. You can get alerts sent directly to you via SMS, email, PagerDuty, Opsgenie or Slack. While the threshold for alerts can be configured, the alerts themselves can't be turned off completely. If the alerts are modified, the settings will apply to the alerts for all users within your account.

MongoDB Global Alert Rules

These are the default alert rules automated by ScaleGrid for all MongoDB clusters:

  • Current Connections greater than 3000
  • CPU - Total (%) greater than 60%
  • Available Disk Space (%) less than 20%
  • Replica - Window Size (Hours) less than 12 Hours
  • Replica - Lag (sec) greater than 60 sec
  • Ticket Reads less than 50
  • Ticket Writes less than 50

📘

Override Global Rules

You can override a Global Alert Rule by creating a new Cluster-Level Alert Rule or an Account-Level Alert Rule. For example, our CPU usage rule is set to 60% by default, but if you create a new Cluster-Level Alert and set it to 80%, this will override the Global Alert Rules for this cluster.

Account Level Rules

How To Create: MongoDB Account-Level Rules

These rules are enforced by you and apply to all clusters in your ScaleGrid account based on database type. You can configure account level rules under:

Settings > Global Rules > Alert Rules

🚧

These rules are enforced by you and apply to all clusters in your ScaleGrid account based on database type. Account-Level rules override Global Rules and are overridden by Cluster-Level rules.

Cluster Level Rules

How To Create: MongoDB Cluster-Level Rules

These rules are enforced by you and apply to specific clusters in your ScaleGrid account. You can configure cluster level rules on the details page:

Details Page > Alerts > Rules > Create New Rule

🚧

These rules are enforced by you and apply to all clusters in your ScaleGrid account based on database type. Cluster-Level rules override Global Rules and Account-Level rules.

MongoDB Alert Types

ScaleGrid supports three top-level alert types:

  • Metrics
  • Disk Free (only for MongoDB clusters)
  • Server Role Change

If you create an alert rule based on a specific metric, you have the option set conditions:

  • Threshold
  • Trigger if threshold greater than or less than the current value
  • Time for which the condition lasts for: 2 minutes, 6 minutes or 1 hour

MongoDB Metrics

Here is a list of MongoDB metrics that you can use to create alerts:


CPU - User (%)
API Enum: CPU_USER
The percentage of time the CPU spent on user applications.


CPU - System (%)
API Enum: CPU_SYSTEM
The percentage of time the CPU spent in the operating system.


CPU - Nice (%)
API Enum: CPU_NICE
The percentage of time the CPU spent in nice mode.


CPU - IO Wait (%)
API Enum: CPU_IOWAIT
The percentage of time the CPU spent waiting for IO operations to complete.


CPU - Total (%)
API Enum: CPU_TOTAL
The total percentage of time the CPU spent on user applications, operating systems, nice mode, and IO wait.


Memory - Total (KB)
API Enum: MEM_TOTAL
Total available system memory in KB.


Memory - Used (KB)
API Enum: MEM_USED
KB of system memory used.


Memory - Free (KB)
API Enum: MEM_FREE
KB of system memory free.


Memory - Buffers (KB)
API Enum: MEM_BUFFERS
KB of system memory used for buffers.


Memory - Cached (KB)
API Enum: MEM_CACHED
KB of system memory used for page cache.


Disk - Read (KB/sec)
API Enum: DISK_READ
Total disk reads in KB/sec.


Disk - Write (KB/sec)
API Enum: DISK_WRITE
Total disk writes in KB/sec.


Lock (%)
API Enum: LOCK_EFFECTIVE
(version < 3.0) Effective lock percentage is the sum of the global lock percentage and the lock percentage of the most-locked database at the time.


Memory - Virtual (MB)
API Enum: PROCESS_MEM_VIRTUAL
Virtual memory of the mongod process.


Memory - Resident (MB)
API Enum: PROCESS_MEM_RESIDENT
Resident memory of the mongod process.


Memory - Mapped (MB)
API Enum: PROCESS_MEM_MAPPED
MMAP (MMAPv1 only) MongoDB memory reflects data size.


Memory - Non Mapped (MB)
API Enum: PROCESS_MEM_NONMAPPED
Amount of virtual MongoDB memory not used by MMAP files. High usage indicates high number of connections to the database.


Assert - Regular
API Enum: ASSERT_REGULAR
Regular Asserts raised/sec.


Assert - Warning
API Enum: ASSERT_WARNING
Warning Asserts raised/sec.


Assert - Message
API Enum: ASSERT_MSG
Message Asserts raised/sec.


Assert - User
API Enum: ASSERT_USER
User Asserts raised/sec.


Operation - Insert (per sec)
API Enum: OPCOUNTERS_INSERT
Number of Insert operations/sec.


Operation - Query (per sec)
API Enum: OPCOUNTERS_QUERY
Number of Find operations/sec.


Operation - Update (per sec)
API Enum: OPCOUNTERS_UPDATE
Number of Update operations/sec.


Operation - Delete (per sec)
API Enum: OPCOUNTERS_DELETE
Number of Delete operations/sec.


Operation - Getmore (per sec)
API Enum: OPCOUNTERS_GETMORE
Number of Getmore operations/sec.


Operation - Command (per sec)
API Enum: OPCOUNTERS_CMD
Number of Command operations for your MongoDB queries/sec.


Index - Access
API Enum: INDEX_ACCESSES
(version < 3.0) Number of times that operations have accessed indexes. This value is the combination of the hits and misses.


Index - Hits
API Enum: INDEX_HITS
(version < 3.0) Number of times that an index has been accessed and mongod is able to return the index from memory.


Index - Miss
API Enum: INDEX_MISS
(version < 3.0) Number of times that an operation attempted to access an index that was not in memory.


Index - Reset
API Enum: INDEX_RESET
(version < 3.0) Number of times that the index counters have been reset since the database last restarted.


Index - Miss Ratio
API Enum: INDEX_MISSRATIO
(version < 3.0) The ratio of hits to misses.


Current Connections
API Enum: CONNECTIONS_CURRENT
Currently active connections.


Background Flush Average (ms)
API Enum: BACKGROUND_FLUSH_AVG
(MMAPV1 only) Amount of time, in milliseconds, that the last flush operation took to complete.


Queues - Total
API Enum: QUEUES_TOTAL
Number of operations queued waiting for lock.


Queues - Read
API Enum: QUEUES_READ
Number of operations queued waiting for read lock.


Queues - Write
API Enum: QUEUES_WRITE
Number of operations queued waiting for write lock.


Network - Bytes In (KB/sec)
API Enum: NETWORK_BYTES_IN
Averages KB/sec sent into database server.


Network - Bytes Out (KB/sec)
API Enum: NETWORK_BYTES_OUT
Averages KB/sec sent out from database server.


Page Faults
API Enum: PAGE_FAULT
Average Page Faults/sec.


Journalled MB (MB)
API Enum: JOURNALED_MB
(MMAPV1 only) Amount of data in megabytes (MB) written to journal during the last journal group commit interval.


Write To Data File MB (MB)
API Enum: WRITETODATAFILES_MB
(MMAPV1 only) Amount of data in megabytes (MB) written from journal to the data files during the last journal group commit interval.


Cursor - Open
API Enum: CURSOR_OPEN
Total open cursors on the server.


Cursor - TimeOut
API Enum: CURSOR_TIMEOUT
Average timed out cursors/sec.


Replica - Window Size (hrs)
API Enum: REPL_WINDOW_SIZE
Replication Oplog window in hours. If the secondary falls behind by more than this it will need to be resynced.
The replication window is a buffer of the mongo oplog that the secondary uses to sync to the primary. The Replica Window Size is the time difference between the oldest and newest entries in the primary's oplog. That’s the amount of time an operation will remain in the oplog before being overwritten by a new entry (and thus be available for syncing to the secondary). Typically we like to have 10-12 hrs of window size.
The window size reduces when there is an influx of write operations.


Replica - Lag (sec)
API Enum: REPL_LAG
Delay between an operation on the primary and the application of that operation from the Oplog to the secondary.


Head Room (sec)
API Enum: REPL_HEADROOM
Difference between the primary’s Oplog window and the replication lag of the secondary.


Tickets Reads
API Enum: TICKETS_READS
The number of read tickets available to the WiredTiger storage engine. Read tickets represent the number of concurrent read operations allowed into the storage engine. When this value reaches zero new read requests may queue until a read ticket becomes available.


Ticket Writes
API Enum: TICKETS_WRITES
The number of write tickets available to the WiredTiger storage engine. Write tickets represent the number of concurrent write operations allowed into the storage engine. When this value reaches zero new write requests may queue until a write ticket becomes available.


Cache Dirty (b)
API Enum: CACHE_DIRTY
The number of tracked dirty bytes currently in the WiredTiger cache.


Cache Used (b)
API Enum: CACHE_USED
The number of bytes currently in the WiredTiger cache.


Cache Activity Read (KB/sec)
API Enum: CACHE_ACTIVITY_READ
The average rate of kilobytes/sec read into WiredTiger's cache over the selected sample period.


Cache Activity Write (KB/sec)
API Enum: CACHE_ACTIVITY_WRITE
The average rate of kilobytes/sec written from WiredTiger's cache over the selected sample period.