This is one of the most common messages a DBA could
see in alert log file of a database. Exact message should be similar to the
following
Wed
Mar 04 12:02:01 2015
Thread 1 cannot allocate new log, sequence 201
Checkpoint
not complete
|
Reason
Whenever a redo log switch occurs i.e. current redo log group (suppose log group 1) is full and LGWR starts writing to the next available redo log group (suppose log group 2), a CHECKPOINT also occurs which means that all database block changes which are in database buffer cache; and have not yet been written to the database files; should be written to the disk (datafiles) by the database writer process (DBWR). Now DBWR starts writing changed blocks to the datafiles and this redo log group (group 1) will no longer be available for the LGWR to reuse until DBWR completes writing all changed blocks from the buffer cache (buffer cache blocks which are related to the data changes recorded in this redo log group). Now if redo log group 2 is also full and LGWR tries to allocate redo log group 1 again for writing; and DBWR has still not finished writing changed blocks to the data files, LGWR would need to wait, and message “Checkpoint no complete” would be written the alert log file. All transactions would hang here as these can’t continue until LGWR is allocated next redo log group for writing.
Whenever a redo log switch occurs i.e. current redo log group (suppose log group 1) is full and LGWR starts writing to the next available redo log group (suppose log group 2), a CHECKPOINT also occurs which means that all database block changes which are in database buffer cache; and have not yet been written to the database files; should be written to the disk (datafiles) by the database writer process (DBWR). Now DBWR starts writing changed blocks to the datafiles and this redo log group (group 1) will no longer be available for the LGWR to reuse until DBWR completes writing all changed blocks from the buffer cache (buffer cache blocks which are related to the data changes recorded in this redo log group). Now if redo log group 2 is also full and LGWR tries to allocate redo log group 1 again for writing; and DBWR has still not finished writing changed blocks to the data files, LGWR would need to wait, and message “Checkpoint no complete” would be written the alert log file. All transactions would hang here as these can’t continue until LGWR is allocated next redo log group for writing.
Solution
Most of the time the solution of “checkpoint not complete”
is to add more redo log groups in the database so that DBWR has enough time to
write changed blocks from database buffer cache to the datafiles before LGWR
tries to acquire same redo log group again. For example, if you have 2 redo log
groups (group 1 and group 2) and you face checkpoint not complete warning,
adding 2 more groups (group 3 and group 4) would mean that turn of a redo log
group for writing by LGWR would come after longer time as compared to having
only 2 log groups, and this would ensure that LGWR have enough free redo log
groups available to be allocated and write rather than waiting for a redo log
group to become available before it could be reused.
How Many redo log groups, and redo log group size
This is a very tricky question. As per my experience
and knowledge, for a new database, I would recommend 5 redo log groups with 100
MB size for each redo log group Start monitoring the log switch frequency
during peak hours and if log switch
frequency is very less (suppose less than 5 minutes), increase the size of redo
log groups to make log switch frequency to not less than 10 minutes. If checkpoint
not complete message appears in alert log file, start adding more redo log
groups until you no longer see this message appearing in alert log file.
Other things to consider
As I mentioned above
in “Solution” section, most of the time solution is to add more redo log
groups, but this may not be a true solution for some scenarios (could be a work
around though). If your disk drives where you have placed your redo log groups are
very slow, you should consider putting your redo log groups on faster disks. For
example this is not recommended to put your redo log groups on RAID 5, as this
RAID type is very slow for sequential writing because of computing parity during
writing to the disks.There is another misconception of putting redo logs on SSD (I also had this misconception, but thanks to an Oracle community discussion which cleared this confusion). SSDs are also not good for sequential writes (also a couple of other reasons) and redo logs should not be placed on these (unless you are using some appliance and it is recommended by the vendor, for example Exadata machine)
Thanks, this is awesome!
ReplyDelete