Recently I performed a
failover of my Oracle database (running on Linux) to my standby database, and
after the switchover, application team started complaining about extreme
slowness. I was using OEM Cloud Control and the graph was showing high waits
for “free buffer wait” and alert log started showing Checkpoint not Complete. Since I never saw these waits on my previous
primary server (now standby), so first thing came into my mind was that the
disks on the standby server (now primary) are probably very slow, because
hardware of my servers was very old. Servers also had internal disks (not SAN
or NAS). I generated AWR report for the time when database was running fine and
without any performance issue, and then a latest time report based on latest
snapshots to see what is going wrong with the IO.
Next, we compare instance efficiency and 10 foreground wait events. During performance issue time, Buffer Hit % seems to be higher than “good” time, but it can be ignored. Next alarming thing that is showing is “free buffer waits” event at the top in the right side AWR, and this is first sign of disk IO latency (we will confirm it later). Because of slow IO, DBWR was not fast enough in cleaning the buffer cache (moving dirty blocks from the buffer cache to the data files to make room for new data blocks to be cached). “Configuration” can also be seen listed as the top wait class, to which “free buffer waits” event belongs to. “free buffer waits” was the wait event causing application to respond very slow.
Once it is identified that IO is very slow, the disks/storage needs to be replaced (or latest firmware installation sometimes helps). If it cannot be done immediately, we can check for some alternatives as a temporary solution.
In my case, I got to know that FILESYSTEMIO_OPTIONS parameter was set to NONE (although Oracle recommends setting it to “SETALL”). Setting this parameter to NONE means that asynchronous IO is not enabled for the file system. I set this value to SETALL and after that I observed that performance of IO has got better and I can survive until we switchover back to the primary, or replace the disks of this server.
I have written another article on same topic that could be used to investigate IO latency using AWR report. Click here to view that article.
You may also find this article relevant and interesting to check disks io stats at OS level in Linux environments.
In the following I will
show a comparison of both AWR reports (good time and bad time) and explain how
to check IO related issues in AWR reports. In the end, I will also explain that
despite slow disks/IO, how I was able to solve the latency issue. AWR snapshot
interval was set to 30 minutes for this case.
Left portion (AWR 1, with
green caption) of following images is form the time when database was running
on the primary node without any disk IO latency issue. Right portion (AWR 2, with
red caption) is when database was running on standby host (after the
switchover), and this standby host had disk IO latency issues.
In the first section of AWR report we don’t see
much difference in load profile that would need our attention to investigate IO
latency.Next, we compare instance efficiency and 10 foreground wait events. During performance issue time, Buffer Hit % seems to be higher than “good” time, but it can be ignored. Next alarming thing that is showing is “free buffer waits” event at the top in the right side AWR, and this is first sign of disk IO latency (we will confirm it later). Because of slow IO, DBWR was not fast enough in cleaning the buffer cache (moving dirty blocks from the buffer cache to the data files to make room for new data blocks to be cached). “Configuration” can also be seen listed as the top wait class, to which “free buffer waits” event belongs to. “free buffer waits” was the wait event causing application to respond very slow.
In the following section
of AWR, we see Foreground Wait Events. We can easily identify latency in the
disk IO because average wait (milliseconds) is quite high in AWR 2 as compared
to the AWR 1 for IO related wait events.
In the following
section of AWR, background wait events related to IO are also showing extensively
high wait time in AWR 2
Lastly, let’s have a
look at the IO Stat section of both
AWR reports. We can see that writing by database writer (DBWR) process and log
writer (LGWR) is very much slow in AWR 2. Last column shows value in
milliseconds each of these processes waited on average.
Once it is identified that IO is very slow, the disks/storage needs to be replaced (or latest firmware installation sometimes helps). If it cannot be done immediately, we can check for some alternatives as a temporary solution.
In my case, I got to know that FILESYSTEMIO_OPTIONS parameter was set to NONE (although Oracle recommends setting it to “SETALL”). Setting this parameter to NONE means that asynchronous IO is not enabled for the file system. I set this value to SETALL and after that I observed that performance of IO has got better and I can survive until we switchover back to the primary, or replace the disks of this server.
I have written another article on same topic that could be used to investigate IO latency using AWR report. Click here to view that article.
You may also find this article relevant and interesting to check disks io stats at OS level in Linux environments.
No comments:
Post a Comment