Sunday, September 21, 2008

Oracle Wait Interface

When a database is up and running, every connected process is either busy performing work or waiting to do so. A process that is waiting may mean nothing in the overall scheme of things—the user may have left to get a cup of coffee, for example—or it can be an indicator that a database bottleneck exists. This is where wait-based or bottleneck analysis comes into play.

If latch contention or heavy table-scan activity has been dragging a database's performance down, you can use the wait interface to confirm the actual root cause. Once one or more wait events or other bottlenecks have been pinpointed as possible performance vampires, you can drill down and often discover a fair amount of detail about the sessions and objects that are causing the problem.

How do you correctly practice bottleneck or wait-based analysis? First, it is imperative that you obtain not only the number of waits for a particular event but also the times involved for each wait. Having both lets you form a complete picture regarding the magnitude of wait-initiated performance degradation. (Of course, the CPU may be the actual bottleneck, but that's a subject for a different article.)

The next prerequisite to using bottleneck analysis is that certain wait events should sometimes be filtered out of any metrics used to diagnose performance bottlenecks. For example, Oracle records a wait statistic that represents how long a particular user sits at the SQL tool prompt between every issued database request and the next one. Such a statistic provides no real value to a DBA who is trying to figure out where a database bottleneck exists but is useful for determining if a bottleneck is inside or outside the database. Any SQL scripts that are used to collect database wait statistics can exclude such events.

Depending on the database engine, there are several levels of detail you can work through when collecting wait statistics. The first level is the system view, which provides a global, cumulative snapshot of all the waits that have occurred on a system. Viewing these numbers can help you determine which wait events have caused the most commotion in a database thus far.

After looking at system-level wait activity, you can drill down further to discover which current connections may be responsible for any reported waits that have been observed at the system level. Such a drill-down, for example, can indicate the processes responsible for most of the index scan waits reported in a global system overview query. For systems that have been up for an extended period, it's hard to spot recent wait activities or trends. This is where it's nice to be able to automatically capture snapshots of wait-based metrics and view the delta measurements for the time periods you're interested in.