CHAPTER 1 ■ METHODOLOGY
9
If the issue was with the instance or a server crashing in the RAC environment, data related to specific modules,
such as the interconnect, data related to the heartbeat verification via the interconnect, and the heartbeat verification
against the voting disks have to be collected. For example, a detailed look at the data in the GRID infrastructure log
files may have to be analyzed after enabling debug (crsctl debug log css "CSSD:9") to get the clusterware to write
more data into these log files. If this is a performance-related concern, then collecting data using a trace from the user
session would be really helpful in analyzing the issue. Tools such as Lightweight Onboard Monitor (LTOM
1
), or at the
minimum collecting trace using event 10046, would be really helpful.
Several times instance or server crashes in a RAC environment could be due to overload on the system affecting
the overall performance of the system. In these situations, the directions could shift to availability or stability of the
cluster. However, the root cause analysis may indicate other reasons.
Area Drilldown
Drilling down further to identify the cause or area of a performance issue is probably the most critical of the steps
because with all the data collected, it’s time to drill down to the actual reason that has led to the problem. Irrespective
of whether this is an instance/server crash because of overload or poorly performing module or application, the
actual problem should be identified at this stage and documented. For example, what query in the module or
application is slowing down the process, or is there a contention caused by another application (batch) that is causing
the online application to slow down?
At this level of drilldown, the details of the application area need to be identified: what service, what module, and
what action was the reason for this slowness. To get this level of detail, the DBMS_APPLICATION_INFO package discussed
earlier is a very helpful feature.
Problem Resolution
Working to resolve the performance issue is probably the most critical step. When resolving problems, database
parameters may have to be changed, host bus adaptor (HBA) controllers or networks or additional infrastructure such
as CPU or memory may have to be added, or maybe it all boils down to tuning a bad performing structured query
language (SQL) query, or making sure that the batch application does not run in the same time frame as the primary
online application, or even better if the workload can be distributed using database services to reduce resource
contention on any one server/instance causing poor response times. It is important that when fixing problems the
entire application is taken into consideration; making fixes to help one part of the application should not affect the
other parts of the application.
Testing Against Baseline
Once the problem identified has been fixed and unit tested, the code is integrated with the rest of the application
and tested to see if the performance issue has been resolved. In the case of hardware related changes or fixes, such
a test may be very hard to verify; however, if the fix is done over the weekend or during a maintenance window, the
application could be tested to ensure it is not broken due to these changes. Depending on the complexity of the
situation and maintenance window available, it will drive how extensive these tests can be. Here is a great benefit
of using database services that allow disabling usage of a certain server or database instance from regular usage or
allowing limited access to certain part of the application functionality, which could be tested using an instance or
workload until such time as it’s tested and available for others to use.
1
Usage and implementation of LTOM will be discussed in Chapter 6.
www.it-ebooks.info