Getting to Know EMDIAG: repvfy execute optimize

In my group, we work with a lot of customers with very large EM environments.  On the range of 2000+ agents.  So as you can imagine there’s a little bit of optimizing that needs to get done to account for these numbers.

A few of these standard tweaks have been put into the repvfy execute optimize command.    You can make all these changes individually, but if you want to get them all done at once, optimize is your tool.

There’s 3 categories of optimization that is handled at this point:  Internal Tasks, Repository Settings and Target system.    The script will first evaluate the size of your repository based on the number of agents, and from there determine what optimizations need to be done or recommended for future implementation.

Internal Task Tuning

Enterprise Manager uses short and long workers, depending on the task activity.  We typically recommend 2 workers for each for most larger systems, so in repvfy execute optimize this is what gets set. Smaller systems are usually sufficient with the default settings of 1 each.    You can view the configuration in EM on Manage Cloud Control -> Repository page.   Here you can also configure the short workers, but not the long.  If you see a high collection backlog, this is an indication that your in need of additional task workers.

shortworkers

The next step is to evaluate the current settings of the job system and ensure that there are enough connections available for the job system.  This change is not implemented automatically, but is printed out for you to change with emctl, as it will require a restart to take effect.   Recommendations for Large Job System Load can be found in the Sizing chapter of Advanced Installation Guide.  Increasing the number of connections may require an increase in database processes value.

Repository Settings Tuning

EM tracks system errors in one of it’s tables.   In larger systems, the MGMT_SYSTEM_ERROR_LOG table can become quite large over the 31 day default retention.   The optimize script reduces log retention to 7 days for normal operating.

There are also various levels of tracing enabled by default, this can generate a lot of extra activity during normal operations if you’re not utilizing the traces.    Tracing is turned off by the optimize command.  It can be enabled at any time by using the repvfy send start_trace -name <name>  and repvfy send start_repotrace commands.

Finally this step looks for any invalid SYSMAN objects and validates them, then checks for stale optimizer statistics and makes a recommendation as needed.

System Tuning

After an EM outage or downtime, all the agents will attempt to upload and update their status (or heartbeat) with the OMS.  There’s a grace period in which no alerts are sent.  In larger systems, this grace period may not be long enough to get all agents updated before alerts start going out.   This can be adjusted by increasing that grace period.

In 12.1.0.3 and higher, you can also increase the number of threads that perform the ping heartbeat tasks.  This should be done if you have more than 2000 agents per OMS.  The optimize command will make this calculation for you and recommend the appropriate emctl command to set the heartbeatPingRecorderThreads property.  Recommendations for Large Number of Agents can be found in the Sizing chapter of Advanced Installation Guide.

The optimize command will only output those items that require attention, so not every item will appear in the output on every site.
The recommended values reported in the output are specific for THAT environment  and should not be copied over to another environment just like that.  To tune another EM environment, run the optimize script on that environment.

Sample output from a small EM system:

bash-4.1$ ./repvfy execute optimize

Please enter the SYSMAN password:
SQL*Plus: Release 11.1.0.7.0 – Production on Thu Jul 9 07:59:35 2015

Copyright (c) 1982, 2008, Oracle. All rights reserved.

SQL> Connected.

Session altered.
Session altered.

========== ========== ========== ========== ========== ========== ==========
== Internal task system tuning ==
========== ========== ========== ========== ========== ========== ==========

– Setting the number of short workers to 2 (1->2)
– Setting the number of long workers to 2 (1->2)
========== ========== ========== ========== ========== ========== ==========
========== ========== ========== ========== ========== ========== ==========
== Job system tuning ==
========== ========== ========== ========== ========== ========== ==========

– On each OMS, run this command:
  $ emctl set property -name oracle.sysman.core.conn.maxConnForJobWorkers -value 72 -module emoms
  This change will require a bounce of the OMS

========== ========== ========== ========== ========== ========== ==========
========== ========== ========== ========== ========== ========== ==========
== Repository tuning ==
========== ========== ========== ========== ========== ========== ==========
– Setting retention for MGMT_SYSTEM_ERROR_LOG table to 7 days (31->7)

– Disabling PL/SQL tracing for module (EM.GDS)
– Disabling PL/SQL tracing for module (EM_DBM)

– Disabling repository metric tracing for ID (1234)

– Recompiling invalid object (foo,TRIGGER)
– Recompiling invalid object (bar,CONSTRAINT)

– Stale CBO statistics in the repository. Gather statistics for the SYSMAN schema
  Command to use:
  $ repvfy send gather_stats
  Or:
  SQL> exec emd_maintenance.gather_sysman_stats_job(p_gather_all=>’YES’);

========== ========== ========== ========== ========== ========== ==========
========== ========== ========== ========== ========== ========== ==========
== Target system tuning ==
========== ========== ========== ========== ========== ========== ==========

– Setting the PING grace period to (90) (60->90)

– Set the parameter oracle.sysman.core.omsAgentComm.ping.heartbeatPingRecorderThreads to 3
  $ emctl set property -module emoms -name oracle.sysman.core.omsAgentComm.ping.heartbeatPingRecorderThreads -value 3

========== ========== ========== ========== ========== ========== ==========
not spooling currently

Getting to Know EMDIAG – repvfy show score_card

Continuing on with a series of EMDIAG commands that I find useful, today we’re going to look at the repvfy show score_card.   You may also want to review the previous post on repvfy diag all.

One of the main functions of repvfy is the verify function (i.e. repvfy verify -details -level 9).  This goes through a long list of checks to identify areas you might need to investigate.  This could be anything from Agents with Clock Skew to Expired User accounts.

Now there’s a scorecard (show score_card) that reads the output of the verify log and will summarize the category of errors that were found, and a generate a score based on the weight, number of tests and number of violations.  This can help you in tracking improvements as you work through any cleanup and issues.

repvfy show score_card (also try score_card_details)

Category                  Score  #Tests #Viol
————————- —— —— ——
Best Practice              81.59      6    780
Configuration              21.64     20   2298
Data Integrity             25.09     29   2793
Monitoring/Operations      60.71     23    886
————————- —— —— ——

The output of show score_card_details will give the breakdown by category (Best Practice, Configuration, Data Integrity, Monitoring/Operations) and Module (Targets, Agents, etc) along with the test to run (i.e. repvfy verify targets -test 6002).

Note: The sample output below has been modified and shortened to fit the blog.

$ repvfy show score_card_details

Category        Rank Module          ID   Test               Score  #Viol
BP  1 TARGETS 6002 OMS mediated targets without backup Agent   10.0  639
BP  2 AGENTS  6006 Deployed Agent plugins lower than OMS plugin 6.6  106
BP  3 REPO  6039 Newer version avail for deployed OMS plugin    1.1    9
BP  4 REPO  6005 Tables with locked statistics                  0.3   23
BP  5 JOBS  6006 Job steps running more than two hours          0.3    2

Using the repvfy verify -details will help you identify targets to investigate.  Some tests have automated fixes that can be run with repvfy verify <module> -test <test#> -fix.     Other issues may have manual steps or suggestions, some may require opening an SR with Oracle Support.  After fixing issues, rerun the full repvfy verify -level 9 and regenerate the score card to track your progress!

Getting to Know EMDIAG – repvfy diag all

If you’ve worked with me, or called me about a problem with your Enterprise Manager, or even attended any of my sessions, you’ve probably heard me talk about EMDIAG.   One of the most popular components of EMDIAG is the Repvfy tool. This is basically a series of scripts and queries that will provide data from the repository to help diagnose configuration and data issues.  You can get more details on downloading and installing in EMDIAG Troubleshooting Kits Master Index (Doc ID 421053.1).

There are 3 components that make up EMDIAG:  repvfy, omsvfy and agtvfy.   Today, one of the features I am introducing you to is in repvfy.  This is the component that pulls data from the EM repository.

repvfy diag all  

This is my go to these days.  Instead of tell the customer I need X, Y, Z and A, B, C, I get this.   The diag all runs through various EMDIAG reports that are frequently used in troubleshooting issues with support and development.  It runs the different reports and zips them into a file that can then be uploaded easily.    There’s also a shorter version repvfy diag core.

adding: advisor_day_2015_07_06_084304.log (deflated 81%)
adding: advisors_2015_07_06_084304.log (deflated 83%)
adding: agent_health_2015_07_02_120925.log (deflated 83%)
adding: analyze.log (deflated 83%)
adding: backlog_2015_07_06_084304.log (deflated 84%)
adding: body1.log (stored 0%)
adding: body2.log (stored 0%)
adding: body3.log (stored 0%)
adding: cursor_2015_07_06_084304.log (deflated 84%)
adding: custom_2015_07_06_084304.log (deflated 86%)
adding: deinstall.log (deflated 79%)
adding: details_2015_07_02_082451.log (deflated 82%)
adding: details_2015_07_02_082451.sql (deflated 68%)
adding: details_2015_07_06_084304.log (deflated 87%)
adding: details_2015_07_06_084304.sql (deflated 74%)
adding: errors_2015_07_06_084304.log (deflated 83%)
adding: install.log (deflated 80%)
adding: job_health_2015_07_06_084304.log (deflated 83%)
adding: loader_health_2015_07_06_084304.log (deflated 91%)
adding: metric_stats_2015_07_06_084304.log (deflated 92%)
adding: mtm_2015_07_06_084304.log (deflated 89%)
adding: notif_health_2015_07_06_084304.log (deflated 85%)
adding: performance_2015_07_06_084304.log (deflated 88%)
adding: ping_health_2015_07_06_084304.log (deflated 76%)
adding: pkg.log (deflated 62%)
adding: space_2015_07_06_084304.log (deflated 91%)
adding: system_2015_07_06_084304.log (deflated 86%)
adding: task_health_2015_07_06_084304.log (deflated 83%)
adding: upgrade2.log (stored 0%)
adding: verify.log (deflated 89%)
adding: verify_2015_07_02_082451.log (deflated 49%)
adding: verify_2015_07_06_084304.log (deflated 45%)
adding: views.log (deflated 82%)

File created: /u01/oracle/em12r5/oms/emdiag/tmp/repvfy_2015_07_06_084304.zip

So just what does it gather information about?   Here’s a one line summary of each report:

advisors  – ADDM, ASH and AWR reports from the repository database
agent_health – summary of deployed agents, plugins and targets as well as availability and ping statistics
backlog – statistics from dbms_scheduler, loader subsystem, job subsystem, notification subsystem and the task/worker subsystem
cursor – cursor parameters and statistics for EM SQL
custom – summary of EM customizations done
errors – targets, agents, plugins, metrics, collections, jobs in error
job_health – summary of job subsystem configuration, statistics and performance
loader_health – summary of loader subsystem configuration, statistics and performance
metric_stats – performance summary of repository, loader subsystem, purge policies and metrics including top targets and metrics
mtm – summary of Repository and OMS configuration, housekeeping jobs, agent and plugin deployments
notif_health – summary of notification subsystem configuration, statistics and performance
performance – performance summary of repository, OMS, agents and internal subsystems.
ping_health – summary of agent ping jobs and communication
space – summary of schema statistics collections, table/index sizes and fragmentation
system – full configuration summary
task_health – summary of task subsystem configuration, statistics and performance
verify/details – the standard verification checks with detailed output

So depending on the issue you’re seeing, I will typically look at various reports.    If you have problems with notifications, I’m obviously going to go through the notif_health and probably the backlog and job_health reports.   If I’m just trying to get a good understanding of how your system is built, what targets you manage and what you’re doing with them, I’d start with the system and custom reports.

In future posts, we’ll break down some of these reports in detail, but that’s it for today!

Preventing Alerts on OS Audit File Size when Upgrading DB Plug-in

In January, the DB Plug-in 12.1.0.7.0 was released for EM 12c.   Not long after, my friend Brian found they added a new metric with default thresholds.   The new metric group is Operating System Audit Files and the metric alerts on Size of Audit Files.  Depending on the size and agent of your environment, you may immediately start getting notifications or pages as the default thresholds are 10MB/20MB, which can be quite small.

An example of the notification you might receive:

 Host=xxxxxx.us.oracle.com 
 Target type=Database Instance 
 Target name=emrep 
 Categories=Capacity 
 Message=35.39 MB of Audit Trail files collected (.aud: 35.39, .xml: 0, .bin: 0) 
 Severity=Critical 
 Event reported time=Feb 6, 2015 6:53:56 AM PST 
 Target Lifecycle Status=Production 
 Operating System=Linux
 Platform=x86_64
 Department=DBA
 Associated Incident Id=2103 
 Associated Incident Status=New 
 Associated Incident Owner= 
 Associated Incident Acknowledged By Owner=No 
 Associated Incident Priority=None 
 Associated Incident Escalation Level=0 
 Event Type=Metric Alert 
 Event name=sizeOfOSAuditFiles:FILE_SIZE 
 Metric Group=Operating System Audit Records
 Metric=Size of Audit Files
 Metric value=35.39
 Key Value= 
 Rule Name=DBA_Incident_Rule,Create incident for critical metric alerts 
 Rule Owner=SYSMAN 
 Update Details:
 3.39 MB of Audit Trail files collected (.aud: 3.39, .xml: 0, .bin: 0)
 Incident created by rule (Name = DBA_Incident_Rule, Create incident for critical metric alerts; Owner = SYSMAN).


So if you’re planning to upgrade 1000 agents with the new Database Plugin, you might start getting a little nervous about receiving all of these pages.   Since the metric didn’t exist before, it’s not included in your templates to be disabled.  Even if it were in the templates, it would likely alert before you could reapply templates.

Luckily, Incident Rules provide a method to exclude a particular event when evaluating an Incident Rule.

From Setup -> Incidents -> Incident Rules, you’ll want to edit your defined Incident Rule.  If you haven’t customized an Incident Rule, you can select the default Incident Ruleset and do a Create Like to clone and be able to edit.

inc1

Select the Metric Alert rule, and click Edit.

inc2

On this first screen, you’ll see the Advanced Selection Options.  If you expand this you’ll see an option for Event name.  This is where you can exclude a specific metric event by select Not Equals and enter the event name.   In the case of this metric, the event name is sizeOfOSAuditFiles:FILE_SIZE.

inc3

 

Click Next and Continue until you finally get to Save.   To validate, you can use the Simulate Rules or trigger an alert to see if it sends the email.

This concept can be applied to help filter out other events, or categories of metrics as needed.

 

Resolving Conflicts While Patching the EM Agent

Unless you’re in a cave asleep, you’ve seen the recent Oracle PSUs and patches that were released in January.  This has many of my customers patching their agents, and a few have noticed a problem with some previously applied patches.  Thanks to Brian for pointing this out!

If you applied the recommended Java patches sometime last year (18502187, 18721761), and you go to apply the 12.1.0.4.5 agent patch 20282974 (or an earlier bundle), the analyze step will fail as the Java patches are included in the agent bundle, but for some reason it can’t “ignore” them.    Checking the detailed results of the failed Analyze job will show which step failed.   In this specific case, the step PrerequisiteCheckForApply will be marked Failed and the error will look like this:

ap1

To remedy this with EM, you can create a patch plan with the Java patches (18502187, 18721761) and select Rollback in the Deployment Options.  This will run the rollback for these patches.

ap2

Once rolled back, re-analyze your 12.1.0.4.5 patch plan and it should be successful and allow you to deploy!

When you update Agents, keep these things in mind:

  • Be sure to update any Agent clones/gold images with the newly updated patch and plug-ins if they’ve been upgraded or patched.
  • If you staged patches for auto-deployment in the OMS $ORACLE_HOME/install/oneoffs, remove the old patches and you should just need the 12.1.0.4.5 patch now, or any current Discovery patches.
  • Update your procedural documents to reflect the current patches required.

For more details on how to patch your agent using EM, see the Administrator’s guide.

 

Notifications for Expiring DBSNMP Passwords

Most user accounts these days have a password profile on them that automatically expires the password after a set number of days.   Depending on your company’s security requirements, this may be as little as 30 days or as long as 365 days, although typically it falls between 60-90 days. For a normal user, this can cause a small interruption in your day as you have to go get your password reset by an admin. When this happens to privileged accounts, such as the DBSNMP account that is responsible for monitoring database availability, it can cause bigger problems.

In Oracle Enterprise Manager 12c you may notice the error message “ORA-28002: the password will expire within 5 days” when you connect to a target, or worse you may get “ORA-28001: the password has expired”. If you wait too long, your monitoring will fail because the password is locked out. Wouldn’t it be nice if we could get an alert 10 days before our DBSNMP password expired? Thanks to Oracle Enterprise Manager 12c Metric Extensions (ME), you can! See the Oracle Enterprise Manager Cloud Control Administrator’s Guide for more information on Metric Extensions.

Read more here

Simplified Agent and Plug-in Deployment

On your site of hundreds or thousands of hosts have you had to patch agents immediately as they get deployed?  For this reason I’ve always been a big fan of cloning an agent that has the required plug-ins and all the recommended core agent and plug-in patches, then using that clone for all new agent deployments. With EM 12c this got even easier as you can now clone the agent using the console “Add Host” method. You still have to rely on the EM users to use the clone.The one problem I have with cloning is that you have to have a reference target for each platform that you support. If you have a consolidated environment and only have Linux x64, this may not be a problem. If you are managing a typical data center with a mixture of platforms, it can become quite the maintenance nightmare just to maintain your golden images.You must update golden image agents whenever you get a new patch (generic or platform specific) for the agent or plug-in, and recreate the clone for each platform. Typically, I find people create a clone for their most common platforms, and forget about the rest. That means, maybe 80% of their agents meet their standard patch requirements and plug-ins upon deployment, but the other 20% have to be patched post-deploy, or worse – never get patched!

While deployed agents and plug-ins can be patched easily using EM Patches & Updates, but what about the agents still getting deployed or upgraded? Wouldn’t it be nice if they got patched as part of the deployment or upgrade? This article will show you two new features in EM 12.1.0.3 (12cR3) that will help you deploy the most current agent and plug-in versions. Whether you have 100s or 1000s of agents to manage, reducing maintenance and keeping the agents up to date is an important task, and being able to deploy or upgrade to a fully patched agent will save you a lot of time and effort.

Read original post here.