Standardize Target Monitoring with Templates

Enterprise Manager is a critical tool for monitoring database and middleware targets, as well as Engineered Systems and hosts.  Each target has it’s own set of metrics. If you read my previous posts on viewing metrics and setting thresholds, you’ve got a good understanding of how to set thresholds on a single target.  What if you have 100 targets?  Or a 1000?   Your targets for production may even have different thresholds then non-production.   Do you really want to manually set these metrics up on all targets?   Not likely.   If you have more than 3 databases or targets, you should probably consider standardizing your monitoring by using Monitoring Templates.   Templates allow you to  reuse the metrics you’ve defined for like targets.

From Enterprise menu, select Monitoring / Monitoring Templates.


You can see in the search box, you can display Oracle Certified templates. temp2

If you check this, you’ll find a long list of templates for various midddleware and application situations.


Create Template from Target

The first method to create a template is based on an existing target.  This allows you to configure your monitoring on one sample target, and copy this to a template.

Click Create.    Notice the copy monitoring settings from Target is selected.


Click the search icon to find the sample target you want to copy metrics from and click Select.temp5

First we need to give our template a name.  If you’re going to have multiple templates, it’s best to give them a detailed name to make them distinct and easily identified.     Notice the Default Template checkbox – if you check this, this template will be automatically applied to all new (not existing) Cluster Database targets as they are discovered in Enterprise Manager.  Only one default template per target type can be identified.


Click on Metric Thresholds and you will see a familiar screen with the target metrics and Warning and Critical thresholds.


If there’s additional metrics you want to add, or maybe remove from this template, click the Remove or Add metrics button.


When adding metrics, you’ll be able to search for another target, template or metric extension that you wish to add to this template.


When you’ve made your adjustments, click on the OK button to save your template.  You’ll get a confirmation when your template is created. temp10

Create from Target Type

From Monitoring Templates, click Create, this time select the option for Target Type.  This option will pull the default registered metrics for that particular target type.


Next you’ll select a category and the target type.  For Database, we will select Database Instance.    From here, the process is the same.  This template will have all default recommended metrics and you can make your adjustments from here.


Apply Templates

Now that you have a new template, you can select this template and click Apply to apply to any existing targets.    temp11

The Apply Options are important to consider.  By default, templates override only metrics common to template and target. This means if there’s a metric on the target, that is not included in the template, it is not removed or replaced.  If the metric has different thresholds or no thresholds, then it is updated to match the template.  The top option, to completely replace settings on the target will make the target identical to the target.   Which means if there are metrics not in the template, the apply will remove thresholds for those metrics and no longer alert.


The Key Values section tells the template apply how you want to handle those metrics such as Tablespace that might have multiple key values, say different thresholds for SYSTEM and SYSAUX tablespaces.


Click Add to select the targets or group you would like to apply the template to, and click Select.  Then click OK to submit the Apply job.



You can view the apply status from the Past Apply Operations button and get information on succeeded and failed operations.

So now you can take some time up front, standardize your metrics, and enforce them with templates.


Hands on Monitoring Exercises with Enterprise Manager

Dive deeper into the areas that interest you!   All steps can be done on your lab box or on your own EM system.

View Data with All Metrics

Modifying Metrics and Collections

Create a Template

Create a Metric Extension to notify on expiring DBSNMP accounts

Create a Metric Extension for Fast Recovery Area

Create a Repository-Side Metric Extension

Filter out a specific alert from incident rules

Managing Metric Thresholds in Enterprise Manager

One of the most critical steps in monitoring your targets with Enterprise Manager, is to set your metrics and thresholds properly for your environment.   All targets will have predefined metrics that will be enabled and thresholds set based on recommendations from Oracle product teams.    These may or may not be good for your environment.    Customers all have different requirements for what they want to be e-mailed, paged or notified by ticket about.

The most common metrics for databases are going to be the ones that cause service outages:  availability, space issues, archiver issues, data guard gaps, critical ORA- errors.   Some things, you just don’t need to know about at 2am though, things like global cache blocks lost.

From the target menu, select Monitoring / Metric and Collection Settings.  This will show you the current settings of your target.  Notice the default view is Metrics with Thresholds.  Other items are collected and can be seeing in the All Metrics view.


Let’s take a closer look at what we see here.  First we have the metric grouping or category.  Then for each metric in the group, you’ll have the operator, warning and critical thresholds.  These are the most important.  If you don’t provide a value, alerts will not be triggered as there will be no threshold violations.  The next column displays if a corrective action job has been registered on this metric. Followed by the collection schedule and Edit icon.



Clicking on the link in Collection Schedule will bring you to the collection settings.  You can enable or disable a metric collection, change the frequency, and determine whether alert only or historical trending data will be saved.   If you select alert only, it will only store occurrences where thresholds are violated.  Pay careful attention to the Affected Metrics section, as some metrics are collected in a group, and modifying these settings will affect all metrics in that group.


Returning to the main screen, click on the pencil icon to edit the metric.


This first section is where you can add a Corrective Action job if you want to automatically fix your alerts.  An example would be kicking off a RMAN archive log backup job when Archive Area Used % event is triggered.


In the Advanced Threshold section, you can determine how many times a threshold must be exceeded in a row to trigger an alert.  So if you want to alert if CPU is 95% for over 3 collections (15 minutes), then you would set Number of Occurrences to 3.


Template override allows an administrator to prevent a particular metric from being changed when templates are applied.  You want to avoid this as a common practice and reserve for special exceptions.




The Threshold Suggestion section allows you to evaluate what warning and critical severity alerts  would be generated if you changed thresholds.  You can look at the last month of collected metrics to make the best threshold estimates.  metric11

If your metric has multiple keys, you will have an additional screen where you can add additional keys.  A key would be a filesystem, or a tablespace that you want to monitor with different thresholds then the rest.


Whey you’re finished making changes, clicking Continue and OK to save metric changes to the repository and push out to the Agent.   Once you get a target set up for monitoring the way you want, you can create a template to push the same settings to all like targets.   I’ll cover this in another post soon!

Getting to Know Your Target with All Metrics View

Every target in Enterprise Manager has a set of target related metrics.   These metrics control what is collected, how frequently, and whether alerts and notifications are sent.   They are defined by target metadata and are specific to a particular target type.  The metric is collected by the Agent on regular intervals, and then batch uploaded to the EM repository.   Exploring these collected metrics can provide you with a wealth of information about your target.

From the target, click the target menu / Monitoring / All Metrics.


In this view you will get all possible metrics for this target.   You’ll also see a list of the Open Metric Events (a metric that has crossed a threshold), and the top 5 events over the last 7 days.


If you click on a metric category on the left, you’ll get the real-time values of those metrics.   The Last Upload is telling you when these metrics were last collected and uploaded to the repository.


To see those values, expand the category by clicking on the viewmetric4and selecting a specific metric, in this example Tablespace Space Used %.


This view is now showing you the last collection, by tablespace with average, low, high and last known values.   You will see the severity is clear for all tablespaces at this time.  If you have an open event, you may see a warning or critical icon here.    When you select an individual tablespace, a chart will appear in the lower half of the screen.


In this lower section, you can do a variety of actions.  At the top you’ll see a summary of the metric data, as well as the option to Modify Thresholds.  Thresholds saved will be sent out to the agent for changes.


If you want to see the metrics in table view to see the exact values and timestamps over the last several days, click the Table View link.


Under Options, you can also export this metric data to a CSV file.   Or maybe you want to see related metrics or problem analysis to identify what might have caused an issue with this metric.


When viewing Related Metrics, the predefined related metrics will be displayed, but you can add your own from any targets.


Additionally, you can compare to other keys, which would be other tablespaces in this example.  Or you can compare to other targets, say if you wanted to compare CPU utilization on 2 hosts.


By default, the data is show for a 24 hour period.  Options to view 7 days, 31 days, and custom time periods are also available.


There’s a wealth of information collected and stored, and the best place to start looking at it is in the All Metrics view.  This can help you identify collection category, additional metrics you might be interested in, and patters and trends on alerts.


Getting to Know EMDIAG: repvfy execute optimize

In my group, we work with a lot of customers with very large EM environments.  On the range of 2000+ agents.  So as you can imagine there’s a little bit of optimizing that needs to get done to account for these numbers.

A few of these standard tweaks have been put into the repvfy execute optimize command.    You can make all these changes individually, but if you want to get them all done at once, optimize is your tool.

There’s 3 categories of optimization that is handled at this point:  Internal Tasks, Repository Settings and Target system.    The script will first evaluate the size of your repository based on the number of agents, and from there determine what optimizations need to be done or recommended for future implementation.

Internal Task Tuning

Enterprise Manager uses short and long workers, depending on the task activity.  We typically recommend 2 workers for each for most larger systems, so in repvfy execute optimize this is what gets set. Smaller systems are usually sufficient with the default settings of 1 each.    You can view the configuration in EM on Manage Cloud Control -> Repository page.   Here you can also configure the short workers, but not the long.  If you see a high collection backlog, this is an indication that your in need of additional task workers.


The next step is to evaluate the current settings of the job system and ensure that there are enough connections available for the job system.  This change is not implemented automatically, but is printed out for you to change with emctl, as it will require a restart to take effect.   Recommendations for Large Job System Load can be found in the Sizing chapter of Advanced Installation Guide.  Increasing the number of connections may require an increase in database processes value.

Repository Settings Tuning

EM tracks system errors in one of it’s tables.   In larger systems, the MGMT_SYSTEM_ERROR_LOG table can become quite large over the 31 day default retention.   The optimize script reduces log retention to 7 days for normal operating.

There are also various levels of tracing enabled by default, this can generate a lot of extra activity during normal operations if you’re not utilizing the traces.    Tracing is turned off by the optimize command.  It can be enabled at any time by using the repvfy send start_trace -name <name>  and repvfy send start_repotrace commands.

Finally this step looks for any invalid SYSMAN objects and validates them, then checks for stale optimizer statistics and makes a recommendation as needed.

System Tuning

After an EM outage or downtime, all the agents will attempt to upload and update their status (or heartbeat) with the OMS.  There’s a grace period in which no alerts are sent.  In larger systems, this grace period may not be long enough to get all agents updated before alerts start going out.   This can be adjusted by increasing that grace period.

In and higher, you can also increase the number of threads that perform the ping heartbeat tasks.  This should be done if you have more than 2000 agents per OMS.  The optimize command will make this calculation for you and recommend the appropriate emctl command to set the heartbeatPingRecorderThreads property.  Recommendations for Large Number of Agents can be found in the Sizing chapter of Advanced Installation Guide.

The optimize command will only output those items that require attention, so not every item will appear in the output on every site.
The recommended values reported in the output are specific for THAT environment  and should not be copied over to another environment just like that.  To tune another EM environment, run the optimize script on that environment.

Sample output from a small EM system:

bash-4.1$ ./repvfy execute optimize

Please enter the SYSMAN password:
SQL*Plus: Release – Production on Thu Jul 9 07:59:35 2015

Copyright (c) 1982, 2008, Oracle. All rights reserved.

SQL> Connected.

Session altered.
Session altered.

========== ========== ========== ========== ========== ========== ==========
== Internal task system tuning ==
========== ========== ========== ========== ========== ========== ==========

– Setting the number of short workers to 2 (1->2)
– Setting the number of long workers to 2 (1->2)
========== ========== ========== ========== ========== ========== ==========
========== ========== ========== ========== ========== ========== ==========
== Job system tuning ==
========== ========== ========== ========== ========== ========== ==========

– On each OMS, run this command:
  $ emctl set property -name oracle.sysman.core.conn.maxConnForJobWorkers -value 72 -module emoms
  This change will require a bounce of the OMS

========== ========== ========== ========== ========== ========== ==========
========== ========== ========== ========== ========== ========== ==========
== Repository tuning ==
========== ========== ========== ========== ========== ========== ==========
– Setting retention for MGMT_SYSTEM_ERROR_LOG table to 7 days (31->7)

– Disabling PL/SQL tracing for module (EM.GDS)
– Disabling PL/SQL tracing for module (EM_DBM)

– Disabling repository metric tracing for ID (1234)

– Recompiling invalid object (foo,TRIGGER)
– Recompiling invalid object (bar,CONSTRAINT)

– Stale CBO statistics in the repository. Gather statistics for the SYSMAN schema
  Command to use:
  $ repvfy send gather_stats
  SQL> exec emd_maintenance.gather_sysman_stats_job(p_gather_all=>’YES’);

========== ========== ========== ========== ========== ========== ==========
========== ========== ========== ========== ========== ========== ==========
== Target system tuning ==
========== ========== ========== ========== ========== ========== ==========

– Setting the PING grace period to (90) (60->90)

– Set the parameter to 3
  $ emctl set property -module emoms -name -value 3

========== ========== ========== ========== ========== ========== ==========
not spooling currently

Changing OMS or Agent Properties in OEM Console

There was a thread going on earlier this week about how to change a value for an OMS property setting.   This is typically done when working with support to adjust a timing or enable debug or tracing.

The most common way is using emctl set property at command level.

$ emctl set property -name oracle.sysman.eml.maxInactiveTime -value 60 

To get the current setting there’s a emctl get property command:

$ emctl get property -name oracle.sysman.eml.maxInactiveTime

That usually works great if support has just given you the exact command to run, or you’re using a MOS note to reference the exact syntax.   However, if you’re getting older like me, and syntax is just one of those things that you tend to put in the way back corners of your brain… you tend to forget was it oracle.sysman.eml.maxInactiveTime or was it oracle.em.sysman.maxTimeout or…   There’s just too many to remember.   Lucky for us, there is now a place to view and set these properties in the EM  console.   You can find it under Setup / Manage Cloud Control / Management Services.


Then under the Management Servers menu select Configuration properties.


From here you’ll get a window that lists the non-default properties.  Understand, that some properties will not show up, that doesn’t mean they are not set, but that they just have the system default value.



By switching the Show view to All, you’ll see a larger list of properties.  Not all of them are modifiable, as indicated by the lock icon.   If you want to view more information about a property, or modify it, click on the Name.    This will bring up a new window, with the ability to modify the value and save.  This view will also tell you whether the property is Dynamic (can be changed without OMS restart).   If you expand the Change History, you can also view the previous changes for this parameter.


Of course, not all properties are OMS based, so there’s an equivalent option on the Agent side.  From the Agent home page, click on Agent menu then select Properties.


You will get a list of properties, some which you can edit, some you can’t.   By default the basic properties are shown.


Select Advanced Properties to see additional agent properties such as dynamicPropsComputeTimeout which is often adjusted on very large servers.


This is great right?  But what if you want to change the same property on 1000 agents?   Well, that’s in here too!  Click on Setup / Manage Cloud Control / Agents.


From there, click on the agents (or just one for now) and click Properties.properties_6

This will start a job wizard in which you can add additional agents by clicking Add in the Targets section.  Then click on the Parameters tab.


Now you can set the parameter value that you wish to push out to all selected agents.


The caveat — use with caution and common sense.  There’s a lot of parameters in here, and very little are documented, some should not be changed unless directed by Oracle Support.   So don’t go cowboy on us and start tweaking them all just to see what they do!

Get the Most from OpenWorld… Or any conference.

Before working for Oracle on the EM team, I was a DBA. I got the chance to go to OpenWorld twice in that role. My first time I was a newbie. I spent a lot of time going to sessions and trying to learn. I walked the demo grounds and came home with lots of junk, but didn’t really get involved or talk to the technical people there, or anybody really!

Having worked OOW for the last 4 years, I have a different perspective. I still enjoy going to sessions (when I can) and learning about new features or products that I don’t have much experience in. There’s so many sessions you could stay busy all day long!  While some folks complain about sessions not being technical enough or deep enough, I disagree.  I find the quality of sessions to be pretty equivalent to other conferences I’ve been to.  I think you have to take a look at the abstract and speaker, and make a decision based on that, not all Oracle people are sales.   You also have to realize it’s about 40 min +/- with time for questions.  The idea is to introduce a new concept or idea, and give you the background to research further and implement.  There’s very few sessions that are going to be an step-by-step guide to implement a feature in 40 minutes.

If you feel the sessions are too general, and you haven’t signed up for one of the Hands-On-Labs, you’re missing out.   The HOL are designed to showcase common use cases and allow you to walk through a feature, such as DBaaS or Middleware Diagnostics.   The rooms are usually limited to 40 or 50 participants, and again, there really bright people there ready and willing to help you out and answer questions.

One thing I feel most people don’t take advantage of is the tremendous number of technical resources that are standing around on the demo grounds!  Yea, Oracle booths are boring because we don’t give away blinkie martini glasses, or iPads or anything. We expect you to just want to talk tech.  I know first-hand that some of the brightest, technical minds are standing at those Oracle booths waiting to talk to customers.  Maybe you want to see the demo on the feature they’re talking about, or maybe you want to talk about what your company is doing or needs to do.  The product managers and developers are there, all day long… for you – the customer.  Now don’t come up with your list of SRs that you’re stuck on, but think about the use cases that you’re stuck on, or that one thing that the product is missing to make it complete in your world.  Reach out, meet the people who create and code the products you’re using.  Get involved, introduce yourself.  If you follow Oracle people on Twitter or LinkedIn, or follow their blogs, say hi and thanks.   Then you can move on and find the free FitBit.

Also, get involved with a Users Group or SIG while you’re there. Most SIGs will have a meeting or event. This is a great way to network with fellow technologists and users, share your ideas and just be in the presence of smart people.  Attend the IOUG sessions on SIG Sunday for more great speakers and sessions.

Visit the Oracle Support Stars bar!  You know there’s actual people on the other end of that SR you’re working?   You’ve probably dealt with the same people over and over again, go say hi.   Talk about how you can help your issues move faster, what tools do they recommend, how to use My Oracle Support better…

Oracle University also offers full day events on Sunday, maybe you want to prepare for your DB 12c Certification, or you want to learn about EM High Availability, these options are additional cost, but you can register for a class while registering for Oracle OpenWorld.

Oracle OpenWorld is more than just presentations and marketing and giveaways and parties.  Take advantage of being in the same spot as some of the brightest technical minds in the industry, and get as much as you can from it!  Oh, and wear really comfortable shoes.  You’ll be walking a lot!

Getting to Know EMDIAG – repvfy show score_card

Continuing on with a series of EMDIAG commands that I find useful, today we’re going to look at the repvfy show score_card.   You may also want to review the previous post on repvfy diag all.

One of the main functions of repvfy is the verify function (i.e. repvfy verify -details -level 9).  This goes through a long list of checks to identify areas you might need to investigate.  This could be anything from Agents with Clock Skew to Expired User accounts.

Now there’s a scorecard (show score_card) that reads the output of the verify log and will summarize the category of errors that were found, and a generate a score based on the weight, number of tests and number of violations.  This can help you in tracking improvements as you work through any cleanup and issues.

repvfy show score_card (also try score_card_details)

Category                  Score  #Tests #Viol
————————- —— —— ——
Best Practice              81.59      6    780
Configuration              21.64     20   2298
Data Integrity             25.09     29   2793
Monitoring/Operations      60.71     23    886
————————- —— —— ——

The output of show score_card_details will give the breakdown by category (Best Practice, Configuration, Data Integrity, Monitoring/Operations) and Module (Targets, Agents, etc) along with the test to run (i.e. repvfy verify targets -test 6002).

Note: The sample output below has been modified and shortened to fit the blog.

$ repvfy show score_card_details

Category        Rank Module          ID   Test               Score  #Viol
BP  1 TARGETS 6002 OMS mediated targets without backup Agent   10.0  639
BP  2 AGENTS  6006 Deployed Agent plugins lower than OMS plugin 6.6  106
BP  3 REPO  6039 Newer version avail for deployed OMS plugin    1.1    9
BP  4 REPO  6005 Tables with locked statistics                  0.3   23
BP  5 JOBS  6006 Job steps running more than two hours          0.3    2

Using the repvfy verify -details will help you identify targets to investigate.  Some tests have automated fixes that can be run with repvfy verify <module> -test <test#> -fix.     Other issues may have manual steps or suggestions, some may require opening an SR with Oracle Support.  After fixing issues, rerun the full repvfy verify -level 9 and regenerate the score card to track your progress!

Getting to Know EMDIAG – repvfy diag all

If you’ve worked with me, or called me about a problem with your Enterprise Manager, or even attended any of my sessions, you’ve probably heard me talk about EMDIAG.   One of the most popular components of EMDIAG is the Repvfy tool. This is basically a series of scripts and queries that will provide data from the repository to help diagnose configuration and data issues.  You can get more details on downloading and installing in EMDIAG Troubleshooting Kits Master Index (Doc ID 421053.1).

There are 3 components that make up EMDIAG:  repvfy, omsvfy and agtvfy.   Today, one of the features I am introducing you to is in repvfy.  This is the component that pulls data from the EM repository.

repvfy diag all  

This is my go to these days.  Instead of tell the customer I need X, Y, Z and A, B, C, I get this.   The diag all runs through various EMDIAG reports that are frequently used in troubleshooting issues with support and development.  It runs the different reports and zips them into a file that can then be uploaded easily.    There’s also a shorter version repvfy diag core.

adding: advisor_day_2015_07_06_084304.log (deflated 81%)
adding: advisors_2015_07_06_084304.log (deflated 83%)
adding: agent_health_2015_07_02_120925.log (deflated 83%)
adding: analyze.log (deflated 83%)
adding: backlog_2015_07_06_084304.log (deflated 84%)
adding: body1.log (stored 0%)
adding: body2.log (stored 0%)
adding: body3.log (stored 0%)
adding: cursor_2015_07_06_084304.log (deflated 84%)
adding: custom_2015_07_06_084304.log (deflated 86%)
adding: deinstall.log (deflated 79%)
adding: details_2015_07_02_082451.log (deflated 82%)
adding: details_2015_07_02_082451.sql (deflated 68%)
adding: details_2015_07_06_084304.log (deflated 87%)
adding: details_2015_07_06_084304.sql (deflated 74%)
adding: errors_2015_07_06_084304.log (deflated 83%)
adding: install.log (deflated 80%)
adding: job_health_2015_07_06_084304.log (deflated 83%)
adding: loader_health_2015_07_06_084304.log (deflated 91%)
adding: metric_stats_2015_07_06_084304.log (deflated 92%)
adding: mtm_2015_07_06_084304.log (deflated 89%)
adding: notif_health_2015_07_06_084304.log (deflated 85%)
adding: performance_2015_07_06_084304.log (deflated 88%)
adding: ping_health_2015_07_06_084304.log (deflated 76%)
adding: pkg.log (deflated 62%)
adding: space_2015_07_06_084304.log (deflated 91%)
adding: system_2015_07_06_084304.log (deflated 86%)
adding: task_health_2015_07_06_084304.log (deflated 83%)
adding: upgrade2.log (stored 0%)
adding: verify.log (deflated 89%)
adding: verify_2015_07_02_082451.log (deflated 49%)
adding: verify_2015_07_06_084304.log (deflated 45%)
adding: views.log (deflated 82%)

File created: /u01/oracle/em12r5/oms/emdiag/tmp/

So just what does it gather information about?   Here’s a one line summary of each report:

advisors  – ADDM, ASH and AWR reports from the repository database
agent_health – summary of deployed agents, plugins and targets as well as availability and ping statistics
backlog – statistics from dbms_scheduler, loader subsystem, job subsystem, notification subsystem and the task/worker subsystem
cursor – cursor parameters and statistics for EM SQL
custom – summary of EM customizations done
errors – targets, agents, plugins, metrics, collections, jobs in error
job_health – summary of job subsystem configuration, statistics and performance
loader_health – summary of loader subsystem configuration, statistics and performance
metric_stats – performance summary of repository, loader subsystem, purge policies and metrics including top targets and metrics
mtm – summary of Repository and OMS configuration, housekeeping jobs, agent and plugin deployments
notif_health – summary of notification subsystem configuration, statistics and performance
performance – performance summary of repository, OMS, agents and internal subsystems.
ping_health – summary of agent ping jobs and communication
space – summary of schema statistics collections, table/index sizes and fragmentation
system – full configuration summary
task_health – summary of task subsystem configuration, statistics and performance
verify/details – the standard verification checks with detailed output

So depending on the issue you’re seeing, I will typically look at various reports.    If you have problems with notifications, I’m obviously going to go through the notif_health and probably the backlog and job_health reports.   If I’m just trying to get a good understanding of how your system is built, what targets you manage and what you’re doing with them, I’d start with the system and custom reports.

In future posts, we’ll break down some of these reports in detail, but that’s it for today!