Friday, June 15, 2018

Azure Monitor - Alerting Gets an Upgrade

Earlier this week, Microsoft announced some upgrades to the alerts experience inside Azure Monitor and if you've ever worked with SCOM, then a few of these changes will have a pretty familiar look about them.

New Alert Enumeration Experience
There's a new Alert Enumeration feature which delivers a centralized view of all the alerts that have occurred across your various Azure deployments. You can query alerts across multiple subscriptions and sort them based on severity, signal types, resource type, and even resolution state. The enhanced alert enumeration feature is a serious upgrade on the previous Azure Monitor Alerts experience shown in the following image...

To upgrade to the new feature, click the purple banner at the top of the old Monitor - Alerts view and you will be presented with the following new enhanced user interface...

When you've upgraded, the first thing you will notice (assuming you've already got a few alerts present across your subscriptions), is that Azure Monitor has gathered all of your alerts into a central view and sorted them by Severity.

Now, if you've used SCOM Alert Rules in the past, you'll be familiar with Microsoft's method of defining severity levels using integers (where Critical = 2, Warning = 1 and Informational = 0). In Azure Monitor, Microsoft use a similar mapping process however, the lower numbered severity is the most important (which is the opposite to SCOM). You can read more about the exact Azure Monitor Alert Severity Mappings in my previous blog post here.

Clicking on any of the Severity links will then pivot you into the All Alerts page with a filter that's scoped to that particular severity.

Additional filters can then be applied to scope the view even further with options such as subscriptions, resource groups, time range and conditions to choose from.

Alert State Management

The next addition to Azure Monitor alerting is the new Alert State Management feature. These are essentially very similar to SCOM Alert Resolution States and in Azure Monitor, three alert resolution states are currently supported - New, Acknowledged and Closed.

You can manage the alert resolution state by drilling into an alert in the All Alerts view and clicking the Change Alert State button shown in the following image...

From there, you can use the drop-down menu to change the alert resolution state from New to either Acknowledged or Closed as shown here..

After that, you have the option to add a comment as to why you're changing the resolution state before then returning to the All Alerts view - where you should see the new Alert Resolution State assigned to your alert.

If you need to bulk-edit the resolution state of a number of alerts, then Microsoft have made this easy for you too. All you need to do is select each of the alerts that you need to modify, then hit the Change State button as shown in the following image...

Then modify your resolution state, add your comment and hit OK to return to the All Alerts view. Alert resolution states should now be easy to identify for all alerts that you've modified.

Something to keep in mind when working with these new Alert States is that they are completely separate from the Monitoring Condition - which supports two values - Fired and Resolved.  The Monitoring Condition indicates whether or not the condition that created a metric alert has subsequently been resolved.

To define the Monitoring Condition, the metric alert rules sample a particular metric at regular intervals and if the criteria in the alert rule is met, then a new alert is created with a condition of Fired. When the metric is sampled again and if the criteria is still the same, then nothing happens. However, if the criteria is not met, then the condition of the alert is changed to Resolved. The next time that the criteria is met, then a new alert is created with a condition of Fired.

Putting my SCOM hat back on again, the Monitoring Condition is a similar process to how SCOM Alert Monitors fire when a specific threshold is breached and then auto-close when that threshold is no longer breached.

One gotcha that might catch people out however, is that even though the system may set the Monitor Condition to Resolved, the alert state isn't changed until the user changes it manually and vice-versa. For example, if I modify an alert resolution state for a number of alerts and I set the resolution state to Closed, the Monitoring Condition will still show that the alert is still in a Fired state. The following image shows this exact scenario - where I've set the resolution state of a couple of my alerts to Closed, but as the metric that fired the alert in the first place is still present, the alerts are still displaying a Monitoring Condition of Fired.

Smart Groups

The final new alerting feature that I wanted to post about is Smart Groups. These contain alerts that were automatically grouped together based on either similarity, historical patterns or a combination of both. Smart Groups are automatically created using machine learning algorithms looking for similarity and co-occurrence patterns among alerts originating from a monitor service such as Log Analytics or across the rest of the Azure platform.

There's a couple of ways that you can view/access Smart Groups. The first method is to simply click the Smart Groups button from the All Alerts view in the new Alert Enumeration feature shown here...

The second method is to open the All Alerts view then click the blue banner as shown in this image...

Using Smart Groups, you can significantly reduce the number of alerts to analyze by focusing on only a handful of groups with some handy alert correlation in place.

As an example, if a performance counter such as CPU or RAM spikes on multiple virtual machines in your Azure subscription at the same time, this will generate a lot of alerts in Azure Monitor. When you click the Smart Groups feature, those alerts will get automatically grouped into a single Smart Group - offering up a much clearer picture of a common root cause.

In the following image, you can see a Smart Group that Azure Monitor has automatically created in my subscription where it has correlated 25 alerts together based on the reason that they are very similar to other alerts that have fired. From here, I can change the alert resolution state of individual alerts or I can use the Change Smart Group State button to change the resolution state of all alerts contained in the group.

Microsoft kicked the tires with alert correlation in SCOM when they released the Exchange 2010 management pack a few years ago and although it was quite noisy, the event correlation engine it came with was a similar concept to what we now have with Smart Groups. I think this is a pretty handy feature to have in your Azure monitoring toolbox and along with all the other features that have just launched, things are looking good for the next generation of Microsoft monitoring!

Azure Monitor Alert Severity Mappings

When I first started using SCOM, one of the things that I had to quickly get my head around was how alerts that were generated by rules were defined with a Severity that mapped to an integer value (e.g. Critical = 2, Warning = 1, and Informational = 0).

With alerts in Azure Monitor, Microsoft have taken a similar approach where they have defined five alert severity levels - each one mapping to it's own integer. These severity levels have been color-coded to help quickly identify alerts that should be treated as more important than others but for clarity, I've detailed the exact mappings as follows:

Azure Monitor Alert Severity Levels

Sev 0 = Critical
Sev 1 = Error
Sev 2 = Warning
Sev 3 = Informational
Sev 4 = Verbose

As you can see from the mappings above, in Azure, the lower the integer, the higher the severity - which is the opposite to alert rule severity mappings in SCOM. Hopefully this post will prove useful for any SCOM administrators who are dipping more into the Azure Monitor world over the coming year and might get slightly confused by the reverse numbering mapping between the two platforms.

If you'd like to read more about some newly announced feature enhancements in Azure Monitor, then check out my recent post here.

Wednesday, June 13, 2018

The OMS Portal is Moving to Azure

Over the last couple of years, I've worked a lot with the awesome Microsoft Operations Management Suite (aka OMS) and at one of the presentations I attended during Microsoft Ignite last year, it was announced that they would soon be retiring the OMS Portal and integrating all of it's functionality directly into the Azure Portal.

Earlier this week, Microsoft confirmed that the OMS Portal would indeed be retired and all it's functionality moved into the Azure Portal. The idea behind this move is to deliver a more centralized experience for monitoring and managing your on-premise and Azure-based workloads.

As it stands, nearly all of the existing OMS solutions have been available within the Azure Portal for a number of months and the only solutions still waiting to be ported over are as follows:
If you're using any of these solutions, then you'll still need to manage them within the original OMS Portal and Microsoft have committed to moving these solutions over to Azure by August 2018. When this happens, Microsoft will then communicate an official timeline for 'sunsetting' the original OMS Portal.

When this happens, the old OMS Portal that looks something like this (depending on which solutions you have enabled)...

Will then look like something similar to this in the Azure Portal...

As you can see from the two images above, they're not too dissimilar and in the Azure Portal, we get the added management benefit of being able to quickly pivot directly into Azure Resources using the navigation menu on the left or by simply drilling down into one of the dashboard widgets.

At the time of writing and along with the five OMS solutions mentioned earlier, there are still a few additional gaps that Microsoft need to address. These gaps are as follows:

  • To access Log Analytics resource in Azure, the user must be granted access through Azure role-based access.
  • Update schedules that were created with the OMS portal may not be reflected in the scheduled update deployments or update job history of the Update management dashboard in the Azure portal. This gap is expected to be addressed by the end of June 2018.
  • Custom logs preview feature can only be enabled through OMS Portal. By the end of June 2018, this will be automatically enabled for all work spaces.

You can read more about these gaps and the planned migration from the OMS portal to the Azure Portal in Microsoft's original post here.

They've also put together a useful FAQ post to help answer some common questions that you or your customers might have and you can access this post here.

All-in-all, I'm pretty happy with this move as I find that lately, I've been spending all of my time in the Azure Portal instead of the original OMS Portal. Having the additional management capabilities inside the Azure Portal definitely makes it a more seamless user experience and hopefully others will see the benefit of this too.

SCOM - New Community MP to Multi-Home Large Numbers of Agents

Microsoft's Kevin Holman has just released a very useful new community MP for SCOM that enables you to multi-home large numbers of agents in a phased and controlled time-frame. This is perfect for any large side-by-side migrations you might be planning from SCOM 2012 R2 to SCOM 2016 or the latest SCOM 180x release.

On earlier versions of SCOM, I've used the excellent 'Extended Agent Info Management Pack' from Jose Fehse and over the last year or so, I've been using Kevin Holman's 'SCOM Agent Management Pack' to meet the same requirement. Although both of these community MP's enable me to add or remove Management Group name references on agents (which essentially multi-homes the agent), it's still a manual task that needs to be kicked off from the console.

With Kevin's newest 'SCOM Multi-Home Management Pack', this process is made a lot easier through the use of a rule that runs periodically and which is targeted at eight pre-created SQL Query-based groups within the MP.

This means that in large environments (think 1000's of agents), the management pack will query the SCOM database and then automatically distribute the number of agents you have across each of the pre-defined groups shown below.

The automatic assignment of agents to the different groups is configured by default to distribute in batches of 500 agents per group however, you can modify this number by editing the group discovery prior to importing the MP into SCOM.

Once the groups have been populated, the MP will then perform a check once a day to validate if the agents have been multi-homed and if any haven't, then it will update those agents using a random time window - thus ensuring your OpsDB doesn't get hammered with the dreaded Event ID 2115 data insertion errors.

To conclude, if you're planning any side-by-side migrations that contain large numbers of agents in the near future, then you'll definitely want to try out this MP to make your job easier and to ensure your OpsDB stays healthy.

You can get the full lowdown on the MP from Kevin Holman's blog here and you can download it directly from the TechNet Gallery here.