Overview
Data Collection Rules are standard configuration files that is usually leveraged by Azure Monitor Agent. This tells Azure Monitor Agent (often referred to as AMA) on what data to collect and where to send that data.
Transformations in Azure Monitor basically allows us to modify/filter data incoming data before its sent to Log Analytics Workspace. Transformation is performed within a cloud pipeline before the data is sent to a destination usually log analytics workspace hence it also referred to as "ingestion time transformation".
Transformations are defined within a data collection rule. We will have a look and try to understand raw structure of a data collection rule however before we go there, let's have an architectural view of data collection rules and transformation.
The above architectural diagram gives us a glimpse into the various ways in which transformation within a data collection rule can be used.
For this tutorial, we will focus on Scenario 6, following is a closer look at the same.
The Scenario
Managing high volume of log ingestion into Microsoft Sentinel Workspace can be rather challenging and a daunting task.
By default, Syslog data collection rule will collect logs from all facilities and this can lead to rather abnormally high costs as the out of the box table Syslog is of analytics tier. For example, analytics table costs around 5.59 USD while basic table costs only 0.50 USD for the East US Region.
In this case, we can identify facilities which do not provide much value from security point of view.
Once these logs are identified, now we are left with two options namely -
Drop these logs from the data collection rule
Send these specific logs into a basic/auxiliary table
Obviously, option 1 is not recommended as you will lose these logs and if you would like to check past events, you do not have this data in SIEM. You will have to rummage through /var/log/ files provided that the log files have not been overwritten/deleted.
For this tutorial; we will focus on option 2.
Before we proceed, let us try to understand the different table plans offered by Azure Monitor.
Analytics plan is the default plan for all tables associated with Microsoft Sentinel. This plan is suited for continuous monitoring, real-time detection, and performance analytics.
By default, sentinel workspaces have an interactive retention period of 90 days. Analytics pay as you go is priced at 5.59 USD per GB.
Basics plan is suited for medium touch data, troubleshooting and incident response. This plan offers discounted ingestion and optimized single-table queries by default for 30 days which can be extended, if required. You can query archived data by using search jobs and all KQL operators are supported with lookups to analytics tables. Basics pay as you go is priced at 0.50 USD per GB.
The Auxiliary plan is suited for low touch data such as verbose level logs, and data required for auditing and compliance. This plan offers low-cost ingestion and unoptimized single-table queries for 30 days. Auxiliary pay as you go is priced at 0.10 USD per GB.
Note - Even though auxiliary plan is relative inexpensive and cheaper than basic, we do not recommend this option as this feature is currently in public preview and does not support transformations. Auxiliary plan as of today, only supports ingestion via text files or log ingestion API and has limited region support.
For more information on limitations, please visit https://learn.microsoft.com/en-us/azure/azure-monitor/logs/create-custom-table-auxiliary#public-preview-limitations
The Challenge
To address the challenge of high costs, we create an ingestion time transformation in this tutorial to send specific facility logs into basic tables. This approach will significantly reduce costs by routing specific logs (secondary data) to basic tables based on their "Facility" while keeping relevant logs (primary data) in the default "Syslog" table.
Primary vs Secondary!
Primary data are logs that are critical to the organization from a security standpoint and these logs are frequently used in threat hunting, analytic rules and for real-time monitoring.
Secondary data are logs that are high volume verbose/informational level logs that have very limited security value but these logs could be helpful when investigating a security incident or a breach. These logs are also not frequently queried upon unless required. These logs are good from a monitoring standpoint as they can provide additional details into performance insights etc.
High Level Overview
Pre-Requisites
Microsoft Sentinel Workspace
A Linux Virtual Machine that is supported by Azure Monitor Agent
Syslog Solution
Data Collection Rule Toolkit Workbook
And finally, enough RBAC permissions! :)
Low Level Steps
Let us begin by installing the relevant solution in Microsoft Sentinel workspace. Navigate to Content Management -> Content Hub and search for Syslog.
After installing the above solution, search for data collection rule and install the below solution.
Now, let's head over to Configuration -> Data Connectors and open configuration page for Syslog via AMA.
If a DCR is not already created, hit the create data collection rule button.
Give a name for your DCR and select the resource group.
Hit next and in Resources, select your machines.
Hit Next and select the facilities. I am leaving out local0 to local7 as these facilities are mostly used by network devices.
Once done, hit Next (Review+Create) button and wait for the extension and association to be created for the selected resources. In the background, AMA package will get pushed and installed onto the selected resources.
Incase if this machine is a collector as well then you can run the following script. This script basically edit's rsyslog configuration that is located at /etc directory and opens port 514 for both tcp and udp. This script also restarts rsyslog and azuremonitoragent services.
sudo wget -O Forwarder_AMA_installer.py https://raw.githubusercontent.com/Azure/Azure-Sentinel/master/DataConnectors/Syslog/Forwarder_AMA_installer.py&&sudo python3 Forwarder_AMA_installer.py
3. Log Validation and Identification
Let's navigate to Microsoft Sentinel -> Logs and query the Syslog table to get a feel of the various data that is being ingested.
Following is the query and results.
Syslog
| where TimeGenerated >=ago(1h)
| where HostName =~ 'aniketdemo'
| project TimeGenerated, Facility, SeverityLevel, SyslogMessage, ProcessName, ProcessID
| top 50 by TimeGenerated
As we can see from the above, daemon logs are very chatty and mostly are low/medium touch informational logs. Let's check the count of facilities being ingested currently.
Syslog
| where TimeGenerated >=ago(1h)
| summarize count() by Facility, HostName
| order by count_ desc
Let's suppose you have now identified a few facilities such as daemon, kern, cron and mail as logs that can be sent to a basic table as they are mostly verbose logs.
However, before we jump onto the transformation itself; we need to create a custom table with the basics plan to hold this data.
4. Custom Table with Basic Plan
Navigate to shell.azure.com and login with your account.
Switch to Azure Powershell and select the relevant subscription by using the following command.
Select-AzSubscription -SubscriptionName 'XXX'
Now, it's time to copy over the script.
Create a file with the command New-Item table_replication.ps1
Now, open Visual Studio Code or Notepad++ and copy over the PowerShell Script which can be fetched from the following link. (Credits to the script creator) https://gitlab.com/azurecodes/queries/-/blob/main/Table%20Replication/table_replication.ps1
Navigate to Log Analytics Workspace -> Overview and populate the parameters within the script as shown below.
Once the parameters have been populated, then you copy over the script into Azure PowerShell by using nano table_replication.ps1
Once copied over, save the file with CTRL+X and then hit Y followed by Enter button.
Now, let's try to execute the script by typing ./table_replication.ps1
It should say that the custom table has been created successfully. It might throw some errors for one or two columns but that should be fine.
Let's check the table in GUI. Navigate to Log Analytics Workspace -> Tables and look for this custom table.
You can check by going to Microsoft Sentinel -> Logs and running the following query. SyslogMisc_CL
| getschema
Time to Transform!
Now, finally it's time to create our KQL Transformation. There are multiple ways to go about this but I am going to use an easy approach here.
You can also do it via Azure Powershell or by heading over to Monitor -> Data Collection Rules -> Automation (Export Template) -> Deploy -> Edit Template.
Navigate to Microsoft Sentinel -> Threat Management -> Workbooks and look for Data Collection Rule Toolkit
Open this workbook and on the top select your subscription and log analytics workspace.
After selecting from the dropdown, select the third tab that says "Review/Modify DCR Rules"
Now, select your DCR and click on Modify DCR.
Scroll down the data collection rule until you come across dataFlows property.
Basically, Data flows match input streams with destinations. Each data source may optionally specify a transformation and will specify a specific table in the Log Analytics workspace.
Within Data Flow structure, we need to add two more properties namely:
transformKql : Optional transformation applied to the incoming stream. The transformation must understand the schema of the incoming data and output data in the schema of the target table. If you use a transformation, the data flow should only use a single stream.
outputStream : Describes which table in the workspace specified under the destination property the data will be sent to. The value of outputStream has the format Microsoft-[tableName] (Microsoft-Syslog) when data is being ingested into a standard table, or Custom-[tableName] (Custom-SyslogMisc_CL) when ingesting data into a custom table. Only one destination is allowed per stream.
Following are the changes that I have done:
Following is the dataFlows section for your reference.
"dataFlows": [
{
"streams": [
"Microsoft-Syslog"
],
"destinations": [
"DataCollectionEvent"
],
"transformKql": "source | where Facility !in ('kern', 'daemon', 'cron', 'mail')",
"outputStream": "Microsoft-Syslog"
},
{
"streams": [
"Microsoft-Syslog"
],
"destinations": [
"DataCollectionEvent"
],
"transformKql": "source | where Facility in ('kern', 'daemon', 'cron', 'mail')",
"outputStream": "Custom-SyslogMisc_CL"
}
],
From above, we can see that all facilities apart from "kernel, daemon, cron, mail" such as "auth, authpriv, syslog, user….." will flow into the in-built Syslog table and only the former facilities will flow into the custom basic table as defined by the outputStream property.
Note- When creating a transformation, you will always use source instead of the default table name.
Now, hit on Deploy Update -> Update DCR
You should get a success message if the syntax is correct. Incase if it fails, just cross-check your syntax and KQL statement once.
Usually transformations take anywhere between 20-30 minutes, so we can check back in a while. 6. Verify that logs are arriving.
I checked back after 30 minutes and I can now see a few entries for the basic table.
You can validate if the other facilities are now going to the native table or not by running the following query.
You can also check by using the following query:
Syslog
| where TimeGenerated >= ago(30m)
| summarize count() by Facility
You can also try to generate mock logs by running logger command on the Linux machine.
logger -p mail.info "This is a test message"
After that, run the following query in log analytics workspace.
SyslogMisc_CL
| where Facility == 'mail'
Conclusion Effective data management is a cornerstone of any robust monitoring strategy and SIEM technology. By leveraging KQL to implement data collection rules and ingestion-time transformations, organizations can optimize their resource utilization and ensure meaningful data segregation.
The example of Syslog table demonstrates how non-essential facilities, such as cron, daemon, mail, and kernel, can be directed to a custom table under a basic plan. This approach reduces costs while preserving access to critical data for analysis and monitoring in the native Syslog table under the analytics plan.
This technique empowers security teams to strike a balance between cost-efficiency and functionality, aligning their data ingestion strategies with organizational priorities. By fine-tuning these transformations, organizations can adapt to evolving needs while maintaining operational visibility and control at all times.
Comments