Policies¶
Policies can be defined in two locations:
- per-component policies section. Each policy defined here is applied only to the component where it is defined
- top-level policies section. Policies defined here are applied to all components (with the exception of custom policies with the
apply-to
key)
The following example shows where policies are allowed:
name: test-policies
description: a sample manifest to describe the usage of policies
components:
- name: comp1
type: kubernetes
manifests:
- name: c1
##
## PER-COMPONENT policies section
##
policies:
- name: ...
type: ...
- name: comp2
type: kubernetes
manifests:
- name: c2
##
## TOP-LEVEL policies section
##
policies:
- name: ...
type: ...
manifests:
...
Security¶
Node Security Level¶
This policy define a security level (based on the Wazuh SCA score) for the nodes where the application should run. ICOS will ensure that the application will be deployed in a node that satisfy the security level defined. If during the runtime, the node level changes ICOS will move (redeploy) the application in a node taht satisfy the policy.
Levels:
- high: SCA score >= 80
- medium: SCA score >= 50 and < 80
- low: SCA score >= 0 and < 50
Resources Usage and Availability¶
Node CPU and Memory¶
This policy allows to define a threshold of CPU and/or Memory usage on the node where the app is running. If the usage exceeds the threshold, it will trigger a violation. This is useful to make sure that the node where the app is running has always enough free resources (e.g. to meet app usage spikes).
policies:
- type: node-resource-usage
cpu_threshold_perc: 0.8
memory_threshold_perc: 0.6
memory_threshold: 200Mi
exclude_app_resources: false
Parameters:
- cpu_threshold_perc (0-1): the maximum CPU usage percentage allowed. If the current CPU usage is above the threshold it will trigger the violation
- memory_threshold_perc (0-1): the maximum Memory usage percentage allowed. If the current Memory usage is above the threshold it will trigger the violation
- memory_threshould: the minimal amount of Memory that is free. If less Memory than the threshold is available, it will trigger the violation
- exclude_app_resources: indicates if the resources taken from the app itself should be excluded from the count (default is
true
). This parameter makes sense only at runtime and, hence, it is ignored by the MatchMaker
The memory_threshold parameters of this policy, is equivalent to the memory requirement:
If both are specified, the largest one is used.The default remediation action is: redeploy
.
Devices Availability¶
This policy makes sure that devices specified in the requirements
section remain available at runtime. If, at any moment, the device becomes unavailable, it will trigger a violation.
In the above example, if at any moment the camera attached to the device becomes unavailable (e.g. it is detached), a violation will be triggered.
The default remediation action is: redeploy
.
Components¶
Components Reachability¶
This policy monitors the ICOS telemetry data sent automatically for each app deployed. When this data is not received for a given amount of time the policy is violated. If the telemetry data is not received it is very likely that the node has been disconnected or crashed, or the app has been manually removed.
policies:
# short form
- redeployOnLostTelemetry: 15m
# normal form
- type: redeployOnLostTelemetry
timeout: 15m
Default timeout
time is 5m
.
The default remediation action is: redeploy
.
COMPSS¶
COMPSS Under Allocation¶
This policy works with COMPSS applications and it is used to scale-up the application replicas when the estimated time to finish (ETA) is above a given threshold.
policies:
- type: custom
fromTemplate: compss-under-allocation
remediation: scale-up
variables:
thresholdTimeSeconds: 120
compssTask: provesOtel.example_task
Custom¶
Beside the predefined policies presented above, it is possible to define custom policies. These policies are interpreted only by the Policy Manager and not by the Matchmaker, so they are only enforced at runtime, not a deployment time.
Templated Policies¶
This policy is defined using one of the templates defined in the Policy Manager service. To overcome the complexity of writing PromQL Expressions for custom policies, the service defines a set of templates. For instance, all the policies presented in the previous sections have their own template defined in the Policy Manager.
policies:
- type: custom
fromTemplate: <template-name>
remediation: redeploy
variables:
var1: val1
var2: val2
Parameters:
- fromTemplate specify the name of the template to use. Currently the following templates are defined:
Name | Variables |
---|---|
compss-under-allocation |
|
cpu-usage-host |
|
app-host-cpu-usage-perc |
|
app-host-cpu-usage-perc-excl |
|
app-host-memory-usage-perc |
|
app-host-memory-usage-perc-excl |
|
app-host-memory-avail-bytes-excl |
|
app-host-memory-avail-bytes |
|
app-sca-score |
threshold : the minimum SCA score to not trigger the violation |
device-available |
device : the name of the device that should be available |
redeploy-on-lost-telemetry |
|
app-security-events |
maxEvents : maximum number of events before triggering a violation (default: 0 ) severity : tetragon severity level (default: .+ ) eventName : filter by Tetragon event name (default: .+ ) |
- remediation: the name of the remediation action. It is mandatory
- variables: variables for customizing the template
Warning
This section is under construction. More details for each template and the expected variables are coming
PromQL Expression Policies¶
This type of policy is defined by a PromQL expression. The Policy Manager will monitor that expression and will trigger a violation when the expression value is outside the specified range. This is the most generic way of defining a policy for the Policy Manager. All other policies enforced by the Policy Manager can also be expressed in this form, although they could have a very complex expression.
policies:
- type: custom
spec:
expr: "... custom PROMQL expression ..."
remediation: redeploy
variables:
myVar: myVal
The spec.expr
parameter contains the expression that the Policy Manager should evaluate. It can be any PromQL expression that returns a value when it is violated. Under the hood, the Policy Manager uses this expression to define a Prometheus Alerting Rule so any valid alerting rule is also a valid expression for this policy.
Placeholders¶
In the expression it is possible to use some placeholders that will be replaced at policy creation time. They are delimited by the {{ }}
sequence. Any variable can be referenced with a placeholder.
For instance in this policy:
policies:
- type: custom
spec:
expr: "... > {{myVar}}"
remediation: redeploy
variables:
myVar: myVal
the {{myVar}}
placeholder will be replaced with myVal
when the policy is created. This promotes the readability, maintenability and reusability of policies.
There are some placeholders that are predefined and replaces the labels that identify the subject of the policy (e.g. the app). This is useful, and often the only way, to define a policy that monitor a single app.
For instance, a policy that uses the following expression:
will monitor that metric for ALL the apps! This is, in most of the cases, not the wanted behaviour. A way to filter the expression in PromQL is to use labels, for instance container_cpu_usage{icos_app_name: "myapp", icos_app_instance: "myapp-123"}
. However the icos_app_instance
value is unknown until the app is deployed!
For this reason, there exists two predefined placeholders that will be automatically replace with the "coordinates" of the app for which the policy is created:
subject_label_selector
is replaces withicos_app_name: "<app_name>", icos_app_component: "<app_component(s)>, icos_app_instance: "<app_instance>"
. It is used for writing metric selectors. If the policy applies to multiple components, theicos_app_component
will be"comp1|comp2|..."
(if it applies to all components, it will be"(.+)"
)subject_label_list
is replaced withicos_app_name, icos_app_component, icos_app_instance
. It is useful to shorten the writing of grouping or aggregation expressions
For instance the expression of the device-available
templated policy is:
(
node_mounted{resource_path=~"{{device}}"} offset 10m
* on(icos_host_id) group_left( {{subject_label_list}} )
tlum_workload_info{ {{subject_label_selector}} } offset 10m
)
unless
(
node_mounted{resource_path=~"{{device}}"}
* on(icos_host_id) group_left( {{subject_label_list}} )
tlum_workload_info{ {{subject_label_selector}} }
)
It makes uses of the predefined placeholders plus the {{device}}
placeholder that corresponds to the value of the device
variable defined in the app descriptor.
Common Parameters¶
All policies presented in this page allows some common parameters that can customize their behaviour.
policies:
# each policy can be assigned a custom name useful to identify the policy in the system
# if no name is assigned, one will be generated automatically
- name: cusotm name
# allowed only in top-level "policies" section
# restrict the policy scope to only a subset of components
apply-to:
- c1
- c2
# remediation action. It is "redeploy" by default in most of the policies
remediation: redeploy
# variables. They are replaced in the templated and PromQL Expression policies if expected
variables:
var1: val1
var2: val2
# properties that can be used to customize some internal aspects of the Policy Manager
properties:
# wait the timeout before triggering a violation. If in the meantime, the condition for
# the violation disappear, then the violation is not triggered
pendingInterval: 5m