[vitrage] I have some problems with Prometheus alarms in vitrage.

classic Classic list List threaded Threaded
18 messages Options
Reply | Threaded
Open this post in threaded view
|

[vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
I have some problems with Prometheus alarms in vitrage.
I receive a list of alarms from the Prometheus alarm manager well, but the alarm does not disappear when the problem(alarm) is resolved. The alarm that came once in both the alarm list and the entity graph does not disappear in vitrage.  The alarm sent by zabbix disappears when alarm solved, I wonder how to clear the Prometheus alarm from vitrage and how to update the alarm automatically like zabbix.
thank you.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
Hi,

In the alertmanager.yml file you should have a receiver for Vitrage. Please verify that it includes "send_resolved: true". This is required for Prometheus to notify Vitrage when an alarm is resolved.

The full Vitrage receiver definition should be:

- name: <receiver name>

  webhook_configs:

  - url: <vitrage event url>  # example: 'http://127.0.0.1:8999/v1/event'

    send_resolved: true

    http_config:

      basic_auth:

        username: <an admin user known to keystone >

        password: <user’s password>


Hope it helps,
Ifat


On Tue, Oct 2, 2018 at 7:51 AM Won <[hidden email]> wrote:
I have some problems with Prometheus alarms in vitrage.
I receive a list of alarms from the Prometheus alarm manager well, but the alarm does not disappear when the problem(alarm) is resolved. The alarm that came once in both the alarm list and the entity graph does not disappear in vitrage.  The alarm sent by zabbix disappears when alarm solved, I wonder how to clear the Prometheus alarm from vitrage and how to update the alarm automatically like zabbix.
thank you.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Thank you for your reply Ifat.

The alertmanager.yml file already contains 'send_resolved:true'. 
However, the alarm does not disappear from the alarm list and the entity graph even if the alarm is resolved, the alarm manager makes a silence, or removes the alarm rule from Prometheus. 
The only way to remove alarms is to manually remove them from the db. Is there any other way to remove the alarm?
Entities(vm) that run on multi nodes in the rocky version have similar symptoms. There was a symptom that the Entities created on the multi-node would not disappear from the Entity Graph even after deletion. 
Is this a bug in rocky version?

Best Regards,
Won

2018년 10월 3일 (수) 오후 5:46, Ifat Afek <[hidden email]>님이 작성:
Hi,

In the alertmanager.yml file you should have a receiver for Vitrage. Please verify that it includes "send_resolved: true". This is required for Prometheus to notify Vitrage when an alarm is resolved.

The full Vitrage receiver definition should be:

- name: <receiver name>

  webhook_configs:

  - url: <vitrage event url>  # example: 'http://127.0.0.1:8999/v1/event'

    send_resolved: true

    http_config:

      basic_auth:

        username: <an admin user known to keystone >

        password: <user’s password>


Hope it helps,
Ifat


On Tue, Oct 2, 2018 at 7:51 AM Won <[hidden email]> wrote:
I have some problems with Prometheus alarms in vitrage.
I receive a list of alarms from the Prometheus alarm manager well, but the alarm does not disappear when the problem(alarm) is resolved. The alarm that came once in both the alarm list and the entity graph does not disappear in vitrage.  The alarm sent by zabbix disappears when alarm solved, I wonder how to clear the Prometheus alarm from vitrage and how to update the alarm automatically like zabbix.
thank you.
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
Hi,

Can you please give us some more details about your scenario with Prometheus? Please try and give as many details as possible, so we can try to reproduce the bug.


What do you mean by “if the alarm is resolved, the alarm manager makes a silence, or removes the alarm rule from Prometheus”? these are different cases. None of them works in your environment?

Which Prometheus and Alertmanager versions are you using?

 Please try to change the Vitrage loglevel to DEBUG (set “debug = true” in /etc/vitrage/vitrage.conf) and send me the Vitrage collector, graph and api logs. 


Regarding the multi nodes, I'm not sure I understand your configuration. Do you mean there is more than one OpenStack and Nova? more than one host? more than one vm? 

Basically, vms are deleted from Vitrage in two cases:
1. After each periodic call to get_all of nova.instance datasource. By default this is done once in 10 minutes.
2. Immediately, if you have the following configuration in /etc/nova/nova.conf:
notification_topics = notifications,vitrage_notifications

So, please check your nova.conf and also whether the vms are deleted after 10 minutes.

Thanks,
Ifat


On Thu, Oct 4, 2018 at 7:12 AM Won <[hidden email]> wrote:
Thank you for your reply Ifat.

The alertmanager.yml file already contains 'send_resolved:true'. 
However, the alarm does not disappear from the alarm list and the entity graph even if the alarm is resolved, the alarm manager makes a silence, or removes the alarm rule from Prometheus. 
The only way to remove alarms is to manually remove them from the db. Is there any other way to remove the alarm?
Entities(vm) that run on multi nodes in the rocky version have similar symptoms. There was a symptom that the Entities created on the multi-node would not disappear from the Entity Graph even after deletion. 
Is this a bug in rocky version?

Best Regards,
Won


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi. I'm sorry for the late reply.

my prometheus version : 2.3.2 and alertmanager version : 0.15.2 and I attached files.(vitrage collector,graph logs and apache log and prometheus.yml alertmanager.yml alarm rule file etc..)
I think the problem that resolved alarm does not disappear is the time stamp problem of the alarm.
alarm list.JPG
vitrage-entity_graph.jpg
-gray alarm info
severity:PAGE
vitrage id: c6a94386-3879-499e-9da0-2a5b9d3294b8  ,  e2c5eae9-dba9-4f64-960b-b964f1c01dfe , 3d3c903e-fe09-4a6f-941f-1a2adb09feca , 8c6e7906-9e66-404f-967f-40037a6afc83 , e291662b-115d-42b5-8863-da8243dd06b4 , 8abd2a2f-c830-453c-a9d0-55db2bf72d46
----------

The alarms marked with the blue circle are already resolved. However, it does not disappear from the entity graph and alarm list.
There were seven more gray alarms at the top screenshot in active alarms like entity graph. It disappeared by deleting gray alarms from the vitrage-alarms table in the DB or changing the end timestamp value to an earlier time than the current time.

At the log, it seems that the first problem is that the timestamp value from the vitrage comes to 2001-01-01, even though the starting value in the Prometheus alarm information has the correct value. 
When the alarm is solved, the end time stamp value is not updated so alarm does not disappear from the alarm list. 

The second problem is that even if the time stamp problem is solved, the entity graph problem will not be solved. Gray alarm information is not in the vitage-collector log but in the vitrage graph and apache log.
I want to know how to forcefully delete entity from a vitage graph.


Regarding the multi nodes, I mean, 1 controll node(pc1) & 1 compute node(pc2). So one openstack.  
image.png
The test VM in the picture is an instance on compute node that has already been deleted. I waited for hours and checked nova.conf but it was not removed.
This was not the occur in the queens version; in the rocky version, multinode environment, there seems to be a bug in VM creation on multi node.
The same situation occurred in multi-node environments that were configured with different PCs.

thanks,
Won











2018년 10월 4일 (목) 오후 10:46, Ifat Afek <[hidden email]>님이 작성:
Hi,

Can you please give us some more details about your scenario with Prometheus? Please try and give as many details as possible, so we can try to reproduce the bug.


What do you mean by “if the alarm is resolved, the alarm manager makes a silence, or removes the alarm rule from Prometheus”? these are different cases. None of them works in your environment?

Which Prometheus and Alertmanager versions are you using?

 Please try to change the Vitrage loglevel to DEBUG (set “debug = true” in /etc/vitrage/vitrage.conf) and send me the Vitrage collector, graph and api logs. 


Regarding the multi nodes, I'm not sure I understand your configuration. Do you mean there is more than one OpenStack and Nova? more than one host? more than one vm? 

Basically, vms are deleted from Vitrage in two cases:
1. After each periodic call to get_all of nova.instance datasource. By default this is done once in 10 minutes.
2. Immediately, if you have the following configuration in /etc/nova/nova.conf:
notification_topics = notifications,vitrage_notifications

So, please check your nova.conf and also whether the vms are deleted after 10 minutes.

Thanks,
Ifat


On Thu, Oct 4, 2018 at 7:12 AM Won <[hidden email]> wrote:
Thank you for your reply Ifat.

The alertmanager.yml file already contains 'send_resolved:true'. 
However, the alarm does not disappear from the alarm list and the entity graph even if the alarm is resolved, the alarm manager makes a silence, or removes the alarm rule from Prometheus. 
The only way to remove alarms is to manually remove them from the db. Is there any other way to remove the alarm?
Entities(vm) that run on multi nodes in the rocky version have similar symptoms. There was a symptom that the Entities created on the multi-node would not disappear from the Entity Graph even after deletion. 
Is this a bug in rocky version?

Best Regards,
Won

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

environment.zip (310K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
Hi Won,

On Wed, Oct 10, 2018 at 11:58 AM Won <[hidden email]> wrote:

my prometheus version : 2.3.2 and alertmanager version : 0.15.2 and I attached files.(vitrage collector,graph logs and apache log and prometheus.yml alertmanager.yml alarm rule file etc..)
I think the problem that resolved alarm does not disappear is the time stamp problem of the alarm.

-gray alarm info
severity:PAGE
vitrage id: c6a94386-3879-499e-9da0-2a5b9d3294b8  ,  e2c5eae9-dba9-4f64-960b-b964f1c01dfe , 3d3c903e-fe09-4a6f-941f-1a2adb09feca , 8c6e7906-9e66-404f-967f-40037a6afc83 , e291662b-115d-42b5-8863-da8243dd06b4 , 8abd2a2f-c830-453c-a9d0-55db2bf72d46
----------

The alarms marked with the blue circle are already resolved. However, it does not disappear from the entity graph and alarm list.
There were seven more gray alarms at the top screenshot in active alarms like entity graph. It disappeared by deleting gray alarms from the vitrage-alarms table in the DB or changing the end timestamp value to an earlier time than the current time.

I checked the files that you sent, and it appears that the connection between Prometheus and Vitrage works well. I see in vitrage-graph log that Prometheus notified both on alert firing and on alert resolved statuses. 
I still don't understand why the alarms were not removed from Vitrage, though. Can you please send me the output of 'vitrage topology show' CLI command? 
Also, did you happen to restart vitrage-graph or vitrage-collector during your tests? 
 
At the log, it seems that the first problem is that the timestamp value from the vitrage comes to 2001-01-01, even though the starting value in the Prometheus alarm information has the correct value. 
When the alarm is solved, the end time stamp value is not updated so alarm does not disappear from the alarm list.  
 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log.
 
The second problem is that even if the time stamp problem is solved, the entity graph problem will not be solved. Gray alarm information is not in the vitage-collector log but in the vitrage graph and apache log.
I want to know how to forcefully delete entity from a vitage graph.

You shouldn't do it :-) there is no API for deleting entities, and messing with the database may cause unexpected results. 
The only thing that you can safely do is to stop all Vitrage services, execute 'vitrage-purge-data' command, and start the services again. This will cause rebuilding of the entity graph.
 
Regarding the multi nodes, I mean, 1 controll node(pc1) & 1 compute node(pc2). So one openstack.  

The test VM in the picture is an instance on compute node that has already been deleted. I waited for hours and checked nova.conf but it was not removed.
This was not the occur in the queens version; in the rocky version, multinode environment, there seems to be a bug in VM creation on multi node.
The same situation occurred in multi-node environments that were configured with different PCs.

Let me make sure I understand the problem. 
When you create a new vm in Nova, does it immediately appear in the entity graph?
When you delete a vm, it remains? does it remain in a multi-node environment and deleted in a single node environment? 
 
Br,
Ifat


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi Ifat,
I'm sorry for the late reply. 

I solved the problem of not updating the Prometheus alarm.
Alarms with the same Prometheus alarm name are recognized as the same alarm in vitrage.

------- alert.rules.yml
groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 60s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down
------
This is the contents of the alert.rules.yml file before I modify it.
This is a yml file that generates an alarm when the cardvisor stops(instance down). Alarm is triggered depending on which instance is down, but all alarms have the same name as 'instance down'. Vitrage recognizes all of these alarms as the same alarm. Thus, until all 'instance down' alarms were cleared, the 'instance down' alarm was recognized as unresolved and the alarm was not extinguished.


------alert.rules.yml(modified)
groups:
- name: alert.rules
  rules:
  - alert: InstanceDown on Apigateway
    expr: up{instance="192.168.12.164:31121"} == 0
    for: 5s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down

  - alert: InstanceDown on Signup
    expr: up{instance="192.168.12.164:31122"} == 0
    for: 5s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down
.
.
.
---------------
By modifying the rules as above, the problem has been solved.

 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log.

 
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.


Let me make sure I understand the problem. 
When you create a new vm in Nova, does it immediately appear in the entity graph?
When you delete a vm, it remains? does it remain in a multi-node environment and deleted in a single node environment? 
 
image.png
Host name ubuntu is my main server. I install openstack all in one in this server and i install compute node in host name compute1.
When i create a new vm in nova(compute1) it immediately appear in the entity graph. But in does not disappear in the entity graph when i delete the vm. No matter how long i wait, it doesn't disappear.
Afther i execute 'vitrage-purge-data' command and reboot the Openstack(execute reboot command in openstack server(host name ubuntu)), it disappear. Only execute 'vitrage-purge-data' does not work. It need a reboot to disappear.
When i create a new vm in nova(ubuntu) there is no problem. 




I implemented the web service of the micro service architecture and applied the RCA. Attached file picture shows the structure of the web service I have implemented. I wonder what data I receive and what can i do when I link vitrage with kubernetes.
As i know, the vitrage graph does not present information about containers or pods inside the vm. If that is correct, I would like to make the information of the pod level appear on the entity graph.

I follow (https://docs.openstack.org/vitrage/latest/contributor/k8s_datasource.html) this step. I attached the vitage.conf file and the kubeconfig file. The contents of the Kubeconconfig file are copied from the contents of the admin.conf file on the master node.
I want to check my settings are right and connected, but I don't know how. It would be very much appreciated if you let me know how.

Br,
Won




















2018년 10월 10일 (수) 오후 11:52, Ifat Afek <[hidden email]>님이 작성:
Hi Won,

On Wed, Oct 10, 2018 at 11:58 AM Won <[hidden email]> wrote:

my prometheus version : 2.3.2 and alertmanager version : 0.15.2 and I attached files.(vitrage collector,graph logs and apache log and prometheus.yml alertmanager.yml alarm rule file etc..)
I think the problem that resolved alarm does not disappear is the time stamp problem of the alarm.

-gray alarm info
severity:PAGE
vitrage id: c6a94386-3879-499e-9da0-2a5b9d3294b8  ,  e2c5eae9-dba9-4f64-960b-b964f1c01dfe , 3d3c903e-fe09-4a6f-941f-1a2adb09feca , 8c6e7906-9e66-404f-967f-40037a6afc83 , e291662b-115d-42b5-8863-da8243dd06b4 , 8abd2a2f-c830-453c-a9d0-55db2bf72d46
----------

The alarms marked with the blue circle are already resolved. However, it does not disappear from the entity graph and alarm list.
There were seven more gray alarms at the top screenshot in active alarms like entity graph. It disappeared by deleting gray alarms from the vitrage-alarms table in the DB or changing the end timestamp value to an earlier time than the current time.

I checked the files that you sent, and it appears that the connection between Prometheus and Vitrage works well. I see in vitrage-graph log that Prometheus notified both on alert firing and on alert resolved statuses. 
I still don't understand why the alarms were not removed from Vitrage, though. Can you please send me the output of 'vitrage topology show' CLI command? 
Also, did you happen to restart vitrage-graph or vitrage-collector during your tests? 
 
At the log, it seems that the first problem is that the timestamp value from the vitrage comes to 2001-01-01, even though the starting value in the Prometheus alarm information has the correct value. 
When the alarm is solved, the end time stamp value is not updated so alarm does not disappear from the alarm list.  
 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log.
 
The second problem is that even if the time stamp problem is solved, the entity graph problem will not be solved. Gray alarm information is not in the vitage-collector log but in the vitrage graph and apache log.
I want to know how to forcefully delete entity from a vitage graph.

You shouldn't do it :-) there is no API for deleting entities, and messing with the database may cause unexpected results. 
The only thing that you can safely do is to stop all Vitrage services, execute 'vitrage-purge-data' command, and start the services again. This will cause rebuilding of the entity graph.
 
Regarding the multi nodes, I mean, 1 controll node(pc1) & 1 compute node(pc2). So one openstack.  

The test VM in the picture is an instance on compute node that has already been deleted. I waited for hours and checked nova.conf but it was not removed.
This was not the occur in the queens version; in the rocky version, multinode environment, there seems to be a bug in VM creation on multi node.
The same situation occurred in multi-node environments that were configured with different PCs.

Let me make sure I understand the problem. 
When you create a new vm in Nova, does it immediately appear in the entity graph?
When you delete a vm, it remains? does it remain in a multi-node environment and deleted in a single node environment? 
 
Br,
Ifat

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

kubeconfig.txt (7K) Download Attachment
micro service architecture.png (482K) Download Attachment
vitrage.conf (1K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
Hi,

On Fri, Oct 26, 2018 at 10:34 AM Won <[hidden email]> wrote:

I solved the problem of not updating the Prometheus alarm.
Alarms with the same Prometheus alarm name are recognized as the same alarm in vitrage.

------- alert.rules.yml
groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 60s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down
------
This is the contents of the alert.rules.yml file before I modify it.
This is a yml file that generates an alarm when the cardvisor stops(instance down). Alarm is triggered depending on which instance is down, but all alarms have the same name as 'instance down'. Vitrage recognizes all of these alarms as the same alarm. Thus, until all 'instance down' alarms were cleared, the 'instance down' alarm was recognized as unresolved and the alarm was not extinguished.

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?
 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log. 
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.
  
image.png
Host name ubuntu is my main server. I install openstack all in one in this server and i install compute node in host name compute1.
When i create a new vm in nova(compute1) it immediately appear in the entity graph. But in does not disappear in the entity graph when i delete the vm. No matter how long i wait, it doesn't disappear.
Afther i execute 'vitrage-purge-data' command and reboot the Openstack(execute reboot command in openstack server(host name ubuntu)), it disappear. Only execute 'vitrage-purge-data' does not work. It need a reboot to disappear.
When i create a new vm in nova(ubuntu) there is no problem. 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I implemented the web service of the micro service architecture and applied the RCA. Attached file picture shows the structure of the web service I have implemented. I wonder what data I receive and what can i do when I link vitrage with kubernetes.
As i know, the vitrage graph does not present information about containers or pods inside the vm. If that is correct, I would like to make the information of the pod level appear on the entity graph.

I follow (https://docs.openstack.org/vitrage/latest/contributor/k8s_datasource.html) this step. I attached the vitage.conf file and the kubeconfig file. The contents of the Kubeconconfig file are copied from the contents of the admin.conf file on the master node.
I want to check my settings are right and connected, but I don't know how. It would be very much appreciated if you let me know how.
Unfortunately, Vitrage does not hold pods and containers information at the moment. We discussed the option of adding it in Stein release, but I'm not sure we will get to do it.  
 
Br,
Ifat
 
 

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi,

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?

Using the original definition, no matter how different the instances are, the alarm names are recognized as the same alarm in vitrage.
And I tried to install the rocky version and the master version on the new server and retest but the problem was not solved. The latest bugfix seems irrelevant.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.  

I have attached 'vitrage-alarm-list.txt.'
 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I have attached 'vitrage_log_on_compute1.zip' and 'vitrage_log_on_ubuntu.zip' files. 
When creating a vm on computer1, a vitrage-collect log occurred, but no log occurred when it was removed.

Br,
Won



2018년 10월 30일 (화) 오전 1:28, Ifat Afek <[hidden email]>님이 작성:
Hi,

On Fri, Oct 26, 2018 at 10:34 AM Won <[hidden email]> wrote:

I solved the problem of not updating the Prometheus alarm.
Alarms with the same Prometheus alarm name are recognized as the same alarm in vitrage.

------- alert.rules.yml
groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 60s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down
------
This is the contents of the alert.rules.yml file before I modify it.
This is a yml file that generates an alarm when the cardvisor stops(instance down). Alarm is triggered depending on which instance is down, but all alarms have the same name as 'instance down'. Vitrage recognizes all of these alarms as the same alarm. Thus, until all 'instance down' alarms were cleared, the 'instance down' alarm was recognized as unresolved and the alarm was not extinguished.

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?
 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log. 
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.
  
image.png
Host name ubuntu is my main server. I install openstack all in one in this server and i install compute node in host name compute1.
When i create a new vm in nova(compute1) it immediately appear in the entity graph. But in does not disappear in the entity graph when i delete the vm. No matter how long i wait, it doesn't disappear.
Afther i execute 'vitrage-purge-data' command and reboot the Openstack(execute reboot command in openstack server(host name ubuntu)), it disappear. Only execute 'vitrage-purge-data' does not work. It need a reboot to disappear.
When i create a new vm in nova(ubuntu) there is no problem. 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I implemented the web service of the micro service architecture and applied the RCA. Attached file picture shows the structure of the web service I have implemented. I wonder what data I receive and what can i do when I link vitrage with kubernetes.
As i know, the vitrage graph does not present information about containers or pods inside the vm. If that is correct, I would like to make the information of the pod level appear on the entity graph.

I follow (https://docs.openstack.org/vitrage/latest/contributor/k8s_datasource.html) this step. I attached the vitage.conf file and the kubeconfig file. The contents of the Kubeconconfig file are copied from the contents of the admin.conf file on the master node.
I want to check my settings are right and connected, but I don't know how. It would be very much appreciated if you let me know how.
Unfortunately, Vitrage does not hold pods and containers information at the moment. We discussed the option of adding it in Stein release, but I'm not sure we will get to do it.  
 
Br,
Ifat
 
 
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd


2018년 10월 31일 (수) 오후 5:58, Won <[hidden email]>님이 작성:
Hi,

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?

Using the original definition, no matter how different the instances are, the alarm names are recognized as the same alarm in vitrage.
And I tried to install the rocky version and the master version on the new server and retest but the problem was not solved. The latest bugfix seems irrelevant.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.  

I have attached 'vitrage-alarm-list.txt.'
 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I have attached 'vitrage_log_on_compute1.zip' and 'vitrage_log_on_ubuntu.zip' files. 
When creating a vm on computer1, a vitrage-collect log occurred, but no log occurred when it was removed.

Br,
Won



2018년 10월 30일 (화) 오전 1:28, Ifat Afek <[hidden email]>님이 작성:
Hi,

On Fri, Oct 26, 2018 at 10:34 AM Won <[hidden email]> wrote:

I solved the problem of not updating the Prometheus alarm.
Alarms with the same Prometheus alarm name are recognized as the same alarm in vitrage.

------- alert.rules.yml
groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 60s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down
------
This is the contents of the alert.rules.yml file before I modify it.
This is a yml file that generates an alarm when the cardvisor stops(instance down). Alarm is triggered depending on which instance is down, but all alarms have the same name as 'instance down'. Vitrage recognizes all of these alarms as the same alarm. Thus, until all 'instance down' alarms were cleared, the 'instance down' alarm was recognized as unresolved and the alarm was not extinguished.

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?
 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log. 
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.
  
image.png
Host name ubuntu is my main server. I install openstack all in one in this server and i install compute node in host name compute1.
When i create a new vm in nova(compute1) it immediately appear in the entity graph. But in does not disappear in the entity graph when i delete the vm. No matter how long i wait, it doesn't disappear.
Afther i execute 'vitrage-purge-data' command and reboot the Openstack(execute reboot command in openstack server(host name ubuntu)), it disappear. Only execute 'vitrage-purge-data' does not work. It need a reboot to disappear.
When i create a new vm in nova(ubuntu) there is no problem. 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I implemented the web service of the micro service architecture and applied the RCA. Attached file picture shows the structure of the web service I have implemented. I wonder what data I receive and what can i do when I link vitrage with kubernetes.
As i know, the vitrage graph does not present information about containers or pods inside the vm. If that is correct, I would like to make the information of the pod level appear on the entity graph.

I follow (https://docs.openstack.org/vitrage/latest/contributor/k8s_datasource.html) this step. I attached the vitage.conf file and the kubeconfig file. The contents of the Kubeconconfig file are copied from the contents of the admin.conf file on the master node.
I want to check my settings are right and connected, but I don't know how. It would be very much appreciated if you let me know how.
Unfortunately, Vitrage does not hold pods and containers information at the moment. We discussed the option of adding it in Stein release, but I'm not sure we will get to do it.  
 
Br,
Ifat
 
 
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

vitrage-alarm-list.txt (489K) Download Attachment
vitrage_log_on_compute1.zip (11K) Download Attachment
vitrage_log_on_ubuntu.zip (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi,

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?

Using the original definition, no matter how different the instances are, the alarm names are recognized as the same alarm in vitrage.
And I tried to install the rocky version and the master version on the new server and retest but the problem was not solved. The latest bugfix seems irrelevant.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.  

I have attached 'vitrage-alarm-list.txt.'
 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I have attached 'vitrage_log_on_compute1.zip' and 'vitrage_log_on_ubuntu.zip' files. 
When creating a vm on computer1, a vitrage-collect log occurred, but no log occurred when it was removed.

Br,
Won

2018년 10월 31일 (수) 오후 5:59, Won <[hidden email]>님이 작성:


2018년 10월 31일 (수) 오후 5:58, Won <[hidden email]>님이 작성:
Hi,

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?

Using the original definition, no matter how different the instances are, the alarm names are recognized as the same alarm in vitrage.
And I tried to install the rocky version and the master version on the new server and retest but the problem was not solved. The latest bugfix seems irrelevant.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.  

I have attached 'vitrage-alarm-list.txt.'
 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I have attached 'vitrage_log_on_compute1.zip' and 'vitrage_log_on_ubuntu.zip' files. 
When creating a vm on computer1, a vitrage-collect log occurred, but no log occurred when it was removed.

Br,
Won



2018년 10월 30일 (화) 오전 1:28, Ifat Afek <[hidden email]>님이 작성:
Hi,

On Fri, Oct 26, 2018 at 10:34 AM Won <[hidden email]> wrote:

I solved the problem of not updating the Prometheus alarm.
Alarms with the same Prometheus alarm name are recognized as the same alarm in vitrage.

------- alert.rules.yml
groups:
- name: alert.rules
  rules:
  - alert: InstanceDown
    expr: up == 0
    for: 60s
    labels:
      severity: warning
    annotations:
      description: '{{ $labels.instance }} of job {{ $labels.job }} has been down
        for more than 30 seconds.'
      summary: Instance {{ $labels.instance }} down
------
This is the contents of the alert.rules.yml file before I modify it.
This is a yml file that generates an alarm when the cardvisor stops(instance down). Alarm is triggered depending on which instance is down, but all alarms have the same name as 'instance down'. Vitrage recognizes all of these alarms as the same alarm. Thus, until all 'instance down' alarms were cleared, the 'instance down' alarm was recognized as unresolved and the alarm was not extinguished.

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?
 
Can you please show me where you saw the 2001 timestamp? I didn't find it in the log. 
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.

Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.
  
image.png
Host name ubuntu is my main server. I install openstack all in one in this server and i install compute node in host name compute1.
When i create a new vm in nova(compute1) it immediately appear in the entity graph. But in does not disappear in the entity graph when i delete the vm. No matter how long i wait, it doesn't disappear.
Afther i execute 'vitrage-purge-data' command and reboot the Openstack(execute reboot command in openstack server(host name ubuntu)), it disappear. Only execute 'vitrage-purge-data' does not work. It need a reboot to disappear.
When i create a new vm in nova(ubuntu) there is no problem. 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I implemented the web service of the micro service architecture and applied the RCA. Attached file picture shows the structure of the web service I have implemented. I wonder what data I receive and what can i do when I link vitrage with kubernetes.
As i know, the vitrage graph does not present information about containers or pods inside the vm. If that is correct, I would like to make the information of the pod level appear on the entity graph.

I follow (https://docs.openstack.org/vitrage/latest/contributor/k8s_datasource.html) this step. I attached the vitage.conf file and the kubeconfig file. The contents of the Kubeconconfig file are copied from the contents of the admin.conf file on the master node.
I want to check my settings are right and connected, but I don't know how. It would be very much appreciated if you let me know how.
Unfortunately, Vitrage does not hold pods and containers information at the moment. We discussed the option of adding it in Stein release, but I'm not sure we will get to do it.  
 
Br,
Ifat
 
 
__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

vitrage-alarm-list.txt (489K) Download Attachment
vitrage_log_on_compute1.zip (11K) Download Attachment
vitrage_log_on_ubuntu.zip (8K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
Hi,

On Wed, Oct 31, 2018 at 11:00 AM Won <[hidden email]> wrote:
Hi,

This is strange. I would expect your original definition to work as well, since the alarm key in Vitrage is defined by a combination of the alert name and the instance. We will check it again. 
BTW,  we solved a different bug related to Prometheus alarms not being cleared [1]. Could it be related?

Using the original definition, no matter how different the instances are, the alarm names are recognized as the same alarm in vitrage.
And I tried to install the rocky version and the master version on the new server and retest but the problem was not solved. The latest bugfix seems irrelevant.

Ok. We will check this issue. For now your workaround is ok, right?
 
Does the wrong timestamp appear if you run 'vitrage alarm list' cli command? please try running 'vitrage alarm list --debug' and send me the output.  

I have attached 'vitrage-alarm-list.txt.'

I believe that you attached the wrong file. It seems like another log of vitrage-graph.
 
 
Please send me vitrage-collector.log and vitrage-graph.log from the time that the problematic vm was created and deleted. Please also create and delete a vm on your 'ubuntu' server, so I can check the differences in the log.

I have attached 'vitrage_log_on_compute1.zip' and 'vitrage_log_on_ubuntu.zip' files. 
When creating a vm on computer1, a vitrage-collect log occurred, but no log occurred when it was removed.
 
Looking at the logs, I see two issues:
1. On ubuntu server, you get a notification about the vm deletion, while on compute1 you don't get it. 
Please make sure that Nova sends notifications to 'vitrage_notifications' - it should be configured in /etc/nova/nova.conf.

2. Once in 10 minutes (by default) nova.instance datasource queries all instances. The deleted vm is supposed to be deleted in Vitrage at this stage, even if the notification was lost. 
Please check in your collector log for the a message of "novaclient.v2.client [-] RESP BODY" before and after the deletion, and send me its content. 

Br,
Ifat



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
In reply to this post by ddaasd
Hi,

We solved the timestamp bug. There are two patches for master [1] and stable/rocky [2].
I'll check the other issues next week.

Regards,
Ifat



On Wed, Oct 31, 2018 at 10:59 AM Won <[hidden email]> wrote:
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi,

We solved the timestamp bug. There are two patches for master [1] and stable/rocky [2]. 
 
I applied the patch and checked it worked well.

Looking at the logs, I see two issues:
1. On ubuntu server, you get a notification about the vm deletion, while on compute1 you don't get it. 
Please make sure that Nova sends notifications to 'vitrage_notifications' - it should be configured in /etc/nova/nova.conf.
2. Once in 10 minutes (by default) nova.instance datasource queries all instances. The deleted vm is supposed to be deleted in Vitrage at this stage, even if the notification was lost. 
Please check in your collector log for the a message of "novaclient.v2.client [-] RESP BODY" before and after the deletion, and send me its content. 

 I attached two log files. I created a VM in computer1 which is a computer node and deleted it a few minutes later. Log for 30 minutes from VM creation.
The first is the log of the vitrage-collect that grep instance name.
The second is the noovaclient.v2.clinet [-] RESP BODY log.
After I deleted the VM, no log of the instance appeared in the collector log no matter how long I waited.

I added the following to Nova.conf on the computer1 node.(attached file 'compute_node_local_conf.txt')
notification_topics = notifications,vitrage_notifications
notification_driver = messagingv2
vif_plugging_timeout = 300
notify_on_state_change = vm_and_task_state
instance_usage_audit_period = hour
instance_usage_audit = True

However, the problem has not been resolved.
I tried to test Vitrage for prometheus alarm recognition problems and for problems where instances of multinode do not disappear from Entity-graph, but I have not yet found the cause.

Br,
Won

2018년 11월 8일 (목) 오후 8:47, Ifat Afek <[hidden email]>님이 작성:
Hi,

We solved the timestamp bug. There are two patches for master [1] and stable/rocky [2].
I'll check the other issues next week.

Regards,
Ifat



On Wed, Oct 31, 2018 at 10:59 AM Won <[hidden email]> wrote:
 
image.png
The time stamp is recorded well in log(vitrage-graph,collect etc), but in vitrage-dashboard it is marked 2001-01-01.
However, it seems that the time stamp is recognized well internally because the alarm can be resolved and is recorded well in log.


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

vitrage_collect_log_grep_instancename.txt (2K) Download Attachment
vitrage_collect_log_grep_RESP_BODY.txt (167K) Download Attachment
compute_node_local_conf.txt (3K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek

On Thu, Nov 15, 2018 at 10:28 AM Won <[hidden email]> wrote:
Looking at the logs, I see two issues:
1. On ubuntu server, you get a notification about the vm deletion, while on compute1 you don't get it. 
Please make sure that Nova sends notifications to 'vitrage_notifications' - it should be configured in /etc/nova/nova.conf.
2. Once in 10 minutes (by default) nova.instance datasource queries all instances. The deleted vm is supposed to be deleted in Vitrage at this stage, even if the notification was lost. 
Please check in your collector log for the a message of "novaclient.v2.client [-] RESP BODY" before and after the deletion, and send me its content. 

 I attached two log files. I created a VM in computer1 which is a computer node and deleted it a few minutes later. Log for 30 minutes from VM creation.
The first is the log of the vitrage-collect that grep instance name.
The second is the noovaclient.v2.clinet [-] RESP BODY log.
After I deleted the VM, no log of the instance appeared in the collector log no matter how long I waited.

I added the following to Nova.conf on the computer1 node.(attached file 'compute_node_local_conf.txt')
notification_topics = notifications,vitrage_notifications
notification_driver = messagingv2
vif_plugging_timeout = 300
notify_on_state_change = vm_and_task_state
instance_usage_audit_period = hour
instance_usage_audit = True

Hi,

From the collector log RESP BODY messages I understand that in the beginning there were the following servers: 
compute1: deltesting 
ubuntu: Apigateway, KubeMaster and others

After ~20minutes, there was only Apigateway. Does it make sense? did you delete the instances on ubuntu, in addition to deltesting?
In any case, I would expect to see the instances deleted from the graph at this stage, since they were not returned by get_all.
Can you please send me the log of vitrage-graph at the same time (Nov 15, 16:35-17:10)?

There is still the question of why we don't see a notification from Nova, but let's try to solve the issues one by one.

Thanks,
Ifat



__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi,

I attached four log files.  
I collected the logs from about 17:14 to 17:42. I created an instance of 'deltesting3' at 17:17. 7minutes later, at 17:24, the entity graph showed the dentesting3 and vitrage colletor and graph logs are appeared.
When creating an instance in ubuntu server, it appears immediately in the entity graph and logs, but when creating an instance in computer1 (multi node), it appears about 5~10 minutes later.
I deleted an instance of 'deltesting3' around 17:26.
 
After ~20minutes, there was only Apigateway. Does it make sense? did you delete the instances on ubuntu, in addition to deltesting?

I only deleted 'deltesting'. After that, only the logs from 'apigateway' and 'kube-master' were collected. But other instances were working well. I don't know why only two instances are collected in the log.
NOV 19 In this log, 'agigateway' and 'kube-master' were continuously collected in a short period of time, but other instances were sometimes collected in long periods.

In any case, I would expect to see the instances deleted from the graph at this stage, since they were not returned by get_all.
Can you please send me the log of vitrage-graph at the same time (Nov 15, 16:35-17:10)?

Information  'deldtesting3' that has already been deleted continues to be collected in vitrage-graph.service.

Br,
Won
 

 

2018년 11월 15일 (목) 오후 10:13, Ifat Afek <[hidden email]>님이 작성:

On Thu, Nov 15, 2018 at 10:28 AM Won <[hidden email]> wrote:
Looking at the logs, I see two issues:
1. On ubuntu server, you get a notification about the vm deletion, while on compute1 you don't get it. 
Please make sure that Nova sends notifications to 'vitrage_notifications' - it should be configured in /etc/nova/nova.conf.
2. Once in 10 minutes (by default) nova.instance datasource queries all instances. The deleted vm is supposed to be deleted in Vitrage at this stage, even if the notification was lost. 
Please check in your collector log for the a message of "novaclient.v2.client [-] RESP BODY" before and after the deletion, and send me its content. 

 I attached two log files. I created a VM in computer1 which is a computer node and deleted it a few minutes later. Log for 30 minutes from VM creation.
The first is the log of the vitrage-collect that grep instance name.
The second is the noovaclient.v2.clinet [-] RESP BODY log.
After I deleted the VM, no log of the instance appeared in the collector log no matter how long I waited.

I added the following to Nova.conf on the computer1 node.(attached file 'compute_node_local_conf.txt')
notification_topics = notifications,vitrage_notifications
notification_driver = messagingv2
vif_plugging_timeout = 300
notify_on_state_change = vm_and_task_state
instance_usage_audit_period = hour
instance_usage_audit = True

Hi,

From the collector log RESP BODY messages I understand that in the beginning there were the following servers: 
compute1: deltesting 
ubuntu: Apigateway, KubeMaster and others

After ~20minutes, there was only Apigateway. Does it make sense? did you delete the instances on ubuntu, in addition to deltesting?
In any case, I would expect to see the instances deleted from the graph at this stage, since they were not returned by get_all.
Can you please send me the log of vitrage-graph at the same time (Nov 15, 16:35-17:10)?

There is still the question of why we don't see a notification from Nova, but let's try to solve the issues one by one.

Thanks,
Ifat


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

vitrage_collect_logs_grep_Instance_name (2K) Download Attachment
vitrage_collect_logs_grep_RESPBODY (141K) Download Attachment
vitrage_graph_logs_grep_Instance_name (1M) Download Attachment
vitrage_graph_logs (22M) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

Ifat Afek
Hi,

A deleted instance should be removed from Vitrage in one of two ways:
1. By reacting to a notification from Nova
2. If no notification is received, then after a while the instance vertex in Vitrage is considered "outdated" and is deleted

Regarding #1, it is clear from your logs that you don't get notifications from Nova on the second compute.
Do you have on one of your nodes, in addition to nova.conf, also a nova-cpu.conf? if so, please make the same change in this file:

notification_topics = notifications,vitrage_notifications

notification_driver = messagingv2


And please make sure to restart nova compute service on that node.

Regarding #2, as a second-best solution, the instances should be deleted from the graph after not being updated for a while. 
I realized that we have a bug in this area and I will push a fix to gerrit later today. In the meantime, you can add to 
InstanceDriver class the following function:

    @staticmethod
    def should_delete_outdated_entities():
        return True

Let me know if it solved your problem,
Ifat


On Wed, Nov 21, 2018 at 1:50 PM Won <[hidden email]> wrote:
I attached four log files.  
I collected the logs from about 17:14 to 17:42. I created an instance of 'deltesting3' at 17:17. 7minutes later, at 17:24, the entity graph showed the dentesting3 and vitrage colletor and graph logs are appeared.
When creating an instance in ubuntu server, it appears immediately in the entity graph and logs, but when creating an instance in computer1 (multi node), it appears about 5~10 minutes later.
I deleted an instance of 'deltesting3' around 17:26.
 
After ~20minutes, there was only Apigateway. Does it make sense? did you delete the instances on ubuntu, in addition to deltesting?

I only deleted 'deltesting'. After that, only the logs from 'apigateway' and 'kube-master' were collected. But other instances were working well. I don't know why only two instances are collected in the log.
NOV 19 In this log, 'agigateway' and 'kube-master' were continuously collected in a short period of time, but other instances were sometimes collected in long periods.

In any case, I would expect to see the instances deleted from the graph at this stage, since they were not returned by get_all.
Can you please send me the log of vitrage-graph at the same time (Nov 15, 16:35-17:10)?

Information  'deldtesting3' that has already been deleted continues to be collected in vitrage-graph.service.


__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev
Reply | Threaded
Open this post in threaded view
|

Re: [vitrage] I have some problems with Prometheus alarms in vitrage.

ddaasd
Hi,

I checked that both of the methods you propose work well.
After I add 'should_delete_outdated_entities' function to InstanceDriver, it took about 10 minutes to clear the old Instance.  
And I added two sentences you said to Nova-cpu.conf, so the vitrage collector get notifications well.

Thank you for your help.

Best regards,
Won

2018년 11월 22일 (목) 오후 9:35, Ifat Afek <[hidden email]>님이 작성:
Hi,

A deleted instance should be removed from Vitrage in one of two ways:
1. By reacting to a notification from Nova
2. If no notification is received, then after a while the instance vertex in Vitrage is considered "outdated" and is deleted

Regarding #1, it is clear from your logs that you don't get notifications from Nova on the second compute.
Do you have on one of your nodes, in addition to nova.conf, also a nova-cpu.conf? if so, please make the same change in this file:

notification_topics = notifications,vitrage_notifications

notification_driver = messagingv2


And please make sure to restart nova compute service on that node.

Regarding #2, as a second-best solution, the instances should be deleted from the graph after not being updated for a while. 
I realized that we have a bug in this area and I will push a fix to gerrit later today. In the meantime, you can add to 
InstanceDriver class the following function:

    @staticmethod
    def should_delete_outdated_entities():
        return True

Let me know if it solved your problem,
Ifat


On Wed, Nov 21, 2018 at 1:50 PM Won <[hidden email]> wrote:
I attached four log files.  
I collected the logs from about 17:14 to 17:42. I created an instance of 'deltesting3' at 17:17. 7minutes later, at 17:24, the entity graph showed the dentesting3 and vitrage colletor and graph logs are appeared.
When creating an instance in ubuntu server, it appears immediately in the entity graph and logs, but when creating an instance in computer1 (multi node), it appears about 5~10 minutes later.
I deleted an instance of 'deltesting3' around 17:26.
 
After ~20minutes, there was only Apigateway. Does it make sense? did you delete the instances on ubuntu, in addition to deltesting?

I only deleted 'deltesting'. After that, only the logs from 'apigateway' and 'kube-master' were collected. But other instances were working well. I don't know why only two instances are collected in the log.
NOV 19 In this log, 'agigateway' and 'kube-master' were continuously collected in a short period of time, but other instances were sometimes collected in long periods.

In any case, I would expect to see the instances deleted from the graph at this stage, since they were not returned by get_all.
Can you please send me the log of vitrage-graph at the same time (Nov 15, 16:35-17:10)?

Information  'deldtesting3' that has already been deleted continues to be collected in vitrage-graph.service.

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: OpenStack-dev-request@...?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: [hidden email]?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev