Forum Discussion

Dominique's avatar
Dominique
Icon for Advisor rankAdvisor
3 years ago

Ping Failing from collector to device and back?

Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”
“The host VIPEIEEMP01 is down. No data has been received”
 

 If I do a ping from the command prompt of the Client to the Collector  it works

The firewall is wide opened!!!

What did I miss?

Thanks,

Dom

  • Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

    Hello,

    It is a Windows Server 2019 Standard for the client VIPEIEEMP01, Windows Server 2016 Standard for the “preferred” collector VIPLGCMON02. 

    The Client has an external IP, so it is attached to the closest Datacenter for its collector.

    Thanks,

    Dom

  •  @Joe Williams 

    what size were your collectors and what did you end up adjusting the threadpool too? 

    I’m new to LM and have been working on implementing it within my company the past few months. I’ve been facing this issue since December and have been going back and forth with LM Support. First, they had me do a fresh install, then they said there was a bug in v 33.x for the collectors and I needed to downgrade and the last step was to completely delete the collectors from the portal and add them again. At this point LM support claims they have done everything they can and have escalated to an internal ticket but I haven’t been able to receive any updates on the status of that ticket. I’m willing to make adjusts me to the threadpools/timeouts if needed to see if this resolved the issue for me.

    Hello,

    Windows Server 2016 Standard

    CPU:                    Quad 2.40 Ghz Intel Xeon Silver 4214R

    RAM:                   16 GB

    Disk:                     75 GB

    Collector Version: 32.003

    Let me know what I should do to adjust the threadpool? 

    Thanks,

    Dom

  • I suppose it is possible there are two different issues -- threadpool change requirements needed indicates internal resources are exhausted for checks.  The original issue is still our constant problem -- static ID values cause sessions to become invalidated by firewalls when there is a disruption on the firewall path.  A new session must be observed by the firewall to start letting traffic through again, which can only be done (currently) by restarting the collector since the code allocates those session ID values only one at startup. LM has been informed about this repeatedly, is aware of it and does nothing to fix it.  The only thing I’ve seen is a doc note blaming particular firewalls, but it impacts pretty much any stateful inspection firewall. More and more folks use firewalls for internal segmentation and there are many cases where a remote collector is needed due to lack of resources to deploy a local collector.

    Hello,

    Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!

    Thanks,

    Dom

  • Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

    Hello,

    I would have to check with the Linux Team as I have only Windows Server 2016 Collectors on my side.

    Thanks,

    Dom

  • 1 hour ago, Mike Moniz said:

    have collectors located on the same network segment as the devices being monitored so not hitting firewalls

    This is the recommended architecture.

    1 hour ago, mnagel said:

    most organizations these days are moving to internal compartmentalization, which means firewalls of some sort

    If your firewalls are blocking legitimate business traffic, they need to not do that.

    Hello,

    Yes we are trying to have the collectors and their client on the same segment 

    Yes we have multiple Firewalls all through the Network not on the client itself…  I will recheck them as the issue does not seem to be wide spread even through a group of servers within a common application which typically are all on the same subnet and have the same firewalls.

    Thanks,

    Dom

  • Adjusting the threadpools isn’t something to be done lightly. We have a few standards around them in our deployments based on collector size, but each collector is its own thing. I am hesitant to say what we even do as it probably isn’t the right thing for others to do.

    I suggest adding the collector into monitoring and watching the collector graphs. Look at the queue depth, tasks failing, etc. And see where it makes sense to adjust things.

    We have also noticed, even if a collector is memory starved, most times, it is better served with giving it more vcpu if possible.

  • I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:

    Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.

    Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

    So I’ve had some time now on EA 34.300 with one of our “problem children” and I am saddened to report the SNMP issues have not been addressed, at least not sufficiently.  What I have observed during a spate of recent ISP disruptions for monitoring of a remote site (via IPSec tunnel) is that LogicMonitor eventually seems to figure it out and will begin collecting data, but it takes roughly 2 hours. Having 2 hour gaps is better than indefinite gaps, but it is still unacceptable.

  • ICMP is a terrible way for the Collector to determine “Host Dead” status.  Better would be if any DataSource is able to collect data from it.  If data can be collected, the host isn’t dead.

  • ICMP is a terrible way for the Collector to determine “Host Dead” status.  Better would be if any DataSource is able to collect data from it.  If data can be collected, the host isn’t dead.

    ICMP itself seems to be fine now, actually.  The problem that persists is SNMP when an intermediate stateful inspection engine (firewall) invalidates sessions.  UDP is stateless, but SNMP uses a session ID most modern firewalls recognize.  Once the session ID is broken, LM stops working since the developers chose to blindly use the same session ID indefinitely. My guess is with the new collector code they periodically refresh the session ID so it eventually recovers rather than trigger a new session after a failed poll or two. The right way is very often not the way these developers roll, sadly,

  • ICMP is a terrible way for the Collector to determine “Host Dead” status.  Better would be if any DataSource is able to collect data from it.  If data can be collected, the host isn’t dead.

    That actually is close to how it works now. ICMP does reset the idleInterval datapoint (or whatever the internal flag is), which is what determines host down status. However, it’s not the only thing. Any datasource that can be trusted to actually get a reply from a device should reset the idleInterval datapoint. This includes any SNMP datasources, website/http datasources, etc. It does not include scripted datasources. The thinking there is that a scripted datasource might be contacting a 3rd party system to collect data and not actually getting an actual response from the device itself. So, anything that is guaranteed to return data from the device itself should reset the idle interval counter.

    The bigger feature request here is that customers need a way to modify/override the built in criteria for considering a device down. For some people, pingability is enough. For others, it needs to be pingable and responding to some other query. The customers need the ability to determine (on the device, group, and global levels) what constitutes a device being down. For example, i would need to be able to say that ping has to be up, but also x/y of these other datasources must also be returning data.