Hello I am getting errors:“VIPEIEEMP01 is suffering ping loss. 100.0% of pings are not returning, placing the host into critical state.”“The host VIPEIEEMP01 is down. No data has been received” If I do a ping from the command prompt of the Client to the Collector it worksThe firewall is wide opened!!!What did I miss?Thanks,Dom

@mnagel For us it did resolve it. We had consistent collectors and devices that this happened with. After adjusting the threadpools, those collectors and devices didn’t show the issue.

Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!Auto balanced collector group (ABCG) does not mean load balancing. It is overload redistribution. Example: Two collectors with rebalance threshold of 10,000. One collector has 9,999, the other has 5. No rebalancing happens in this case.When/if the first collector gets more than 10,000, the device with the highest number of instances will be reassigned to a different collector. If that brings the count back below 10,000, rebalancing stops. Example: Two collectors with rebalance threshold of 10,000. One collector has 10,001, the other has 5. Largest device on the first collector has 200 instances. Rebalancing happens. The first collector ends up with 9,801, the other has 205.ABGC != Load Balancing.

We have observered similar issues with Ping failing via the collector application but not from the OS it self. A restart would resolve the issue for awhile. What we found was adjusting the threadpool for Ping to fix our problem. It seems the collector would max out of threads for ping and then just start failing. After adjusting the threadpool count we didn’t encounter this issue again. Are you certain that fixed the issue? Because that would require a collector restart and that is what fixes the problem since it forces generation of new ID values for ICMP and SNMP “sessions”.This problem has been going on for years and LM seems to have no plan to fix it. We routinely lose hours of data due to intermediate firewall session invalidation and I’ve seen only a glimmer of interest from folks at LM. The collector code needs to be updated to generate new ICMP and SNMP “sessions” for each check, or at least do so periodically (e.g., every 5-10 minutes) so this stops happening.

I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

On 12/12/2022 at 12:23 PM, mnagel said:I would not hold my breath. I pushed my CSM at the time on this issue back in 2018 and they refused to consider it a bug, but a feature request. I also brought that up recently with my current CSM and I got crickets. I dutifully followed through, but since the feature request "system" is nearly worthless, nothing has been done. I cannot begin to enumerate the number of embarrassing conversations with clients that start like "Why is LogicMonitor alarming about SNMP being down or a host being down when we can get to those devices just fine?" The workaround is time-intensive (manual collector restarts) and the repeated data loss is unforgivable. I don't know what possessed the developers to generate a fixed SNMP session ID or ICMP ID once when the collector starts rather than at each new get/walk or ping. It is the ultimate in false optimization and causes unending problems for any but the smallest simplest network. LM should be ashamed of letting this continuing to happen.I've pushed our CSM a good deal on infrequent SNMP failures, and I was able to hop on a Zoom call with their Devs and walk them through the exact behavior we were seeing (I keyed on this by virtue of alerts for SNMP host down, despite 'Poll Now' clearly showing a response), and I recently received some tacit confirmation that this 'bug' was, indeed, acknowledged. I wasn't aware of the exact cause of the problem, but was able to clearly demonstrate to them that this presents as a bug, clearly not a 'feature' request.Obviously, no indication of timeline for fix, but this seemed like something that finally landed -- Hopefully this will lead to a fundamental fix. I'll be sure to highlight this thread to our CSM as well, in case that might help at all.

Ping Failing from collector to device and back?

Dominique
Advisor
2 years ago

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

Hello,
It is a Windows Server 2019 Standard for the client VIPEIEEMP01, Windows Server 2016 Standard for the “preferred” collector VIPLGCMON02.
The Client has an external IP, so it is attached to the closest Datacenter for its collector.
Thanks,
Dom
Dominique
Advisor
2 years ago

@Joe Williams

what size were your collectors and what did you end up adjusting the threadpool too?

I’m new to LM and have been working on implementing it within my company the past few months. I’ve been facing this issue since December and have been going back and forth with LM Support. First, they had me do a fresh install, then they said there was a bug in v 33.x for the collectors and I needed to downgrade and the last step was to completely delete the collectors from the portal and add them again. At this point LM support claims they have done everything they can and have escalated to an internal ticket but I haven’t been able to receive any updates on the status of that ticket. I’m willing to make adjusts me to the threadpools/timeouts if needed to see if this resolved the issue for me.

Hello,
Windows Server 2016 Standard
CPU: Quad 2.40 Ghz Intel Xeon Silver 4214R
RAM: 16 GB
Disk: 75 GB
Collector Version: 32.003
Let me know what I should do to adjust the threadpool?
Thanks,
Dom
Dominique
Advisor
2 years ago

I suppose it is possible there are two different issues -- threadpool change requirements needed indicates internal resources are exhausted for checks. The original issue is still our constant problem -- static ID values cause sessions to become invalidated by firewalls when there is a disruption on the firewall path. A new session must be observed by the firewall to start letting traffic through again, which can only be done (currently) by restarting the collector since the code allocates those session ID values only one at startup. LM has been informed about this repeatedly, is aware of it and does nothing to fix it. The only thing I’ve seen is a doc note blaming particular firewalls, but it impacts pretty much any stateful inspection firewall. More and more folks use firewalls for internal segmentation and there are many cases where a remote collector is needed due to lack of resources to deploy a local collector.

Hello,
Yes I think the collector are overloaded as all of them for the datacenter implied here are over the Load Balancing settings which would mean there is no more balancing anymore !!!
Thanks,
Dom
Dominique
Advisor
2 years ago

Interesting, I haven't run into that yet myself. I typically have collectors located on the same network segment as the devices being monitored so not hitting firewalls, but some situations do go thru some. Is this specific to Windows or Linux Collectors?

Hello,
I would have to check with the Linux Team as I have only Windows Server 2016 Collectors on my side.
Thanks,
Dom
Dominique
Advisor
2 years ago

1 hour ago, Mike Moniz said:

have collectors located on the same network segment as the devices being monitored so not hitting firewalls

This is the recommended architecture.

1 hour ago, mnagel said:

most organizations these days are moving to internal compartmentalization, which means firewalls of some sort

If your firewalls are blocking legitimate business traffic, they need to not do that.

Hello,
Yes we are trying to have the collectors and their client on the same segment
Yes we have multiple Firewalls all through the Network not on the client itself… I will recheck them as the issue does not seem to be wide spread even through a group of servers within a common application which typically are all on the same subnet and have the same firewalls.
Thanks,
Dom
Joe_Williams
Expert
2 years ago
Adjusting the threadpools isn’t something to be done lightly. We have a few standards around them in our deployments based on collector size, but each collector is its own thing. I am hesitant to say what we even do as it probably isn’t the right thing for others to do.
I suggest adding the collector into monitoring and watching the collector graphs. Look at the queue depth, tasks failing, etc. And see where it makes sense to adjust things.
We have also noticed, even if a collector is memory starved, most times, it is better served with giving it more vcpu if possible.
mnagel
Professor
12 months ago

I am pleased to announce that LM has (after nearly 5 years of back-and-forth -- my first attempt to get this addressed was in June 2018) finally has fixed both the SNMP and ping issues impacted by intermediate firewall session invalidation -- update from support last week:

Our development team has acknowledged the issues you outlined with Ping. Currently the behavior is to have cached sessions for ICMP ping and then reuse them, only refreshing the cache on sbproxy restart. An alternative has been in development and will be fixed in the next EA release. Similar issues with SNMP have been addressed already in EA 34.100.

Hopefully this is actually the case, but if so it will be very nice to tell our clients this longtime bug has finally been quashed.

So I’ve had some time now on EA 34.300 with one of our “problem children” and I am saddened to report the SNMP issues have not been addressed, at least not sufficiently. What I have observed during a spate of recent ISP disruptions for monitoring of a remote site (via IPSec tunnel) is that LogicMonitor eventually seems to figure it out and will begin collecting data, but it takes roughly 2 hours. Having 2 hour gaps is better than indefinite gaps, but it is still unacceptable.
David_Bond
Professor
12 months ago
ICMP is a terrible way for the Collector to determine “Host Dead” status. Better would be if any DataSource is able to collect data from it. If data can be collected, the host isn’t dead.
mnagel
Professor
12 months ago

ICMP is a terrible way for the Collector to determine “Host Dead” status. Better would be if any DataSource is able to collect data from it. If data can be collected, the host isn’t dead.

ICMP itself seems to be fine now, actually. The problem that persists is SNMP when an intermediate stateful inspection engine (firewall) invalidates sessions. UDP is stateless, but SNMP uses a session ID most modern firewalls recognize. Once the session ID is broken, LM stops working since the developers chose to blindly use the same session ID indefinitely. My guess is with the new collector code they periodically refresh the session ID so it eventually recovers rather than trigger a new session after a failed poll or two. The right way is very often not the way these developers roll, sadly,
Stuart_Weenig
Mastermind
12 months ago

ICMP is a terrible way for the Collector to determine “Host Dead” status. Better would be if any DataSource is able to collect data from it. If data can be collected, the host isn’t dead.

That actually is close to how it works now. ICMP does reset the idleInterval datapoint (or whatever the internal flag is), which is what determines host down status. However, it’s not the only thing. Any datasource that can be trusted to actually get a reply from a device should reset the idleInterval datapoint. This includes any SNMP datasources, website/http datasources, etc. It does not include scripted datasources. The thinking there is that a scripted datasource might be contacting a 3rd party system to collect data and not actually getting an actual response from the device itself. So, anything that is guaranteed to return data from the device itself should reset the idle interval counter.
The bigger feature request here is that customers need a way to modify/override the built in criteria for considering a device down. For some people, pingability is enough. For others, it needs to be pingable and responding to some other query. The customers need the ability to determine (on the device, group, and global levels) what constitutes a device being down. For example, i would need to be able to say that ping has to be up, but also x/y of these other datasources must also be returning data.