vicidial.org

Posted: **Tue Jun 14, 2022 12:23 am**

First, let me say that I am (still) running Version 2.14-670a (2.14b0.5), SVN 2973 (on all servers), DB Schema 1542, build 180424-1521, 6 of my telephony servers are running asterisk v11.22.0-vici, 1 is running 11.21.2-vici and 2 are running 13.27.0-vici, and my system was installed from the vicibox v8.1 ISO.

Next a little background...
We have recently added about 30 more agents - we are load balancing across 3 of our best servers - 1 that used to be my DB server (20 core Intel Xeon processor) and 2 that have always been Agent servers (each has 12 core Intel Xeon). With these additional agents, I'm at around 150, so - I am typically seeing 40 - 50 agents on each of these servers on the REPORTS page. Currently, my (40 core) DB server is just that - only DB and Web - and the remaining 6 servers vary in processor cores, RAM, etc - but - they're used exclusively for outbound dialing trunks - which all combine to give me 1500 - 2000 outbound dialing channels if/when I need them. Not sure if it's relevant - but - our 150 agents are spread out over 10+ separate offices (different cities, etc) - so - as much as I'd like to - I can't blame their local internet connection for the problem, since the problem I'm about to describe appears to happen everywhere.

The problem...
Over the past week or so, our agents have been experiencing random pausing between calls as well as randomly showing in a DEAD call on the Real Time screen, while still showing READY (or even PAUSED) on their screen. The random pausing isn't so bad, because the Agent sees it, but the DEAD call thing isn't obvious, so it's not resolved until someone notices and tells them to logout/in.

While trying to figure out what the issue is - I checked each server's Load Average on their Real Time screens (running that report via each server's web server, so the Load Average I see is for each server). It seems that we are experiencing a relatively HIGH Load on the DB server - commonly 5.0 - 17.5 - but the strange thing is, it's this high even when nothing appears in mtop / seems like there is nothing going on in MySQL - PLUS - top/htop shows mysql using 250 - 480% CPU or more - again - even when I don't see any active queries in mtop. The three Agent servers where all of our extensions are registering to do show some random peaks in Load Average, but they typically remain in the single digit range.

My question is this...

Is this something more likely caused by the load on our database server - OR - is it more likely caused by the number of Agent's each load balanced server is now responsible for?
Should I be looking to fix/optimize something on the DB server? Or should I be involving additional servers in our load balancing configuration by repointing the URL of 1 or 2 of each station's 3 extensions to different servers to lower the number of Agents balanced on those three servers?
If this is more likely an issue on the DB server - what suggestions do you have to help make it more efficient? Honestly - it's been running great until about 3 weeks ago, we have had times in the past where we approached 150 reps without these issues, but, it's possible that we didn't stay there long enough to have the problems be noticed.

I am running a mysqlcheck overnight tonight to see if there are any corrupt tables. Tomorrow, unless I hear anything else, I am planning on modifying the log archive crontab job to keep less data in the _log tables and more in the _log_archive ones.

Anyway - thank you for taking the time to read this - I really do appreciate any feedback/suggestions/advice you can give me - thanks!

David

Posted: **Tue Jun 14, 2022 1:57 am**

I would check mariadb slow log for reoccurring slow queries.
Do you do regular archival of the log tables? There is a script available that you can use for that and can be scheduled to run every day/week.
How many records have you got in vicidial_list? Archive all old and unneeded leads to another table.
Mysql/mariadb don't like large tables and if one or more grow beyond the breaking point, simple queries that used to take milliseconds suddenly start taking several seconds to execute and everything goes upside down.

On the other side- what browser are the agents using? Could be browser throttling the background/inactive tabs, thus leading to disconnected agent screens from the web servers. Search the forum for Chrome throttling, there are workarounds available.

Also check httpd and mariadb error logs for any abnormalities.

Posted: **Tue Jun 14, 2022 12:50 pm**

Verify networking is gigabit between all servers on a private network with no chance of bottleneck and no firewall to slow things down.

Verify you are not running out of network ports on any of the servers. Same goes for agents! If your local network has bottlenecks, it may be notable that specific workstations are more prone to failure than others.

Start a LOG of occurrences. Helpful later when you are trying to work from "memory" and "gut feeling". Don't make assumptions that two agents who groaned had the same problem: Ask, and LOG.

Verify you are using enterprise settings on the apache tuning file on web servers (/etc/apache2/server-tuning.conf). Later installs have instructions/options for this. Also consider activating/using/monitoring the Apache Server Status module. It shows the number of processes, those in use and those available. Running out of apache child processes causes erratic behavior.

Check DB (as previously mentioned) for aborted connects and other flaws. Vicidial requires absolute DB responsiveness and connectivity.

Check all services for errors (apache, mysql, asterisk, even ntp). Any service can down the system for a moment and cause irregular behavior.

Check server load at the time of these occurrences, across all servers. Not the values shown in "reports" but the "uptime" and htop Average Server Load values (against the number of cores in the machines).

Sometimes these problems are small but they are still difficult to pin down. 8-)

Posted: **Tue Jun 14, 2022 3:41 pm**

okli - yes - we run the ADMIN_archive_log_tables.pl every Sunday - we were keeping 90 days, but I just changed it to 30. Also - thinking that this could be contributing to the issue - I renamed the vicidial_log table and created a new/empty one today (yes, with the indexes) - but - the issue is still occurring. Of course, there are more log files than just vicidial_log, but, this is one of the main ones (we were already dumping the call_log table every night, starting with an empty one every day for the past few months - so that table shouldn't be an issue either). As far as the browser goes - typically, they're using Chrome, Edge (the Chromium version) or FireFox. Again - this problem is relatively new - and it's occurring across several (10+) offices located in different cities and states - so - although anything is possible - I don't think (or at least I hope) that suddenly everyone started having a browser issue simultaneously...but I will see if the stations having the random pausing / DEAD call situation share any pattern there. The vicidial_list table has about 33MM records - and has had that quantity for several years - but - I will see what I can dump from there (we make over 1MM calls/day - so - as crazy as it might sound - the vast majority of those records are actually active/necessary). I will check the log files you suggested and let you know if there is anything that looks suspect or I don't understand...thanks!

williamconley - FIRST - I found the old thread you mentioned (like you said - I shouldn't rely on my memory - so thanks for the reminder!) - anyway the "show status like '%Aborted%'"; command does show over 3,000 aborted clients, but only 80 Aborted connections so far today, - but every time I "tail /val/log/mysql/mysqld.log" - the only new entries I see are "[Warning] Aborted connection ####### to db:" - so - not sure why it's only showing 80 when tail displays 100+ in the last few minutes - only guess is that the tail is actually displaying client and not connection Aborts (even though the terminology does say "connection")? I also just ran mysqltuner.pl and it showed "Aborted connections: 0.00% (82/16482617)" which does seem to agree with the "show status like" command output. As for the rest of your comments...all of the servers are connected locally to each other in a hosting facility - while I cannot verify that it's a GB or better connection personally since they're not local to me at all - I will get confirmation from the hosting company as to the model of the switches used to connect our servers together. I do know that they are all on the same subnet - so their traffic should not have to pass through any router/firewall to reach each other. For what it's worth, when I do an "iax2 show peers" on any of the servers - ALL of the connected servers in the cluster show up as OK (1 ms) to (3 ms) maximum in the status column - so - I am pretty sure they all have a decent connection to each other. Of course, that one metric doesn't tell the entire story - but- at least it's encouraging rather than showing something like 400ms. I want to clarify that we are having issues across all of our offices - and that none of our agents are local to any of our servers - and the issue started for everyone about the same time - a week or 2 ago. Next, I will look into the server-tuning.conf file and see if anything needs to be changed based on any information I can find on it. I will see what, if anything I can learn from the logs for the services you mentioned and correct anything that appears to be failing (I would LOVE to find something failing so that this makes sense!) I have had the Real-Time screen up for all of my servers, the DEAD calls are difficult to determine the exact time of occurrence - since - there ARE actual DEAD calls from time to time - but - it's the extended duration ones that are usually the issue. I will instead PuTTY into all of my servers and run a split-screen top/htop with all of them so I can see what's going on - learn if all of them are peaking at the same/similar times.

In addition to what I reported in my previous posting, there is one other symptom that we are experiencing - in addition to DEAD calls - we get a handful of agents that show up with QUEUE as their status - some disappear after a bit - but - often I end up with 3 - 6 agents in this status for 10 - 30+ minutes (I am hoping that an actual agent isn't sitting at the desk waiting for a call for that long without letting anyone know or logging out/back in - but - you never know!) Is this status also implicated in a high Load Average/network issue type of situation?

I think I mentioned that top shows mysqld at least 300%CPU and I have seen it hit over 600% - again top shows that our DB server has 40 cores and /proc/cpuinfo shows that they are all Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz. /proc/meminfo shows 264,055,980kb (256GB).

Thank you again for your replies - I will keep digging and see if I can find what is causing this for us...ANYTHING else you can think of is truly appreciated!!!

Posted: **Tue Jun 14, 2022 4:33 pm**

Do not rely on verbal or written "from someone else" to determine if the connections are gigabit.

Code: Select all: ethtool eth0 ... Speed: 1000Mb/s

I do know that they are all on the same subnet - so their traffic should not have to pass through any router/firewall to reach each other.

Subnets can be virtual/logical rather than physical. We have VPN servers that allow agents to be on the "local/same" subnet as the servers for many clients. But that sort of connection can result in far too many delays and dropped packets for Server to Servr connections. If you can not visibly confirm the servers are on a hard-wired local physical network, try passing massive files between them and see what your throughput is. If you perform this task when not otherwise using the servers, your throughput should be spectacular, right? If it's not, then that bottleneck could cap your system capacity. And note that the pathways change with the tide. Servers pass calls to other servers, reports are run ... it's a storm that's constantly moving, so 1G to and from and between all servers is not a bad thing to confirm. Then comes the challenge of colocation: Is any of this shared physically? Some colos will combine all networks into one system and then split it out later with routers. But that means your 1G connection at night with nobody using it ... has sharing issues during the day. So check it again during a break and see if you're sharing anything. (ie: transfer massive files during a quick down-moment in the middle of the day, if possible, and see if those results are worse than the middle of the night). If it's dedicated inter-server switching, you shouldn't see any difference.

In case that didn't quite get through: Just because each of the servers is directly connected to a gigabit switch, does not prove that they are gigabit all the way to the other server(s). Anyhting 10/100 in the path bottlenecks. Anyone sharing that hardware (even on a different logical subnet) reduces capacity WHEN THEY ARE USING IT. We have had incidents where another facility firing up something major for a few minutes daily would generate a problem for a client ONLY during those few minutes. And let's not forget the Airplane interference from the MetroNet (microwave internet a client was using), lol.

ALL of the connected servers in the cluster show up as OK (1 ms) to (3 ms) maximum in the status column - so - I am pretty sure they all have a decent connection to each other.

Good ... was this during a situation when the problem was happening? Or during a "slow time" when you could check such things? If during slow time, then it has no meaning. You need to check settings (all of them) during the problem. Things Change. 8-)

we get a handful of agents that show up with QUEUE as their status - some disappear after a bit

Check your mysql errors (logs and variables) and see if those grow when these abnormalities occur. We've seen situations like this where the (fairly few, but still notable) error counts in the logs grow each time an abnormality occurs, generally indicating that the problem was in fact logged. Then comes the "ok, then what caused this?" battle, of course. Churning rdp ports faster, opening more ports, removing process limitations ... all sorts of things become necessary.

Posted: **Tue Jun 21, 2022 4:30 pm**

So - at this point, I believe that the problem has been resolved - it ended up being the ServerLimit and MaxRequestWorkers settings in the apache server-tuning.conf file - it seems that once we hit 125+ Agents, the previous setting of 384 (each) was queuing up other agents' web requests. We set these to 1024, and haven't seemed to have a problem again.

Fingers crossed!

Posted: **Tue Jun 21, 2022 4:45 pm**

Excellent postback! 8-)

vicidial.org

Agents in DEAD Calls and Randomly Pausing

Agents in DEAD Calls and Randomly Pausing

Re: Agents in DEAD Calls and Randomly Pausing

Re: Agents in DEAD Calls and Randomly Pausing

Re: Agents in DEAD Calls and Randomly Pausing

Re: Agents in DEAD Calls and Randomly Pausing

Re: Agents in DEAD Calls and Randomly Pausing

Re: Agents in DEAD Calls and Randomly Pausing