vicidial.org

by **rwfaught** » Mon Apr 05, 2010 11:57 am

I've tried most of the solutions I've found on this forum and I'm still getting this message often. It's not every call... usually about every 3rd or 4th per agent when it's bad. Sometimes we go for a good while without the issue.

I'm just wondering from others who have experienced this what are the most likely causes.

Right now We have a Vicidialnow CE installation on the dialer with 8 agents and a Database Server that is running ViciBOX. We were using the DB server to dial from and were getting the same issue. I loaded ViciDialNOW on a seperate box to try to fix the problem. We are not dialing from the DB server currently. Please let me know what I need to provide to help solve this problem.

by **mflorell** » Mon Apr 05, 2010 12:55 pm

Please describe your network.

Does this happen with only one agent on?

Is this manual dial/inbound/auto-dial-outbound campaign?

CPU/RAM/Operating-System of the agent workstations?

by **rwfaught** » Mon Apr 05, 2010 1:49 pm

mflorell wrote:Please describe your network.

Does this happen with only one agent on?

Is this manual dial/inbound/auto-dial-outbound campaign?

CPU/RAM/Operating-System of the agent workstations?

s

Thanks Mark for replying.

We are running one campaign right now -- outbound, adapt-tapered or ratio depending on the list. I have all others inactive.

I have not tried dialing with only one agent. The problem does seem to occur more often with a couple of agents. I can have those agents move to another PC and it still happens. I have tried to create a new user for those agents. Still happens.

Our agent PCs are all:

Pentium 4 2.8ghz
500mb RAM
Windows XP
Firefox
Zoiper

The network switch is a Catalyst 2980G-A which we have replaced and put back when it did not solve the problem. That switch is connected to both the DB and Dialer servers directly. The Cisco IAD2800 router is also directly connected to that switch. We have autosensing configured on the switch ports and 10/100 hard on the IAD 0/0 and 0/1 interfaces. We have 2 T1 lines and are using Qwest SIP trunking.

by **mflorell** » Mon Apr 05, 2010 2:01 pm

Does this happen if you do manual dial or INBOUND_MAN with manual list dialing?

by **rwfaught** » Mon Apr 05, 2010 2:34 pm

mflorell wrote:Does this happen if you do manual dial or INBOUND_MAN with manual list dialing?

Not sure. We have a shift change at 3:30 CST. I'll keep one agent on until 4:00 CST and see.

by **jamestaylor** » Sun Apr 11, 2010 9:55 am

Recently we installed new web servers and had this problem only when agents were on the new web servers.

I made the following change to my apache2.conf file and I haven't had mass reports if the issue since.

Code: Select all: KeepAliveTimeout 30

by **Nikhil Vohra** » Mon Apr 12, 2010 1:18 am

Vicidial = 2.2.0-252
build = 100207-1109
4 Servers
a). Suse 11.2-Database
b). Suse 11.2-Dialer-1
c). Suse 11.2-Dialer-2
d). Suse 11.2-Dialer-3

asterisk version 1.4.21.2

appache-2.2.14

I am having the same issue and yes this is happening with only one agent.

by **vccsdotca** » Wed Apr 14, 2010 5:02 pm

Not only is this happening with my clients server but he keeps getting an agent being in ready position but the admin summary says they are on pause. At this point the agents screen and time seems to be locked. The server is remote but I have seen this also on our internal server using strictly auto dial (ratio) and nothing else. Further more theres serious audio issues with choppiness no matter the latency or resources we put the server at. All seems to come back to vicidialnow 1.3 thus far. Also we have just upgraded to SVN 2.2.0 with no success. We will be deploying vicibox server and will post any updates results.

by **mflorell** » Wed Apr 14, 2010 10:04 pm

zaptel timer?

by **vccsdotca** » Fri Apr 16, 2010 10:39 am

This may be related to running the dialer in Vmware esx server. From what I know with vmware tools installed the timing should be stable and accurate but I was reading a post (http://www.vicidial.org/VICIDIALforum/v ... ptel+timer) about issues with vmware and the ticks. I was also ready http://www.voip-info.org/wiki/view/Aste ... er+ztdummy in regard to "ztdummy and Kernel 2.6" and apparently theres some issues with IAX and meetme.

Anyway I have 2 virtual machines with only 3 agents each and our auto pause issue happens very often, on auto dial only as well as intermitent audio issues whn dialing through conferances only, never while using vicidial as a proxy. Our network is pretty solid and will do some tests for packet loss and QOS, how ever does this seem like a zaptel issue, vmware, vicidialnow (centos based issue?). I have also seen posts of people increasing the mysql max connections number ? :s

vmware 1.3 with svn 2.2.0, unlimited ram and cpu at max is 50%

by **vccsdotca** » Fri Apr 16, 2010 11:47 am

Just want to confirm a couple firewall rule also.
Correct me if im wrong but IAX ports 4569 is not needed to the outside world in a pure SIP env., only localhost access needed correct? Also ping requests to my remote network (agent side) are blocked and I can see these ping packets denied on the server side but I think this is only for keep-alive correct?

Thanks a bunch.

by **mflorell** » Fri Apr 16, 2010 12:16 pm

correct

by **vccsdotca** » Mon Apr 19, 2010 12:29 pm

Thanks Matt. On another note, I noticed the server and the agent windows stations have a 6 second time differrence..could this account for any of these issues??

by **mflorell** » Mon Apr 19, 2010 12:45 pm

it shouldn't

by **vccsdotca** » Mon Apr 19, 2010 1:13 pm

humm. well our 2 ideas now are to move the machine to a dedicated server and eliminate the shared cpu from the other virtual machines (which shouldnt matter because we have unlimited resources) in order to maximize clock tick reliability. the other is to install and test from vicibox server to eliminate vicidialnow and the centos distro. The major problem is with the agent pause issue happening so frequently (http://www.eflo.net/VICIDIALforum/viewt ... edfd762cf7). I have done multiple jitter and QOS test from remote to server and thats fine, no dropped firewall packets, latest firefox browsers, and agent qualifying times are normal for remote users at around 150ms.. Would there be any obvious suggestions here?

by **mflorell** » Mon Apr 19, 2010 3:32 pm

on a true virtual server there is no way to guarantee proper ticking, even on a dedicated server under high load it is not 100%, which is why we use the Sangoma voicetime USB sticks to guarantee timer accuracy and this is why all of our ViciDial Hosted servers have these in them.

by **vccsdotca** » Mon Apr 19, 2010 4:43 pm

Thanks Matt.. with the new zaptel drivers in 2.6 kerns these are not needed any longer according to my previous posted link on ztdummy. are you using 2.4 kerns? .. and from what I understand your company is actually giving physical machines with your hosting??

by **mflorell** » Mon Apr 19, 2010 7:52 pm

Under high call load using very recent Linux 2.6 kernel versions the ztdummy(and dahdi_dummy) timers can still come out of sync, we have proven this which is why we are using the Sangoma USB timers.

Yes we allocate individual physical servers to each of our hosted clients(some clients have more than one as well).

by **vccsdotca** » Mon Apr 26, 2010 2:23 pm

So our testing has provided some useful information. These issues so far, mostly including audio issues while in conferences is the zap timing. For all those wondering what your timing looks like simply run the zttest command from the linux shell..

Now for vicidialNOW which is running on Centos we were getting consistantly 93.xxx% while drops are anywhere from 70,89, and even -10%. Upgrading to DAHDI on vicidialnow boosted performance right away to a 96-97% which i a significant improvement.

Ultimately, we upgraded to vicibox on Ubuntu and out of the box were fluxing between 100% and 99.987793% consistantly. Using DAHDI actually decreased the quality. We will be testing the server this week and am eagre to see what results come from it, but immediately I have seen a huge increase in call quality. The GUI timeout issues are the next issue.
It should be noted that I could not find the cuase of the time freezing and other related issues. When the clock percent dropped severely (example 96-->80%) it did not trigger the GUI errors. We hav also ordered an FXO card with a timer which will be the final piece of the puzzle if needed.

Will update..

by **vccsdotca** » Tue Aug 10, 2010 12:08 pm

I have done some follow up testing on this issue as mentioned I would.
I am a bit confused with the results of my further tests and would appreciate some feedback.

Upon installing the new VICIbox Redux 2.0.2 I did a dahdi dummy test out of box with these results (in virtual env.):

Best = 100%
Worst = -8.190%
Avg = 99.800738

------------
Now with same software and X100P FXO (non virtual env):

Best = 100%
Worst = 98.864
Avg = 99.913517

While I admit that -8 is significant in results but if you see my prior post on ztdummy you will notice I didnt get that low. I also don't think I let the FXO timer run long enough to see if the numbers would drop heavily in a single tick, as per normal behaviour.
So my questions left are what is a hardware timing source really supposed to do??

My next tests will be with SIPp under the same configs...

by **williamconley** » Tue Aug 10, 2010 4:52 pm

my experience: when the clock manages to get "off" by more than a few seconds, your system will kick out agents. in addition to this, if your agent loses contact with the server (from his agent screen), he will be kicked.

now: what causes these two things? there is a random number field in one of the live tables that is checked regularly and is set with a tolerance of (i believe) 6 seconds. if the field does not change for 6 seconds, the agent is kicked.

now: what can cause it to not be updated for 6 seconds?

1) agent loses contact via AJAX (which is what initiates the mysql change to the field).

2) clock is off between systems (mysql timestamp vs apache and box vs box)

the timer is meant to keep the system time (even during heavy load) within that 6 second barrier. since the ztdummy software PC timer, when overloaded, can become off by enough to cause the 6 second flaw and timing issues can mess with apache (causing disconnects with agents) ... a SOLID timer can reduce or eliminate these issues. remember that a hardware timer can keep time without relying on CPU ticks. so it's clock will be right, regardless.

now, the deep details of which process looks at the time and under which circumstances and intervals ... you'll have to check the agent screen AJAX and follow it through to the field being updated and find the processes that "check" this field (and that "pause" the agent)

i am seriously enjoying your research, by the way. keep posting

by **vccsdotca** » Tue Aug 10, 2010 5:18 pm

Hey thanks Will! and thanks for the feedback, great information. Can you elaborate on these issues for me;

there is a random number field in one of the live tables that is checked regularly

what is this doing exactly, keepalive?
This is a big issue right now. I see agents logging in and out every hour because of dead lock.

clock is off between systems (mysql timestamp vs apache and box vs box)

in a virtual system where every service is on the same box, is this referring to the agent/server or the host/guest being off?

Thanks! [/b]

by **mflorell** » Tue Aug 10, 2010 5:41 pm

I think you two are talking about different things.

Dahdi/zaptel timer is used for 1 millisecond timing that keeps audio streams synchronized. This affects meetme, music on hold, recording and IAX trunking if the timing is not consistently 1 ms or very close to it.

Time synchronization across machines on a network is a very different subject and has nothing to do with Dahdi/zaptel, and it is much more forgiving if it is off by a couple seconds(but your logs can then also be off by a couple seconds).

by **williamconley** » Tue Aug 10, 2010 6:32 pm

strangely enough they still seem to be connected. of course, it could be that the stress on the system is reduced with a hardware timer and then the agents don't get kicked because the unstressed system doesn't break the connection with the agent as often, but then again, that's why I love this particular thread. it's giving some very interesting details. and reducing my need to research (and read code).

by **vccsdotca** » Tue Aug 10, 2010 7:05 pm

Matt,

I know that you prefer to use HWT's in addition to physical servers. What was your experience with using the servers without a HW timer?

Also from what I gather, the voice quality issues caused by virtualizing vicidial is because of the zaptel/dahdi timing since Vici is all Meetme.
I could see that causing issues when the zttest shows below 99.8% for example but when it goes down to a lower leve it is randome and jumps back up (99.8 - 99.993 - 99.9999). For poor voice quality to be notice I would expect atleast a 100ms(?) delay from the 8000ms and it would haveto be for an extended period of time. So why would it break unless you were at consistant 97% or lower??

Also this wouldnt interrupt the timing between the agent station causing the deadlock which im understanding is the kernel timing. Im doing some testing using a Perl script.

Code: Select all: #!/usr/bin/perl # Get drift my $lines = `ntpdate -q pool.ntp.org`; my @lines = split(/\s/,$lines); my $dummy = pop(@lines); my $drift = pop(@lines); my $nowstring = `date +"%Y%m%d-%H%M%S"`; chomp $nowstring; print "$nowstring -> $drift\n";

It outputs the time drift of the system. I am seeing anywhere from " -0.012569" to "-0.00083. Certainly no where near the 6 second mark.. Any ideas?

by **mflorell** » Tue Aug 10, 2010 7:16 pm

We have noticed that the dahdi/zaptel timer inaccuracy can actually be cumulative in some cases. If the timer hits a bad patch it will keep propagating audio issues long past when the timer has recovered.

Using software timers we try to keep loadavg 20-30% lower than we would using a hardware timer and there is almost no chance of having any audio issues.

by **vccsdotca** » Tue Aug 10, 2010 7:17 pm

Will,

I believe so also. zttest and dahdi_test is checking whether 8000ms has infact passed since reading 8192 bytes from the timing source. These commands use gettimeofday, soooo if the internal clock is off it could be causing the zaptel/dahdi delay illusion.
Furthermore, if the clock is off, perhaps the Z/D timing is accurate but read wrong by the sytem. How ever I cant account for the variants of zttest and dahdi_test on the physical server with x100p(99.913517 avg.).

by **vccsdotca** » Tue Aug 10, 2010 7:20 pm

Thanks Matt,

Would you be willing to run the perl script and dahdi_test for a few mins each from one of your servers and post some results?

by **mflorell** » Tue Aug 10, 2010 7:32 pm

I don't have any servers without hardware timers sitting around at the moment, but I can send it on to the guys in the office that build our servers and see if they can try it out the next time they build a new server.

by **williamconley** » Tue Aug 10, 2010 7:55 pm

what versions would you like it run on? we NEVER use hardware, so I have lots of those (i must have cheap customers, LOL)

by **vccsdotca** » Tue Aug 10, 2010 8:53 pm

Actually Matt I was looking for results from a physical machine with a HWTer. Also, William if you can do some from your machines also that would be cool. Im not sure if OS is an issue at the moment, how ever I know Ubuntu has been the best on vmware thus far.

Also Matt, what do you think would be a cause for the deadlock of agent GUI?

thx guys

by **williamconley** » Tue Aug 10, 2010 9:11 pm

define deadlock. it locks up if you miss an ajax call. (every second, miss one and you're done unless you have the latest code, matt added a possible recovery to allow missing a call, but not several in a row)

by **vccsdotca** » Tue Aug 10, 2010 10:16 pm

i mean when the agent stations freezes. what would cause a missed ajax call.. clock tick issue or network packet loss etc..

by **williamconley** » Tue Aug 10, 2010 10:24 pm

ordinarily packet loss, but if there is a virtual machine involved ... that could easily result from a missed tick on the Virtual server causing it to not respond to the client web request.

by **mflorell** » Wed Aug 11, 2010 7:46 am

We have added code recently to reduce deadlocks in SVN/trunk, it has helped clients with chronic bandwidth issues with about 50% reduction in LAGGED pauses.

Clients of ours with no bandwidth issues usually have no problems at all with LAGGED pauses, or deadlocks as you call them.

by **TroyD** » Tue Feb 15, 2011 7:53 pm

Interesting thread, found it due to having an issue with agents just recently getting the "your session has been paused" error. I am using VmWare ESX server V3.0.2 and I have my Mysql server on a single VM along with the Web Server on another VM. The actual dialer is running on a third server that is a physical box. We have utilized this configuration for about 3.5 years now and have gone through about 4 or 5 version upgrades and until about a week ago had never got the error. I elected to use physical hardware for the dialer because of the critical timimg issues with the Zaptel services and hardware as in the past I have had really screwey audio issues running asterisk in a virtual machine. I did have a weird issue where the MySQL server was about 2 minutes behind the other servers due to a ntp server aparently having issues. I corrected the ntpd.conf to move to a pair of stratum 2 time servers locally here in florida and the issue dissapeared at least until i got in the car and started to drive home, then poof they are back, I think in the past it happened one time and I had to clear the auto_calls table, and it never returned, but alas it has reared its ugly head again and rather suddenly. We typically have about 15 agents on average logged into the server and according to the management interface, the server load remains below 15%. I am going to try to look a bit further into what the issues are. I am hopeful by what I have seen here in the thread, is there anything I can do to help contribute to the investigation of this issue by supplying results from my virtual/physical hybird? I use IBM servers and Fibre channel storage for the infrastructure as we have done in the past for ESX deployments. I would like to figure out what is causing the pausing all of a sudden when we have ran for years without issue. The only thing that sticks out to me is obviously we have grown and over time have added more agents. Maybe we are hitting the wall here and it is time to add another physical dialer server to load balance. The timing is done via a T1 PRI card (Digium TE120P) although it is not currently used. we send the calls through the Sangoma CPD and out to another standalone asterisk box that has the Sangoma Single T1 Card (forget the model now, think its a 101 something) and that box acts as the "GATEWAY" Box. We just had a new Dual T1 Bonded circuit put in that is purely SIP, so I am turning that up hopefully tomorrow and we will use that instead of the standalone box for our gateway as 23 channels is no longer enough to handle the call load. This will increase the capability to 38 channels since Sangoma requires the G711 codec to be able to do call progress detection.. If there is anything I can do to help with this Mystery!! please let me know and I will test and post any results..

by **williamconley** » Tue Feb 15, 2011 8:11 pm

You've been running a 4 server system for 3.5 years for 23 channels?

by **TroyD** » Tue Feb 15, 2011 9:09 pm

williamconley wrote:You've been running a 4 server system for 3.5 years for 23 channels?

Yeah, Initially it was built to be able to scale, I did'nt want to have to rebuild the infrastructure mid way through. Also wanted to make sure that the environment was optimal for performance since it was my first go round with astguiclient. It was painful, all done manually and therefore I think I learned a lot about the system, but looking back, and even today, I am tempted to just throw it all on a nice healthy server. It was'nt always 4 servers though, initially I used the physical dialer to handle all call traffic and built trunks (SIP or IAX) to Trixbox to service the in house pbx, but it has all changed now, the company has gone from 12 to over 100 employees ( i give a lot of credit to the dialer for this growth) now and we are pleased with the dialer overall, it is tightly integrated with SugarCrm as well, but we now have 8 developers so CRM is going to change face to our own custom .net version soon. SugarCRM is very painful app to develop for.. kind of a love/hate relationship there!! Every upgrade of CRM brings fixes and new bugs..not fun, .net will be used instead of php for CRM anyway.. long story but interesting...

by **williamconley** » Tue Feb 15, 2011 9:17 pm

OK: So ... here's what you do. Go get a nice Core i5 processor and associated Motherboard. Install Vicibox Redux (latest) into it. Use it as a standalone system, but continue to use the upline CPD (since you've already paid for those licenses!).

Then report the model # of the motherboard so others can follow

Then when you need more servers, you can just use the built-in clustering system with Vicibox Redux. And honestly if you have 100 employees, you need to have a dedicated separate dialer system that is in no way linked to or dependent upon any other system.

by **TroyD** » Tue Feb 15, 2011 9:32 pm

williamconley wrote:OK: So ... here's what you do. Go get a nice Core i5 processor and associated Motherboard. Install Vicibox Redux (latest) into it. Use it as a standalone system, but continue to use the upline CPD (since you've already paid for those licenses!).

Then report the model # of the motherboard so others can follow

Then when you need more servers, you can just use the built-in clustering system with Vicibox Redux. And honestly if you have 100 employees, you need to have a dedicated separate dialer system that is in no way linked to or dependent upon any other system.

Tempted to do exactly that as far as the dialer is concerned, it used to be linked to the pbx, however now it is its own system, has the dual dedicated T1's (The current T1 PRI is going away) As far as our PBX is concerned it is totally seperate, it has another T1 PRI attached and has the capability to talk to the dialer and vice versa via sip trunks, but is totally a seperate system that is backed up by an additional Tribox appliance located in our datacenter.. since polycom phones that we use allow for dual servers the datacenter pbx is only a backup system in the event that the primary pbx goes offline. We maintain a 10MB metro-e circuit between the corp office and the datacenter. But all 3 systems run independent of each other (would be wacky not to at the size of our company) however each has the capability to dial each other via sip trunks between them. Works Great! I am really tempted to use a seperate IBM server for the dialer. An all in one box solution with a high powered box. But then again, why not just add another dialer node to the system if in fact the load is what the issue is.. Might be silly to re-invent the wheel when what we have works so well (at least until the pause issues that we have had just the past 2 days..) I'm building another standalone dialer box anyway for dev purposes, so I will likely just figure out what is causing our recent issues and fix them, much more interesting to learn more about the system. I like to master the things that I manage, and have been bringing together custom infrastructure solutions for years so why not continue to do so. Vicidial has likely been the most prevalent tool for our success, millions of calls and lots of deals had due to the system. Sometimes I think that our sales team thinks its just a magic box that finds us money, and in a way it is!! We'll get past the pausing issue, just took a look at the latest trunk and may just upgrade the system to take advantage of the new fixes. (Plus I may have a very good argument now for getting the owner to finally get me a better netwrok since the rest of our infrastructure is enterprise class, gotta get the network up to snuff as well..its not trash, but I want Cisco in there...)

vicidial.org

"Your Session has Been Paused"

"Your Session has Been Paused"

Your Session has been Paused....................

Glad I found this thread

yup

Who is online