by duncanb » Fri Oct 20, 2017 8:30 am
Hi,
This is still not resolved. We did identify one issue where calls to the CRM backend (via Dispo URL / Strat call URL / etc) were taking anywhere from 1 up to 30 seconds. Resolving this made a lot of our problems go away. However we're still seeing some problems with databases not being updated correctly.
We've tracked things down to gaps of time where we see no packets from the agent workstations. We don't think it's network (based on lots of testing), so think it's more likely something on the Ubuntu/Firefox/Jitsi agent workstations.
I've had to remove some useful info from the below notes as there was some private data included when I sent this summary to the client's technical team.
When investigating reported discrepancies between the agent activity log and the agent’s recordings, we found a substantial amount of missing entries in the vicidial_log table (which is used for the agent activity report).
If we run the following 2 queries on Benoni for a specific user:
select *
from vicidial_log
where call_date >= '2017-10-11 00:00:00'
and call_date <= '2017-10-11 23:59:59'
and user = 'XXX';
select *
from recording_log
where start_time >= '2017-10-11 00:00:00'
and start_time <= '2017-10-11 23:59:59'
and user = 'XXX';
We can see a lot of entries in the recording log without a matching entry in the vicidial_log. The recording_log is updated server-side, and vicidial_log updates are mostly triggered by the agent workstations. The following screenshot shows items in the recording log (highlighted in grey) that have corresponding entries in the vicidial_log. The white entries are where the problems lie:
[screenshot note available]
We wrote a quick script (x.php here), that looks for packets coming from a specific IP address and prints the amount of time since the web server last saw a packet from that IP if it's longer than 2 seconds ago). The below web server log extract was run on the same day for the same user (using the IP address instead of user name). What each “Warning” line is showing is a when there has been a delay of more than 2 seconds since the last packet from the workstation hit the web server. For instance the following line means that at 08:29:58 the web server received a packet from the PC, and the last packet the web server received was 7 seconds before that:
08:29:58 Warning 7
[root@vicidial-benoni support]# zcat /var/log/httpd/access_log-20171012.gz | grep "192\.168\.21\.124 " | sed "s/^.*2017:\(.*\) +0200.*$/\1/" | php x.php
08:29:50 Warning 1503988190 // this line can be ignored as there is no previous log entry for this IP today
08:29:58 Warning 7
08:30:01 Warning 3
08:37:01 Warning 264
08:57:43 Warning 913
09:17:10 Warning 676
09:26:00 Warning 344
09:40:49 Warning 319
09:55:13 Warning 100
09:55:21 Warning 3
09:55:39 Warning 6
09:55:42 Warning 3
09:55:46 Warning 4
09:59:08 Warning 149
09:59:13 Warning 4
09:59:18 Warning 5
10:14:23 Warning 226
10:43:22 Warning 1119
10:55:18 Warning 716
11:07:54 Warning 751
11:07:57 Warning 3
11:21:52 Warning 3
13:20:42 Warning 6513
13:30:40 Warning 5
13:59:27 Warning 811
14:17:41 Warning 3
14:50:55 Warning 1074
15:41:47 Warning 70
17:28:00 Warning 1321
17:39:40 Warning 649
17:41:39 Warning 95
17:49:14 Warning 112
18:00:13 Warning 594
18:00:44 Warning 31
18:01:16 Warning 24
If we compare the above web server log extract to the “whitespace” entries in the screenshot above, there appears to be a correlation. Importantly, it seems in every case where there is a delay in the web server log extract, there are no entries in vicidial_log that match the recording_log table for that period. Note that I have only checked this manually thus far. I also checked user Mandla and saw the same symptoms.
I have also performed a packet capture on the Ethernet interface to confirm that we stop receiving packets from the workstation where we see these warnings in the web server log extract, so it is not just a web server logging problem.
Does anybody perhaps have any ideas? I saw a post from Matt in the forums recently that spoke about Firefox auto-update breaking AJAX calls or something. Is this still an issue and could that have something to do with this?