Front End servers tend to be running 'in the field' and seldem require user input or intervention. Nevertheless, situations arise where a server needs to be accessed (for instance to check local conditions) or to be restarted. It is in fact true, that the current on-going server statistics can be an invaluable debugging tool, when trying to pin down hard-to-find connectivity problems, etc. To this end TINE offers numerous remote servers to the controls administrator or front end developer.
On the one hand, all TINE servers offer a set of so-called 'Stock' properties, many of which deliver (per TINE protocol) all of the pertinent server statistics and activities. On the other hand, when a server is 'hung' or can otherwise not be reached via the normal TINE protocol, trying to restart the server via the TINE protocol would be useless, since it would be guaranteed not to work under these conditions. In such cases an independent means of remotely restarting the server is what is called for. And the solution to the 'remote restart' problem is frequently platform dependent. We shall discuss these platform dependent techniques for remote restart and well as remote 'control' in more detail below.
All TINE servers offer the stock properties which let the caller pull information up from the server. Among these stock properties are the trivial properties "SRVDESC", "SRVLOCATION", "SRVOS", and "SRVVERSION" which offer static information cached at startup time, namely the server's description, physical location, operating system, and TINE version respectively. In addtion, a caller can obtain security information as to the list of users with WRITE access and the list of networks with WRITE access by calling the stock properties "USERS" and "IPNETS", respectively.
More interesting is the dynamic information obtained by calling the stock properties "ACTIVITY" and "SRVSTATS". A call to "ACTIVITY" must ask for an "ActivityQueryStruct" structure with TINE data type CF_STRUCT. An ActivityQueryStruct looks like:
typedef struct { char FecName[FEC_NAME_SIZE]; char reserved[4]; time_t localtime; time_t starttime; long systemPollingRate; short numBkgTsks; short numTotalContracts; short numTotalClients; short numTargetContracts; short numTargetClients; short numConnections; UINT32 numConnectionTimeouts; UINT32 numConnectionArrivals; UINT32 numUDPpkts; UINT32 numTCPpkts; UINT32 numIPXpkts; UINT32 numSPXpkts; } ActivityQueryStruct;
We see that the information returned is general information concerning server operation. For instance the server's starup time and the server's current clock time, along with the current number of registered contracts and clients in the server's connection tables. General counters refering to the server's network activity are also availble, such as the number of UDP, TCP, IPX packets received at the server, etc. If the server is itself a client, then the number of connection arrivals and timeouts is also presented inside the structure information.
A more detailed analysis of the server's activity since startup is offered by the stock property "SRVSTATS". As the information provided consists entirely of long integer counters, this property can and should be called with the TINE datatype of CF_LONG. The information returned is shown in the structure below:
typedef struct { UINT32 AveBusyTime; UINT32 CycleCounts; UINT32 MaxCycleCounts; UINT32 SingleLinkCount; UINT32 ClientMisses; UINT32 ClientReconnects; UINT32 ClientRetries; UINT32 ContractMisses; UINT32 ContractDelays; UINT32 BurstLimitReachedCount; UINT32 DataTimeStampOffset; } ServerStatsStruct;
Before we launch into a discussion as to what all of these parameters mean, we note that these statistics can be pulled from a server via calling the stock properties or by typing "get stats" at the command line (for console servers running in the foreground or via the 'attachfec' tool for servers running in the background or as a service). An example output of "get stats" is shown below:
get stats
>Running since Tue Apr 1 21:02:21 2003
Total SPX requests : 0
Total IPX requests : 0
Total TCP requests : 0
Total UDP requests : 283254
Total KBD commands : 26
Socket RCV Buffers : 65536
Socket SND Buffers : 32768
Server Work Area : 131070 bytes
Registered clients : 11
Registered contracts: 57
Contract misses : 1663
Contract delays : 2
Client misses : 13
Client reconnects : 1393
Client retries : 0
Synchronous calls : 52829
Bursts : 27
Connection arrivals : 72062966
Connection timeouts : 1028507
Incomplete transfers: 190
Client Work Area : 65536 bytes
System Polling rate : 1000 msec
CPU usage : 0 percent
Average Cycles/sec : 2 Hz
Max Cycles/sec : 23 Hz
We now note that contract misses and contract delays indicate communications problems on the server-side. Specifically, when the server sees that it is scheduling communication to a client at a rate greater than twice the requested transmission rate, it records a 'miss'. When the server tries to schedule a contract request, but notices that output from the previous request to the same contract is still pending, it records a 'delay'. When the server notices that it is delivering data after then client-specified polling rate or time out has expired, it increments the client 'miss' counter. In each of these cases this indicates a busy server, and these values should be as low as possible.
When a client request a value and the CM_RETRY flag is set, the server increments the client 'retry' counter. Likewise if a client unexpectedly renews a contract subscription (indicating it missed a delivery) the server increments the client 'reconnect' counter. These values can be indicative of busy clients or busy networks.
The synchronous calls indicator can be examined to get an idea of how client applications are making use of the server.
The number of 'bursts' is the number of times that the server has reached its burst limit while sending data packets.
Counters such as 'connection timeouts' and 'incomplete transfers' are client-side counters.
The CPU usage refers to a 'best guess' calculation of a server's busy time versus its idle time.
Cycles/sec refers to passes through SystemCycle(). For DOS machines these numbers should be high. For all other platforms these values reflect the system polling rate. The maximum value gives the most passes through SystemCycle in one second since server startup, and could be a large number if the server has ever been accessed via numerous repetitive synchronous calls for instance. The average number is actually the most recent number of passes through SystemCycle per second.
Many of the other stats give configuration parameters, most of which can be adjusted if necessary.
The above output was generated from a very busy gateway server (NETMEX) after two weeks of up time.
A TINE statistics server can be configured to collect and keep histories of some of the more important counters from designated servers. In addition, the statistics server itself can maintain timeout counters, etc. as well as obtain information from the equipment name server as to the number of reboot history of a server.
For instance, consider the display of the timeout counts as measured by the statistics server:
In the display, it makes the most sense to view the trend as binned output since the timeout counter will monotonically increase over the course of the year. The timout counter is only an indication of the connection timeouts seen by the statistics server itself and does not necessarily mean that the server was down. Connection timeouts could arise due to network problems or a very busy server. To help decide where the root of the problem is, the other statistics should be consulted over the same time span.
For instance, consider the display of the reboot counts as obtained from the equipment name server:
This gives the number of times the server was restarted over the time span in question. Note that a server could have been down for hours before a restart, or could have been restarted from a running state. The equipment name server simply keeps track of the number of restarts.
Now consider the display of the client reconnects as obtained directly from the server in question.
This gives an indication as to how often a server's clients are seeing timeouts and are thus forced to re-establish their data links. If this number is high, it is a sure indication that there is either a network problem or that the server is unduly busy.
Now consider the display of the contract misses as obtained directly from the server in question.
This gives an indication of how often a server is returning late contracts. If this correlates well with the number of client reconnects, then it is a sure indication that the primary reason for clients receiving connection timeouts is that the server is busy.
Now consider the display of the number of synchronous calls as obtained directly from the server in question.
If this number is high, it generally means that poorly written clients are synchronously polling the server for data. This is the least efficient way of obtaining data from a server, and if the same client program is run at many stations this could generate an unnecessarily high load on the server and could be one reason why the server is busy.
Finally, consider the display of the average busy time as obtained directly from the server in question.
This statistic gives an estimate of percentage of the cpu load used by the server thread. It is best view as a trend and makes little sense as a binned statistic. This statistic should be regarded approximate and reflects the amount of time spent 'doing something' versus the server idle time.
Also of potential interest are the performance settings (particularly if performance problems are being investigated). At the console (or via the 'attachfec' tool) one can type
get settings
and view an output such as the following:
>get settings
>
>Server Settings :
>Server Work Area : 262140 bytes
>System Polling rate : 10 msec
>Contract renewal len: 60 items
>Req ack. on change : yes
>Server Burst Limit : 1000 packets
>Burst Cycle Delay : 20 packets
>Server Packet MTU : 1472 bytes
>Server Send Buffers : 65536 bytes
>Server Recv Buffers : 32768 bytes
>Server Scheduling : eager
>Server tasks : not re-entrant
>Server cycle thread : separate
>transport thread : separate
>Client Settings :
>Client Work Area : 65536 bytes
>Client Burst Limit : 1000 packets
>Client Send Buffers : 65536 bytes
>Client Recv Buffers : 32768 bytes
>Client Recv Queue : 0 items
>use loopback addr : no
>use watchdog links : yes
>allow common links : yes
>retry on timeout : yes
>
which provices the user with information concerning working buffer lengths, threading information, socket and transport information, and so on.
All TINE servers maintain a log file called 'fec.log', which is rotated into 'fec.bak' when the allowed file size has been exceeded. By default this log file is kept on the local disk for those platforms which typically have a disk, otherwise the log file is kept in main memory as a ring buffer (and will consequently disappear if the server is restarted). In any event the log file can be pulled from the server by calling the stock property "LOGFILE". This not only allows a control system administrator to be able to access all servers' log files at a central location (without worrying about file mounts) it also allows secondary processes to periodically pull the log files from those servers which do not have disks and store them on a file system elsewhere.
The primary management tool for making use of the remote servers offered by TINE servers is the FEC Remote control application shown below.
The view shown above is trained on the PETRA machine and one can see at a glance the number of different platforms (all speaking TINE) involved at the front end. As seen from the buttons offered, one of the services offered from this control program is the ability to 'restart' the front end process in question. Another service is the ability to 'control' the front end process in question. These services require a bit of explanation.
The first thing to note is that the ability to either 'control' or 'restart' a front end process is platform dependent, and the level or amount of 'control' will vary depending on the platform. We shall discuss these case by case below.
UNIX
The solution used in the case of Unix servers (especially including Linux) is to make use of the 'autoproc' daemon, which can manage all manner of processes on the machine in question and offer an independent remote port for accepting commands. The autoproc daemon is also a 'watchdog' in that it will restart a process it is managing if the process dies unexpectedly or is suddenly using too much CPU time, etc. Thus 'restarting' a server process determines that the operating system is UNIX and then issues a command to the autoproc daemon giving it the name of the front end process to restart. This must be the same as the FEC name, which identifies the server process at the control system level.
'Control'ing the process in this case amounts to offering a secure shell connection to the remote computer and starting the autoproc control panel on the local host. Once logged in via ssh on a UNIX machine, you can run the TINE 'attachfec' access program, remembering to supply the FEC name as an argument, as in:
attachfec Ion_Pump.504
attachfec
The attachfec program is part of the TINE package and should be built for the target platform by using the remote.mak make file. If you logged under the same account which owns the server processes, then the attachfec program can attach into a named pipe started by the server. You can then have the full panoply of console command at your disposal as if the server were running in the foreground. For instance:
help >Currently available commands: > quit - terminates process (foreground) or service viewer (background) > kill - terminates process as well as service viewer > get modules - displays a list of registered equipment modules > get properties(<eqm>) - displays the registered properties for equipment module <eqm> > get devices(<eqm>) - displays the registered devices for equipment module <eqm> > get clients - displays the current consumer list > get contracts - displays the current contract list > get contract(#) - displays contract Nr. <#> information > get globals - displays the current globals list > get connections - displays the current connection list > get connection(#) - displays connection Nr. <#> information > get BurstLimit - displays Burst Limit in packets > set BurstLimit - sets Burst Limit to num packets input > get CycleDelay - displays Cycle Delay in msec > set CycleDelay - sets Cycle Delay to msec input > get time - displays local time > get version - displays TINE version number > get users - displays WRITE access user list > get nets - displays WRITE access net list > get stats - displays operational statistics > get users - displays the users with WRITE permission > get nets - displays the networks with WRITE access > get filter - displays current debug output filter string > set filter - sets debug output filter string > set debug = 0 - turns debug printing off > set debug = 1 - sets debug level 1 (trace RPC commands) > set debug = 2 - sets debug level 2 (trace network activity) > set debug = 3 - sets debug level 2 (trace data exchange) > set debug = 4 - sets debug level 3 (full diagnostic trace ) > set logdbg = 0 - turns debug logging off > set logdbg = 1 - turns debug logging on > help - display this list > >Extra commands: > freeSessions - get function value > sessions - get function value > ArchiveDebug - get or set integer value > doocsstats - get function value > >quit >Debug level 0 > >debug logging OFF > >debug name filter entered > > Thanks for using attachfec !
Note that by issuing a 'quit' you quit the attachfec session and not the server process. You can stop the server process from an attachfec session by issuing a 'kill' command. This in fact is one way to gracefully close a server process.
attachfec is a powerful tool for investigating the activity of a TINE server. Simply typing 'attachfec' at the command line will produce the following output:
fecadmin@acclxd2facil01:~$ fecadmin@acclxd2facil01:~$ attachfec usage: attachfec <fecname> (via local pipe - normal usage) or: attachfec /<context>/<server> (via remote stream) or: attachfec <ip>:<port> (via remote stream) fecadmin@acclxd2facil01:~$
In which case, we notice that we can also attach to any running fec process anywhere by providing the input in the form of "/<context>/<server>" which is a signal to resolve the host address and attach to a special debug stream socket, which will work regardless of the platform of the remote host (provided that the TINE server is of release 4.1.9 or higher).
Note, however, that such debugging (especially with the debug level set to 2 or higher) can introduce a substantial additional load on the server and that this load is much greater if the debug output is be transmitted via a network stream as opposed to a local named pipe.
Also note, that 'remotely' attaching to a fec makes use of whatever TINE network address security is in place at the remote host (i.e. IP address lists, but not user lists).
As a final note concerning the attachfec tool, note that a pure client process can also make use of the API call OpenIpcSocket() to provide a local pipe name (traditionally the local process id) in order to offer an avenue of investigation also using 'attachfec'. Here you would simply type for example 'attachfec <pid>' in order to have a debugging window into a running client application. This method only works via a named pipe, as pure client application will not have a systematically known debugging socket available.
'Control'ing the process will also start the autoproc controller on the local host offering quick access to all of its managed processes:
Win32
The solution in the case of the Win32 world is to make use of a separate TINE server running as a watchdog. This server processes uses the computer's host name as device server name and makes use of port offset 20. So as long as the host name is related to the FEC name in the manner described below, it is easy to deduce the TINE watchdog server which needs to be contacted when it is desired to either restart or control a Win32 server.
The suggested naming scheme to be used for Win32 servers is as follows. All FEC names used on a Win32 host should use the computer name appended by ".<port offset>" where the <port offset> is the assigned port offset as used in either the fecid.csv file or the RegisterFecName() API call. Optionally the computer name can be stripped of its first 5 characters when these are used systematically to identify a Windows group plus operating system. For example, "ACCNTHEEKOLLI" or "ACCXPHEPIDC" might indicate that the windows group is "ACC" (for accelerator) and the operating systems are "NT" and "XP" respecively. The actual FEC names might be "HEEKOLLI.4" and "HEPIDC.2" for instance.
As in the case of the autoproc daemon, the Windows watchdog will restart processes which die unexpectedly, and allows remote commands to restart a process. 'Control'ing a Win32 front end will amount to starting the remote Watch dog controller on the local host machine:
If one has access to the remote desktop of the machine running the FEC process, one can also make use of the 'attachfec' utility for windows in the same was as for UNIX. Attachfec takes the FEC name as argument and brings up a GUI application providing the same set of command utilities for the UNIX attachfec utility:
attachfec
The attachfec GUI application likewise makes use (primarily) of a windows named pipe to communicate with the running server process.
But as in the case of the unix tool 'attachfec', you can use the windows GUI to attach to a remotely running server by supply the input in the form of /<context>/<server>
Staring attachfec.exe and not providing an input parameter will meet with the following input error box:
VxWorks
The solution in the case of VxWorks is to run a separate task (analogous to the autoproc daemon) which accepts a remote reboot command. As VxWorks runs all tasks in the same address space, there is only one TINE server on a VxWorks CPU. All device servers are then attached to the single TINE server. Although it is possible to stop and remove the TINE server process and then reload it, the 'restart' daemon will take the more drastic step of rebooting the CPU.
'Control'ing a VxWorks server process amounts to starting a remote login session on the VxWorks CPU. You then have all of the task tracing functionality which VxWorks has to offer. Indeed, you are taking complete control of the CPU when you do this.
attachfec
There is no sense or need to provide a VxWorks 'attachfec' utility, as there are no local named pipes (or reason to use them). However one can make use of either the windows GUI executable 'attachfec.exe' or unix command line tool 'attachfec' to remotely attach to a running VxWorks server for debugging purposes.
Java
Java servers will make use of the native restart daemons on the platform on which they are running.
attachfec
Java has no ability to offer named pipes as a form of local interprocess communication. However, as in the case of VxWorks servers, one can make use of either the windows GUI executable 'attachfec.exe' or unix command line tool 'attachfec' to remotely attach to a running java server for debugging purposes.
Note that the debugging commands available to java servers are similar but not identical to the standard C-Lib servers.
DOS
The solution in the case of DOS is to run a separate TSR program prior to the start of the server process. This scenario is similar to the VxWorks case described above. Indeed the TSR program (slave.exe) is capable of accepting keyboard input from your host machine and returning the screen buffers so that you effectively 'take control' of the remote DOS CPU.
Worth mentioning is the fact that the slave.exe uses IPX datagrams to exchange information between the remote and local hosts. This satisfies the wish to use an independent means for communication, but otherwise appears to be a rather unusual solution. We can only say here that this solution is stable. The DOS FECs running TINE make use of the Client32 stack from Novell, which is by far the best from the standpoint of minimizing the footprint in the critical lower 640 Kbytes. Attempts at using UDP datagrams instead of IPX datagrams in the slave.exe TSR program met with only limited success, in that some CPUs crashed following a short life span. As the TCP stack for DOS has been frozen for some time and is extremely unlikely to be improved upon, this will be the state of affairs as long as DOS is needed. We could mention that TINE runs fine on DOS without IPX, so that if IPX is unavailable on a particular subnet, a TINE server would still run fine. However remote reboot would then be impossible by these means.
Win16
The solution in the case of the Win16 world parallels that in the Win32 world to some extent and makes use of a separate TINE server. Here however the ability control server processes is somewhat compromised as Win16 runs all Windows processes out of the same address space (as in VxWorks or DOS for that matter). Indeed the watchdog process has the same FEC name as all other TINE server processes on a Win16 machine. This means that in the case of a hanging server, it is unlikely that the watchdog process will be able to respond since it is attached to the same server process. On the other hand restarting or rebooting otherwise healthy processes is not a problem.
For operating systems such as Win16 or DOS where there is no virtual memory it is advisable to have an additional hardware watchdog card which will issue a cold boot in the advent of serious problems.
1.5.8