Main Page | Features | Central Services | csv-Files | Types | Transfer | Access | API-C | API-VB/ActiveX | API-Java | Examples | Downloads
TINE Alarm System

Introduction

TINE alarms are processed at two levels. The first level is located directly at the front end and is known as the Local Alarm Server (LAS). The LAS is a standard part of all TINE servers, independent of platform. Here, it is most easily determined whether an alarm is oscillating (coming and going), whether an alarm should append a heartbeat flag, whether an alarm has terminated, etc. The second level of processing occurs at a dedicated middle layer TINE server known as the Central Alarm Server (CAS). Here, alarms are collected, filtered, sorted, and made available for client-side review. The CAS also performs several additional tasks more appropriate to central level processing. It can determine whether 'server-down' alarms need to be issued in the case of a non-responsive server (something an individual server of course cannot do). It can also take 'actions' following the receipt of particular alarms. For instance, it can issue post-mortem triggers, send e-mails, or write reports, etc. The TINE Alarm Viewer receives data primarily directly from the CAS. Specific alarm information (descriptions, references, etc.) is always retrieved from the individual servers.

It is also important to note the 'pulling' strategy used in accumulating alarm information. If a server detects an alarm, there might be a couple of ways to disseminate this information. A server could broadcast/multicast the alarm. This scenario was rejected out of hand. As it is desireable that no alarm information is lost, a server would have to broadcast its entire alarm table at all times in order that the CAS be guaranteed not to miss crucial information due to packet loss. A server could 'push' the alarm to the CAS. This scenario was also rejected as unncessary. If the CAS is not running (for instance temporarily due to server restart), then once again an alarm might be lost, unless the server caches the alarm until it can successfully contact the CAS. Furthermore there could also be test- or secondary- servers which might want to process the alarms. The CAS identity is assumed to be a server called "CAS" running in the device server's context. As this identity is systematically known, a server will cache alarms until they have been read by the CAS. This essentially defaults to the same behavior as pushing an alarm to the CAS, but allows any number of interested clients retrieve alarm information in the same manner.

The TINE LAS offers an alarm 'snapshot' consisting of five long integers, giving

The TINE CAS then monitors these snapshots from all relevant servers (configured by database) in the control system. As the monitor is a TINE REFRESH monitor, the CAS only receives the snapshot if one of the five numbers changes. The CAS can then retrieve alarm information from individual servers incrementally. Furthermore, alarms are not lost, as a server's alarm list always accessible. If the CAS is restarted, it can quickly re-establish the current alarm situation. The identity of the CAS does not need to be known systematically, and any number of secondary servers (tests, backups, etc.) can likewise retrieve the alarm information.

TINE Alarm Structure

A TINE alarm as it exists at the front end consists of the following c-structure:

typedef struct AlarmMsgStruct
{
  char server[EXPORT_NAME_SIZE];
  char device[DEVICE_NAME_SIZE];
  char alarmTag[ALARM_TAG_SIZE];
  UINT32 alarmCode;
  UINT32 timestamp;
  UINT32 timestampUSec;
  UINT32 starttime;
  UINT32 starttimeUSec;
  UINT32 alarmMask;
  BYTE alarmData[ALARM_DATA_SIZE];
  BYTE alarmDataFormat;
  BYTE alarmDataArraySize;
  BYTE severity;
  BYTE descriptor;
  UINT16 alarmSystem;
  BYTE reserved[2];
} AMS;

Some of the above fields warrant an explanation. The 'tag' field refers to the device server or sub-system issuing the alarm. Typically this is nothing more than the device server name of the device server issuing the alarm. However it should be rememberd that an alarm can be issuing at a server on behalf of another (a middle layer, for instance). The 'device' field is the device name of the specific device responsible for the alarm. The 'alarmTag' field is a short alarm description, which should alert the operator to the primary nature of the alarm. The 'alarmCode' field defines the alarm as a number. Typically this will be a user-defined (i.e. server specific) code beginning at 512 which should be unique for each category of server. Alarm codes under 512 should refer to TINE systematic error code (such as 'sedac_error' or 'gpib_error', etc.) if used. The 'timestamp' field gives the UTC time of the alarm. Higher resolution timestamps should be incorporated in the 'alarmData' field. The 'alarmMask' field is a user-defined (i.e. server specific) mask which can be used to further categorize an alarm. The 'alarmData' field is a place holder for up to 64 bytes of alarm-specific data, which can be used to provide any relevant information pertaining to the alarm in question (such as hardware address, high-resolution time information, threshold limits, a character string etc.). The 'alarmDataFormat' gives the TINE format of the 'alarmData' field. The 'alarmDataArraySize' field tells in essence how much of the 6-bytes of alarm data are relevant. Note: six bytes does not sound like much, but is generally sufficient to give a hardware address (3 short integers giving line, crate, sub address), high resolution time (nanoseconds to append to UTC, for instance), or a long or float value that exceeded a threshold, etc. The 'severity' field gives the severity of the alarm (0 to 15). The 'descriptor' field gives the ORed descriptor flags pertaining to the alarm in question. Finally, the 'alarmSystem' gives a system-wide categorical description of the alarm system for which the alarm is relevant (e.g. 150 = Electron RF).

As noted above, TINE Alarms have a severity range of 0 to 15, where '0' is entirely informational, levels 1 through 7 are 'warning' level alarms. Severities above '7' are deemed critical, where the most extreme case '15' is usually reserved for immanent beam loss. Furthermore, TINE alarms can carry one of 7 different descriptor flags, defined by

almNEWALARM
almHEARTBEAT
almOSCILLATION
almDATACHANGE
almINSTANT
almTERMINATE
almTEST

In general these descriptor flags can occur in combination. However in principle, certain combinations are mutually exclusive. For instance a 'NEWALARM' cannot carry the 'HEARTBEAT', 'OSCILLATION' or 'DATACHANGE' descriptor. The first appearance of an alarm for a particular device will carry the NEWALARM descriptor. If the alarm is unattended and remains in the LAS alarm list for at least 15 minutes, the 'HEARTBEAT' descriptor will be applied. If an alarm is cleared and ready to be terminated but abruptly returns, it will be assigned the 'OSCILLATION' descriptor. If an alarm continues to be set by the front ends IO Loop, but with different data, the 'DATACHANGE' descriptor will be applied. Certain types of alarms do not have duration associated with them. Alarms such as hardware-error alarms, or threshold-exceeded alarms will remain in an alarm state until someone fixes the hardware or makes adjusts so that the threshold is no longer exceeded. However, beam-loss, quenches, and other types of transient alarms simply set an alarm following the occurrence of such a transient event, perhaps marking it with a high-resolution timestamp or issuing a post-mortem event, but do not remain in an alarm state for a duration. Such transient alarms will carry the 'INSTANT' descriptor. When an alarm is marked for termination it will carry the 'TERMINATE' descriptor, signaling the fact that it will be removed from the LAS alarm list in due course. Finally, a server can explicitly mark an alarm as a 'TEST' if desired. It is interesting to note that a transient alarm will by definition carry the 'NEWALARM', 'INSTANT', and 'TERMINATE' descriptors when it is set.

Setting Alarms

How do alarms get entered into the LAS alarm table? It should be noted that the only alarm which can be applied systematically to any server is a 'server-down' alarm issued by the CAS. Otherwise, establishing which alarm is to be set when is a task for each individual server.

Automatic Alarms

In the case of 'threshold-exceeded' alarms or 'pattern-not-matched' alarms, one can make use of an Alarm Watch database, which is read at initialization time and instructs the server engine to monitor the values of given properties and issue alarms if the read values fall above or below associated thresholds or do not match a given pattern. This method requires no extra coding on the part of the server developer. The Alarm Watch database can either be a text .csv startup file called almwatch.csv or under an ALARM tag within the PROPERTY section of 'fec.xml', or entries can be appended to the table via an API call appendAlarmWatchTable(). In some cases, server IO generates values of channels which should lie between certain boundaries for safe operation. Instead of using the max and min value range supplied during property registration, it was decided to allow separate registration of "watch" threshold values for relevant property values.

In the almwatch.csv file (or the appendAlarmWatchTable() API call), you can specify high and low thresholds as well as high and low "warning" thresholds. The local alarm server will then monitor those properties given and issue value_too_high, warn_too_high, value_too_low, warn_too_low alarms accordingly. You can likewise specify the severity such alarms should have. Alarms can also be automatically generated if a specific readback either does or does not match a given pattern.

For example consider the following almwatch.csv file:

LOCALNAME,DEVICENAME,PROPERTY,SIZE,FORMAT,SEVERITY,HIGH,LOW,HIGHWARN,LOWWARN
VACEQM,#0,PRESSURE,600,float,15,E-07,0,.5E-07,0

The presence of this file at startup time (in the directory specified by the FEC_HOME parameter) will cause the local alarm server to check the property PRESSURE every 1000 msecs and compare all 600 float values read against the threshold values 0.5E-07 for issuing a warn_too_high alarm and against 1.0E-7 for issuing a value_too_high alarm. Values of 0 are given for the LOW and LOWWARN states, which in this case are not relevant. A single column 'SEVERITY' is given here which defaults to assigning the given severity (15) to value_too_high or value_too_low alarms and the severity minus two (13) to warn_too_high or warn_too_low alarms. Individual severity can also be specified by instead using the csv columns 'SEVERITY_HIGH', SEVERITY_LOW', 'SEVERITY_HIGHWARN', and SEVERITY_LOWWARN'.

The available csv columns within almwatch.csv or xml tag within the 'ALARM' tag are

Special Cases

A device server can also optionally automatically detect 'low disk space' and issue an alarm. When the API call SetFreeBlocksAlarmThreshold() is used and points to a valid mounted disk and the remaining space on the disk falls below the given threshold, then a disk space alarm will automatically be generated. The available diskspace will be supplied in the alarm data.

CDI servers will likewise automatically set 'hardware error' alarms when an attempt to read or write to a hardware address results in an error. The hardware address of the deviced accessed will be supplied in the alarm data. If the number of hardware alarms surpasses the current 'alarm collapse window' the alarm server will (as always in such cases) issue a single alarm reflecting the best known state of affairs. In such a way 'alarm storms' (with largly uninteresting and irrelevant information) will not overwhelm the central alarm system and archive. The alarm collapse window can be set via the API call SetAlarmCollapseWindow().

Specific Alarms

Other varieties of alarms should be set inside a server's IO-loop. The suggested strategy is make use of two API calls, ClearDeviceAlarm() and SetDeviceAlarm(). At the beginning of the IO-loop, a ClearDeviceAlarm() should be called once for all devices (device number = -1). This will augment the alarms 'clear' counter. Then following hardware readout and other processing, alarms should be set as necessary using SetDeviceAlarm() for the device in question. A call to SetDeviceAlarm() will reset the 'clear' counter. If this procedure is followed, the LAS can determine if an alarm is oscillating or not, as the 'clear' counter must exceed a value of '8' in order for an alarm to be marked as terminated. However if the 'clear' counter increases substantially on its way to eight and is suddenly reset, then the alarm is deemed to be oscillating. This initial threshold value of '8' can possibly be augmented by the local alarm system if it 'learns' of other oscillation criteria. Note that a readout error such as a hardware error might issue several alarms of different severity. The initial hardware error could for instance issue a 'hardware_error' alarm at a warning level, so that someone knows to fix the hardware. However the piece of hardware which is in need of repair might be critical for operations and hence a follow-up alarm 'critical-value-undefined' alarm might then be issued.

Transient alarms are likewise set with SetDeviceAlarm() where the Alarm Descriptor is explicitly set to almINSTANT. Such alarms of course do not need to be cleared, as they are marked immediately for termination.

Note in passing: The legacy API calls SetAlarm(), SetAlarmEx(), ClearAlarm() are still supported but will not be discussed here.

Alarm Definitions

The form of the SetDeviceAlarm() API call is fairly simple. In the TINE c-interface it looks like:

int SetDeviceAlarm(char *eqm, char *dev, long code, BYTE *data, BYTE flags);

The first argument 'eqm' refers to the local equipment module name registered with the equipment module and is not present in an object oriented interface, where SetDeviceAlarm() appears as a method of a equipment module object.

An alarm is then specified in the call by four parameters, 'dev' giving the device name associated with the alarm, 'code' giving the alarm code associated with the alarm, 'data' giving any optional data associated with the alarm, and 'flags' giving any optional alarm descriptors associated with the alarm.

To the extent that 'data' and 'flags' are optional (i.e. data = NULL, and flags = 0 are allowed), it is primarily 'dev' the device name and 'code' the alarm code which determine the alarm.

How then does the LAS know what alarm tag to associate with the alarm code? Or if data are given, how does the LAS know how to interpret it?

It should be realized that the alarm code essentially defines the nature of the alarm, which is assumed to be 'static', i.e. alarm of type '578' always refers to the same kind of alarm. Thus a startup database file alarms.csv is required to give the details behind any given alarm code. If the 'fec.xml' file is used, an 'ALARM_DEFINITION' tag within the 'EQM' section can supply the same information. And, as usual, an API alternative exists via the AppendAlarmInfoTable() routine.

This startup database file will supply a human readable alarm tag to the alarm code (and give a longer description as well), supply the information necessary to interpret the alarm data, give the severity of the alarm, provide html links for further documentation, etc. If this startup file is missing, any alarm set will only be able to forward the alarm code along, with 0 severity and no description and no data.

An 'alarms.csv' (or 'ALARM_DEFINITION' xml tag) can contain the following csv columns (or xml tags):

As an example consider the following alarm definition file:

ALARMTAG,ALARMCODE,ALARMMASK,SEVERITY,DATAFORMAT,DATAARRAYSIZE,ALARMTEXT,DEVICETEXT,DATATEXT,URL
SEDAC error,79,0,1,short,3,SEDAC ERROR,Swars BLM Module,"line,crate,subaddress",http://acclxheeblm.desy.de/alarms.html
BLM init error,512,0,7,short,1,SEDAC READ ERROR,Swars BLM Module,,http://acclxheeblm.desy.de/alarms.html

Thus by setting an alarm with a specific alarm code all other information concerning this alarm can be gleaned by consulting a lookup table based on the definition file. The information in the above configuration file is used to fill in the alarm definition information maintained in the following structure:

typedef struct ADStag   /* Alarm Definition Structure */
{
   char alarmTag[ALARM_TAG_SIZE];
   UINT32 alarmCode;    /* alarm code ID */
   UINT32 alarmMask;    /* alarm mask */
   UINT16 alarmSystem;  /* alarm system ID */
   short  alarmSeverity;/* alarm severity */
   BYTE alarmDataFormat;
   BYTE alarmDataArraySize;
   BYTE reserved[2];
   char alarmText[ALARM_TEXT_SIZE];
   char deviceText[ALARM_TEXT_SIZE];
   char dataText[ALARM_TEXT_SIZE];
   char url[ALARM_TEXT_LONGSIZE];
} ADS;

Alarm Viewer

The TINE Alarm Viewer can be started from any station. In the spirit of "one picture is worth a thousand words", we show below a (live) display of the alarm viewer for the DESY2 pre-accelerator. Note that the alarm viewer has several 'views', the default view being 'just' a subsystem panel showing each category and the number of active alarms in each cateagory. In our example we have exanded the view to also show the active alarms in a table (sorted descending by time).

almViewerJava1.jpg

By clicking on any displayed alarm, one can obtain more detailed information. For instance, clicking on the Vertical Corrector Magnet (V.Korrekt.Mag.) 'Mag.Corr-LW/SVL127' we obtain a display containing more detailed information about this specific alarm, including its 'recent' history.

almViewerJava2.jpg

In live mode, the user can acknowledge alarms locally (within his instance of the alarm viewer) or globally, if he has permission. Globally acknowledging alarms is typically only allowed by the operators in the control room, and lets the alarm viewer always display the same information if several instances are running on different stations. As acknowledged alarms can still be viewed if desired, the casual user in his office can always see the current alarm state of the control system.

As can also be noted on the above display, the alarms are archived at the CAS and can always be recalled if desired. Otherwise, the alarm viewer generally displays alarms from the 'recent' past (last 2 hours) by default.

Central Alarm Server Configuration

The CAS does not collect alarms indiscriminately, but instead gathers only alarms from those servers deemed 'important'. Namely it reads its information from a configuration database. The database follows the tradition of making use of flat .csv files and consists of a primary file called 'ServerList.csv' which lists the important servers and and other processing instructions and a cross-reference file called 'AlarmCodes.csv. There might also be secondary files which contain action responses when specific alarms arise.

As an administrator, you can avoid editing these files by hand by making use of the Alarm Database Manager, a snapshot of which is shown below:

almManager.jpg

You can edit or add entries making use of the 'Update List' button, and when finished, clicking on 'Update DB' will send all information to the selected CAS and instruct it to re-read its database and restart.

Using the same managment tool, you can change the layout which appears in the alarm viewer GUI. By selection 'Options' -> 'Alarm Systems Manager' you call up another GUI component:

almManager2.jpg

Here, you can add and remove alarm systems or rearrange the viewer grid by simple drag and drop.

The CAS is capable of starting up without a database, in which case it will make one based on querying the equipment name server (ENS) for 'IMPORTANT' Servers for its context. That is, if an administrator has marked (via the ENS administration tool) certain servers within a given context as having imporance = IMPORTANT or higher, then the CAS will automatically begin monitoring them, using the associated subsystem information as the alarm subsystems. This in itself will provide a general overview of the relevant alarm situation but may not be the best 'view' for operators. Fine tuning should be done either by hand or using the CAS database manager. In particular, if one needs to establish 'actions' associated with a specific alarm, this must be done with the database manager or by hand if you know what you're doing. 'Actions' might be sending a post-mortem event trigger or sending an email to a responsible person.

For completeness, an example of the principal CAS database file, 'ServerList.csv' is shown below

Context,Server,Extension,SeverityLevel,ArchiveLevel,Retention,AlarmLevel,SubSystem,AlarmSystem
DORIS,DOAlarm,ALM,0,0,7200,1,SER,1600
DORIS,DOTAlarm,TALM,0,0,7200,1,SER,1600
DORIS,DoBeam,BEAM,0,0,7200,1,DIA,600
DORIS,DOBeamLineMon,BEAMLIN,0,0,7200,1,DIA,600
DORIS,DOBUNCHE,BUN,0,0,7200,1,DIA,1360
DORIS,DOIDC,IDC,0,0,7200,1,1350
DORIS,DoMstUpd,MST,0,0,7200,1,EXP,1400
DORIS,DONEG,NEG,0,0,7200,1,VAC,350
DORIS,DOPilSta,PILO,0,0,7200,1,SUB,1050
DORIS,DORCAV,ERF,0,0,7200,1,RF,150
DORIS,DORFB,ERFFB,0,0,7200,1,RF,150
DORIS,GLOBALS,GLOB,0,0,7200,1,SER,1600
DORIS,STATE,STATE,0,0,7200,1,SER,1600
DORIS,DORQ1,DORQ1,0,0,7200,1,RF,150
DORIS,DORQ4,DORQ4,0,0,7200,1,RF,150
DORIS,DOTRCRF,TRC,0,0,7200,1,RF,150
DORIS,DoTrigger,TRIG,0,0,7200,1,INJ,400
DORIS,DOVAC,VAC,0,0,7200,1,VAC,350
DORIS,Bunche_RWeg,BUNCHRWEG,0,0,7200,1,DIA,1360
DORIS,HasyAlarms1,HASY1,0,0,7200,1,SUB,1090
DORIS,HasyAlarms2,HASY2,0,0,7200,1,SUB,1090
DORIS,HasyAlarms3,HASY3,0,0,7200,1,SUB,1090
DORIS,HasyAlarms4,HASY4,0,0,7200,1,SUB,1090
DORIS,HasyAlarms5,HASY5,0,0,7200,1,SUB,1090

One sees the following csv columns are in play:

In addition, the CAS makes use of a CAS subsystem number and name cross-reference file called 'alarmCodes.csv', an example of which is shown below.

NR, COLUMN, ROW, TAG, SUBSYSTEM
100, 1, 1, Magnet, MAG
101, 1, 2, HCorr, MAG
102, 1, 3, VCorr, MAG
150, 1, 4, HF, RF
1120, 1, 5, Wasser-HF, SUB
730, 1, 6, HF-Dump, RF
171, 1, 7, Wiggler, RF
350, 1, 8, Vac, VAC
3001, 1, 9, BL.Interlock, SUB
3000, 1, 10, Per.Interlock, SUB
1090, 1, 11, HasyLab, SUB
400, 2, 1, SeKi, SEKI
940, 2, 2, Trigger Mod, TIM
950, 2, 3, Timing, TIM
500, 2, 4, Feedback, FB
520, 2, 5, Tune Cntl, DIA
600, 2, 6, Orbit, DIA
103, 2, 7, Lage Regl., MAG
1490, 2, 8, FotoMon, DIA
1500, 2, 9, Schirme, DIA
850, 2, 10, Profile, DIA
5000, 3, 1, System, SYS
5001, 3, 2, Hardware, HDW
10, 3, 3, Radio, SER
3010, 3, 4, DatenTV, SUB
1350, 3, 5, IDC, DIA
1360, 3, 6, Bunche, DIA
1361, 3, 7, Neb.Bunche, DIA
1351, 3, 8, I-Hist, SER
830, 3, 9, Scope, DIA
1400, 3, 10, Scraper, EXP
1010, 3, 11, Temp, DIA
#0, 3, 12, Test, TST 

Here the following csv columns play a role:

The CAS will both read information files and maintain its own information files in a parallel directory/subdirectory called 'CACHE/AlarmInfo'. Each monitored server will maintain files here and specfically within subdirectories given by the 'EXTENSION' information found in the 'ServerList.csv file (shown above). The CAS itself will acquire the relevant alarm definition tables for the monitored servers and maintain a cache repository.

If a file named 'actions.csv' is found in the 'extension' subdirectory within this repsository, the CAS will attempt to read it and if successful if will scan incoming alarms from the device server in question to see if a specific alarms requires an action. An example 'actions.csv' file is show below:

Context,Server,Device,Property,AlarmCode,ActionLevel,mailto
HERA,STORAGE,hevac,TRIGGER,515,1,
HERA,STORAGE,SLCAlarm,TRIGGER,521,1,
,,,,530,john.doe@myinsitute.com


The relevant csv columns are given by:


Generated for TINE API by  doxygen 1.5.8