The ESI Spectrograph Online Documentation

Dashboard Failure Modes and General Diagnostic Tips

There are three kinds of error which can be experienced when using KTL-based Dashboards.

  1. The Dashboard can fail to find needed configuration or startup files
  2. The backend (instrument hardware and/or software) can fail
  3. The Dashboard code can manifest a previously unknown bug

I will treat these possibilities in descending order of probability.


A code bug

The surfacing of a source code bug in Dashboard will probably result in a popup TkError box (a plain gray box containing an error message and a "Stack Trace" button). The correct thing to do if this happens is
  1. Use the Stack Trace button to see the offending code.
  2. Using X11 copy/paste, paste the code into a mail message to de@ucolick.org, with some description of the circumstances surrounding the error (i.e. what was the last action the user took before the error).
  3. Dismiss the popup and see if the Dashboard will continue to function after the error. If it is hung or dies, try restarting it. If there is time, try reproducing what the user did immediately prior to the error
Code bugs should be very few in normal operation. They should not happen. Any error popup of this kind will be taken very seriously and I'll try to patch it within 24 hours.

Misconfiguration

The Dashboard code requires several inputs as it starts up. First, it must be started by the correct name and with the correct command line arguments. Next, it must load the right files to configure itself for the current version of the instrument or service you want to control. Lastly, if the instrument has removable elements such as filters and you want to see real filter names in the user menus, the "dynamic configuration" phase must succeed, which requires momentary access to the database
server.

Of these possibilities, the most likely to go wrong is the dynamic configuration feature, so check that diagnostic path first.

Next most likely is that the .kwd file or .dbd file has been damaged, by installing from a bad copy of the source, or by well-intentioned manual meddling. In general, check the datestamp on these files and compare them with the last version checked in to CVS.

After this in likelihood, it's possible the UI was started up wrongly. The Dashboard should always be started by a script or menu which runs the standard "esi start" scripts. The command line arguments to Dashboard are correctly specified in the "esi start dashboard" script. However, if it has been manually started or if the script has been broken, bad command line arguments can result in an odd appearance or in a failure. Bad command line arguments may result in failure to do dynamic configuration, so you should probably eliminate this possibility before delving further into any dynamic configuration issues.


Backend Hang or Failure

If the backend fails, the Dashboard should gracefully detect this condition and inform the user. We have tried to make the music, traffic, dispatcher, Galil, and client connections as robust as possible, so that any part of the control system can fail and be restarted without dire consequences for the rest. However, there may be some conditions which the Dashboard can't recover from.

If the backend goes away (i.e. the dispatcher dies or hangs, traffic dies or hangs, the Galil controller goes catatonic) then the Dashboard should detect and report this condition. If traffic is still up but the dispatcher or hardware is no longer talking, there are a couple of keywords which stop updating. Dashboard watches those keywords and if they stop updating for more than a certain number of seconds, it pops up an alert box informing the user that "CTRL0CLK (or CTRL1CLK) has stopped." On most production GUI there will be a "Backend Status" button which will pop up a status panel for the dispatchers, controllers, etc.

If the traffic connection is lost and cannot be regained, the Dashboard will shut itself down after a warning message to the user.

Shutting down and restarting the dashboard is relatively painless. It should never "just hang". But if it does, the first thing to do is note when and under what circumstances it hung, and either enter this in the nightlog or email it to de@ucolick.org. Then kill and restart the Dashboard. If it is too badly hung to respond to its own Quit button or a window manager Close or Destroy, then look in the process table for the string "obsr", check that it has the right arguments for the service you are using, and kill that process.

The Dashboard keeps a fairly verbose log as it runs. Look for a file like "ktui_931289126.log", where the large integer is a Unix time of year clock. You can turn this integer into a formatted date by various means. Perhaps the simplest is to start tcl (type "tcl") and at the prompt type "clock format 931289126". This will tell you the start time of the dashboard which wrote the log file. If something bizarre happens to the Dashboard and you have to kill it, you should mail the last 50 to 100 lines of the log file to de@ucolick.org.


Specific Failure Modes and Diagnostic Procedures


  • No Dynamic Configuration
  • Keywords Missing or Incorrect
  • Startup Errors: bogus appearance or behaviour
  • Dashboard Hangs
  • Dashboard Dies
  • Dashboard Won't Start

    No Dynamic Configuration

    Symptom: When the user requests a filter, grating, or other removable element change, nothing moves and the dispatcher gets an error "unable to map name to position".
    Procedure: We suggest that you read the Overview and Diagnostic writeups for inconfig, the tool used to configure the instrument for removable elements such as filters, gratings, masks, etc. This will give you a better understanding of how inconfig is used and how it works with Dashboard. However, you can also just start checking the following possibilities:
    1. Can a modify command use the names that the Dashboard is unable to use? If so, the dispatcher and the Dashboard are out of synch. This should never happen. If it happens, one of these procedural or coding errors has taken place:
      1. The dispatcher dynamic config file (mapRON.cfg, in kroot/data/esi/dyna directory) has been manually replaced or altered without using inconfig.
      2. Inconfig has failed without completing its "Commit" operation.
      3. The database has been manually altered without using inconfig.
      In all three cases, the next thing to do is re-run inconfig and attempt to correct the data, then Commit. This should work. However, in the unlikely event that inconfig fails, there is a fallback strategy.

      If inconfig fails the quickest strategy is to revert to "factory" settings:

      1. EITHER delete the mapRON.cfg file from the dyna subdirectory of kroot/data/esi, and restart the dispatcher (recommended) OR manually restart the dispatcher WITHOUT the -u flag.
      2. restart the UI with the -z flag (this suppresses dynamic mapping)
      3. provide the user with a written map of filter positions to names
      4. proceed with observing
      5. report the inconfig failure ASAP to de@ucolick.org
    2. Less likely, but possible, is that the Dashboard was not started up correctly. It may have been started manually, or someone may have meddled with the startup scripts for the service. If the -x command line was mistakenly used at startup, this disables the database connection entirely. Or the -z flag might have caused dynamic configuration data to be ignored. Also, if the correct envars are not set in the startup script, it may not be possible to find the database server. Check the startup scripts. If the process table (ps) shows you any command line arguments, check the arguments for the "obsr" process.
    3. Is the database server online? Try running another (preferably harmless) database application. If the server is down and cannot be restarted, then in an emergency revert to "factory" settings as above, and proceed.
    Explanation: See the write-up for inconfig for a full explanation of dynamic configuration.

    Keywords Missing or Incorrect

    Symptom: Error messages appear on stderr during startup, and highlighted in yellow in the session log widget, which complain about variables and keywords not existing. Widgets which usually appear may fail to appear, causing "missing pieces" in the UI.
    Procedure:
    1. Can a modify command use the keywords which are causing errors in the dashboard? If it can, then the dashboard .kwd file is out of rev with the keyword library and/or with the .dbd file. This can only happen if the .kwd file has been manually replaced or altered, or if the keyword library has been manually replaced or altered. A proper build/install procedure will always keep these files in synch.

      Check the date on the file. If it doesn't match the dates of other installed files in the same directory, move it to a new name for later examination; retrieve the latest version from CVS and reinstall it. Restart Dashboard.

    2. If a modify command cannot use the keywords which are causing errors in the dashboard, then the .kwd file and the keyword library are in synch, but the .dbd file is bad. Again, this should never happen if proper build/install procedures are followed. Check the .dbd file just as described for the .kwd file above.
    Explanation: The .kwd file tells the Dashboard application everything it needs to know about all the keywords for a KTL service. This file must match the keyword library used by all KTL clients, and the .dbd file which is used to paint the actual graphical dashboard on your screen. If there is disagreement between them about the spelling of keyword names, the existence of keywords, the datatypes of keywords, etc. then errors will occur. All these files are generated and installed simultaneously by a normal build/install procedure. Only a failure to follow procedure, or perhaps a partial recovery from disk loss, should result in mismatched configuration files.

    Startup Errors: bogus appearance or behaviour

    Symptoms: Different symptoms are possible based on the different startup flags and envars that may be incorrect if the UI is manually started.
    1. Developer Mode: the dashboard widgets all have big ugly bezels or frames around them. This means that the -d flag was used at startup.
    2. No Database: dynamic configuration of menus doesn't happen, which looks a lot like dynamic configuration errors described above. The -x flag suppresses use of the database server and therefore disables dynamic menu configuration. The -z flag suppresses dynamic menu configuration, but permits access to the database for other purposes. If either of these flags was used at startup, dynamic configuration is prevented.
    3. No KTL connection: the -f flag causes the Dashboard to run in "Fake" mode, without making any real KTL connection. The UI will look very weird, with zero and "<udf>" values and a lot of yellow highlighting.
    4. No KTL Writes: the -s "safe" mode flag will cause the Dashboard to ignore all KTL write requests. It will passively monitor the service but not move anything.
    5. No Envir Alarms: certain alarms, mostly temperature and voltage, are automatically implemented, but can be suppressed by the -l flag. If this flag is incorrectly specified at startup, then these voltages and temperatures can get out of range without any alarm popup or log entry.
    At startup time, Dashboard spits out some diagnostic text to stderr. This text verifies which Modes have been selected. Many environment variables are also needed at startup. The startup scripts provided by the instrument developer or UI designer should take care of all this, and you should use them.
    Procedure: Determine how this copy of Dashboard was started. If it was started manually, kill this copy and restart using the canonical (menu or script) method for this instrument. If that does not work, then someone has broken the canonical startup method. You may attempt to diagnose this, or if in a hurry, try a command line like this: "obsr .kwd .dbd", which may be close to correct. However, you will need to know a lot about the shell environment needed by Dashboard, so this is not recommended for the inexpert.
    Explanation: The Dashboard should always be started by a script or menu which runs the standard "esi start" scripts. The command line arguments to Dashboard are correctly specified in the "esi start dashboard" script. However, if it has been manually started or if the script has been broken, bad command line arguments can result in an odd appearance or in a failure.

    Dashboard Hangs

    Symptom: The UI stops responding. If a window is moved above it and then moved away, the obscured area does not repaint. It does not respond to X11 events such as mouse movement or clicks.
    Procedure: This should never happen, so if it does happen it should be taken seriously, carefully documented, and immediately reported.
    1. Check show and modify commands. Are they hanging?
    2. Try a cshow command of some continuously varying keyword like CTRLnCLK. Does it hang?
    3. Are other X11 clients on the same screen also hanging?
  • If other X11 clients are hanging, there may be something wrong with the workstation where the UI is running. If show and modify are also hanging, then there is something wrong with the backend and you should proceed to check the backend status.
  • If only Dashboard is hanging, then you have encountered a new code bug. You should send its last stderr output, plus the last 40 or so lines of the UI logfile, to de@ucolick.org -- along with some description of what was happening around the time of the hang. You should kill the GUI (you'll probably have to do this using the process table and an explicit 'kill' command). Then try to restart it, and report (by email) how it behaves when starting up: send any stderr output to de@ucolick.org.
  • Explanation: In 1999, UCO/Lick SPG spent a lot of time trying to make the music/traffic/KTL connections robust, so that failure or restart of one component doesn't cause other components to hang. We think we succeeded. Dashboards for recent Lick-built instruments should not hang, even if backend processes like traffic or the dispatcher go away. So a Dashboard hang indicates a serious problem.

    Dashboard Dies

    Symptom: Dashboard exits spontaneously
    Procedure: Gather any stderr output, and the last few lines of the log file. Restart the Dashboard. Mail the gathered text to de@ucolick.org.
    Explanation: The Dashboard should only shut itself down in the event where it is disconnected from traffic and cannot reconnect. It should pop up a nasty-looking warning message, then shut down within a couple of minutes. If the user wandered away from the console, this warning message might be missed; but it would appear in the log file and on stderr.

    Dashboard Won't Start

    Symptom: Dashboard refuses to start up
    Procedure: Is there another Dashboard for the same service already running on the same display? There is a built-in constraint which prevents duplication of UI for the same service on the same X server. If you really need to run two Dashboards for the same service on the same server, you need to set the environment variable SILLYWIZARDS (to any value you like) and run the 2nd dashboard manually. This is not recommended for the inexpert.
    Explanation: Multiple Dashboards can be very confusing for the user. Duplicate copies of alarms and other popups start to appear. We found it best to prevent this situation.
    The Observer documents are hand-written. The Technical Documents are produced from plain text files in the CVS source tree by some Tcl scripts written at UCO/Lick Observatory. The Reference Documents are mostly generated by software from data in a relational database. Individual authors are responsible for the content of the Observer and Technical Documentation. The Lick SPG as a whole is responsible for the content of the Reference doco. Send mail to de@ucolick.org to report inconsistencies or errors in the documentation.