Orca Frequently Asked Questions

This is a FAQ (Frequently Asked Questions) for Orca and the tools that
gather data for it.

Please email submissions to the FAQ to orca-users@orcaware.com.

# $HeadURL: http://svn.orcaware.com:8000/repos/trunk/orca/FAQ $
# $LastChangedDate: 2003-10-07 10:49:43 -0700 (Tue, 07 Oct 2003) $
# $LastChangedBy: blair $
# $LastChangedRevision: 262 $

General
-------

  1.1) What is the m, k, u, or some other character following the
       numbers in the Y axis scale?
  1.2) Why is my Y axis scale have such large numbers?
  1.3) Why are there random characters at the end of my HTML and GIF
       or PNG images names, i.e.
       o_host3_disk_runp_c0t6d0...disk_runp_c-4QyP2ziXlrwXj8eG_n_A.html?
  1.4) What should I use, NFS or rsync, to get my data from my clients
       to the Orca server?  Should I push my data to the server from
       the clients or have my server pull my data?
  1.5) How should I set up ssh access securely without entering a
       password everytime a process needs to contact a remote system?

Warning Messages
----------------

  2.1) Number of columns in line '1,2,3.....' of
      ../orcallator/.../percol-2000-09-26 does not match column
      description.
  2.2) Warning: file `../orcallator/.../temp-percol-2001-02-22' was
       current and now is not.
  2.3) Warning: cannot create Orca::HTMLFile object: cannot open
       `/home/orca_html/o_host1-monthly.html.htm' for writing: Too
       many open files.
  2.4) Warning: file `.../orcallator/host1/orcallator-2001-11-06-000'
       did exist and is now gone.

Solaris/Orcallator.se
---------------------

  3.1) What do I do about the error "/opt/RICHPse/bin/se: Unsupported
       platform: sparcv9 SunOS 5.8"?
  3.2) Why does orcallator.se die, in particular when I log out of the
       system that I started it on?
  3.3) Why is the page scan rate is always zero even though sar says
       it is not?
  3.4) Why are all the NFS server statistics zero?
  3.5) Why are my Ethernet bits/second measurements all zero?
  3.6) Why do I get the error "Fatal: member: txunderruns vanished!:
       Near line 201"?
  3.7) Why do I get the error "Fatal: member: txunderruns0 vanished!:
       Near line 255"?
  3.8) Why do I get the error "Fatal: member: framming vanished!: Near
       line 160"?
  3.9) Why do I get the error "Fatal: member: drop vanished!: Near
       line 285"?
 3.10) Why do I get the error "Fatal: member: XXXXX vanished!: Near
       line YYY"?
 3.11) What do I do when I get the message "Fatal: subscript: 2 out of
       range for: GLOBAL_net[2]: Near line 178"?
 3.12) Why don't I get plots for my Veritas filesystems?
 3.13) Why don't I get plots for my RSM filesystems?
 3.14) Why don't I get any Interface Bits Per Second data for my qe
       board?
 3.15) Orcallator.se core dumps.
 3.16) Why should I keep my compressed percol-* or orcallator-* files?
 3.17) Why do my Orca plots no longer contain any data after I change
       anything related to orcallator, such as the subsystems to
       measure, or when something changes on the system, such as mount
       points, ethernet devices, etc?

General
-------

  1.1) What is the m, k, u, or some other character following the
       numbers in the Y axis scale?

       This is SI magnitude symbol to scale the number by.  Here is a
       table of symbols and scaling factors.

         a  10e-18 Ato
         f  10e-15 Femto
         p  10e-12 Pico
         n  10e-9  Nano
         u  10e-6  Micro
         m  10e-3  Milli
         k  10e3   Kilo
         M  10e6   Mega
         G  10e9   Giga
         T  10e12  Terra
         P  10e15  Peta
         E  10e18  Exa

       So if you see "250 m" in the Y axis, this means 0.25.

       The page

         http://www.physlink.com/reference_dprefixes.cfm

       is also a good reference for this information.

  1.2) Why is my Y axis scale have such large numbers?

       See question 1) above.  Most likely there is a scaling letter
       after your number that means the true value is several orders
       of magnitude smaller than you think it is.

  1.3) Why are there random characters at the end of my HTML and GIF
       or PNG images names, i.e.
       o_host3_disk_runp_c0t6d0...disk_runp_c-4QyP2ziXlrwXj8eG_n_A.html?

       The way Orca generates HTML and image filenames uses all of the
       input data sources.  With plots that contain a large amount of
       different data, the filename can exceed the maximum filename
       length for the operating system and Orca will not be able to
       create the file.

       The solution is to limit the filename length to less than 255
       characters.  This could be performed by simply trimming the
       filename down to less than 255 characters, but then the
       filename may not be unique and two distinct HTML files and/or
       images may end up being written to the same file.  The solution
       used by Orca is to calculate the MD5 hash of the full length
       filename, trim the filename down and insert the MD5 into the
       short filename, which will guarantee uniqueness.

  1.4) What should I use, NFS or rsync, to get my data from my clients
       to the Orca server?  Should I push my data to the server from
       the clients or have my server pull my data?

       [Answer written by Sean O'Neill <sean@seanoneill.info>.]

       Yeah, NFS is a total pain for more reasons than just security.

       rsync is the way to go.  By default, it uses ssh as it's
       transport application vs. rsh.  Don't use rsh for the obvious
       reasons.

       But you need to really think about what this means in regards
       to security.  First, your security group is probably going to
       be nervous about anything that allows for unattended
       password-less access between servers.  But you also need to
       figure out if you want to PUSH or PULL your Orca data.

       If you have a Orca server that ssh's into the remote systems
       and rsync's down the data (e.g. PULL), this one machine would
       have ssh access to LOTS of other systems and would probably
       make any security group very nervous about that machine.

       If you have the remote systems rsync their data to the Orca
       server (e.g. PUSH), then you have lots of other machines with
       ssh access to ONE system.  This generally makes security a
       /little/ less nervous.

       Some folks on the list have multiple Orca servers because of
       the system resources required by Orca.  Its a CPU/memory hog at
       times.  Also, pushing data into a box is generally an
       asynchronous activity (from the Orca server's point of view) so
       it will take in as many as the box will support.

       If the Orca server is PULLING data, you need some script to
       keep track of what systems to pull data from, have logic to
       make it less serial to get the data down faster, etc etc -
       e.g. its more of a headache IMHO.

  1.5) How should I set up ssh access securely without entering a
       password everytime a process needs to contact a remote system?

       To get ssh working, use key authentication.  One easy way to
       use key authentication is to use the keychain tool at

       http://www.gentoo.org/proj/en/keychain.xml

       The first keychain article introduces the concepts behind
       RSA/DSA key authentication and shows you how to set up
       primitive (with passphrase) RSA/DSA authentication:

       http://www-106.ibm.com/developerworks/library/l-keyc.html

       The second article shows you how to use keychain to set up
       secure, password-less ssh access in an extremely convenient
       way.  keychain also provides a clean, secure way for cron jobs
       to take advantage of RSA/DSA keys without having to use
       insecure unencrypted private keys.

       http://www-106.ibm.com/developerworks/linux/library/l-keyc2/

       A third keychain article shows you how to use ssh-agent's
       authentication forwarding mechanism.

       http://www-106.ibm.com/developerworks/linux/library/l-keyc3/

       Even with these methods, when a system reboots, a person will
       need to manually log into the system, su into the account, run
       keychain and enter the passphrase to unlock the RSA/DSA keys.

Warning Messages
----------------

  2.1) Number of columns in line '1,2,3.....' of
       ../orcallator/...../percol-2000-09-26 does not match column
       description.

       When Orca sees a line in an input data file that does not have
       the same number of columns as defined at the top of the file
       when column_description is `first_line' or does not match the
       column_description, then Orca will complain and ignore the
       line.  Additionally, Orca will not record the data from this
       line in the RRD files and no data will be plotted.

       If this happens when using orcallator.se, then it will happen
       for orcallator.se versions 1.28b6 or older when hardware, mount
       points, network interfaces, etc. are added or removed from the
       system and orcallator.se outputs a different number of columns.

       To work around this problem, upgrade to version 1.32 or later
       of orcallator.se and orcallator.cfg at

         http://www.orcaware.com/orca/pub/

       which now create new output data file any time the number of
       columns or a name of a column changes so that Orca will not
       complain and all of your data will be plotted.

  2.2) Warning: file `../orcallator/.../temp-percol-2001-02-22' was
       current and now is not.

       First, Orca considers a file to be current if the file's last
       modified time is within `late_interval' seconds of the current
       time.  In other words, Orca checks if a process is modifying
       the file to keep it current.  The `late_interval' value is
       determined by the configuration file or set to the `interval'
       value if the configuration file does not set `late_interval'.

       Orca stat()s the file when it first looks for files using the
       `find_files' and determines that the file is current.  Any time
       after that Orca reads the file, it stat()s the file again and
       determines if it is current.  If there was a previous stat()
       and the file was current followed by another stat() and the
       file is not current, then the message is printed.

       The appearance of this message means that the process that has
       been updating the file has stopped updating it and this may be
       worth looking into.

       This message is also seen when the data gathering program,
       e.g. orcallator.se, opens a new log file at the end of a day
       and the old log file is no longer updated.  Orca tries to
       manage this situation when a file is no longer updated at the
       end of a day.

       If the actual measurement interval is not consistent with
       Orca's configuration file "interval" 300 seconds, then Orca's
       "interval" should be modified to match the actual measurement
       interval.  If you need to do this, then delete the RRD files
       because they will keep the old "interval" and the input data
       will need to be reloaded into new RRD files.

       Increasing the "late_interval" may also remove this error.

  2.3) Warning: cannot create Orca::HTMLFile object: cannot open
       `/home/orca_html/o_host1-monthly.html.htm' for writing: Too
       many open files.

       This obviously happens with Orca runs out of open file
       descriptors.  Orca opens many file descriptors to do its work
       and it doesn't like to close them unless it needs to.

       The first thing to check is the maximum number of file
       descriptors each process can have.  On some systems, the login
       shell scripts lower the maximum number of open file descriptors
       a process may have.

       To check this in a Csh shell variant (csh, tcsh), then type

         limit descriptors

       or for Bourne shell variant (sh, bash), then type

         ulimit -n

       On all operating systems Orca should be able to use 256 file
       descriptors.  On some, such as Linux, Orca can open 1024 files
       at once.  If the number you are getting is less than 256, then
       raise this limit.  Some operating systems let you raise the
       limit. such as Solaris, while others do not, such as Linux.  To
       try to raise the limit, do

         limit descriptors 1024

       or

         ulimit -n 1024

       If these commands do not work, ask your system administrator
       how to do this.

       There is a bug in Orca's older than 0.27b2 where Orca would not
       close a pipe file descriptor that is uncompressing a compressed
       percol-* file to Orca.  If your percol-* files are compressed,
       then try either upgrading to 0.27b2 or later or apply the patch

         http://www.orcaware.com/orca/pub/patches/orca-0.26-defunct-processes-patch.txt

       to Orca 0.26.  This should have Orca reduce its file descriptor
       count.

  2.4) Warning: file `.../orcallator/host1/orcallator-2001-11-06-000'
       did exist and is now gone.

       Orca prints this message when it found an input data file to
       read and when it goes to read it, which may be a while later,
       the file no longer exists.

       When Orca is being used with orcallator.se, this message may
       occur when orcallator.se compresses the previous day's percol-*
       or orcallator-* file that Orca found.

Solaris/Orcallator.se
---------------------

  3.1) What do I do about the error "/opt/RICHPse/bin/se: Unsupported
       platform: sparcv9 SunOS 5.8"?

       There are two solutions.  SE 3.2 is now available and you can
       upgrade to this version.  It is available at:

         http://www.setoolkit.com/

       If you are using an SE version less than 3.2, then do this:

         cd /opt/RICHPse/bin
         ln se.sparc.5.7   se.sparc.5.8
         ln se.sparcv9.5.7 se.sparcv9.5.8

  3.2) Why does my background orcallator.se process die, in particular
       when I log out of the system that I started it on?

       This sounds like orcallator.se is started under the Bourne
       shell, which kills background processes unless the process is
       started with nohup:

         nohup se orcallator.se &

  3.3) Why is the page scan rate is always zero even though sar says
       it is not?

       It has been observed that on Solaris 2.5.1 the page scan rate
       is always zero.  This occurs with orcallator.se version 1.23 or
       older.  The problem is that on older versions of SE the
       p_vmstat.scan variable is an integer and orcallator.se was
       assuming a double.  The fix is to upgrade to orcallator.se 1.24
       or newer.

  3.4) Why are all the NFS server statistics zero?

       On Solaris 2.6 there is a bug in the SE toolkit that prevents
       orcallator.se from getting the NFS server statistics.  To fix
       this, edit your RICHPse/include/kstat.se file and change #ifdef
       to #if near the top of the file where it says

         #ifdef MINOR_VERSION >= 70
         # define rfs_counter_t uint64_t
         #else

       to

         #if MINOR_VERSION >= 70
         # define rfs_counter_t uint64_t
         #else

  3.5) Why are my Ethernet bits/second measurements all zero?

       On 2.5.1 or older Solaris operating system releases, the kernel
       does not measure the bits/second going through a particular
       device.  I believe some later kernel and device driver patches
       may fix this, but Solaris 2.6 and greater definitely does
       measure this.

  3.6) Why do I get the error "Fatal: member: txunderruns vanished!:
       Near line 201"?

       This problem occurs with SE 2.5.0.2 and FDDI 5.0
       interfaces.  There are three possible fixes:
         1) Upgrade to SE 3.0.
         2) Visit the SE 2.5.0.2 download page and get the FDDI patch.
         3) Add the latest patch to FDDI which should reinstate the
            metric that went missing.

  3.7) Why do I get the error "Fatal: member: txunderruns0 vanished!:
       Near line 255"?

       This problem occurs with Solaris 2.5 and hme interfaces.  There
       are three possible fixes:

         1) Upgrade to a later Solaris release.
         2) Get the hme patch for Solaris 2.5.
         3) As a temporary work-around, change the member "defer" to
            "missing1" in the ks_hme_network structure in
            /opt/RICHPse/include/kstat.se like this:

            #ifdef MINOR_VERSION >= 51
            ulong defer;
            #else
            ulong missing1;
            #endif

            The next build of SE 3.0 will figure this out
            automatically.

  3.8) Why do I get the error "Fatal: member: framming vanished!: Near
       line 160"?

       This problem occurs with Solaris 2.5.1 and le interfaces.  The
       le patch for Solaris 2.5.1 corrects the spelling from framming
       to framing.  SE 3.0 tries to detect this patch, but if you
       don't have the patch directory in /var/sadm/patch it can't tell
       that the patch is installed.  There are three fixes.

         1) Upgrade to Solaris 2.6.
         2) Reinstall the le patch for Solaris 2.5.1.
         3) Create the directory /var/sadm/patch/103903-03 by hand.
         4) As a temporary work-around, run scripts using se
            -DLE_PATCH script.se to force the update or edit
            start_orcallator.se and enable the appropriate SE_PATCHES
            variable.

  3.9) Why do I get the error "Fatal: member: drop vanished!: Near
       line 285"?

       This has been observed on Solaris 2.6 using SE version 3.0.
       Upgrade to the latest version of SE.

 3.10) Why do I get the error "Fatal: member: XXXXX vanished!: Near
       line YYY"?

       This happens when Sun changes some of the kernel names for
       particular variables that the SE package is expecting to find.
       For example, one error was that

       SE has some conditional compile flags that tell it to look for
       the new name.  Try adding these to the command line options for
       SE or editing start_orcallator.sh to enable some of the
       defines.

         1) -DLE_PATCH=1
         2) -DHME_PATCH=1
         3) -DHME_PATCH_IFSPEED=1
         4) -DMULTICAST_PATCH=1
         5) -DROBUST_LISTENQ=1

       If a particular define does not fix the problem, then don't run
       SE with it.

 3.11) What do I do when I get the message "Fatal: subscript: 2 out of
       range for: GLOBAL_net[2]: Near line 178"?

       Try adding the following string "-DMAX_IF=XYZ" where XYZ is
       larger than the total number of interfaces you have on the
       system to se's command line.

       For example, if you had 9 separate interfaces on the system,
       then you could edit start_orcallator like this:

         --- start_orcallator.sh.in.FCS  Sat Oct 28 15:04:33 2000
         +++ start_orcallator.sh.in      Thu May 24 08:55:39 2001
         @@ -31,6 +31,7 @@
          #SE_PATCHES="$SE_PATCHES -DLE_PATCH"
          #SE_PATCHES="$SE_PATCHES -DHME_PATCH"
          #SE_PATCHES="$SE_PATCHES -DHME_PATCH_IFSPEED"
         +SE_PATCHES="$SE_PATCHES -DMAX_IF=10"

          # Check if the SE executable was found upon configure.
          if test -z "$SE"; then

 3.12) Why don't I get plots for my Veritas filesystems?

       Version 1.13 or later of orcallator.se should be able to
       generate filesystem statistics for Veritas filesystems.  The
       first thing to try is to upgrade to the latest SE and
       orcallator.se release.

 3.13) Why don't I get plots for my RSM filesystems?

       I don't think orcallator.se records data from RSM disks.

       However, to make sure, can you look through the first line of
       the output orcallator files and see if the RSM filesystems
       are listed there?  Look for the disk_runp_ string.  If you see
       the filesystems there, then the Orca configuration file needs
       to be updated to plot this data.  In this case, email me the
       names of the filesystems as listed in the output file and the
       Orca configuration file can be updated to plot the data.

       Otherwise, orcallator.se or the underlying SE header files will
       need to be modified to find RSM filesystems.  Since I don't
       have access to a system with RSM on it, this is something
       that somebody else will need to take on.  Hint, hint...

 3.14) Why don't I get any Interface Bits Per Second data for my qe
       board?

       Most likely you are either running Solaris 7 or older or
       Solaris 8 with SE version 3.1 or older.  You must run Solaris 8
       to get qe bits per second data.  Apply the following patches to
       include/kstat.se and include/netif.se:

         *** kstat.se.orig       Fri Feb  9 01:44:14 2001
         -- kstat.se    Fri Feb  9 14:31:46 2001
         ***************
         *** 646,651 ****
         --- 646,661 ----
               ulong_t no_tbufs;
               ulong_t no_rbufs;
               ulong_t rx_late_collisions;
         + #if MINOR_VERSION >= 70
         +     ulong_t  rbytes;
         +     ulong_t  obytes;
         +     ulong_t  multircv;
         +     ulong_t  multixmt;
         +     ulong_t  brdcstrcv;
         +     ulong_t  brdcstxmt;
         +     ulong_t  norcvbuf;
         +     ulong_t  noxmtbuf;
         + #endif

           };

         *** netif.se.orig       Fri Feb  9 01:45:06 2001
         --- netif.se    Fri Feb  9 14:31:10 2001
         ***************
         *** 229,236 ****
         --- 229,241 ----
                 nocanput       = if_qe.nocanput;
                 defer          = if_qe.excess_defer;
                 nocarrier      = if_qe.nocarrier;
         + #if MINOR_VERSION >= 70
         +       ooctets        = if_qe.obytes + if_qe.multixmt + if_qe.brdcstxmt;
         +       ioctets        = if_qe.rbytes + if_qe.multircv + if_qe.brdcstrcv;
         + #else
                 ooctets        = 0;
                 ioctets        = 0;
         + #endif
                 break;
               case NETIF_BF:
                 kstat$bf.number$ = number$ - (if_max[NETIF_QE] + 1);

 3.15) Orcallator.se core dumps.

       Some of the SE include files can cause core dumps if there are
       oddities on the system.  Apply the following patches:

         --- diskinfo.se.orig       Fri Jan 12 11:20:09 2001
         +++ diskinfo.se Tue Feb 27 19:37:28 2001
         @@ -197,7 +197,12 @@
              points_at[n] = '\0';

              // chop off the :a at the end
         -    strcpy(strrchr(points_at, ':'), "");
         +    while (--n >= 0) {
         +      if (points_at[n] == ':') {
         +        points_at[n] = '\0';
         +        break;
         +      }
         +    }

              // hack off ../../devices from the start
              sscanf(points_at, "../../devices%s", &points_at);

         --- mnt_class.se.orig   Fri Jan 12 11:20:09 2001
         +++ mnt_class.se        Tue Feb 27 19:53:34 2001
         @@ -96,7 +96,12 @@
                number$ = -1;
                return;
              }
         -    strcpy(strchr(buf, '\n'), "");
         +    for (i=0; i<sizeof(buf); ++i) {
         +      if (buf[i] == '\n') {
         +        buf[i] = '\0';
         +        break;
         +      }
         +    }
              i = 0;
              for(p=strtok(buf, "\t"); p != nil; p=strtok(nil, "\t")) {
                switch(i) {

 3.16) Why should I keep my compressed percol-* or orcallator-* files?

       There are several reasons to keep the data files:

       1) If the RRD files get screwed up, you'll be able to
          regenerate them.

       2) If Orca ever needs to change the internal format of the RRD
          files, it'll need to regenerate them.

       3) If you ever want to look at older data in the RRD files, the
          older data has less resolution.  For example, if you want to
          look at data, say 6 months old, in the RRD files, it is
          averaged over a whole day.  You won't be able to get the 5
          minute data generated by orcallator.se.

          Here's the resolution of the data in the RRD files (here
          RRA is Round Robin Archive and there can be many in one RRD
          file) as defined in lib/Orca/Constants.pm

          # The first RRA is every 5 minutes for 200 hours, the second
          # is every 30 minutes for 31 days, the third is every 2
          # hours for 100 days, and the last is every day for 3 years.

 3.17) Why do my Orca plots no longer contain any data after I change
       anything related to orcallator, such as the subsystems to
       measure, or when something changes on the system, such as mount
       points, ethernet devices, etc?

       Orca's input data files must contain the exact number of
       columns either specified in the orcallator.cfg with the
       column_description listing the actual column names or the
       column names as specified in the first line of the input data
       file when column_description is set to `first_line'.  If the
       number of columns do not match, then Orca ignores this data to
       protect the RRD files from incorrect data added to them.

       The columns in the Orca's input files commonly happen with
       orcallator.se when the administrator tells orcallator to
       measure a different set of subsystems or when the system itself
       adds or removed mount points, disk drives, ethernet devices,
       etc.

       To work around this problem, upgrade to the latest version of
       Orca, orcallator.se and orcallator.cfg which now create a new
       data file anytime the number of columns changes.  The file
       format of the output files has changed from
       orcallator-YYYY-MM-DD to orcallator-YYYY-MM-DD-XXX where XXX is
       a monotonically increasing number that resets to 0 at the
       beginning of the next day.