perfsonar-user - RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue

Subject: perfSONAR User Q&A and Other Discussion

List archive

RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue

From: Muhammad Tayyab <>
To: Valentin Vidic <>, "" <>
Subject: RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue
Date: Mon, 13 May 2019 11:50:38 +0000

Hi Valentin,

Please find attached config. FYI

Regards,
Tayyab

-----Original Message-----
From:
[] On Behalf Of Valentin Vidic
Sent: Sunday, May 12, 2019 5:38 PM
To:
Subject: Re: [perfsonar-user] perfSONAR nodes are down due to Memory
utilization issue

On Sun, May 12, 2019 at 02:24:31PM +0000, Muhammad Tayyab wrote:
> This is a Supermicro server and it has physical HD. Please find the
> attached outputs; FYI
>
> How we can normalize the HD utilization?

From what I can see the run, task and archiving tables are quite big and they
are probably causing all this disk IO. You can try cleaning them up to fix
the problem, but maybe it would be good to see why this happened in the first
place. If you want to post your mesh config we can check if there are any
problems there causing so many runs.

--
Valentin Vidic
Computer Systems Engineer - Expert
Department of Computer Infrastructure and Services Croatian Academic and
Research Network - CARNET Josipa Marohnica 5, HR-10000 Zagreb, Croatia
tel: +385 1 6661 714, fax. +385 1 6661 635
gsm: +385 91 2480 919
www.CARNet.hr
ï»¿ 14:45:31(167):Last login: Thu Apr 25 16:51:32 2019 from
pc-kw-16055.kaust.edu.sa
14:45:31(170):Welcome to the perfSONAR Toolkit v4.1.6-1.el7
14:45:31(170):
14:45:31(177):You may create accounts to manage this host through the web
interface by running the following as root:
14:45:31(177):
14:45:31(179): /usr/lib/perfsonar/scripts/nptoolkit-configure.py
14:45:31(179):
14:45:31(181):The web interface should be available at:
14:45:31(181):
14:45:31(183):https://[host address]/toolkit
14:45:46(379):[nocuser@lthpsdashboard ~]$ cat
/etc/maddash/maddash-server/maddash.yaml
14:45:46(387):##
14:45:46(391):# Set the directory where the database will be stored
14:45:46(392):database: /var/lib/maddash/
14:45:46(392):
14:45:46(393):##
14:45:46(397):# Set the number of jobs that can run in parallel. Default is
20.
14:45:46(398):#jobThreadPoolSize: 20
14:45:46(399):#
14:45:46(399):##
14:45:46(404):# Set number of jobs that can be in queue at one time. Default
is 250.
14:45:46(405):#jobBatchSize: 250
14:45:46(406):#
14:45:46(406):###
14:45:46(411):# Disable the job scheduler if you only want to run the REST
server
14:45:46(412):#disableScheduler: 0
14:45:46(413):#
14:45:46(413):###
14:45:46(419):# Skips table and index rebuild at start-up. It can speed up
start-up time if set to 1.
14:45:46(421):#skipTableBuild: 0
14:45:46(421):#
14:45:46(422):##
14:45:46(425):# Set the host where the REST server listens
14:45:46(426):serverHost: "localhost"
14:45:46(427):#
14:45:46(427):##
14:45:46(430):# Activate http and set the port where it listens
14:45:46(430):#http:
14:45:46(431):# port: 8881
14:45:46(431):
14:45:46(432):#notifications:
14:45:46(433):# -
14:45:46(434):# ##
14:45:46(436):# #This will be the email subject
14:45:46(437):#name: "My Email Report"
14:45:46(438):# ##
14:45:46(443):# #This indicates this is an email notification,
currently the only supported type
14:45:46(444):#type: "email"
14:45:46(445):# ##
14:45:46(449):# # A cron schedule in 'MIN HOUR DAY-OF-MONTH MONTH
DAY-OF-WEEK' format."
14:45:46(454):# # the example runs every hour on the hour. Only sends
email if new problem found
14:45:46(455):#schedule: "0 * * * ?"
14:45:46(456):# ##
14:45:46(461):# #Frequency with which to report the same problem.
Example below will report a problem
14:45:46(464):# # again only if it still exists after 24 hours.
14:45:46(466):#problemReportFrequency: 400
14:45:46(466):# ##
14:45:46(471):# #The minimum severity a problem must be where 0=OK,
1=Warning, 2=Critical
14:45:46(472):#minimumSeverity: 1
14:45:46(473):# ##
14:45:46(474):# # parameters specific to email
14:45:46(475):# parameters:
14:45:46(476):# ##
14:45:46(480):# # The main address of your dashboard used to create
links in email.
14:45:46(483):# dashboardUrl: "http://ps-dashboard.kaust.edu.sa";
14:45:46(484):# ##
14:45:46(489):# # information about the mail server. If not set will
use 127.0.0.1 port 25 with
14:45:46(490):# # no auth or SSL
14:45:46(492):#mailServer:relay.kaust.edu.sa
14:45:46(495):# address: "127.0.0.1, 10.0.0.0/23"
14:45:46(496):# port: 25
14:45:46(497):# ##
14:45:46(499):# # Set from address in mailed reports
14:45:46(501):#from: ""
14:45:46(501):# ##
14:45:46(504):# # Set list of to addresses in mailed reports
14:45:46(505):#to:
14:45:46(507):# - ""
14:45:46(508):# - ""
14:45:46(509):#
14:45:46(509):##
14:45:46(512):# Activate https and set port and keystores
14:45:46(512):# https:443
14:45:46(513):# port: 8882
14:45:46(517):# keystore: "/usr/lib/maddash/maddash-webui/etc/maddash.jks"
14:45:46(518):# keystorePassword: "changeit"
14:45:46(520):# want, require or off
14:45:46(521):# clientAuth: "want"
14:45:46(522):http:
14:45:46(522): port: 8881
14:45:46(523):#
14:45:46(524):##
14:45:46(525):# Email notifications
14:45:46(526):notifications:
14:45:46(528): - name: "KAUST Performance Problems"
14:45:46(529): type: "email"
14:45:46(530): schedule: "0 * * * ?"
14:45:46(532): problemReportFrequency: 600
14:45:46(533): minimumSeverity: 1
14:45:46(534): filters:
14:45:46(535): - type: "category"
14:45:46(537): value: "PERFORMANCE"
14:45:46(537): parameters:
14:45:46(541): dashboardUrl: "http://ps-dashboard.kaust.edu.sa";
14:45:46(543): subjectPrefix: "[MaDDash]"
14:45:46(544): from: ""
14:45:46(545): to:
14:45:46(547): - ""
14:45:46(550): - name: "KAUST Configuration Problems"
14:45:46(551): type: "email"
14:45:46(552): schedule: "0 * * * ?"
14:45:46(553): problemReportFrequency: 700
14:45:46(555): minimumSeverity: 1
14:45:46(555): filters:
14:45:46(557): - type: "category"
14:45:46(558): value: "CONFIGURATION"
14:45:46(559): parameters:
14:45:46(562): dashboardUrl: "http://ps-dashboard.kaust.edu.sa";
14:45:46(564): subjectPrefix: "[MaDDash]"
14:45:46(566): from: ""
14:45:46(566): to:
14:45:46(568): - ""
14:45:46(568):
14:45:46(569):##
14:45:46(574):# 'groups' are where you define lists of hosts. You need to
provide the name of the group
14:45:46(580):# and the list of the hosts in that group. The default example
below defines two groups:
14:45:46(585):# "myOwampHosts" and "myBwctlHosts". You can define any number
of other groups and give them
14:45:46(588):# any alphanumeric name. A host may belong to multiple groups.
14:45:46(589):groups:
14:45:46(591): ResidentialOwampHosts :
14:45:46(593): - "ps-rac1-g1755.kaust.edu.sa"
14:45:46(595): - "ps-isc2-g2263.kaust.edu.sa"
14:45:46(597): - "ps-rac2-g1631.kaust.edu.sa"
14:45:46(599): - "ps-isc1-g1133.kaust.edu.sa"
14:45:46(601): - "ps-ec3-g3622.kaust.edu.sa"
14:45:46(603): - "ps-clc1-g3316.kaust.edu.sa"
14:45:46(605): - "ps-clc2-g3057c.kaust.edu.sa"
14:45:46(607): - "ps-sup-g3943d.kaust.edu.sa"
14:45:46(611): - "ps-irc-i5273.kaust.edu.sa"
14:45:46(614): - "ps-grm-i5060.kaust.edu.sa"
14:45:46(615): - "ps-ytc-aux.kaust.edu.sa"
14:45:46(618): - "ps-pul-h4309b.kaust.edu.sa"
14:45:46(626): - "ps-nsh-608.kaust.edu.sa"
14:45:46(626):
14:45:46(627): ResidentialBwctlHosts :
14:45:46(629): - "ps-rac1-g1755.kaust.edu.sa"
14:45:46(632): - "ps-isc2-g2263.kaust.edu.sa"
14:45:46(634): - "ps-rac2-g1631.kaust.edu.sa"
14:45:46(636): - "ps-isc1-g1133.kaust.edu.sa"
14:45:46(638): - "ps-ec3-g3622.kaust.edu.sa"
14:45:46(640): - "ps-clc1-g3316.kaust.edu.sa"
14:45:46(642): - "ps-clc2-g3057c.kaust.edu.sa"
14:45:46(644): - "ps-sup-g3943d.kaust.edu.sa"
14:45:46(646): - "ps-irc-i5273.kaust.edu.sa"
14:45:46(648): - "ps-grm-i5060.kaust.edu.sa"
14:45:46(649): - "ps-ytc-aux.kaust.edu.sa"
14:45:46(651): - "ps-pul-h4309b.kaust.edu.sa"
14:45:46(653): - "ps-nsh-608.kaust.edu.sa"
14:45:46(653):
14:45:46(655): ResearchOwampHosts :
14:45:46(656): - "ps-b2-ter9.kaust.edu.sa"
14:45:46(658): - "ps-b3-ter10.kaust.edu.sa"
14:45:46(660): - "ps-b5-ter8.kaust.edu.sa"
14:45:46(662): - "ps-b4-ter3.kaust.edu.sa"
14:45:46(664): - "ps-b1-ter6.kaust.edu.sa"
14:45:46(664):
14:45:46(665): ResearchBwctlHosts :
14:45:46(667): - "ps-b2-ter9.kaust.edu.sa"
14:45:46(669): - "ps-b3-ter10.kaust.edu.sa"
14:45:46(671): - "ps-b5-ter8.kaust.edu.sa"
14:45:46(673): - "ps-b4-ter3.kaust.edu.sa"
14:45:46(675): - "ps-b1-ter6.kaust.edu.sa"
14:45:46(675):
14:45:46(676): nonResearchOwampHosts :
14:45:46(678): - "ps-bdc-ter11.kaust.edu.sa"
14:45:46(680): - "ps-sr2-ter9.kaust.edu.sa"
14:45:46(682): - "ps-uab-ter27.kaust.edu.sa"
14:45:46(684): - "ps-esb-ter9.kaust.edu.sa"
14:45:46(685): - "ps-unc-ter10.kaust.edu.sa"
14:45:46(687): - "ps-com-ter15.kaust.edu.sa"
14:45:46(689): - "ps-shq-ter9.kaust.edu.sa"
14:45:46(692): - "ps-adb-ter10.kaust.edu.sa"
14:45:46(694): - "ps-shq-ter15.kaust.edu.sa"
14:45:46(694):
14:45:46(695): nonResearchBwctlHosts :
14:45:46(697): - "ps-bdc-ter11.kaust.edu.sa"
14:45:46(698): - "ps-sr2-ter9.kaust.edu.sa"
14:45:46(700): - "ps-uab-ter27.kaust.edu.sa"
14:45:46(702): - "ps-esb-ter9.kaust.edu.sa"
14:45:46(704): - "ps-unc-ter10.kaust.edu.sa"
14:45:46(706): - "ps-com-ter15.kaust.edu.sa"
14:45:46(708): - "ps-shq-ter9.kaust.edu.sa"
14:45:46(710): - "ps-adb-ter10.kaust.edu.sa"
14:45:46(712): - "ps-shq-ter15.kaust.edu.sa"
14:45:46(712):
14:45:46(733):##
14:45:46(738):# 'groupMembers' allow you to assign special properties to
items in 'groups':
14:45:46(744):# - Set the "id" to the value used in the "group" list.
This is required.
14:45:46(747):# - Set the "label" to the name you want displayed. This
block
14:45:46(752):# is useful in cases where you want a value such as
an IP address or port
14:45:46(757):# number passed to a check but on the GUI you want a
human-readable value displayed
14:45:46(761):# - Set the "pstoolkiturl" to the URL of a perfSONAR
Toolkit web page
14:45:46(767):# - Map to create special mapped template variables
dependent on the opposing row or column
14:45:46(773):# - Any custom value accessible in template variables
%row.<prop> and %col.<prop> respectively
14:45:46(774):groupMembers:
14:45:46(779):# - added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:46(781): - id: "ps-rac1-g1755.kaust.edu.sa"
14:45:46(782): label: "Racquet Club 01"
14:45:46(784): - id: "ps-isc2-g2263.kaust.edu.sa"
14:45:46(787): label: "International School Center 02"
14:45:46(789): - id: "ps-rac2-g1631.kaust.edu.sa"
14:45:46(790): label: "Racquet Club 02"
14:45:46(792): - id: "ps-isc1-g1133.kaust.edu.sa"
14:45:46(795): label: "International School Center 01"
14:45:46(797): - id: "ps-ec3-g3622.kaust.edu.sa"
14:45:46(798): label: "Early Childhood Center 03"
14:45:46(800): - id: "ps-clc1-g3316.kaust.edu.sa"
14:45:46(802): label: "Clinic 01"
14:45:46(804): - id: "ps-clc2-g3057c.kaust.edu.sa"
14:45:46(805): label: "Clinic 02"
14:45:46(807): - id: "ps-sup-g3943d.kaust.edu.sa"
14:45:46(809): label: "Tamimi Super Market"
14:45:46(811): - id: "ps-irc-i5273.kaust.edu.sa"
14:45:46(813): label: "Island Recreation Center"
14:45:46(815): - id: "ps-grm-i5060.kaust.edu.sa"
14:45:46(816): label: "Grand Mosque"
14:45:46(818): - id: "ps-ytc-aux.kaust.edu.sa"
14:45:46(820): label: "Yacht Club"
14:45:46(822): - id: "ps-pul-h4309b.kaust.edu.sa"
14:45:46(824): label: "Public Library"
14:45:46(826): - id: "ps-nsh-608.kaust.edu.sa"
14:45:46(827): label: "Oasis"
14:45:46(827):
14:45:46(831):# added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:46(833):# label: "LBL Toolkit"
14:45:46(836):# pstoolkiturl: "http://nettest.lbl.gov/toolkit";
14:45:46(844): # The special map property can be used to create
template variables that change depending on opposing row or column
14:45:46(853): # For example we could use %row.map.ip in a command to
tell nettest.lbl.gov to use ip 10.0.1.1 when every the column is
14:45:46(878): # albq-pt1.es.net and 131.243.24.11 for anything else
(default is a special key). The parameter name is freeform and "ip"
14:45:46(879): # is just an example
14:45:46(880): #map:
14:45:46(882): #"albq-pt1.es.net":
14:45:46(883): # ip: "10.0.1.1"
14:45:46(884): # "default":
14:45:46(886): # ip: "131.243.24.11"
14:45:46(886):##
14:45:46(886):
14:45:46(887):##
14:45:46(892):# 'checks' are where you define a template for a check to
execute. You'll provide a
14:45:46(893):# command to run,
14:45:46(893):checks:
14:45:46(898): # Below defines a check that alarms against the loss
between the row and column host.
14:45:46(903): # It looks at data for the last 30 minutes and runs every
30 minutes. It will go
14:45:46(906): # critical if there is any loss. There is no warning level.
14:45:46(908): owampLossCheck :
14:45:46(912): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:46(914): #A descriptive name of the check
14:45:46(916): name: "Loss"
14:45:46(917): #A description of the check
14:45:46(921): description: "Loss from %row to %col (according to
%row MA)"
14:45:46(923): #Example using mapped variables
14:45:46(928): #description: "Loss from %row.map.ip to %col.map.ip
(according to %row MA)"
14:45:46(933): #The type of check. Other valid values are
net.es.maddash.checks.NagiosCheck and
14:45:46(935): # net.es.maddash.checks.RandomCheck.
14:45:46(938): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:46(938): params:
14:45:46(945): #The URL of the measurement archive. You can
define templates on a per host
14:45:46(950): #basis here. If not defined explicitly the
'default' template will be used.
14:45:46(951): maUrl:
14:45:46(954): default: "http://%row/esmond/perfsonar/archive";
14:45:46(960): # The section below sets a different maURL for
every column in the row albu-owamp.es.net
14:45:46(962): # albu-owamp.es.net:
14:45:46(965): # default:
"http://%col/esmond/perfsonar/archive";
14:45:46(970): # The section below sets a different maURL for
every column in the row
14:45:46(973): # bois-owamp.es.net EXCEPT the column
bost-owamp.es.net
14:45:46(976): # bois-owamp.es.net:
14:45:46(979): # default:
"http://%col/esmond/perfsonar/archive";
14:45:46(983): # bost-owamp.es.net:
"http://%row/esmond/perfsonar/archive";
14:45:46(989): # The section below sets a different maURL for the
check at row bost-owamp.es.net
14:45:46(991): # and column albu-owamp.es.net
14:45:46(993): # bost-owamp.es.net:
14:45:46(998): # albu-owamp.es.net:
"http://perfsonar-archive.es.net/esmond/perfsonar/archive";
14:45:46(999): #
14:45:47(002): #DEPRECATED: It looks up all the 'keys' the
14:45:47(006): #node associates with a particular
source/destination regardless of
14:45:47(011): #whether IP or hostname is used. In general you
should just be able to change
14:45:47(018): #the hostname in the URL (example.mydomain.local),
to the name of your toolkit host
14:45:47(022): #and it will work. You can leave the rest of the
URL untouched.
14:45:47(037): #metaDataKeyLookup:
"http://example.mydomain.local/perfsonar-graphs/metaKeyReq.cgi?ma_url=%maUrl&eventType=%event.delayBuckets&srcRaw=%row&dstRaw=%col&count=0&bucket_width=0";
14:45:47(042): #This is the URL to the graph script. You should
be able to change the
14:45:47(048): #hostname(example.mydomain.local) to your toolkit
hostname and leave the rest of the
14:45:47(049): #URL untouched.
14:45:47(056): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%col&source=%row";
14:45:47(061): #The Nagios command to execute. The -w and -c
options define the thresholds.
14:45:47(065): #The -r option specifies the time range to query.
14:45:47(072): command:
"/usr/lib64/nagios/plugins/check_owdelay.pl -u %maUrl -w .01 -c .1 -r 1800 -l
-p -s %row -d %col"
14:45:47(075): #How often to run the check (in seconds)
14:45:47(076): checkInterval: 900
14:45:47(081): #How often to run the check if it detects a state
different than the previous
14:45:47(086): #state. For example, if a check has been OK for 3
days, but suddenly a critical
14:45:47(091): #is seen, it will try again in this number of seconds
rather than waiting the full
14:45:47(092): #interval
14:45:47(094): retryInterval: 300
14:45:47(099): #The number of consecutive times a new state must be
seen before it changes the
14:45:47(104): #color in a grid. For example, if a check has been OK
for 3 days, but suddenly a
14:45:47(109): #critical is seen, It must be seen 2 more times before
the color will change
14:45:47(110): retryAttempts: 3
14:45:47(114): #The maximum number of seconds a command will be
allowed to run
14:45:47(115): timeout: 60
14:45:47(115):
14:45:47(120): # Below defines a check that alarms against the loss
between the column and row host.
14:45:47(126): # It just swaps the source and destination of the other
OWAMP check to get data for
14:45:47(132): # the reverse direction. The parameters have the same
meaning as the previous example
14:45:47(134): owampLossRevCheck :
14:45:47(139): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:47(141): name: "Loss Reverse"
14:45:47(145): description: "Loss from %col to %row (according to
%row MA)"
14:45:47(147): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(148): params:
14:45:47(149): maUrl:
14:45:47(152): default: "http://%row/esmond/perfsonar/archive";
14:45:47(159): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%row&source=%col";
14:45:47(167): command:
"/usr/lib64/nagios/plugins/check_owdelay.pl -u %maUrl -w .01 -c .1 -r 900 -l
-p -s %col -d %row"
14:45:47(168): checkInterval: 900
14:45:47(169): retryInterval: 300
14:45:47(171): retryAttempts: 3
14:45:47(172): timeout: 60
14:45:47(173):
14:45:47(178): # Below defines a check that alarms on throughput reported
by BWCTL from the row host
14:45:47(183): # to the column host. It runs every 8 hours. It alarms on
average throughput for the
14:45:47(188): # last 24 hours. It will go to the warning (yellow) level
if throughput drops below
14:45:47(193): # 100Mbps. It goes to critical if it drops below 10Mbps.
Adjust the 'command'
14:45:47(198): # parameter's -w property to change the warning level and
-c parameter to change the
14:45:47(201): # critical level. All units are in Gbps (e.g. .1 =
100Mbps).
14:45:47(202): bwctlCheck :
14:45:47(208): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:47(210): name: "Throughput"
14:45:47(214): description: "Throughput from %row to %col (according
to %row MA)"
14:45:47(216): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(217): params:
14:45:47(218): maUrl:
14:45:47(221): default: "http://%row/esmond/perfsonar/archive";
14:45:47(228): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%col&source=%row";
14:45:47(235): #Adjust the -w and -c values to adjust the
thresholds. The thresholds are specified in Gbps.
14:45:47(243): command:
"/usr/lib64/nagios/plugins/check_throughput.pl -u %maUrl -w .08: -c .05: -r
84400 -s %row -d %col"
14:45:47(244): checkInterval: 14400
14:45:47(245): retryInterval: 600
14:45:47(247): retryAttempts: 3
14:45:47(248): timeout: 60
14:45:47(248):
14:45:47(253): # Same as the BWCTL check above but tests in the reverse
direction (from the column
14:45:47(255): # host to the row host).
14:45:47(256): bwctlRevCheck :
14:45:47(260): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:47(262): name: "Throughput Reverse"
14:45:47(266): description: "Throughput from %col to %row (according
to %row MA)"
14:45:47(269): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(270): params:
14:45:47(271): maUrl:
14:45:47(274): default: "http://%row/esmond/perfsonar/archive";
14:45:47(281): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%row&source=%col";
14:45:47(287): #Adjust the -w and -c values to adjust the
thresholds. The thresholds are specified in Gbps.
14:45:47(295): command:
"/usr/lib64/nagios/plugins/check_throughput.pl -u %maUrl -w .08: -c .05: -r
86400 -s %col -d %row"
14:45:47(297): checkInterval: 14400
14:45:47(298): retryInterval: 600
14:45:47(299): retryAttempts: 3
14:45:47(300): timeout: 60
14:45:47(301):#
14:45:47(302):# KAUST Campus Research Node Tests
14:45:47(303):#
14:45:47(303):
14:45:47(305): Research-owampLossCheck :
14:45:47(310): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:47(311): name: "Loss"
14:45:47(314): description: "Loss from %row to %col (according to
%row MA)"
14:45:47(317): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(318): params:
14:45:47(319): maUrl:
14:45:47(322): default: "http://%row/esmond/perfsonar/archive";
14:45:47(322):
14:45:47(329): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%col&source=%row";
14:45:47(337): command:
"/usr/lib64/nagios/plugins/check_owdelay.pl -u %maUrl -w 0 -c .05 -r 1800 -l
-p -s %row -d %col"
14:45:47(338): checkInterval: 900
14:45:47(340): retryInterval: 300
14:45:47(341): retryAttempts: 3
14:45:47(342): timeout: 60
14:45:47(343):
14:45:47(345): Research-owampLossRevCheck :
14:45:47(349): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:47(350): name: "Loss Reverse"
14:45:47(354): description: "Loss from %col to %row (according to
%row MA)"
14:45:47(357): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(357): params:
14:45:47(359): maUrl:
14:45:47(362): default: "http://%row/esmond/perfsonar/archive";
14:45:47(368): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%row&source=%col";
14:45:47(376): command:
"/usr/lib64/nagios/plugins/check_owdelay.pl -u %maUrl -w 0 -c .05 -r 900 -l
-p -s %col -d %row"
14:45:47(377): checkInterval: 900
14:45:47(379): retryInterval: 300
14:45:47(380): retryAttempts: 3
14:45:47(381): timeout: 60
14:45:47(381):
14:45:47(383): Research-bwctlCheck :
14:45:47(387): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:47(390): name: "Throughput"
14:45:47(394): description: "Throughput from %row to %col (according
to %row MA)"
14:45:47(397): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(398): params:
14:45:47(399): maUrl:
14:45:47(402): default: "http://%row/esmond/perfsonar/archive";
14:45:47(409): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%col&source=%row";
14:45:47(415): #Adjust the -w and -c values to adjust the
thresholds. The thresholds are specified in Gbps.
14:45:47(423): command:
"/usr/lib64/nagios/plugins/check_throughput.pl -u %maUrl -w .9: -c .8: -r
84400 -s %row -d %col"
14:45:47(424): checkInterval: 14400
14:45:47(426): retryInterval: 600
14:45:47(427): retryAttempts: 3
14:45:47(428): timeout: 60
14:45:47(428):
14:45:47(430): Research-bwctlRevCheck :
14:45:47(434): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:47(436): name: "Throughput Reverse"
14:45:47(440): description: "Throughput from %col to %row (according
to %row MA)"
14:45:47(443): type: "net.es.maddash.checks.PSNagiosCheck"
14:45:47(444): params:
14:45:47(445): maUrl:
14:45:47(448): default: "http://%row/esmond/perfsonar/archive";
14:45:47(455): graphUrl:
"https://ps-dashboard.kaust.edu.sa/perfsonar-graphs/?url=%maUrl&dest=%row&source=%col";
14:45:47(461): #Adjust the -w and -c values to adjust the
thresholds. The thresholds are specified in Gbps.
14:45:47(469): command:
"/usr/lib64/nagios/plugins/check_throughput.pl -u %maUrl -w .9: -c .8: -r
86400 -s %col -d %row"
14:45:47(470): checkInterval: 14400
14:45:47(472): retryInterval: 600
14:45:47(473): retryAttempts: 3
14:45:47(474): timeout: 60
14:45:47(474):
14:45:47(479):#'grids' are where you define the two dimensional tables that
get displayed. You specify
14:45:47(485):# which host 'groups' will compose the rows and columns and
the 'checks' you want run. You also
14:45:47(488):# provide some descriptive information that will be displayed
14:45:47(489):grids:
14:45:47(494): #The first item in the list is a grid for the loss
checks between the OWAMP hosts
14:45:47(500): # defined. The first property "name" defines the title
that will display on the dashboard.
14:45:47(502): - name: "KAUST Residential - Reliability"
14:45:47(506): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:47(512): # Define the hosts that will be listed down the left
side of the grid. This MUST be a
14:45:47(516): # value defined in the 'groups' section of this config
file.
14:45:47(517): rows: "ResidentialOwampHosts"
14:45:47(523): # Define the hosts that will be listed across the top of
the grid. This MUST be a
14:45:47(528): # value defined in the 'groups' section of this config
file. It can be the same or
14:45:47(531): # a different group from teh one defined in 'rows'.
14:45:47(533): columns: "ResidentialOwampHosts"
14:45:47(538): #Define the checks that will be shown in each cell in
the grid. You may define up
14:45:47(543): # to 2. If there are more than one the top half of the
cell will be the first check
14:45:47(549): # listed and the bottom half will be the second. This is
often useful for showing
14:45:47(551): # results in the different directions.
14:45:47(552): checks:
14:45:47(553): - "owampLossCheck"
14:45:47(554): - "owampLossRevCheck"
14:45:47(559): #Specify the order you want hosts assigned to the 'rows'
attribute to be displayed.
14:45:47(561): # Valid values are as follows:
14:45:47(566): # alphabetical - automatically sorts members of the
group alphabetically
14:45:47(572): # group - displays the members of the group exactly in
the order they are defined
14:45:47(573): rowOrder: "alphabetical"
14:45:47(578): #Specify the order you want hosts assigned to the 'cols'
attribute to be displayed.
14:45:47(580): # Valid values are as follows:
14:45:47(585): # alphabetical - automatically sorts members of the
group alphabetically
14:45:47(590): # group - displays the members of the group exactly in
the order they are defined
14:45:47(592): colOrder: "alphabetical"
14:45:47(597): # Set to if you don't want results to be checked from a
host to itself. Set to 0
14:45:47(598): # otherwise.
14:45:47(599): excludeSelf: 1
14:45:47(604): # The section below allows you to exclude individual
checks. The structure is a map
14:45:47(610): # where the key is the name of the row where you want to
exclude a check. It should match
14:45:47(616): # a member of the group assigned to the "rows" property
of this grid or it can be the special
14:45:47(622): # key 'default' that matches every row. The value is a
list of columns that should not appear
14:45:47(629): # in the grid. An item in the list must be a member of
the group assigned to the "columns"
14:45:47(636): # property of this grid or the special value "all" which
removes all columns for a row. The example
14:45:47(638): # below does the following:
14:45:47(641): # * Excludes the column albu-pt1.es.net from every
row
14:45:47(646): # * Excludes the column chic-pt1.es.net from the row
bois-owamp.es.net
14:45:47(649): # * Excludes every column in the row
bost-owamp.es.net
14:45:47(650): # excludeChecks:
14:45:47(651): # default:
14:45:47(653): # - "albu-pt1.es.net"
14:45:47(654): # bois-owamp.es.net:
14:45:47(656): # - "chic-pt1.es.net"
14:45:47(657): # bost-owamp.es.net:
14:45:47(658): # - "all"
14:45:47(659): #
14:45:47(663): # Determines which checks will be run. Valid values are
as follows:
14:45:47(668): # all - Run a check between every row and column. This
will be the most common.
14:45:47(674): # afterSelf - Run a check to every host that's defined
after the current row in the 'rows' group
14:45:47(681): # beforeSelf - Run a check to every host that's defined
before the current row in the 'rows' group
14:45:47(682): columnAlgorithm: "all"
14:45:47(687): # Reference a report (see the further down in this file)
to apply to this grid
14:45:47(689): report: "loss_mesh_report"
14:45:47(693): # 'statusLabels' is where you provide a human-readable
description of what each
14:45:47(698): # threshold means. These are the values that will be
displayed in the legend. If a
14:45:47(703): # threshold is not defined below then that will not be
displayed in the legend.
14:45:47(705): # Valid values are as follows:
14:45:47(708): # ok: corresponds to the green status.
14:45:47(710): # warning: the yellow status.
14:45:47(711): # critical: the red status.
14:45:47(713): # unknown: the orange status
14:45:47(720): # notrun: the gray status. means the check has not
run yet. should only happen first time check deployed.
14:45:47(721): statusLabels:
14:45:47(723): ok: "Life is Good"
14:45:47(724): critical: "Critical"
14:45:47(726): warning: "Warning"
14:45:47(728): unknown: "Unable to find test data"
14:45:47(730): notrun: "Check has not run yet"
14:45:47(734): #note: 'warning' not defined because no warning
threshold set
14:45:47(739): #Below is a second grid that defines the throughput checks
to the BWCTL hosts defined.
14:45:47(745): # The parameters have the same meaning as the OWAMP check
above so they are not repeated here.
14:45:47(747): - name: "KAUST Residential - Speed"
14:45:47(752): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:47(753): rows: "ResidentialBwctlHosts"
14:45:47(755): columns: "ResidentialBwctlHosts"
14:45:47(756): checks:
14:45:47(757): - "bwctlCheck"
14:45:47(759): - "bwctlRevCheck"
14:45:47(760): rowOrder: "alphabetical"
14:45:47(761): colOrder: "alphabetical"
14:45:47(763): excludeSelf: 1
14:45:47(764): columnAlgorithm: "all"
14:45:47(766): report: "throughput_mesh_report"
14:45:47(767): statusLabels:
14:45:47(768): ok: "All is Well"
14:45:47(770): warning: "Warning "
14:45:47(772): critical: "Critical"
14:45:47(774): unknown: "Unable to find test data"
14:45:47(776): notrun: "Check has not run yet"
14:45:47(785): #we can define a additional states here. A common
one is the "scheduled even state" supported by the admin UI.
14:45:47(786): extra:
14:45:47(787): - value: 5
14:45:47(789): shortName: EVENT
14:45:47(791): description: "Down for maintenance"
14:45:47(791):
14:45:47(793):# Research grids
14:45:47(798): #The first item in the list is a grid for the loss
checks between the OWAMP hosts
14:45:47(803): # defined. The first property "name" defines the title
that will display on the dashboard.
14:45:47(805): - name: "KAUST Research - Reliability"
14:45:47(810): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:47(815): # Define the hosts that will be listed down the left
side of the grid. This MUST be a
14:45:47(820): # value defined in the 'groups' section of this config
file.
14:45:47(822): rows: "ResearchOwampHosts"
14:45:47(827): # Define the hosts that will be listed across the top of
the grid. This MUST be a
14:45:47(832): # value defined in the 'groups' section of this config
file. It can be the same or
14:45:47(835): # a different group from teh one defined in 'rows'.
14:45:47(837): columns: "ResearchOwampHosts"
14:45:47(842): #Define the checks that will be shown in each cell in
the grid. You may define up
14:45:47(847): # to 2. If there are more than one the top half of the
cell will be the first check
14:45:47(852): # listed and the bottom half will be the second. This is
often useful for showing
14:45:47(854): # results in the different directions.
14:45:47(855): checks:
14:45:47(857): - "Research-owampLossCheck"
14:45:47(859): - "Research-owampLossRevCheck"
14:45:47(864): #Specify the order you want hosts assigned to the 'rows'
attribute to be displayed.
14:45:47(866): # Valid values are as follows:
14:45:47(870): # alphabetical - automatically sorts members of the
group alphabetically
14:45:47(876): # group - displays the members of the group exactly in
the order they are defined
14:45:47(877): rowOrder: "alphabetical"
14:45:47(882): #Specify the order you want hosts assigned to the 'cols'
attribute to be displayed.
14:45:47(884): # Valid values are as follows:
14:45:47(889): # alphabetical - automatically sorts members of the
group alphabetically
14:45:47(894): # group - displays the members of the group exactly in
the order they are defined
14:45:47(896): colOrder: "alphabetical"
14:45:47(901): # Set to if you don't want results to be checked from a
host to itself. Set to 0
14:45:47(902): # otherwise.
14:45:47(903): excludeSelf: 1
14:45:47(908): # The section below allows you to exclude individual
checks. The structure is a map
14:45:47(913): # where the key is the name of the row where you want to
exclude a check. It should match
14:45:47(919): # a member of the group assigned to the "rows" property
of this grid or it can be the special
14:45:47(926): # key 'default' that matches every row. The value is a
list of columns that should not appear
14:45:47(932): # in the grid. An item in the list must be a member of
the group assigned to the "columns"
14:45:47(938): # property of this grid or the special value "all" which
removes all columns for a row. The example
14:45:47(940): # below does the following:
14:45:47(943): # * Excludes the column albu-pt1.es.net from every
row
14:45:47(948): # * Excludes the column chic-pt1.es.net from the row
bois-owamp.es.net
14:45:47(951): # * Excludes every column in the row
bost-owamp.es.net
14:45:47(952): # excludeChecks:
14:45:47(953): # default:
14:45:47(954): # - "albu-pt1.es.net"
14:45:47(956): # bois-owamp.es.net:
14:45:47(957): # - "chic-pt1.es.net"
14:45:47(959): # bost-owamp.es.net:
14:45:47(960): # - "all"
14:45:47(961): #
14:45:47(964): # Determines which checks will be run. Valid values are
as follows:
14:45:47(969): # all - Run a check between every row and column. This
will be the most common.
14:45:47(976): # afterSelf - Run a check to every host that's defined
after the current row in the 'rows' group
14:45:47(982): # beforeSelf - Run a check to every host that's defined
before the current row in the 'rows' group
14:45:47(984): columnAlgorithm: "all"
14:45:47(989): # Reference a report (see the further down in this file)
to apply to this grid
14:45:47(990): report: "loss_mesh_report"
14:45:47(995): # 'statusLabels' is where you provide a human-readable
description of what each
14:45:48(000): # threshold means. These are the values that will be
displayed in the legend. If a
14:45:48(005): # threshold is not defined below then that will not be
displayed in the legend.
14:45:48(007): # Valid values are as follows:
14:45:48(010): # ok: corresponds to the green status.
14:45:48(012): # warning: the yellow status.
14:45:48(013): # critical: the red status.
14:45:48(015): # unknown: the orange status
14:45:48(023): # notrun: the gray status. means the check has not
run yet. should only happen first time check deployed.
14:45:48(024): statusLabels:
14:45:48(026): ok: "Reliability = All Good"
14:45:48(028): critical: "Reliability = Critical"
14:45:48(030): warning: "Reliability = Warning"
14:45:48(032): unknown: "Unable to find test data"
14:45:48(034): notrun: "Check has not run yet"
14:45:48(038): #note: 'warning' not defined because no warning
threshold set
14:45:48(044): #Below is a second grid that defines the throughput checks
to the BWCTL hosts defined.
14:45:48(050): # The parameters have the same meaning as the OWAMP check
above so they are not repeated here.
14:45:48(053): - name: "KAUST Research - Speed"
14:45:48(057): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:48(059): rows: "ResearchBwctlHosts"
14:45:48(061): columns: "ResearchBwctlHosts"
14:45:48(062): checks:
14:45:48(063): - "Research-bwctlCheck"
14:45:48(065): - "Research-bwctlRevCheck"
14:45:48(066): rowOrder: "alphabetical"
14:45:48(068): colOrder: "alphabetical"
14:45:48(069): excludeSelf: 1
14:45:48(070): columnAlgorithm: "all"
14:45:48(072): report: "throughput_mesh_report"
14:45:48(073): statusLabels:
14:45:48(075): ok: "Speed = All Good"
14:45:48(077): warning: "Speed = Warning "
14:45:48(079): critical: "Speed - Critical"
14:45:48(081): unknown: "Unable to find test data"
14:45:48(083): notrun: "Check has not run yet"
14:45:48(091): #we can define a additional states here. A common
one is the "scheduled even state" supported by the admin UI.
14:45:48(093): extra:
14:45:48(094): - value: 5
14:45:48(096): shortName: EVENT
14:45:48(098): description: "Down for maintenance"
14:45:48(098):
14:45:48(098):
14:45:48(099):# Non-Research grids
14:45:48(104): #The first item in the list is a grid for the loss
checks between the OWAMP hosts
14:45:48(111): # defined. The first property "name" defines the title
that will display on the dashboard.
14:45:48(113): - name: "KAUST Non-Research - Reliability"
14:45:48(118): added_by_psconfig: 3 # remove this if you want to keep
after psconfig runs
14:45:48(123): # Define the hosts that will be listed down the left
side of the grid. This MUST be a
14:45:48(127): # value defined in the 'groups' section of this config
file.
14:45:48(128): rows: "nonResearchOwampHosts"
14:45:48(134): # Define the hosts that will be listed across the top of
the grid. This MUST be a
14:45:48(139): # value defined in the 'groups' section of this config
file. It can be the same or
14:45:48(142): # a different group from teh one defined in 'rows'.
14:45:48(144): columns: "nonResearchOwampHosts"
14:45:48(149): #Define the checks that will be shown in each cell in
the grid. You may define up
14:45:48(154): # to 2. If there are more than one the top half of the
cell will be the first check
14:45:48(159): # listed and the bottom half will be the second. This is
often useful for showing
14:45:48(161): # results in the different directions.
14:45:48(162): checks:
14:45:48(164): - "owampLossCheck"
14:45:48(165): - "owampLossRevCheck"
14:45:48(170): #Specify the order you want hosts assigned to the 'rows'
attribute to be displayed.
14:45:48(172): # Valid values are as follows:
14:45:48(177): # alphabetical - automatically sorts members of the
group alphabetically
14:45:48(182): # group - displays the members of the group exactly in
the order they are defined
14:45:48(184): rowOrder: "alphabetical"
14:45:48(189): #Specify the order you want hosts assigned to the 'cols'
attribute to be displayed.
14:45:48(191): # Valid values are as follows:
14:45:48(195): # alphabetical - automatically sorts members of the
group alphabetically
14:45:48(201): # group - displays the members of the group exactly in
the order they are defined
14:45:48(202): colOrder: "alphabetical"
14:45:48(207): # Set to if you don't want results to be checked from a
host to itself. Set to 0
14:45:48(210): # otherwise.
14:45:48(211): excludeSelf: 1
14:45:48(216): # The section below allows you to exclude individual
checks. The structure is a map
14:45:48(222): # where the key is the name of the row where you want to
exclude a check. It should match
14:45:48(228): # a member of the group assigned to the "rows" property
of this grid or it can be the special
14:45:48(234): # key 'default' that matches every row. The value is a
list of columns that should not appear
14:45:48(240): # in the grid. An item in the list must be a member of
the group assigned to the "columns"
14:45:48(246): # property of this grid or the special value "all" which
removes all columns for a row. The example
14:45:48(248): # below does the following:
14:45:48(251): # * Excludes the column albu-pt1.es.net from every
row
14:45:48(256): # * Excludes the column chic-pt1.es.net from the row
bois-owamp.es.net
14:45:48(260): # * Excludes every column in the row
bost-owamp.es.net
14:45:48(261): # excludeChecks:
14:45:48(262): # default:
14:45:48(265): # - "albu-pt1.es.net"
14:45:48(266): # bois-owamp.es.net:
14:45:48(268): # - "chic-pt1.es.net"
14:45:48(269): # bost-owamp.es.net:
14:45:48(270): # - "all"
14:45:48(271): #
14:45:48(275): # Determines which checks will be run. Valid values are
as follows:
14:45:48(280): # all - Run a check between every row and column. This
will be the most common.
14:45:48(286): # afterSelf - Run a check to every host that's defined
after the current row in the 'rows' group
14:45:48(293): # beforeSelf - Run a check to every host that's defined
before the current row in the 'rows' group
14:45:48(294): columnAlgorithm: "all"
14:45:48(299): # Reference a report (see the further down in this file)
to apply to this grid
14:45:48(301): report: "loss_mesh_report"
14:45:48(305): # 'statusLabels' is where you provide a human-readable
description of what each
14:45:48(311): # threshold means. These are the values that will be
displayed in the legend. If a
14:45:48(316): # threshold is not defined below then that will not be
displayed in the legend.
14:45:48(318): # Valid values are as follows:
14:45:48(320): # ok: corresponds to the green status.
14:45:48(322): # warning: the yellow status.
14:45:48(324): # critical: the red status.
14:45:48(326): # unknown: the orange status
14:45:48(333): # notrun: the gray status. means the check has not
run yet. should only happen first time check deployed.
14:45:48(334): statusLabels:
14:45:48(336): ok: "Reliability = All Good"
14:45:48(338): critical: "Reliability = Critical"
14:45:48(340): warning: "Reliability = Warning"
14:45:48(343): unknown: "Unable to find test data"
14:45:48(345): notrun: "Check has not run yet"
14:45:48(349): #note: 'warning' not defined because no warning
threshold set
14:45:48(354): #Below is a second grid that defines the throughput checks
to the BWCTL hosts defined.
14:45:48(360): # The parameters have the same meaning as the OWAMP check
above so they are not repeated here.
14:45:48(363): - name: "KAUST Non-Research - Speed"
14:45:48(367): added_by_psconfig: 3 # remove this if you want to keep
after psconfig runs
14:45:48(369): rows: "nonResearchBwctlHosts"
14:45:48(370): columns: "nonResearchBwctlHosts"
14:45:48(371): checks:
14:45:48(372): - "bwctlCheck"
14:45:48(374): - "bwctlRevCheck"
14:45:48(375): rowOrder: "alphabetical"
14:45:48(377): colOrder: "alphabetical"
14:45:48(378): excludeSelf: 1
14:45:48(379): columnAlgorithm: "all"
14:45:48(381): report: "throughput_mesh_report"
14:45:48(382): statusLabels:
14:45:48(384): ok: "Speed = All Good"
14:45:48(386): warning: "Speed = Warning "
14:45:48(388): critical: "Speed - Critical"
14:45:48(391): unknown: "Unable to find test data"
14:45:48(393): notrun: "Check has not run yet"
14:45:48(401): #we can define a additional states here. A common
one is the "scheduled even state" supported by the admin UI.
14:45:48(402): extra:
14:45:48(403): - value: 5
14:45:48(405): shortName: EVENT
14:45:48(408): description: "Down for maintenance"
14:45:48(408):
14:45:48(408):
14:45:48(408):
14:45:48(408):#DMZ Grids
14:45:48(408):
14:45:48(414): #The first item in the list is a grid for the loss
checks between the OWAMP hosts
14:45:48(420): # defined. The first property "name" defines the title
that will display on the dashboard.
14:45:48(422): - name: "KAUST ScienceDMZ - OWAMP Latency"
14:45:48(427): added_by_psconfig: 6 # remove this if you want to keep
after psconfig runs
14:45:48(432): # Define the hosts that will be listed down the left
side of the grid. This MUST be a
14:45:48(436): # value defined in the 'groups' section of this config
file.
14:45:48(437): rows: "DMZOwampHosts"
14:45:48(443): # Define the hosts that will be listed across the top of
the grid. This MUST be a
14:45:48(448): # value defined in the 'groups' section of this config
file. It can be the same or
14:45:48(451): # a different group from teh one defined in 'rows'.
14:45:48(454): columns: "DMZOwampHosts"
14:45:48(459): #Define the checks that will be shown in each cell in
the grid. You may define up
14:45:48(464): # to 2. If there are more than one the top half of the
cell will be the first check
14:45:48(469): # listed and the bottom half will be the second. This is
often useful for showing
14:45:48(471): # results in the different directions.
14:45:48(472): checks:
14:45:48(474): - "owampLossCheck"
14:45:48(475): - "owampLossRevCheck"
14:45:48(480): #Specify the order you want hosts assigned to the 'rows'
attribute to be displayed.
14:45:48(482): # Valid values are as follows:
14:45:48(487): # alphabetical - automatically sorts members of the
group alphabetically
14:45:48(492): # group - displays the members of the group exactly in
the order they are defined
14:45:48(494): rowOrder: "alphabetical"
14:45:48(499): #Specify the order you want hosts assigned to the 'cols'
attribute to be displayed.
14:45:48(500): # Valid values are as follows:
14:45:48(505): # alphabetical - automatically sorts members of the
group alphabetically
14:45:48(510): # group - displays the members of the group exactly in
the order they are defined
14:45:48(512): colOrder: "alphabetical"
14:45:48(517): # Set to if you don't want results to be checked from a
host to itself. Set to 0
14:45:48(518): # otherwise.
14:45:48(519): excludeSelf: 1
14:45:48(524): # The section below allows you to exclude individual
checks. The structure is a map
14:45:48(530): # where the key is the name of the row where you want to
exclude a check. It should match
14:45:48(535): # a member of the group assigned to the "rows" property
of this grid or it can be the special
14:45:48(541): # key 'default' that matches every row. The value is a
list of columns that should not appear
14:45:48(548): # in the grid. An item in the list must be a member of
the group assigned to the "columns"
14:45:48(554): # property of this grid or the special value "all" which
removes all columns for a row. The example
14:45:48(556): # below does the following:
14:45:48(559): # * Excludes the column albu-pt1.es.net from every
row
14:45:48(564): # * Excludes the column chic-pt1.es.net from the row
bois-owamp.es.net
14:45:48(567): # * Excludes every column in the row
bost-owamp.es.net
14:45:48(568): # excludeChecks:
14:45:48(569): # default:
14:45:48(571): # - "albu-pt1.es.net"
14:45:48(572): # bois-owamp.es.net:
14:45:48(574): # - "chic-pt1.es.net"
14:45:48(575): # bost-owamp.es.net:
14:45:48(577): # - "all"
14:45:48(577): #
14:45:48(581): # Determines which checks will be run. Valid values are
as follows:
14:45:48(586): # all - Run a check between every row and column. This
will be the most common.
14:45:48(592): # afterSelf - Run a check to every host that's defined
after the current row in the 'rows' group
14:45:48(599): # beforeSelf - Run a check to every host that's defined
before the current row in the 'rows' group
14:45:48(601): columnAlgorithm: "all"
14:45:48(606): # Reference a report (see the further down in this file)
to apply to this grid
14:45:48(608): report: "loss_mesh_report"
14:45:48(613): # 'statusLabels' is where you provide a human-readable
description of what each
14:45:48(619): # threshold means. These are the values that will be
displayed in the legend. If a
14:45:48(624): # threshold is not defined below then that will not be
displayed in the legend.
14:45:48(625): # Valid values are as follows:
14:45:48(628): # ok: corresponds to the green status.
14:45:48(630): # warning: the yellow status.
14:45:48(632): # critical: the red status.
14:45:48(634): # unknown: the orange status
14:45:48(642): # notrun: the gray status. means the check has not
run yet. should only happen first time check deployed.
14:45:48(643): statusLabels:
14:45:48(644): ok: "Loss is 0"
14:45:48(647): critical: "Loss is greater than 0"
14:45:48(649): unknown: "Unable to retrieve data"
14:45:48(651): notrun: "Check has not yet run"
14:45:48(655): #note: 'warning' not defined because no warning
threshold set
14:45:48(660): #Below is a second grid that defines the throughput checks
to the BWCTL hosts defined.
14:45:48(667): # The parameters have the same meaning as the OWAMP check
above so they are not repeated here.
14:45:48(670): - name: "KAUST ScienceDMZ - TCP BWCTL Throughput"
14:45:48(675): added_by_psconfig: 6 # remove this if you want to keep
after psconfig runs
14:45:48(676): rows: "DMZBwctlHosts"
14:45:48(678): columns: "DMZBwctlHosts"
14:45:48(679): checks:
14:45:48(681): - "bwctlCheck"
14:45:48(684): - "bwctlRevCheck"
14:45:48(685): rowOrder: "alphabetical"
14:45:48(687): colOrder: "alphabetical"
14:45:48(688): excludeSelf: 1
14:45:48(690): columnAlgorithm: "all"
14:45:48(692): report: "throughput_mesh_report"
14:45:48(693): statusLabels:
14:45:48(695): ok: "Throughput >= 900Mbps"
14:45:48(697): ok: "Throughput >= 90Mbps"
14:45:48(699): warning: "Throughput < 90Mbps"
14:45:48(701): critical: "Throughput < 10Mbps"
14:45:48(704): unknown: "Unable to retrieve data"
14:45:48(706): notrun: "Check has not yet run"
14:45:48(714): #we can define a additional states here. A common
one is the "scheduled even state" supported by the admin UI.
14:45:48(715): extra:
14:45:48(716): - value: 5
14:45:48(718): shortName: EVENT
14:45:48(720): description: "Down for maintenance"
14:45:48(720):
14:45:48(720):
14:45:48(720):
14:45:48(725):#'dashboards' provide a way to group grids together. Grids
grouped in this manner will
14:45:48(730):# be displayed on the same page together. This is a list, so
you can define as many
14:45:48(732):# dashboards as you want.
14:45:48(733):dashboards:
14:45:48(738): #The following defines a dashboard that shows the BWCTL
and OWAMP results on the same
14:45:48(744): # page. The 'name' parameter defines the title that will
displayed on the dashboard
14:45:48(746): - name: "1: KAUST Residential"
14:45:48(750): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:48(755): # Defines the list of grids that belong to this
dashboard. Each 'name' must
14:45:48(760): # correspond to the name defined under the 'grids'
sections of this config file.
14:45:48(761): grids:
14:45:48(763): - name: "KAUST Residential - Reliability"
14:45:48(765): - name: "KAUST Residential - Speed"
14:45:48(765):
14:45:48(765):
14:45:48(767): - name: "2: KAUST Research"
14:45:48(771): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:48(776): # Defines the list of grids that belong to this
dashboard. Each 'name' must
14:45:48(781): # correspond to the name defined under the 'grids'
sections of this config file.
14:45:48(782): grids:
14:45:48(784): - name: "KAUST Research - Reliability"
14:45:48(786): - name: "KAUST Research - Speed"
14:45:48(786):
14:45:48(788): - name: "3: KAUST Non-Research"
14:45:48(792): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:48(797): # Defines the list of grids that belong to this
dashboard. Each 'name' must
14:45:48(802): # correspond to the name defined under the 'grids'
sections of this config file.
14:45:48(803): grids:
14:45:48(805): - name: "KAUST Non-Research - Reliability"
14:45:48(808): - name: "KAUST Non-Research - Speed"
14:45:48(808):
14:45:48(809): - name: "4: KAUST Data Center"
14:45:48(814): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:48(819): # Defines the list of grids that belong to this
dashboard. Each 'name' must
14:45:48(824): # correspond to the name defined under the 'grids'
sections of this config file.
14:45:48(825): grids:
14:45:48(827): - name: "KAUST Data Center - Reliability"
14:45:48(829): - name: "KAUST Data Center - Speed"
14:45:48(829):
14:45:48(832): - name: "5: KAUST International Internet Gateways"
14:45:48(836): added_by_psconfig: 2 # remove this if you want to keep
after psconfig runs
14:45:48(841): # Defines the list of grids that belong to this
dashboard. Each 'name' must
14:45:48(846): # correspond to the name defined under the 'grids'
sections of this config file.
14:45:48(847): grids:
14:45:48(851): - name: "KAUST International Internet Gateways -
Reliability"
14:45:48(854): - name: "KAUST International Internet Gateways - Speed"
14:45:48(854):
14:45:48(854):
14:45:48(857): - name: "6: KAUST ScienceDMZ"
14:45:48(862): added_by_psconfig: 6 # remove this if you want to keep
after psconfig runs
14:45:48(867): # Defines the list of grids that belong to this
dashboard. Each 'name' must
14:45:48(872): # correspond to the name defined under the 'grids'
sections of this config file.
14:45:48(872): grids:
14:45:48(875): - name: "KAUST ScienceDMZ - OWAMP Latency"
14:45:48(878): - name: "KAUST ScienceDMZ - TCP BWCTL Throughput"
14:45:48(878):
14:45:48(878):
14:45:48(884):#'reports' provide a way to define patterns that can be used
to alert on certain types of problems
14:45:48(889):# Since every dashboard is different, reports provide a
flexible way to define patterns
14:45:48(891):#defaultReport: "grid_up_down_report"
14:45:48(891):reports:
14:45:48(892): -
14:45:48(894): id: "grid_up_down_report"
14:45:48(898): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:48(899): rule:
14:45:48(900): type: matchFirst
14:45:48(901): rules:
14:45:48(902): -
14:45:48(903): type: rule
14:45:48(905): selector:
14:45:48(906): type: grid
14:45:48(907): match:
14:45:48(909): type: status
14:45:48(911): status: 3
14:45:48(912): problem:
14:45:48(914): severity: 3
14:45:48(916): category: CONFIGURATION
14:45:48(918): message: "Grid is down"
14:45:48(919): solutions:
14:45:48(923): - "Check your maddash
configuration"
14:45:48(924): -
14:45:48(925): type: rule
14:45:48(926): selector:
14:45:48(928): type: grid
14:45:48(929): match:
14:45:48(931): type: status
14:45:48(932): status: 0
14:45:48(934): problem:
14:45:48(935): severity: 0
14:45:48(938): category: CONFIGURATION
14:45:48(940): message: "No issues found"
14:45:48(941): -
14:45:48(943): id: "throughput_mesh_report"
14:45:48(947): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:48(948): rule:
14:45:48(949): type: matchFirst
14:45:48(950): rules:
14:45:48(951): -
14:45:48(953): type: rule
14:45:48(954): selector:
14:45:48(956): type: grid
14:45:48(957): match:
14:45:48(958): type: status
14:45:48(960): status: 3
14:45:48(961): problem:
14:45:48(963): severity: 3
14:45:48(965): category: CONFIGURATION
14:45:48(967): message: "Grid is down"
14:45:48(969): solutions:
14:45:48(985): - "If you just configured this
grid in the mesh, you may just need to wait as it takes several hours for
throughput data to populate (depending on the interval between tests)"
14:45:49(019): - "Verify maddash is configured
properly. Look in the files under /var/log/maddash/ for any errors. Things to
look for are incorrect paths to checks or connection errors."
14:45:49(047): - "Verify that
/usr/lib/perfsonar/bin/generate_gui_configuration has run recently and you
are looking at an accurate test mesh"
14:45:49(051): - "Verify that your measurement
archive(s) are running"
14:45:49(058): - "Verify no firewall is blocking
maddash from reaching your measurement archive(s)"
14:45:49(070): - "Verify your hosts are
downloading the mesh configuration file and that there are tests defined in
/etc/perfsonar/regulartesting.conf"
14:45:49(078): - "Verify that regular testing is
running (/etc/init.d/perfsonar-regulartesting status)"
14:45:49(095): - "Verify your hosts are able to
reach their configured measurement archive and that there are no errors in
/var/log/perfsonar/regulartesting.log"
14:45:49(104): -
14:45:49(105): type: rule
14:45:49(107): selector:
14:45:49(108): type: grid
14:45:49(110): match:
14:45:49(112): type: status
14:45:49(113): status: 0
14:45:49(115): problem:
14:45:49(116): severity: 0
14:45:49(118): category: CONFIGURATION
14:45:49(121): message: "No issues found"
14:45:49(122): -
14:45:49(123): type: forEachSite
14:45:49(124): rule:
14:45:49(127): type: matchFirst
14:45:49(129): rules:
14:45:49(130): -
14:45:49(132): type: rule
14:45:49(134): selector:
14:45:49(136): type: site
14:45:49(138): match:
14:45:49(140): type: status
14:45:49(142): status: 3

14:45:49(144): problem:
14:45:49(146): severity: 3
14:45:49(149): category: CONFIGURATION
14:45:49(152): message: "Site is down"
14:45:49(155): solutions:
14:45:49(158): - "Verify the host is
up"
14:45:49(175): - "If recently added
to the mesh, verify the mesh config file has been downloaded by the end-hosts
since the update. It may also take several hours for the first BWCTL test to
run on this host"
14:45:49(211): - "If recently removed
from the mesh, verify that /usr/lib/perfsonar/bin/generate_gui_configuration
has run recently and you are looking at an accurate test mesh"
14:45:49(216): - "Verify NTP is
synced on this host"
14:45:49(232): - "Verify the local
and remote sites allow access to TCP port 4823, TCP/UDP ports 6001-6200, and
TCP/UDP ports 5001-5900"
14:45:49(233): -
14:45:49(235): type: rule
14:45:49(237): selector:
14:45:49(239): type: row
14:45:49(241): match:
14:45:49(243): type: status
14:45:49(245): status: 3

14:45:49(247): problem:
14:45:49(249): severity: 3
14:45:49(252): category: CONFIGURATION
14:45:49(258): message: "Unable to run
and/or query any outgoing throughput tests."
14:45:49(261): solutions:
14:45:49(269): - "Verify you are not
blocking any of the required outgoing BWCTL ports in your firewall"
14:45:49(279): - "Verify the remote
sites allow your host to access TCP/UDP ports 5001-5900"
14:45:49(292): - "Verify the limits
defined in /etc/bwctl-server/bwctl-server.limits are properly defined and not
being exceeded by the tests"
14:45:49(293): -
14:45:49(295): type: rule
14:45:49(297): selector:
14:45:49(299): type: column
14:45:49(301): match:
14:45:49(303): type: status
14:45:49(305): status: 3

14:45:49(307): problem:
14:45:49(309): severity: 3
14:45:49(312): category: CONFIGURATION
14:45:49(319): message: "Unable to run
and/or query any incoming throughput tests."
14:45:49(321): solutions:
14:45:49(330): - "Verify your host
and router firewalls are allowing TCP/UDP 5001-5900"
14:45:49(342): - "Verify the limits
defined in /etc/bwctl-server/bwctl-server.limits are properly defined and not
being exceeded by the tests"
14:45:49(344): -
14:45:49(346): type: matchAll
14:45:49(348): rules:
14:45:49(350): -
14:45:49(352): type: matchFirst
14:45:49(354): rules:
14:45:49(357): -
14:45:49(364): type: rule
14:45:49(366): selector:
14:45:49(369): type: check
14:45:49(372): rowIndex: 0
14:45:49(375): colIndex: 1
14:45:49(378): match:
14:45:49(381): type:
status
14:45:49(384): status: 3
14:45:49(387): problem:
14:45:49(390): severity: 3
14:45:49(394): category:
CONFIGURATION
14:45:49(403): message:
"Tests initiated at this site are failing in both incoming and outgoing
directions"
14:45:49(407): solutions:
14:45:49(414): -
"Verify that your measurement archive(s) are running"
14:45:49(423): -
"Verify no firewall is blocking maddash from reaching your measurement
archive(s)"
14:45:49(438): -
"Verify your hosts are downloading the mesh configuration file and that there
are tests defined in /etc/perfsonar/regulartesting.conf"
14:45:49(449): -
"Verify that regular testing is running (/etc/init.d/perfsonar-regulartesting
status)"
14:45:49(465): -
"Verify your hosts are able to reach their configured measurement archive and
that there are no errors in /var/log/perfsonar/regulartesting.log"
14:45:49(484): -
14:45:49(488): type: rule
14:45:49(491): selector:
14:45:49(494): type: check
14:45:49(497): rowIndex: 0
14:45:49(501): colIndex: 1
14:45:49(503): match:
14:45:49(507): type:
statusThreshold
14:45:49(510): status: 3
14:45:49(513): threshold:
.6
14:45:49(516): problem:
14:45:49(519): severity: 3
14:45:49(523): category:
CONFIGURATION
14:45:49(535): message:
"A majority (but not all) of tests initiated by this site are failing in both
incoming and outgoing directions"
14:45:49(539): solutions:
14:45:49(547): -
"Check if the sites that are failing are blocking TCP port 4823."
14:45:49(557): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:49(566): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:49(581): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:49(582):
14:45:49(584): -
14:45:49(587): type: rule
14:45:49(590): selector:
14:45:49(593): type: check
14:45:49(597): rowIndex: 0
14:45:49(599): match:
14:45:49(603): type:
statusThreshold
14:45:49(606): status: 3
14:45:49(609): threshold:
.6
14:45:49(612): problem:
14:45:49(615): severity: 3
14:45:49(619): category:
CONFIGURATION
14:45:49(627): message:
"Tests initiated at this site are failing in the outgoing direction"
14:45:49(630): solutions:
14:45:49(640): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:49(650): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:49(665): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:49(667): -
14:45:49(670): type: rule
14:45:49(673): selector:
14:45:49(676): type: check
14:45:49(679): colIndex: 1
14:45:49(681): match:
14:45:49(685): type:
statusThreshold
14:45:49(688): status: 3
14:45:49(691): threshold:
.6
14:45:49(694): problem:
14:45:49(697): severity: 3
14:45:49(701): category:
CONFIGURATION
14:45:49(709): message:
"Tests initiated at this site are failing in the incoming direction"
14:45:49(712): solutions:
14:45:49(724): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:49(733): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:49(749): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:49(751): -
14:45:49(754): type: matchFirst
14:45:49(756): rules:
14:45:49(758): -
14:45:49(761): type: rule
14:45:49(763): selector:
14:45:49(767): type: check
14:45:49(770): rowIndex: 1
14:45:49(773): colIndex: 0
14:45:49(775): match:
14:45:49(778): type:
status
14:45:49(781): status: 3
14:45:49(784): problem:
14:45:49(787): severity: 3
14:45:49(791): category:
CONFIGURATION
14:45:49(801): message:
"Tests initiated by remote sites are failing in both incoming and outgoing
directions"
14:45:49(804): solutions:
14:45:49(815): -
"Verify that the local site has TCP port 4823 open on the host and router
firewalls"
14:45:49(825): -
"Verify that bwctl-server is running on the host with
'/etc/init.d/bwctl-server status'"
14:45:49(840): -
"Verify the limits defined in /etc/bwctl-server/bwctl-server.limits are
properly defined and not being exceeded by the tests"
14:45:49(842): -
14:45:49(845): type: rule
14:45:49(848): selector:
14:45:49(851): type: check
14:45:49(854): rowIndex: 1
14:45:49(857): colIndex: 0
14:45:49(860): match:
14:45:49(864): type:
statusThreshold
14:45:49(867): status: 3
14:45:49(870): threshold:
.6
14:45:49(873): problem:
14:45:49(876): severity: 3
14:45:49(880): category:
CONFIGURATION
14:45:49(893): message:
"A majority (but not all) of tests initiated by remote sites are failing in
both incoming and outgoing directions"
14:45:49(896): solutions:
14:45:49(911): -
"Verify that the local site has TCP port 4823 open on the host and router
firewalls to all hosts in the mesh"
14:45:49(926): -
"Verify the limits defined in /etc/bwctl-server/bwctl-server.limits are
properly defined and not being exceeded by the tests"
14:45:49(928): -
14:45:49(931): type: rule
14:45:49(934): selector:
14:45:49(937): type: check
14:45:49(940): rowIndex: 1
14:45:49(943): match:
14:45:49(946): type:
statusThreshold
14:45:49(949): status: 3
14:45:49(953): threshold:
.6
14:45:49(955): problem:
14:45:49(958): severity: 3
14:45:49(963): category:
CONFIGURATION
14:45:49(971): message:
"Tests initiated by remote sites are failing in the outgoing direction"
14:45:49(974): solutions:
14:45:49(984): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:49(993): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:50(008): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:50(010): -
14:45:50(013): type: rule
14:45:50(016): selector:
14:45:50(019): type: check
14:45:50(022): colIndex: 0
14:45:50(025): match:
14:45:50(029): type:
statusThreshold
14:45:50(032): status: 3
14:45:50(035): threshold:
.6
14:45:50(038): problem:
14:45:50(041): severity: 3
14:45:50(045): category:
CONFIGURATION
14:45:50(054): message:
"Tests initiated by remote sites are failing in the incoming direction"
14:45:50(057): solutions:
14:45:50(068): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:50(076): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:50(092): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:50(094): -
14:45:50(097): type: rule
14:45:50(099): selector:
14:45:50(101): type: row
14:45:50(104): match:
14:45:50(107): type:
statusWeightedThreshold
14:45:50(110): statuses:
14:45:50(113): - 0.0
14:45:50(115): - .05
14:45:50(118): - 1.0
14:45:50(120): - -1.0
14:45:50(123): threshold: .6
14:45:50(125): problem:
14:45:50(128): severity: 2
14:45:50(132): category:
PERFORMANCE
14:45:50(141): message: "Outgoing
throughput is below warning or critical thresholds to a majority of sites"
14:45:50(143): -
14:45:50(145): type: rule
14:45:50(147): selector:
14:45:50(150): type: column
14:45:50(152): match:
14:45:50(156): type:
statusWeightedThreshold
14:45:50(159): statuses:
14:45:50(161): - 0.0
14:45:50(164): - .05
14:45:50(167): - 1.0
14:45:50(169): - -1.0
14:45:50(172): threshold: .6
14:45:50(174): problem:
14:45:50(177): severity: 2
14:45:50(180): category:
PERFORMANCE
14:45:50(190): message: "Incoming
throughput is below warning or critical thresholds to a majority of sites"
14:45:50(190): -
14:45:50(192): id: "loss_mesh_report"
14:45:50(197): added_by_psconfig: 1 # remove this if you want to keep
after psconfig runs
14:45:50(198): rule:
14:45:50(199): type: matchFirst
14:45:50(200): rules:
14:45:50(201): -
14:45:50(202): type: rule
14:45:50(203): selector:
14:45:50(205): type: grid
14:45:50(206): match:
14:45:50(208): type: status
14:45:50(210): status: 3
14:45:50(211): problem:
14:45:50(213): severity: 3
14:45:50(215): category: CONFIGURATION
14:45:50(217): message: "Grid is down"
14:45:50(219): solutions:
14:45:50(230): - "If you just configured this
grid in the mesh, you may just need to wait as it takes a few minutes for
loss data to populate"
14:45:50(247): - "Verify maddash is configured
properly. Look in the files under /var/log/maddash/ for any errors. Things to
look for are incorrect paths to checks or connection errors."
14:45:50(275): - "Verify that
/usr/lib/perfsonar/bin/generate_gui_configuration has run recently and you
are looking at an accurate test mesh"
14:45:50(279): - "Verify that your measurement
archive(s) are running"
14:45:50(286): - "Verify no firewall is blocking
maddash from reaching your measurement archive(s)"
14:45:50(299): - "Verify your hosts are
downloading the mesh configuration file and that there are tests defined in
/etc/perfsonar/regulartesting.conf"
14:45:50(306): - "Verify that regular testing is
running (/etc/init.d/perfsonar-regulartesting status)"
14:45:50(319): - "Verify your hosts are able to
reach their configured measurement archive and that there are no errors in
/var/log/perfsonar/regulartesting.log"
14:45:50(321): -
14:45:50(322): type: rule
14:45:50(324): selector:
14:45:50(325): type: grid
14:45:50(326): match:
14:45:50(329): type: status
14:45:50(335): status: 0
14:45:50(337): problem:
14:45:50(338): severity: 0
14:45:50(340): category: CONFIGURATION
14:45:50(343): message: "No issues found"
14:45:50(344): -
14:45:50(346): type: forEachSite
14:45:50(347): rule:
14:45:50(349): type: matchFirst
14:45:50(350): rules:
14:45:50(351): -
14:45:50(353): type: rule
14:45:50(355): selector:
14:45:50(357): type: site
14:45:50(359): match:
14:45:50(361): type: status
14:45:50(364): status: 3

14:45:50(365): problem:
14:45:50(367): severity: 3
14:45:50(370): category: CONFIGURATION
14:45:50(373): message: "Site is down"
14:45:50(375): solutions:
14:45:50(379): - "Verify the host is
up"
14:45:50(390): - "If recently added
to the mesh, verify the mesh config file has been downloaded by the end-hosts
since the update."
14:45:50(407): - "If recently removed
from the mesh, verify that /usr/lib/perfsonar/bin/generate_gui_configuration
has run recently and you are looking at an accurate test mesh"
14:45:50(423): - "Verify the local
and remote sites allow access to TCP port 861 and UDP ports 8760-9960"

14:45:50(430): -
14:45:50(432): type: rule
14:45:50(434): selector:
14:45:50(436): type: row
14:45:50(438): match:
14:45:50(440): type: status
14:45:50(442): status: 3

14:45:50(444): problem:
14:45:50(447): severity: 3
14:45:50(449): category: CONFIGURATION
14:45:50(456): message: "Unable to run
and/or query any outgoing one-way delay tests."
14:45:50(458): solutions:
14:45:50(466): - "Verify you are not
blocking any of the required outgoing OWAMP ports in your firewall"
14:45:50(475): - "Verify the remote
sites allow your host to access UDP ports 8760-9960"
14:45:50(476): -
14:45:50(479): type: rule
14:45:50(481): selector:
14:45:50(483): type: column
14:45:50(484): match:
14:45:50(487): type: status
14:45:50(489): status: 3

14:45:50(491): problem:
14:45:50(493): severity: 3
14:45:50(496): category: CONFIGURATION
14:45:50(502): message: "Unable to run
and/or query any incoming one-way delay tests."
14:45:50(505): solutions:
14:45:50(514): - "Verify your host
and router firewalls are allowing UDP ports 8760-9960"
14:45:50(515): -
14:45:50(517): type: matchAll
14:45:50(519): rules:
14:45:50(521): -
14:45:50(523): type: matchFirst
14:45:50(525): rules:
14:45:50(527): -
14:45:50(530): type: rule
14:45:50(533): selector:
14:45:50(536): type: check
14:45:50(539): rowIndex: 0
14:45:50(542): colIndex: 1
14:45:50(545): match:
14:45:50(548): type:
status
14:45:50(551): status: 3
14:45:50(554): problem:
14:45:50(557): severity: 3
14:45:50(561): category:
CONFIGURATION
14:45:50(570): message:
"Tests initiated at this site are failing in both incoming and outgoing
directions"
14:45:50(574): solutions:
14:45:50(580): -
"Verify that your measurement archive(s) are running"
14:45:50(590): -
"Verify no firewall is blocking maddash from reaching your measurement
archive(s)"
14:45:50(605): -
"Verify your hosts are downloading the mesh configuration file and that there
are tests defined in /etc/perfsonar/regulartesting.conf"
14:45:50(615): -
"Verify that regular testing is running (/etc/init.d/perfsonar-regulartesting
status)"
14:45:50(632): -
"Verify your hosts are able to reach their configured measurement archive and
that there are no errors in /var/log/perfsonar/regulartesting.log"
14:45:50(651): -
14:45:50(653): type: rule
14:45:50(656): selector:
14:45:50(660): type: check
14:45:50(663): rowIndex: 0
14:45:50(666): colIndex: 1
14:45:50(668): match:
14:45:50(672): type:
statusThreshold
14:45:50(675): status: 3
14:45:50(678): threshold:
.6
14:45:50(681): problem:
14:45:50(684): severity: 3
14:45:50(688): category:
CONFIGURATION
14:45:50(700): message:
"A majority (but not all) of tests initiated by this site are failing in both
incoming and outgoing directions"
14:45:50(703): solutions:
14:45:50(711): -
"Check if the sites that are failing are blocking TCP port 861."
14:45:50(721): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:50(732): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:50(748): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:50(749):
14:45:50(751): -
14:45:50(753): type: rule
14:45:50(756): selector:
14:45:50(759): type: check
14:45:50(762): rowIndex: 0
14:45:50(765): match:
14:45:50(769): type:
statusThreshold
14:45:50(771): status: 3
14:45:50(775): threshold:
.6
14:45:50(778): problem:
14:45:50(781): severity: 3
14:45:50(785): category:
CONFIGURATION
14:45:50(793): message:
"Tests initiated at this site are failing in the outgoing direction"
14:45:50(796): solutions:
14:45:50(806): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:50(815): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:50(831): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:50(834): -
14:45:50(836): type: rule
14:45:50(839): selector:
14:45:50(842): type: check
14:45:50(846): colIndex: 1
14:45:50(849): match:
14:45:50(852): type:
statusThreshold
14:45:50(855): status: 3
14:45:50(859): threshold:
.6
14:45:50(861): problem:
14:45:50(865): severity: 3
14:45:50(868): category:
CONFIGURATION
14:45:50(877): message:
"Tests initiated at this site are failing in the incoming direction"
14:45:50(880): solutions:
14:45:50(890): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:50(899): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:50(914): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:50(916): -
14:45:50(919): type: matchFirst
14:45:50(921): rules:
14:45:50(923): -
14:45:50(925): type: rule
14:45:50(928): selector:
14:45:50(932): type: check
14:45:50(935): rowIndex: 1
14:45:50(938): colIndex: 0
14:45:50(940): match:
14:45:50(943): type:
status
14:45:50(947): status: 3
14:45:50(949): problem:
14:45:50(952): severity: 3
14:45:50(956): category:
CONFIGURATION
14:45:50(966): message:
"Tests initiated by remote sites are failing in both incoming and outgoing
directions"
14:45:50(969): solutions:
14:45:50(979): -
"Verify that the local site has TCP port 861 open on the host and router
firewalls"
14:45:50(989): -
"Verify that owamp-server is running on the host with
'/etc/init.d/owamp-server status'"
14:45:50(991): -
14:45:50(994): type: rule
14:45:50(997): selector:
14:45:51(000): type: check
14:45:51(003): rowIndex: 1
14:45:51(006): colIndex: 0
14:45:51(009): match:
14:45:51(013): type:
statusThreshold
14:45:51(016): status: 3
14:45:51(019): threshold:
.6
14:45:51(022): problem:
14:45:51(025): severity: 3
14:45:51(029): category:
CONFIGURATION
14:45:51(042): message:
"A majority (but not all) of tests initiated by remote sites are failing in
both incoming and outgoing directions"
14:45:51(045): solutions:
14:45:51(058): -
"Verify that the local site has TCP port 861 open on the host and router
firewalls to all hosts in the mesh"
14:45:51(060): -
14:45:51(064): type: rule
14:45:51(067): selector:
14:45:51(070): type: check
14:45:51(073): rowIndex: 1
14:45:51(076): match:
14:45:51(079): type:
statusThreshold
14:45:51(082): status: 3
14:45:51(086): threshold:
.6
14:45:51(088): problem:
14:45:51(091): severity: 3
14:45:51(095): category:
CONFIGURATION
14:45:51(104): message:
"Tests initiated by remote sites are failing in the outgoing direction"
14:45:51(107): solutions:
14:45:51(117): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:51(126): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:51(142): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:51(144): -
14:45:51(147): type: rule
14:45:51(150): selector:
14:45:51(153): type: check
14:45:51(156): colIndex: 0
14:45:51(159): match:
14:45:51(162): type:
statusThreshold
14:45:51(165): status: 3
14:45:51(169): threshold:
.6
14:45:51(172): problem:
14:45:51(175): severity: 3
14:45:51(179): category:
CONFIGURATION
14:45:51(187): message:
"Tests initiated by remote sites are failing in the incoming direction"
14:45:51(191): solutions:
14:45:51(201): -
"Verify that /usr/lib/perfsonar/bin/generate_configuration doesn't throw any
errors."
14:45:51(210): -
"Verify that /etc/perfsonar/regulartesting.conf contains the proper tests"
14:45:51(225): -
"Restart perfsonar-regulartesting, it may not have picked-up configuration
changes (/etc/init.d/perfsonar-regulartesting restart)"
14:45:51(227): -
14:45:51(229): type: rule
14:45:51(232): selector:
14:45:51(234): type: row
14:45:51(236): match:
14:45:51(240): type:
statusWeightedThreshold
14:45:51(243): statuses:
14:45:51(245): - 0.0
14:45:51(248): - .05
14:45:51(250): - 1.0
14:45:51(253): - -1.0
14:45:51(256): threshold: .6
14:45:51(258): problem:
14:45:51(261): severity: 2
14:45:51(264): category:
PERFORMANCE
14:45:51(273): message: "Outgoing
loss is below warning or critical thresholds to a majority of sites"
14:45:51(275): -
14:45:51(277): type: rule
14:45:51(279): selector:
14:45:51(282): type: column
14:45:51(285): match:
14:45:51(288): type:
statusWeightedThreshold
14:45:51(291): statuses:
14:45:51(293): - 0.0
14:45:51(296): - .05
14:45:51(299): - 1.0
14:45:51(301): - -1.0
14:45:51(304): threshold: .6
14:45:51(306): problem:
14:45:51(309): severity: 2
14:45:51(312): category:
PERFORMANCE
14:45:51(321): message: "Incoming
loss is below warning or critical thresholds to a majority of sites"

[perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/12/2019
- Re: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Valentin Vidic, 05/12/2019
  - RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/12/2019
    - Re: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Valentin Vidic, 05/12/2019
      - RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/12/2019
        
        Re: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Valentin Vidic, 05/12/2019
        
        RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/12/2019
        
        Re: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Valentin Vidic, 05/12/2019
        RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/13/2019
        Re: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Antoine Delvaux, 05/14/2019
        RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/15/2019
        Re: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Antoine Delvaux, 05/15/2019
        RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue, Muhammad Tayyab, 05/19/2019

List archive

RE: [perfsonar-user] perfSONAR nodes are down due to Memory utilization issue