OSG CE 0.4.1 Install Instructions
Introduction
This document is intended for administrators responsible for installing and configuring:
- OSG Compute Element (CE) version 0.4.1 onto OSG Production Resources
It is not meant as an all-inclusive guide to Grid computing or even all the options for configuring a CE.
Warning: Can't find topic AGLT2.OSGConventions
Operating Systems
If you experience problems with the installation of VDT supported software or for general system requirements, the basic
VDT 1.3.10b System requirements page may be useful.
Warning: Can't find topic AGLT2.OSGInstallationMethod
Warning: Can't find topic AGLT2.InstallationAssistance
Warning: Can't find topic AGLT2.CE_Software
Warning: Can't find topic AGLT2.CEPreInstallationChecklist
Creation and setup of local user accounts for VOs for OSG
UNIX user accounts need to be created by the system administrator for the VOs.
You will need to create at least one local user account (with the appropriate configuration) for each VO to which you wish to provide resources. The uid and name of the account can be locally determined. You will be asked to provide a mapping from local accounts to VOs later in the installation, the following default accounts are assumed for examples in this document.
The accounts are:
- cdf ( cdfosg 783715 )
- fermilab ( fermilab 825662 )
- grase ( grase 783716 )
- fmri ( fmriosg 783717 )
- gadu ( gadu 783718 )
- mis ( misosg 803968 )
- sdss ( sdss 751566 )
- ivdgl ( ivdgl 751565 )
- star ( starosg 789088 )
- usatlas1 ( usatlas 751564 )
- usatlas2 ( usatlasa 789089 )
- usatlas3 ( usatlasb 789090 )
- usatlas4 ( usatlasc 55620 )
- uscms01 ( uscmsa 751562 )
- uscms02 ( uscmsb 751563 )
- ligo ( ligo 825671 )
- sam ( sam 825663 )
- samgrid ( samgrid 825664 )
- dosar ( dosar 825665 )
- des ( des 825666 )
- glow ( glow 825668 )
- grow ( grow 825669 )
- gridex ( gridex 825670 )
- nanohub ( nanohub 825672 )
- geant4 ( geant4 825673 )
- i2u2 ( i2u2 825674 )
In addition, you need a globus user account (
globus 825675 ) for web services to run.
Warning: Can't find topic AGLT2.CESiteAdminGuide
Warning: Can't find topic AGLT2.PacmanInfo
Installing the OSG Worker Node Client Package
This must be installed in a location visible to jobs executing on the worker node. Refer to the
Worker Node Client Guide for additional information.
A problem we encountered during the 'pacman -get OSG:ce' was that
MySQL failed to start:
[gate02:OSG]# grep -i error vdt-install.log
log_level ERROR,WARN,INFO
error: 'Can't connect to local MySQL server through socket '/afs/atlas.umich.edu/OSG/vdt-app-data/mysql/var/mysql.sock' (2)'
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/afs/atlas.umich.edu/OSG/vdt-app-data/mysql/var/mysql.sock' (2)
ERROR 2002 (HY000): Can't connect to local MySQL server through socket '/afs/atlas.umich.edu/OSG/vdt-app-data/mysql/var/mysql.sock' (2)
The issue is that some files are located in AFS and the mysql 'user' cannot open/write there.
Solution seems to be the following:
This started mysql on gate02.
Problem with OSG:Globus-PBS-Setup
We needed to install the OSG Globus-PBS-Setup package on gate02 because it uses PBS (Torque) for its scheduler. The gate02.grid.umich.edu node is a PBS client of the North Campus Opteron cluster nyx.engin.umich.edu.
I tried the following:
- Get AFS tokens at 'admin': kinit admin; aklog
- export PATH=$PATH:/usr/local/bin:/usr/local/sbin (to insure the PBS utilities are in my path)
- pacman -get OSG:Globus-PBS-Setup
This failed with:
[gate02:OSG]# pacman -get OSG:Globus-PBS-Setup
Package [/afs/atlas.umich.edu/OSG:OSG:Globus-PBS-Setup] not [installed]:
Package [/afs/atlas.umich.edu/OSG:http://vdt.cs.wisc.edu/vdt_1310_cache:Globus-WS-PBS-Setup] not [installed]:
ERROR: Command failed
Looking in the vdt-install.log I find:
...
Re-run this setup package with PBS_HOME environment variable
pointing to the directory containing the PBS server-logs subdirectory
For Torque PBS_HOME seems to be /var/spool/torque. Set this and retry:
- Get AFS tokens at 'admin': kinit admin; aklog
- export PATH=$PATH:/usr/local/bin:/usr/local/sbin (to insure the PBS utilities are in my path)
- export PBS_HOME=/var/spool/torque
- pacman -get OSG:Globus-PBS-Setup
This also failed because it couldn't find a server_log directory in PBS_HOME:
Re-run this setup package with PBS_HOME environment variable
pointing to the directory containing the PBS server-logs subdirectory
So I did 'mkdir $PBS_HOME/server_logs' and reran and things worked:
running /afs/atlas.umich.edu/OSG/globus/setup/globus/setup-seg-pbs.pl..[ Changing to /afs/atlas.umich.edu/OSG/globus/setup/globus ]
..Done
running /afs/atlas.umich.edu/OSG/globus/setup/globus/setup-globus-scheduler-provider-pbs..[ Changing to /afs/atlas.umich.edu/OSG/globus/setup/globus ]
checking for pbsnodes... /usr/local/bin/pbsnodes
checking for qstat... /usr/local/bin/qstat
find-pbs-provider-tools: creating ./config.status
config.status: creating /afs/atlas.umich.edu/OSG/globus/libexec/globus-scheduler-provider-pbs
..Done
running /afs/atlas.umich.edu/OSG/globus/setup/globus/setup-gram-service-pbs..[ Changing to /afs/atlas.umich.edu/OSG/globus/setup/globus ]
Running /afs/atlas.umich.edu/OSG/globus/setup/globus/setup-gram-service-pbs
..Done
########## [Globus-WS-PBS-Setup] installed successfully
Normal post-install README steps
The setup said to follow the $VDT_LOCATION/post-install/README steps. The 'edg-crl-update' daemon was already running.
I ran setup-cert-request from .../OSG/vdt/setup/ directory without a problem. I already had a DOE host cert for gate02.
I needed to create an LDAP cert for gate02:
- grid-cert-request -force -dir /etc/grid-security/ldap/ -service ldap -host gate02.grid.umich.edu
This worked. I used the DOEGrids CA
GridAdmin service to get the cert.
I then setup secure MDS:
vdt/setup/configure_mds -secure
I then setup '/etc/sudoers' to allow globus-ws to work:
Runas_Alias GLOBUSUSERS = ALL, !root
globus ALL=(GLOBUSUSERS) \
NOPASSWD: /afs/atlas.umich.edu/OSG/globus/libexec/globus-gridmap-and-exe\
cute \
-g /etc/grid-security/grid-mapfile \
/afs/atlas.umich.edu/OSG/globus/libexec/globus-job-manager-script.pl *
globus ALL=(GLOBUSUSERS) \
NOPASSWD: /afs/atlas.umich.edu/OSG/globus/libexec/globus-gridmap-and-exe\
cute \
-g /etc/grid-security/grid-mapfile \
/afs/atlas.umich.edu/OSG/globus/libexec/globus-gram-local-proxy-tool *
I then setup for PRIMA by copying gsi-authz.conf and prima-authz.conf to /etc/grid-security
Nexti is the
GenericInformationProvider (GIP) configuration by running vdt/setup/configure_gip:
[gate02:OSG]# vdt/setup/configure_gip --vdt-install $VDT_LOCATION --batch pbs
We have set up the Generic Information Providers (GIP). If you were running
the GRIS (MDS), you must stop and restart it for these changes to take
affect. You can do this with the following commands:
/afs/atlas.umich.edu/OSG/post-install/gris stop
/afs/atlas.umich.edu/OSG/post-install/gris start
Note that this information is also in
/afs/atlas.umich.edu/OSG/post-install/README
Last is to reconfigure Monalisa:
[gate02:OSG]# vdt/setup/configure_monalisa --prompt
To configure MonaLisa you need to specify a few parameters.
Please answer the following questions:
Please specify user account to run MonaLisa daemons as: [monalisa]
This is the name you will be seen by the world, so please choose
a name that represents you. Make sure this name is unique in the
MonaLisa environment.
Please specify the farm name: [gate02.grid.umich.edu] UMATLAS
Your Monitor Group name is important to group your site correctly
in the global site list. OSG users should enter "OSG"
Please enter your monitor group name: [Test] OSG
Please enter your contact name (your full name): [root] Shawn McKee
Contact email (your email): [root@gate02.grid.umich.edu] smckee@umich.edu
City (server's location): [] Ann Arbor, MI
Country: [] US
You can find some approximate values for your geographic location from:
http://geotags.com/
or you can search your location on Google
For USA: LAT is about 29 (South) ... 48 (North)
LONG is about -123 (West coast) ... -71 (East coast)
Location latitude ( -90 (S) .. 90 (N) ): [0] 42.277
Location longitude ( -180 (W) .. 180 (E) ): [0] -83.736
Will you connect to a Ganglia instance (y/n): [n]
Do you want to run OSG_VO_Modules (y/n): [y] y
Please specify GLOBUS location: [/afs/atlas.umich.edu/OSG/globus]
Please specify CONDOR location: []
Please specify the path to condor_config: []
Please specify PBS location: [] /usr/local/
Please specify LSF location: []
Please specify FBS location: []
Please specify SGE location: []
Do you want to enable the MonALISA auto-update feature (y/n): [n] y
This finshed the post-install/README instructions.
After this I tried to start
MonaLisa but it seemed to "hang" on unpacking an update. The problem is again AFS. The 'monalisa' user didn't have the correct access for /afs. I fixed this by moving the whole OSG/MonaLisa/Service/VDTFarm directory into /opt as I did above to fix mysql:
- cp -arv VDTFarm /opt/
- mv VDTFarm VDTFarm.orig
- ln -s /opt/VDTFarm ./VDTFarm
- chown -R monalisa.osg /opt/VDTFarm
Then 'service MLD start' works correctly.
Warning: Can't find topic AGLT2.CEInstallingServices
Warning: Can't find topic AGLT2.ManagedFork
Warning: Can't find topic AGLT2.CEConfigurePKI
Warning: Can't find topic AGLT2.ConfiguringGlobusJobs
Warning: Can't find topic AGLT2.ConfigureOSGAttributes
In this section I ran the OSG/monitoring/configure_osg.sh script to setup OSG/MonaLisa.
I found that since I started anew I need to restore some of the missing AFS mount points for APP, DATA and HOME under /afs/atlas.umich.edu/OSG.
Here is the run:
[gate02:OSG]# ./monitoring/configure-osg.sh
***********************************************************************
################# Configuration for the OSG CE Node ###################
***********************************************************************
This script collects the necessary information required by the various
monitoring and discovery systems for operating for the OSG.
A definition of the attributes that you will have to enter below is in:
http://osg.ivdgl.org/twiki/bin/view/Integration/LocalStorageRequirements
Intructions on how to use this script are in:
http://osg.ivdgl.org/twiki/bin/view/Integration/LocalStorageConfiguration
Your CE may not provide some of the CE-Storages (DATA, SITE_READ, SITE_WRITE,
DEFAULT_SE). In those instances, the value to enter is UNAVAILABLE
At any time, you can out of the script and no updates will be applied.
Preset information you are not prompted for
--------------------------------------------
These variables are preset at installation and cannot be changed:
OSG location
Globus location
User-VO map file
gridftp.log location
Information about your site in general
--------------------------------------
Group: The monitoring group your site is participating in.
- for production, use OSG.
Site name: The name by which the monitoring infrastructure
will refer to this resource.
Sponsors: The VO sponsors for your site.
For example: usatlas, ivdgl, ligo, uscms, sdss...
You must express the percentage of sponsorship using
the following notation.
myvo:50 yourvo:10 anothervo:20 local:20
Policy URL: This is the URL for the document describing the usage policy /
agreement for this resource
Specify your OSG GROUP [OSG]:
Specify your OSG SITE NAME [UMATLAS]:
Specify your VO sponsors [usatlas:80 local:20]:
Specify your policy url [http://gate02.grid.umich.edu/policy]:
Information about your site admininistrator
-------------------------------------------
Contact name: The site administrator's full name.
Contact email: The site adminstrator's email address.
Specify a contact for your server (full name) [Shawn McKee]:
Specify the contact's email address [smckee@umich.edu]:
Information about your servers location
----------------------------------------
City: The city your server is located in or near.
Country: The country your server is located in.
Logitude/Latitude: For your city. This will determine your placement on any
world maps used for monitoring. You can find some approximate values
for your geographic location from:
http://geotags.com/
or you can search your location on Google
For USA: LAT is about 29 (South) ... 48 (North)
LONG is about -123 (West coast) ... -71 (East coast)
Specify your server's city [Ann Arbor, Michigan]:
Specify your server's country [US]:
Specify your server's longitude [42.277]:
Specify your server's latitude [-83.736]:
Information about the available storage on your server
------------------------------------------------------
GRID: Location where the OSG WN Client (wn-client.pacman) has
been installed.
APP: Typically used to store the applications which will run on
this gatekeeper. As a rule of thumb, the OSG APP should be on
- dedicated partition
- size: at least 10 GB.
DATA: Typically used to hold output from jobs while it is staged out to a
Storage Element.
- dedicated partition
- size: at least 2 GB times the maximum number of simultaneously
running jobs that your cluster's batch system can support.
WN_TMP: Used to hold input and output from jobs on a worker node where the
application is executing.
- local partition
- size: at least 2 GB
SITE_READ: Used to stage-in input for jobs using a Storage Element or for
persistent storage between jobs. It may be the mount point of a
dCache SE accessed read-only using dcap.
SITE_WRITE: Used to store to a Storage Element output from jobs or for
persistent storage between jobs. It may be the mount point of a
dCache SE accessed write-only using dcap.
Specify your OSG GRID path [/afs/atlas.umich.edu/OSG]:
Specify your OSG APP path [/afs/atlas.umich.edu/OSG/APP]:
Specify your OSG DATA path [/afs/atlas.umich.edu/OSG/DATA]:
Specify your OSG WN_TMP path [/tmp]:
Specify your OSG SITE_READ path [UNAVAILABLE]:
Specify your OSG SITE_WRITE path [UNAVAILABLE]:
Information about the Storage Element available from your server
----------------------------------------------------------------
A storage element exists for this node.
This is the Storage Element (SE) that is visible from all the nodes of this
server (CE). It may be a SE local or close to the CE that is preferred as
destination SE if the job does not have other preferences.
Is a storage element (SE) available [y] (y/n):
Specify your default SE [umfs02.grid.umich.edu]:
Information needed for the MonALISA monitoring.
-----------------------------------------------
MonALISA services are being used.
If you do not intend to run MonALISA for monitoring purposes, you can
skip this section.
Ganglia host: The host machine ganglia is running on.
Ganglia port: The host machine's port ganglia is using.
VO Modules: (y or n) If 'y', this will activate the VO Modules module
in the MonALISA configuration file.
Are you running the MonALISA monitoring services [y] (y/n):
Are you using Ganglia [n] (y/n):
Do you want to run the OSG VO Modules [y] (y/n):
Information about the batch queue manager used on your server
-------------------------------------------------------------
The supported batch managers are:
condor pbs fbs lsf sge
For condor: The CONDOR_CONFIG variable value is needed.
For sge: The SGE_ROOT variable value is needed
Specify your batch queue manager OSG_JOB_MANAGER [pbs]:
Specify installation directory for pbs [/usr/local]:
##### ##### ##### ##### ##### ##### ##### #####
Please review the information below:
***********************************************************************
################# Configuration for the OSG CE Node ###################
***********************************************************************
Preset information you are not prompted for
--------------------------------------------
OSG location: /afs/atlas.umich.edu/OSG
Globus location: /afs/atlas.umich.edu/OSG/globus
User-VO map file: /afs/atlas.umich.edu/OSG/monitoring/grid3-user-vo-map.txt
gridftp.log file: /afs/atlas.umich.edu/OSG/globus/var/gridftp.log
Information about your site in general
--------------------------------------
Group: OSG
Site name: UMATLAS
Sponsors: usatlas:80 local:20
Policy URL: http://gate02.grid.umich.edu/policy
Information about your site admininistrator
-------------------------------------------
Contact name: Shawn McKee
Contact email: smckee@umich.edu
Information about your servers location
----------------------------------------
City: Ann Arbor, Michigan
Country: US
Longitude: 42.277
Latitude: -83.736
Information about the available storage on your server
------------------------------------------------------
WN client: /afs/atlas.umich.edu/OSG
Directories:
Application: /afs/atlas.umich.edu/OSG/APP
Data: /afs/atlas.umich.edu/OSG/DATA
WN tmp: /tmp
Site read: UNAVAILABLE
Site write: UNAVAILABLE
Information about the Storage Element available from your server
----------------------------------------------------------------
A storage element exists for this node.
Storage Element: umfs02.grid.umich.edu
Information needed for the MonALISA monitoring.
-----------------------------------------------
MonALISA services are being used.
Ganglia host: UNAVAILABLE
Ganglia port: UNAVAILABLE
VO Modules: y
Information about the batch queue manager used on your server
-------------------------------------------------------------
Batch queue: pbs
Job queue: gate02.grid.umich.edu/jobmanager-pbs
Utility queue: gate02.grid.umich.edu/jobmanager
Condor location:
Condor config:
PBS location: /usr/local
FBS location:
SGE location:
SGE_ROOT:
LSF location:
##################################################
##################################################
Is the above information correct (y/n)?: y
##-----------------------------------------##
Updating /afs/atlas.umich.edu/OSG/monitoring/osg-attributes.conf file now.
... creating new /afs/atlas.umich.edu/OSG/monitoring/osg-attributes.conf
... previous file saved as /afs/atlas.umich.edu/OSG/monitoring/osg-attributes.conf.osgsave.1
DONE
##-----------------------------------------##
Checking for grid3-locations.txt file now.
... already exists
-rw-rw-rw- 1 bin root 383 Sep 18 21:04 /afs/atlas.umich.edu/OSG/APP/etc/grid3-locations.txt
... no need to copy it again
DONE
##-----------------------------------------##
Configuring MonALISA now.
... MonALISA service are being used.
... executing configure_monalisa script as
/afs/atlas.umich.edu/OSG/vdt/setup/configure_monalisa --server y --ganglia-used n --vdt-install /afs/atlas.umich.edu/OSG --user daemon --farm "UMATLAS" --monitor-group "OSG" --contact-name "Shawn McKee" --contact-email "smckee@umich.edu" --city "Ann Arbor, Michigan" --country "US" --latitude "-83.736" --longitude "42.277" --vo-modules "y" --globus-location "/afs/atlas.umich.edu/OSG/globus" --condor-location "" --condor-config "" --pbs-location "/usr/local" --lsf-location "" --fbs-location "/usr/local" --sge-location "" --auto-update n
Command failed:
/afs/atlas.umich.edu/OSG/vdt/sbin/vdt-install-service --service MLD --rc /afs/atlas.umich.edu/OSG/post-install/MLD --start --log /afs/atlas.umich.edu/OSG/vdt-install.log
Exited with value 1
ERROR: configure_monalisa failed.
Problem is
MonaLisa was already running (as 'monalisa' not 'daemon'). I edited
ml_env
and fixed this. I also needed to 'chown -R monalisa.osg /afs/atlas.umich.edu/OSG/MonaLisa'.
This mapping into AFS also didn't work. We needed to change the
OSG_APP
,
OSG_DATA
and
OSG_TMP
to use the NFS mounted area. We used the /data08 RAID6 array (11TB) on umfs02.grid.umich.edu. Since the north campus machines using Torque mount this disk as /atlas/data08 we need to specify the following:
-
OSG_DATA
as /atlas/data08/OSG/DATA
-
OSG_APP
as /atlas/data08/OSG/APP
-
OSG_TMP
as /atlas/data08/OSG/DATA
Warning: Can't find topic AGLT2.CESimpleTest
Testing gate02
The first problem I ran into was an authentication failure because of a proxy problem. The issue is that gssklog was not working correctly so my 'smckee' account could not put its grid proxy in the AFS file system.
The problem seemed to be an expired certificate:
[linat04] /afs/atlas.umich.edu/home/smckee > gssklog -port 751 -server linat04.grid.umich.edu
GSS-error init_sec_context failed: major:000a0000 minor:00000006
GSS Major Status: Authentication Failed
GSS Minor Status Error Chain:
init_sec_context.c:169: gss_init_sec_context: SSLv3 handshake problems
globus_i_gsi_gss_utils.c:886: globus_i_gsi_gss_handshake: Unable to verify remote side's credentials
globus_i_gsi_gss_utils.c:851: globus_i_gsi_gss_handshake: SSLv3 handshake problems: Couldn't do ssl handshake
OpenSSL Error: s3_pkt.c:1046: in library: SSL routines, function SSL3_READ_BYTES: sslv3 alert certificate expired
Failed code = 2
The real issue was that the /etc/grid-security/certificates directories on linat02, lilnat03 and linat04 were out of date and the actual thing expiring was the CRL! I proceeded to move the certificates directory to certificates.orig and soft-linked /afs/atlas.umich.edu/OSG/globus/TRUSTED_CA to /etc/grid-security/certificates on each of linat02-04.
A separate problem was that
iptables
on linat02-04 was not setup to allow port 751. I added the following to /etc/sysconfig/iptables on linat02-04:
# +SPM September 19, 2006 to allow gssklogd to work
-A RH-Firewall-1-INPUT -p tcp --dport 751 -j ACCEPT
I then needed to restart the gssklogd service on linat02-04:
bos stop linat02 gssklogd -localauth
bos start linat02 gssklogd -localauth
I was then able to get tokens via gssklog.
I still had a problem getting a submitted job to work. The problem was that the gssklog-client was not installed on gate02.grid.umich.edu. I got the RPM file from linat02 and installed it. I was then able to submit jobs as 'smckee' using globus-job-run.
I copied the convert_mapfile.sh script from gate01 which transforms the /etc/grid-security/grid-mapfile into a /etc/grid-security/globus-kmapfile for use with gssklog as well as copying the grid-mapfile to linat02-04 as /etc/grid-security/afsgrid-mapfile. This should likely be put into
cron
to make sure it regularly creates the globus-kmapfile and updates linat02-04.
I will wait to see how the prima/voms/gums config works to determine how best to do this.
Monitoring Setup
The Monitoring and Information Services Core Infrastructure (MIS-CI) provides
information on the site environment and computing resources. The OSG-CE package includes MIS-CI.
This section describes how to configure MIS-CI if you wish to enable it.
The $VDT_LOCATION/MIS-CI/configure-misci.sh script performs the configuration. It
creates or adds a crontab for the MIS-CI information collectors. The Unix account for the MIS-CI scripts should be
mis. By default, the script assumes the GridCat DN is mapped to the ivdgl user. You will need to use the
--choose_user option to change to
mis.
$ cd $VDT_LOCATION
$ SOURCE ./setup.sh
$ $VDT_LOCATION/MIS-CI/configure-misci.sh --choose_user
Editing site configuration...
Creating MIS-CI.db
:
( a lot of information on the tables it is creating will appear before any questions are asked)
:
Would you like to set up MIS-CI cron now? (y/n) y
At what frequency (in minutes) would you like to run MIS-CI ? [10] 10
Under which account the cron should run ? [ivdgl] mis
Frequency 10
User mis
Would you like to create MIS-CI crontab ? (y/n) y
Updating crontab
Configuring MIS jobmanager
/storage/local/data1/osg/MIS-CI/share/misci/globus/jobmanager-mis is created
Your site configuration :
sitename ITB_INSTALL_TEST
dollarapp /storage/local/data1/osg/OSG.DIRS/app
dollardat /storage/local/data1/osg/OSG.DIRS/data
dollartmp /storage/local/data1/osg/OSG.DIRS/data
dollarwnt /storage/local/data1/osg/OSG.DIRS/wn_tmp
dollargrd /storage/local/data1/osg
batcheS condor
vouserS uscms01 ivdgl sdss usatlas1 cdf grase fmri gadu
End of your site configuration
If you would like to add more vo users,
you should edit /storage/local/data1/osg/MIS-CI/etc/misci/mis-ci-site-info.cfg.
You have additional batch managers : condor .
If you would like to add these,
you should edit /storage/local/data1/osg/MIS-CI/etc/misci/mis-ci-site-info.cfg.
configure--misci Done
Please read /storage/local/data1/osg/MIS-CI/README
I ran the above script on gate02 and got:
[gate02:OSG]# MIS-CI/configure-misci.sh --choose_user
Editing site configuration...
Doing configuration again from the original template file
Creating MIS-CI.db
seq name file
--- --------------- ----------------------------------------------------------
0 main /afs/atlas.umich.edu/OSG/MIS-CI/share/sqlite/MIS-CI.db
MIS-CI.db created
INFO: table GlueCEAccessControlBaseRule exists
CREATE TABLE GlueCEAccessControlBaseRule (
GlueCEUniqueID VARCHAR(128),
Value VARCHAR(128)
);
INFO: table GlueCluster exists
CREATE TABLE GlueCluster (
UniqueID VARCHAR(100),
Name VARCHAR(255),
InformationServiceURL VARCHAR(128)
);
INFO: table GlueHostRemoteFileSystem exists
CREATE TABLE GlueHostRemoteFileSystem (
GlueSubClusterUniqueID VARCHAR(245),
Name VARCHAR(245),
Root VARCHAR(255),
Size INT,
AvailableSpace INT,
ReadOnly VARCHAR(5),
Type VARCHAR(128)
);
INFO: table GlueSubClusterSoftwareRunTimeEnvironment exists
CREATE TABLE GlueSubClusterSoftwareRunTimeEnvironment (
Value VARCHAR(255),
GlueSubClusterUniqueID VARCHAR(100)
);
INFO: table GlueSubCluster exists
CREATE TABLE GlueSubCluster (
UniqueID VARCHAR(100),
Name VARCHAR(255),
GlueClusterUniqueID VARCHAR(255),
RAMSize INT,
RAMAvailable INT,
VirtualSize INT,
VirtualAvailable INT,
PlatformType VARCHAR(128),
SMPSize INT,
OSName VARCHAR(255),
OSRelease VARCHAR(255),
OSVersion VARCHAR(255),
Vendor VARCHAR(255),
Model VARCHAR(255),
Version VARCHAR(255),
ClockSpeed INT,
InstructionSet VARCHAR(255),
OtherProcessorDescription VARCHAR(255),
CacheL1 INT,
CacheL1I INT,
CacheL1D INT,
CacheL2 INT,
BenchmarkSF00 INT,
BenchmarkSI00 INT,
InboundIP VARCHAR(1),
OutboundIP VARCHAR(1),
InformationServiceURL VARCHAR(128)
);
INFO: table diskinfo exists
CREATE TABLE diskinfo (
id int unsigned NOT NULL,
ymdt varchar(128) NOT NULL default '',
sitename varchar(128) NOT NULL default '',
hostname varchar(128) NOT NULL default '',
appavail varchar(20) NOT NULL default '',
appused varchar(20) NOT NULL default '',
appmount varchar(255) NOT NULL default '',
dataavail varchar(20) NOT NULL default '',
dataused varchar(20) NOT NULL default '',
datamount varchar(255) NOT NULL default '',
wntmpavail varchar(20) NOT NULL default '',
wntmpused varchar(20) NOT NULL default '',
wntmpmount varchar(255) NOT NULL default '',
tmpavail varchar(20) NOT NULL default '',
tmpused varchar(20) NOT NULL default '',
tmpmount varchar(255) NOT NULL default '',
griddavail varchar(20) NOT NULL default '',
griddused varchar(20) NOT NULL default '',
griddmount varchar(255) NOT NULL default '',
PRIMARY KEY (id)
);
INFO: table diskinfo_user exists
CREATE TABLE diskinfo_user (
username varchar(128) NOT NULL default '',
dirname varchar(128) NOT NULL default '',
diskused int unsigned NOT NULL,
timestamp int unsigned NOT NULL
);
INFO: table GlueCE exists
CREATE TABLE GlueCE (
UniqueID VARCHAR(128),
Name VARCHAR(255),
GlueClusterUniqueID VARCHAR(100),
TotalCPUs INT,
LRMSType VARCHAR(255),
LRMSVersion VARCHAR(255),
GRAMVersion VARCHAR(255),
HostName VARCHAR(128),
GatekeeperPort VARCHAR(128),
RunningJobs INT,
WaitingJobs INT,
TotalJobs INT,
Status VARCHAR(255),
WorstResponseTime INT,
EstimatedResponseTime INT,
FreeCpus INT,
Priority INT,
MaxRunningJobs INT,
MaxTotalJobs INT,
MaxCPUTime INT,
MaxWallClockTime INT,
InformationServiceURL VARCHAR(128)
);
INFO: table site_cluster exists
CREATE TABLE site_cluster (
id int unsigned NOT NULL,
ymdt varchar(128) NOT NULL default '',
sitename varchar(128) NOT NULL default '',
hostname varchar(128) NOT NULL default '',
batchname varchar(10) NOT NULL default '',
qname varchar(32) NOT NULL default '',
acputime varchar(128) NOT NULL default '',
awalltime varchar(128) NOT NULL default '',
njobs varchar(128) NOT NULL default '',
jobid varchar(28) NOT NULL default '',
username varchar(32) NOT NULL default '',
jobstat varchar(10) NOT NULL default '',
execmd varchar(128) NOT NULL default '',
subtime varchar(64) NOT NULL default '',
runtime varchar(64) NOT NULL default '',
jobsize varchar(64) NOT NULL default '',
PRIMARY KEY (id)
);
INFO: table siteinfo exists
CREATE TABLE siteinfo (
id int unsigned NOT NULL,
ymdt varchar(128) NOT NULL default '',
sitename varchar(128) NOT NULL default '',
hostname varchar(128) NOT NULL default '',
VOname varchar(12) NOT NULL default '',
appdir varchar(255) NOT NULL default '',
datadir varchar(255) NOT NULL default '',
tmpdir varchar(255) NOT NULL default '',
wntmpdir varchar(255) NOT NULL default '',
grid3dir varchar(255) NOT NULL default '',
jobcon varchar(255) NOT NULL default '',
utilcon varchar(255) NOT NULL default '',
locpname1 varchar(255) NOT NULL default '',
locpname2 varchar(255) NOT NULL default '',
ncpurunning varchar(255) default NULL,
ncpus varchar(255) default NULL,
PRIMARY KEY (id)
);
Cronizing...
Would you like to set up MIS-CI cron now ? (y/n) y
At what frequency (in minutes) would you like to run MIS-CI ? [10] 10
Under which account the cron should run ? [ivdgl] mis
Frequency 10
User mis
Would you like to create MIS-CI crontab ? (y/n) y
Updating crontab
Configuring MIS jobmanager
/afs/atlas.umich.edu/OSG/MIS-CI/share/misci/globus/jobmanager-mis is created
Your site configuration :
sitename UMATLAS
dollarapp /afs/atlas.umich.edu/OSG/APP
dollardat /afs/atlas.umich.edu/OSG/DATA
dollartmp /afs/atlas.umich.edu/OSG/DATA
dollarwnt /tmp
dollargrd /afs/atlas.umich.edu/OSG
batcheS pbs
vouserS
End of your site configuration
If you would like to add more vo users,
you should edit /afs/atlas.umich.edu/OSG/MIS-CI/etc/misci/mis-ci-site-info.cfg.
You have additional batch managers : pbs .
If you would like to add these,
you should edit /afs/atlas.umich.edu/OSG/MIS-CI/etc/misci/mis-ci-site-info.cfg.
configure--misci Done
Please read /afs/atlas.umich.edu/OSG/MIS-CI/README
MIS-CI is collecting information using crontab as the user mis (or ivdgl if you left it as the default). Therefore, in order to stop
MIS-CI processes, crontab should be removed. The script $VDT_LOCATION/MIS-CI/uninstall-misci.sh
is provided for this purpose:
> cd $VDT_LOCATION
> source setup.(c)sh
> cd MIS-CI
> ./uninstall-misci.sh
After finishing configuring the MIS-CI, a few checks might be necessary:
- Verify the crontab was created for the mis user.
> crontab -u mis -l
- If you want to force an MIS-CI table update (due to fresh install or update), then as the MIS-CI user (mis), execute:
> $VDT_LOCATION/MIS-CI/sbin/run-mis-ci.sh
- As a non-root user, verify that at least one table is filled.
If you chose not to force an update, it might take 10 minutes or so before the tables are filled with current information.
> source $VDT_LOCATION/.setup.(c)sh
> grid-proxy-init
(enter your password)
> globus-job-run <hostname>/jobmanager-mis /bin/sh siteinfo
(Here <hostname> is the CE hostname.)
...... sample output ....
id 1
ymdt Wed Jan 11 19:00:01 UTC 2006
sitename ITB_INSTALL_TEST
hostname localhost
VOname local:100
appdir /storage/local/data1/osg/OSG.DIRS/app
datadir /storage/local/data1/osg/OSG.DIRS/data
tmpdir /storage/local/data1/osg/OSG.DIRS/data
wntmpdir /storage/local/data1/osg/OSG.DIRS/wn_tmp
grid3dir /storage/local/data1/osg
jobcon condor
utilcon fork
locpname1
locpname2
ncpurunning 0
ncpus 4
This check was run as 'smckee' on gate02 and returned:
[gate02] /afs/atlas.umich.edu/home/smckee > grid-proxy-init
Your identity: /DC=org/DC=doegrids/OU=People/CN=Shawn McKee 83467
Enter GRID pass phrase for this identity:
Creating proxy ...................................................... Done
Your proxy is valid until: Wed Sep 20 06:26:32 2006
[gate02] /afs/atlas.umich.edu/home/smckee > globus-job-run gate02/jobmanager-mis
/bin/sh siteinfo
id 1
ymdt Tue Sep 19 22:25:12 UTC 2006
sitename UMATLAS
hostname localhost
VOname usatlas:80 local:20
appdir /afs/atlas.umich.edu/OSG/APP
datadir /afs/atlas.umich.edu/OSG/DATA
tmpdir /afs/atlas.umich.edu/OSG/DATA
wntmpdir /tmp
grid3dir /afs/atlas.umich.edu/OSG
jobcon pbs
utilcon fork
locpname1
locpname2
ncpurunning 145
ncpus 771
The Globus information system is called MDS and is pre-configured to read the osg-attributes.conf information file.
The configuration script (
VDT_LOCATION/vdt/setup/configure_mds) is executed automatically during the initial download with default values It also install the
gris daemon as an /etc/rc.d service.
The
gris daemon should have been started as a part of the initial installation. To verify:
> ps -efwww |grep ldap
daemon 7584 1 0 15:25 ? 00:00:00 /bin/sh /storage/local/data1/osg/globus/sbin/grid-info-soft-register
-log /storage/local/data1/osg/globus/var/grid-info-system.log
-f /storage/local/data1/osg/globus/etc/grid-info-resource-register.conf
-- /storage/local/data1/osg/globus/libexec/grid-info-slapd
-h ldap://0.0.0.0:2135 -d 0
-f /storage/local/data1/osg/globus/etc/grid-info-slapd.conf
daemon 7627 7584 1 15:25 ? 00:00:00 /storage/local/data1/osg/globus/libexec/slapd
-h ldap://0.0.0.0:2135 -d 0 -f /storage/local/data1/osg/globus/etc/grid-info-slapd.conf
daemon 7639 1 0 15:25 ? 00:00:00 /bin/sh /storage/local/data1/osg/globus/sbin/grid-info-soft-register
-log /storage/local/data1/osg/globus/var/grid-info-system.log -register -t mdsreg2
-h cmssrv09.fnal.gov -p 2135 -period 600
-dn Mds-Vo-Op-name=register, Mds-Vo-name=ITB_INSTALL_TEST, o=grid -daemon -t ldap
-h cmssrv09.fnal.gov -p 2135 -ttl 1200 -r Mds-Vo-name=local, o=grid -T 20 -b ANONYM-ONLY
-z 0 -m cachedump -period 30
If it is not running, you will need to restart it:
Usage:
> /etc/init.d/gris [start | stop ]
MDS should be configured for anonymous bind. You can send a test query to your local host which will perform no authentication on the user submitting the request . First, verify you have no proxy certificate (/tmp/x509u_(your_PID)). If one exists, remove it first. Then,
> source $VDT_LOCATION/setup.sh
> grid-info-search -anonymous
... your screen should scroll for a while showing a lot of data...
....you can redirect the output to validate
Activate Your Site
The GridCat system maintains status and other information on all OSG sites as can be viewed
here.
Sites added to GridCat are presumed to be inactive if a site state bit is not set to be 1 (see below).
- Inactive sites will have the site status dot with the grey color.
- Once the site becomes active, the site status dot will become either green or red, depending on the
GridCat.
test results.
Since the default site state is presumed to be inactive, the CE site administrator has to pro-actively switch the site state to be active. The activation is done by modifying the file
$VDT_LOCATION/MIS-CI/etc/grid-site-state-info. and setting the value to 1 for the variable below:
export grid_site_state_bit = 1
NOTE: It might take up to 2 hours for registered sites to take effect in the GridCat display. If your site is not registered with the OSG-GOC see the instructions in the
OSG Registration section of this document. Until your site is registered, it will not appear in
GridCat
If your site decides to become inactive for various reasons, e.g., site maintenance, the site administrator should set the value of
grid_site_state_bit to be other than 1.
Example
grid-site-state-info file.
Optional Components
To configure this optional component, see the
MonALISA document in this guide.
To configure this optional component, see the
Generic Information Providers document in this guide.
Site Verification
site-verify
Now you're ready to run the full CE site verification test suite. All elements of this test should now pass.
Note: To run the site verify script you should not be
root .
> cd $VDT_LOCATION
> source ./setup.sh
> grid-proxy-init
....enter your passphrase
> cd verify
> ./site_verify.pl
The results will indicate the various tests that are performed with results indicating FAILED, UNTESTED, NOT WORKING, NONE or NO. conditions.
Results of site_verify.pl for gate02
I ran site_verify.pl as user 'smckee' and got:
[gate02] /afs/atlas.umich.edu/OSG/verify > ./site_verify.pl
===============================================================================
Info: Site verification initiated at Tue Sep 19 23:17:12 2006 GMT.
===============================================================================
-------------------------------------------------------------------------------
--------- Begin gate02.grid.umich.edu at Tue Sep 19 23:17:12 2006 GMT ---------
-------------------------------------------------------------------------------
Checking prerequisites needed for testing: PASS
Checking for a valid proxy for smckee@gate02.grid.umich.edu: PASS
Checking if remote host is reachable: PASS
Checking for a running gatekeeper: YES; port 2119
Checking authentication: PASS
Checking 'Hello, World' application: PASS
Checking remote host uptime: PASS
19:17:18 up 4 days, 9:42, 3 users, load average: 0.08, 0.12, 0.04
Checking remote Internet network services list: PASS
Checking remote Internet servers database configuration: PASS
Checking for GLOBUS_LOCATION: /afs/atlas.umich.edu/OSG/globus
Checking expiration date of remote host certificate: Sep 8 19:53:14 2007 GMT
Checking for gatekeeper configuration file: YES
/afs/atlas.umich.edu/OSG/globus/etc/globus-gatekeeper.conf
Checking for a running gsiftp server: YES; port 2811
Checking gsiftp (local client, local host -> remote host): PASS
Checking gsiftp (local client, remote host -> local host): PASS
Checking that no differences exist between gsiftp'd files: PASS
Checking users in grid-mapfile, if none must be using Prima: smckee
Checking for remote globus-sh-tools-vars.sh: YES
Checking configured grid services: PASS
jobmanager,jobmanager-fork,jobmanager-mis,jobmanager-pbs
Checking scheduler types associated with remote jobmanagers: PASS
jobmanager is of type fork
jobmanager-fork is of type fork
jobmanager-mis is of type mis
jobmanager-pbs is of type pbs
Checking for paths to binaries of remote schedulers: PASS
Path to mis binaries is /afs/atlas.umich.edu/OSG/MIS-CI/bin
Path to pbs binaries is /usr/local/bin
Checking remote scheduler status: PASS
pbs : 144 jobs running, 89 jobs idle/pending
Checking for a running MDS service: YES; port 2135
Checking for configured Generic Information Provider service: YES; GLUE attributes present. Detailed validation information at http://grow.its.uiowa.edu/osg-gip/
Checking if Globus is deployed from the VDT: YES; version 1.3.10b
Checking for OSG osg-attributes.conf: YES
Checking for OSG grid3-user-vo-map.txt: YES
ops users: ops
ivdgl users: ivdgl
i2u2 users: i2u2
geant4 users: geant4
grow users: grow
osgedu users: osgedu
nanohub users: nanohub
gridex users: gridex
fmri users: fmri
cdf users: cdf
nwicg users: nwicg
osg users: osg
usatlas users: usatlas1
mariachi users: mariachi
star users: star
dosar users: dosar
uscms users: uscms01
grase users: grase
ligo users: ligo
glow users: glow
fermilab users: fermilab
dzero users: sam,samgrid
mis users: mis
des users: des
sdss users: sdss
gadu users: gadu
Checking for OSG site name: UMATLAS
Checking for OSG $GRID3 definition: /afs/atlas.umich.edu/OSG
Checking for OSG $APP definition: /afs/atlas.umich.edu/OSG/APP
Checking for OSG $DATA definition: /afs/atlas.umich.edu/OSG/DATA
Checking for OSG $TMP definition: /afs/atlas.umich.edu/OSG/DATA
Checking for OSG $WNTMP definition: /tmp
Checking for OSG $APP existence: PASS
Checking for OSG $DATA existence: PASS
Checking for OSG $TMP existence: PASS
Checking for OSG $APP writability: PASS
Checking for OSG $DATA writability: PASS
Checking for OSG $TMP writability: PASS
Checking for OSG $APP available space: GRAM Job submission failed because the gatekeeper contact cannot be parsed (error code 96)
FAIL
Checking for OSG $DATA available space: GRAM Job submission failed because the gatekeeper contact cannot be parsed (error code 96)
FAIL
Checking for OSG $TMP available space: GRAM Job submission failed because the gatekeeper contact cannot be parsed (error code 96)
FAIL
Checking for OSG additional site-specific variable definitions: YES
Checking for OSG execution jobmanager(s): gate02.grid.umich.edu/jobmanager-pbs
Checking for OSG utility jobmanager(s): gate02.grid.umich.edu/jobmanager
Checking for OSG sponsoring VO: usatlas:80 local:20
Checking for OSG policy expression: NONE
Checking for OSG setup.sh: YES
Checking for OSG $Monalisa_HOME definition: /afs/atlas.umich.edu/OSG/MonaLisa
Checking for MonALISA configuration: PASS
key ml_env vars:
FARM_NAME = UMATLAS
FARM_HOME = /afs/atlas.umich.edu/OSG/MonaLisa/Service/VDTFarm
FARM_CONF_FILE = /afs/atlas.umich.edu/OSG/MonaLisa/Service/VDTFarm/vdtFarm.conf
SHOULD_UPDATE = true
URL_LIST_UPDATE = http://monalisa.cacr.caltech.edu/FARM_ML,http://monalisa.cern.ch/MONALISA/FARM_ML
key ml_properties vars:
lia.Monitor.group = OSG
lia.Monitor.useIPaddress = undef
MonaLisa.ContactEmail = smckee@umich.edu
Checking for a running MonALISA: PASS
MonALISA is ALIVE (pid 31915)
MonALISA_Version = 1.6.2-200608161738
MonALISA_VDate = 2006-08-16
VoModulesDir = VoModules-v0.32
tcpServer_Port = 9002
storeType = epgsqldb
Checking for a running GANGLIA gmond daemon: NO
gmond does not appear to be running
Checking for a running GANGLIA gmetad daemon: NO
gmetad does not appear to be running
-------------------------------------------------------------------------------
---------- End gate02.grid.umich.edu at Tue Sep 19 23:20:28 2006 GMT ----------
-------------------------------------------------------------------------------
===============================================================================
Info: Site verification completed at Tue Sep 19 23:20:28 2006 GMT.
After this I found that globus-url-copy doesn't do the
gssklog
call-out and therefore gridftp/gsiftp jobs don't have the appropriate authentications for accessing the OSG_APP area. I then moved OSG_APP to be /data08/OSG/APP (just like OSG_DATA). I set
chmod a+w /data08/OSG/APP
chmod +t /data08/OSG/APP
to allow all VO/accts to create APP areas and changed osg-attributes.conf correspondingly. I found I also need to edit $VDT_LOCATION/MIS-CI/etc/misci/mis-ci-site-info.cfg as well.
Problem with OSG/AFS setup
A number of
daemon's
need to be able to write into OSG installed areas for tmp files, log files and such. I found I had to move a number of directories into a local disk system on gate02 so that things would work. When I identified an directory tree which had to be writeable I did the following:
- cp -arv
dir
/opt/dirtree/dir
- mv dir dir.orig
- ln -s /opt/dirtree/dir ./dir
This softlink in AFS allows the appropriate service to run. The list of affected services is:
- The $VDT_INSTALL/vdt-app-data directory had to be moved: Globus, GUMS, MonaLisa and MySQL effected)
- Globus (need $VDT_INSTALL/globus/var relocated)
- MonaLisa (need $VDT_INSTALL/MonaLisa/Service/VDTFarm relocated)
- Tomcat (need $VDT_INSTALL/tomcat/v5/logs relocated)
- MIS-CI (need $VDT_INSTALL/MIS-CI/share and $VDT_INSTALL/MIS-CI/tmp relocated)
All these were relocated to /opt on gate02.grid.umich.edu.
Remote pilot submission issue
As of July, 2007, or so, pilots no longer properly submit to gate01.aglt2.org, and we began to use local pilots. In pondering the setup of analysis queues, we can either run a second set of remote pilots, or a second set of local pilots. If remote, then ALL remote pilot submission must work. We therefore revisited this issue.
In browsing the problem, I ran across
this Twiki entry posted in July by Anand Padmanabhan indicating the XML::Parser module is not available. I found this at
the CPAN.org site. Unfortunately, installation on gate01 did not resolve the problem for our Condor installation.
--
BobBall - 20 Aug 2007
Authorizing Users: Operational Configuration
The earlier test case only authorized yourself as a local user. You should now go to
Osg CE Authorization document and authorize other users before performing the
OSG Registration of your service (otherwise no one but you will be able to access it!).
OSG Registration
To register the site with the OSG Grid Operations Center and into the Grid Catalog please use the
webform located under the OSG
Administrator Support page. If you are registering into the OSG, be sure to check the appropriate box for which grid catalog you are registering with. You should receive an email response automatically back from the GOC to the operations contact you supplied. If this response doesn't arrive within a reasonable delay, please resubmit your registration.
A minimal amount of information is needed for the OSG Grid Operations Center (GOC) to publish a site to the monitoring and operational infrastructure. This includes organization name, organization manager or designated representative name and email, security contact name and email, resource URL, support center, and form submitter.
While this minimal information will allow the GOC to publish your information to the monitoring tools, more information is requested to make site and support center communication easier. Please take time to fill out the form completely.
Warning: Can't find topic AGLT2.CEFirewalls
Troubleshooting Guide
As you install, monitor the $VDT_LOCATION/vdt-install.log.
- If pacman tries to retrieve something from a website that's having problems, you'll get an error message that's unrelated to the real problem because pacman can't recognize 404 errors when downloading tarballs. For example, when the PRIMA download site was down, it told us the file wasn't in the correct format:
vdt-untar is untarring prima-0.3.x86_rh_9.tar.gz
gzip: stdin: not in gzip format
Shutdown Guide
Please see the
OSG Shutdown Guide.
Major updates:
--
RobQ - 01 May 2006
--
ShawnMcKee - 19 Sep 2006