Overview
|
|
|
|
|
|
|
|
|
| DISK | SIZE ( TB ) | SPEED | GROSS TOTAL | USABLE TOTAL |
|
| XRD-4 INTERNAL | 12 | 2 | 7.2 | 24 | 20 |
|
| XRD-4 TRAY A | 12 | 3 | 7.2 | 36 | 30 |
|
| XRD-4 TRAY B | 12 | 3 | 7.2 | 36 | 30 |
|
| XRD3 INTERNAL | 12 | 2 | 7.2 | 24 | 20 |
|
| XRD2 INTERNAL | 12 | 2 | 7.2 | 24 | 20 |
|
| XRD1 INTERNAL | 12 | 2 | 7.2 | 24 | 20 |
|
| MSU4.MSULOCAL | 15 | 0.75 | 7.2 | 11.25 | 9 |
|
| GYTHEIO | 8 | 4 | 7.2 | 32 | 25.6 |
|
| AGOGE | 6 | 4 | 7.2 | 24 | 19.2 |
|
|
| 2 | 2 | 7.2 | 4 | 2 |
|
| CYNISCA | 6 | 4 | 7.2 | 24 | 19.2 |
|
|
| 2 | 2 | 7.2 | 4 | 2 |
|
| MSUT3-DS FAST A | 12 | 0.6 | 15k | 7.2 | 6 |
|
| MSUT3-DS FAST B | 12 | 0.6 | 15k | 7.2 | 6 |
|
| MSUT3-DS FAST C | 12 | 0.6 | 15k | 7.2 | 6 |
|
| MSUT3-DS FAST D | 12 | 0.6 | 15k | 7.2 | 6 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 288.85 | 241 |
|
|
|
|
|
|
|
|
|
Home Area
Users' home areas are on a relatively small disk and the space usage is protected using quotas. Quotas are set at a soft limit of 10GB to a hard limit of 20GB.
Note that this area is not backed up (no T3 user area is currently backed up). Keep copies of important files --- documents, source code etc. on other systems.
NFS "work" Areas
These Network File System areas are available from all nodes:
path |
size |
comment |
/msu/data/t3work1 |
12 TB |
|
/msu/data/t3work2 |
12 TB |
|
/msu/data/t3work3 |
19 TB |
|
/msu/data/t3work4 |
19 TB |
|
/msu/data/t3work5 |
19 TB |
|
/msu/data/t3work6 |
19 TB |
|
/msu/data/t3work7 |
28 TB |
|
/msu/data/t3work8 |
28 TB |
|
/msu/data/t3fast1 |
5.5 TB |
|
/msu/data/t3fast2 |
5.5 TB |
|
/msu/data/t3fast3 |
5.5 TB |
|
/msu/data/t3fast4 |
5.5 TB |
|
coming...
Condor Usage
The cluster can execute about 500 batch jobs at once, jobs that are processing datafiles stored on the NFS storage areas can easily overwhelm the ability of the storage systems to service data access requests. This can make the jobs run inefficiently lengthening the time for your jobs to complete and can also impact other user's usage of the cluster. The number of jobs that a disk system can efficiently service varies greatly depending on the activity of the job.
Using the condor batch system, we can limit the total number of running jobs that need a given storage area. To do this, add the option "concurrency_limits" to your job description file (submit file). Each of the NFS storage areas has a total limit of 10000 set, in you job, specify a limit of M=10000/N where N is the total number of jobs that can run efficiently at once. A good starting point is N=50 and M=200. If the storage requirements are lower, you can reduce M (condor will then allow more of the jobs to run at once).
For example, if the job reads from the /msu/data/t3work1 area, add this to the submit script:
concurrency_limits = DISK_T3WORK1:200
In the above example,
T3WORK1 can be replaced by
T3WORK2,
T3FAST1,
T3FAST2,
T3FAST3,
T3FAST4. To increase the number of jobs run at once, reduce 200 to a smaller integer. Note that this setup is voluntary --- if you don't use this job option, condor can't help control the disk utilization.
Check Jobs Waiting for Resource
To see if a job is waiting to run because of a concurrency limit, the command
condor_q -better-analyze [jobid]
will report this.
Debugging I/O
Run `top` to see what your waitIO is. If it is more than a couple percent, then a disk somewhere is being overutilized. Check
http://msurxx.msulocal/ganglia/?c=MSU%20Server&m=load_one&r=week&s=descending&hc=4&mc=2 and see if any of the boxes are in red. If you click on the box, and look at the third plot on the right. This is a plot of CPU utilization. Orange means the CPU is waiting on Disk I/O. If there is a visible amount of it then this disk is reading and/or writing too fast. Identify which disk is hosted on that server. Check what condor jobs are dominating the queue and ask the user(s) if they are using the affected disk. If they are they should resubmit with appropriate concurrency limits.
path |
server |
/msu/data/t3work1 |
msu3 |
/msu/data/t3work2 |
msu3 |
/msu/data/t3work3 |
msut3-xrd-1 |
/msu/data/t3work4 |
msut3-xrd-2 |
/msu/data/t3work5 |
msut3-xrd-3 |
/msu/data/t3work6 |
msut3-xrd-4 |
/msu/data/t3work7 |
msut3-xrd-3 |
/msu/data/t3work8 |
msut3-xrd-4 |
/msu/data/t3fast1 |
msut3-d2 |
/msu/data/t3fast2 |
msut3-d2 |
/msu/data/t3fast3 |
msut3-d2 |
/msu/data/t3fast4 |
msut3-d2 |
On previously queued jobs
Add requirement to queued jobs?
--
TomRockwell - 31 May 2011