Recovering from a Lost Pool
When we lose a pool we need to do a number of things to recover. Once we determine we have really lost the pool we will need to find the list of files that were stored there. This can be done using the
chimera
database on head02.aglt2.org.
Login as
root to head02.aglt2.org and you can use the following commands to dump file lists by PNFSID or PATH. Assuming we lost pool
msufs03_2
we could find this list of PNFSIDs using this
postgresql command:
psql --no-align --tuples-only --username pnfsserver --dbname chimera --command "SELECT ipnfsid FROM t_locationinfo WHERE ilocation='msufs03_2';" --output msufs03_2_pnfsid_list.txt
If you instead want filenames you can use the
inode2path function:
psql --no-align --tuples-only --username pnfsserver --dbname chimera --command "SELECT inode2path(ipnfsid) FROM t_locationinfo WHERE ilocation='msufs03_2';" --output msufs03_2_file_list.txt
We now have lists of potentially lost files that we need to deal with.
Identifying and Saving Cached Copies of Lost Files
One challenge we have is identifying files that may be stored in more than one pool at our site. So some files from the missing pool may still be present at our site since dCache can have any file stored in one
or more pools. The
t_locationinfo table can be used to identify all such files. The following command will show all files stored at more than one pool for the specific pool location which was lost:
psql --no-align --tuples-only --username pnfsserver --dbname chimera --command "select ipnfsid,count(ipnfsid) from t_locationinfo where ipnfsid in (select ipnfsid from t_locationinfo where ilocation='msufs03_2') group by ipnfsid having count(ilocation)>1;" --output msufs03_2_duplicates.txt
To get the list of locations for each pnfsid you can do:
psql --no-align --tuples-only --username pnfsserver --dbname chimera --command "select ipnfsid,ilocation from t_locationinfo where ipnfsid in (select ipnfsid from t_locationinfo where ilocation='msufs03_2');" --output msufs03_2_dup-locations.txt
Reduce this to the locations not on the lost pool via:
cp msufs03_2_dup-locations.txt msufs03_2_dup-locations-off-msufs03.txt
sed -i '/msufs03_2/d' msufs03_2_dup-locations-off-msufs03.txt
Once we know where existing cached copies exist, we need to change their state in dCache. We can use the file above to create the set of commands to make the cached copies precious:
cat msufs03_2_dup-locations-off-msufs03.txt | awk -F\| '{print "cd "$2"\nrep set precious "$1"\n.."}' > mark-precious.txt
To run this we can follow the dCache manual info (
https://www.dcache.org/manuals/Book-2.10/start/intouch-admin.shtml see bottom of page).
[root@head02 dCache]# cat run_dcache-admin-commands.sh-template#!/bin/bash## This is a template...include your dCache commands before the 'logoff'#######################outfile=/tmp/$(basename $0).$$.outssh -p 22224 -l admin head01.aglt2.org > $outfile <<EOF
logoff
EOF
This can have the commands inserted from mark-precious.txt and then can be run on head01.aglt2.org (assumes you have the right ssh-key associated with your account):
[smckee@umt3int01 public]$ chmod a+x run_dcache-admin-commands.sh
[smckee@umt3int01 public]$ ./run_dcache-admin-commands.sh
Pseudo-terminal will not be allocated because stdin is not a terminal.
Now all the existing cached copies of the lost files are converted to precious and are no longer "lost".
Addressing Lost Files
The remaining files are presumed lost from our site and we must deal with this. There are two different components to repair: 1) dCache and 2) ATLAS.
dCache
dCache will need to be updated to remove the lost files. If the lost pool will also be decommissioned we will need to remove it from the dCache configuration as well.
To remove the files from dcache I think we simply need to do the following in the chimera DB:
delete from t_locationinfo where ilocation='msufs03_2'
This should remove any files associated with that pool location.
chimera=> delete from t_locationinfo where ilocation='msufs03_2';
DELETE 9057
If the pool is no longer going to be used you need to remove it from the poolmanager.conf file. For AGLT2 this is stored in CFEngine and you need to update the file appropriately.
[smckee@umt3int01 masterfiles]$ pwd
/afs/atlas.umich.edu/home/smckee/AGLT2-CFE/masterfiles
[smckee@umt3int01 masterfiles]$ grep msufs03_2 stash/dcache/*
stash/dcache/poolmanager.conf:psu create pool msufs03_2
stash/dcache/poolmanager.conf:psu addto pgroup aglt2Pools msufs03_2
stash/dcache/poolmanager.conf:psu addto pgroup aglt2MSUPools msufs03_2
So edit this and check it back in. Then update the T2 copies:
[smckee@umt3int01 masterfiles]$ svn status
M stash/dcache/poolmanager.conf
[smckee@umt3int01 masterfiles]$ svn ci -m 'Removed msufs03_2 from poolmanager.conf
'Sending masterfiles/stash/dcache/poolmanager.confTransmitting file data .
Committed revision 2042.
Then issue the commands to update the main config:
../tools/sync-policy.sh -d umcfe:T2 -R -y
../tools/sync-policy.sh -d msucfe:T2 -R -y
ATLAS
ATLAS maintains a web site with information about what needs to be done if files are lost at
https://twiki.cern.ch/twiki/bin/view/AtlasComputing/DDMDarkDataAndLostFiles#Recovery_of_lost_or_corrupted_fi
For a loss of less than 50,000 files we can handle the "repair" ourselves. The challenge is to construct the needed SRM list of lost files. For AGLT2 the front part of the SRM specification is:
srm://head01.aglt2.org:8443/srm/managerv2?SFN=
So we just need to construct the list of lost files (starting with /pnfs/aglt2.org...) and prepend the SRM part above. We can use a similar SQL line to those above to get the path to the lost files:
psql --no-align --tuples-only --username pnfsserver --dbname chimera --command "select inode2path(ipnfsid) from t_locationinfo where ipnfsid in (select ipnfsid from t_locationinfo where ilocation='msufs03_2') group by ipnfsid having count(ilocation)=1;" --output msufs03_2-lost-files.txt
This command finds the path to all files that were on msufs03_2 (and ONLY on msufs03_2) according to chimera. We can convert this into the needed ATLAS SRM list via:
cat msufs03_2-lost-files.txt | awk '{print "srm://head01.aglt2.org:8443/srm/managerv2?SFN="$1}' > msufs03_2-SRM-lost-files.txt
Now this can be submitted to rucio according to the instructions on the URL above. I will use my account as an example. At CERN my account is 'mckee'. First we setup the ATLAS environment:
atlasSetup
Next define your account for RUCIO:
export RUCIO_ACCOUNT=mckee
Now setup Rucio:
lsetup rucio
Verify you have the needed cloud admin privileges:
[smckee@umt3int01 masterfiles]$ rucio-admin account list-attributes --account mckee
+------------+---------+
| Key | Value |
|------------+---------|
| cloud-us | admin |
| country-us | admin |
+------------+---------+
OK. Now we can issue the command:
[smckee@umt3int01 public]$ rucio-admin replicas declare-bad --reason "Lost dCache pool msufs03_2 with 3-disk failure on RAID-6" --inputfile msufs03_2-SRM-lost-files.txt
There were many "cannot be declared" files (1273 out of 12348 files). See attached file showing those.
You can watch the deletion at
https://rucio-ui.cern.ch/bad_replicas/summary (Click "Retrieve results" and you can filter in the "Search" box)
--
ShawnMcKee - 05 Jan 2016