Issues with Production Grids
Tony Hey
Director of UK e-Science
Core Programme
NGS,Today”
Projects
e-Minerals
e-Materials
Orbital Dynamics of Galaxies
Bioinformatics (using BLAST)
GEODISE project
UKQCD Singlet meson project
Census data analysis
MIAKT project
e-HTPX project.
RealityGrid (chemistry)
Users
Leeds
Oxford
UCL
Cardiff
Southampton
Imperial
Liverpool
Sheffield
Cambridge
Edinburgh
QUB
BBSRC
CCLRC.
Interfaces
OGSI::Lite
NGS Hardware
Compute Cluster
64 dual CPU Intel 3.06 GHz (1MB cache) nodes
2GB memory per node
2x 120GB IDE disks (1 boot,1 data)
Gigabit network
Myrinet M3F-PCIXD-2
Front end (as node)
Disk server (as node) with 2x Infortrend 2.1TB
U16U SCSI Arrays (UltraStar 146Z10 disks)
PGI compilers
Intel Compilers,MKL
PBSPro
TotalView Debugger
RedHat ES 3.0
Data Cluster
20 dual CPU Intel 3.06 GHz nodes
4GB memory per node
2x120GB IDE disks (1 boot,1 data)
Gigabit network
Myrinet M3F-PCIXD-2
Front end (as node)
18TB Fibre SAN ( Infortrend F16F 4.1TB Fibre
Arrays (UltraStar 146Z10 disks)
PGI compilers
Intel Compilers,MKL
PBSPro
TotalView Debugger
Oracle 9i RAC
Oracle Application server
RedHat ES 3.0
NGS Software
RealityGrid AHM Experiment
Measuring protein-peptide
binding energies –G
bind
is vital for e.g,understanding
fundamental physical
processes at play at the
molecular level,for designing
new drugs.
Computing a peptide-protein
binding energy traditionally
takes weeks to months.
We have developed a grid-
based method to accelerate
this process,We computed
G
bind
during the UK AHM i.e,
in less than 48 hours
ligand
Src SH2 domain
Experiment Details
A Grid based approach,using the RealityGrid
steering library enables us to launch,monitor,
checkpoint and spawn multiple simulations
Each simulation is a parallel molecular
dynamic simulation running on a
supercomputer class machine
At any given instant,we had up to nine
simulations in progress (over 140 processors)
on machines at 5 different sites:
e.g 1x TG-SDSC,3x TG-NCSA,3x NGS-Oxford,
1x NGS-Leeds,1x NGS-RAL
Experiment Details (2)
In all 26 simulations were run over 48 hours,
We simulated over 6.8ns of classical
molecular dynamics in this time
Real time visualization and off-line analysis
required bringing back data from
simulations in progress.
We used UK-light between UCL and the
TeraGrid machines (SDSC,NCSA)
Computation
Starlight (Chicago)
Netherlight
(Amsterdam)
Leeds
PSC
SDSC
NCSA
Manchester
Oxford
RAL
US TeraGrid
UK NGS
UCL
UKLight
The e-Infrastructure
AHM 2004
Local laptops
and Manchester
vncserver
All sites connected by
production network (not
all shown)
Steering clients
Service RegistryNetwork PoP
The scientific results …
Thermodynamic Integrations
-200
-100
0
100
200
300
400
0 0.2 0.4 0.6 0.8 1
lambda
dE
/
d
l
dp
po
Some simulations require extending
and more sophisticated analysis
needs to be performed
… and the problems
Restarted the GridService container Wednesday
evening
Numerous quota and permission issues,especially at
TG-SDSC
NGS-Oxford was unreachable Wednesday evening to
Thursday morning
The steerer and launcher occasionally fail
We were unable to checkpoint two simulations
The batch queuing systems occasionally did not like
our simulations
5 simulations died of natural causes
Overall,up to six people were working on this
calculation to solve these problems
Tony Hey
Director of UK e-Science
Core Programme
NGS,Today”
Projects
e-Minerals
e-Materials
Orbital Dynamics of Galaxies
Bioinformatics (using BLAST)
GEODISE project
UKQCD Singlet meson project
Census data analysis
MIAKT project
e-HTPX project.
RealityGrid (chemistry)
Users
Leeds
Oxford
UCL
Cardiff
Southampton
Imperial
Liverpool
Sheffield
Cambridge
Edinburgh
QUB
BBSRC
CCLRC.
Interfaces
OGSI::Lite
NGS Hardware
Compute Cluster
64 dual CPU Intel 3.06 GHz (1MB cache) nodes
2GB memory per node
2x 120GB IDE disks (1 boot,1 data)
Gigabit network
Myrinet M3F-PCIXD-2
Front end (as node)
Disk server (as node) with 2x Infortrend 2.1TB
U16U SCSI Arrays (UltraStar 146Z10 disks)
PGI compilers
Intel Compilers,MKL
PBSPro
TotalView Debugger
RedHat ES 3.0
Data Cluster
20 dual CPU Intel 3.06 GHz nodes
4GB memory per node
2x120GB IDE disks (1 boot,1 data)
Gigabit network
Myrinet M3F-PCIXD-2
Front end (as node)
18TB Fibre SAN ( Infortrend F16F 4.1TB Fibre
Arrays (UltraStar 146Z10 disks)
PGI compilers
Intel Compilers,MKL
PBSPro
TotalView Debugger
Oracle 9i RAC
Oracle Application server
RedHat ES 3.0
NGS Software
RealityGrid AHM Experiment
Measuring protein-peptide
binding energies –G
bind
is vital for e.g,understanding
fundamental physical
processes at play at the
molecular level,for designing
new drugs.
Computing a peptide-protein
binding energy traditionally
takes weeks to months.
We have developed a grid-
based method to accelerate
this process,We computed
G
bind
during the UK AHM i.e,
in less than 48 hours
ligand
Src SH2 domain
Experiment Details
A Grid based approach,using the RealityGrid
steering library enables us to launch,monitor,
checkpoint and spawn multiple simulations
Each simulation is a parallel molecular
dynamic simulation running on a
supercomputer class machine
At any given instant,we had up to nine
simulations in progress (over 140 processors)
on machines at 5 different sites:
e.g 1x TG-SDSC,3x TG-NCSA,3x NGS-Oxford,
1x NGS-Leeds,1x NGS-RAL
Experiment Details (2)
In all 26 simulations were run over 48 hours,
We simulated over 6.8ns of classical
molecular dynamics in this time
Real time visualization and off-line analysis
required bringing back data from
simulations in progress.
We used UK-light between UCL and the
TeraGrid machines (SDSC,NCSA)
Computation
Starlight (Chicago)
Netherlight
(Amsterdam)
Leeds
PSC
SDSC
NCSA
Manchester
Oxford
RAL
US TeraGrid
UK NGS
UCL
UKLight
The e-Infrastructure
AHM 2004
Local laptops
and Manchester
vncserver
All sites connected by
production network (not
all shown)
Steering clients
Service RegistryNetwork PoP
The scientific results …
Thermodynamic Integrations
-200
-100
0
100
200
300
400
0 0.2 0.4 0.6 0.8 1
lambda
dE
/
d
l
dp
po
Some simulations require extending
and more sophisticated analysis
needs to be performed
… and the problems
Restarted the GridService container Wednesday
evening
Numerous quota and permission issues,especially at
TG-SDSC
NGS-Oxford was unreachable Wednesday evening to
Thursday morning
The steerer and launcher occasionally fail
We were unable to checkpoint two simulations
The batch queuing systems occasionally did not like
our simulations
5 simulations died of natural causes
Overall,up to six people were working on this
calculation to solve these problems