Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Require better storage technique and better analysis code drivers
Inversion problem:
Development:
Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Require better storage technique and better analysis code drivers
Inversion problem:
Development:
Database
Requirements:
For each config worth of data, will pay a one-time insertion cost Config data may insert out of order Need to insert or delete Requirements basically imply a balanced tree Try DB using Berkeley Sleepy Cat:
300 directories of binary files holding correlators (~7K files each dir.) A single key of quantum number + config number hashed to a string About 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec.
Solution:
Preliminary Tests:
Database key:
String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpath Not intending (at the moment) any relational capabilities among sub-keys Array< Array<double> > read_correlator(const string& key);
Interface function
struct Arg {Array<int> p_i; Array<int> p_f; int gamma;}; Getter: Ensemble<Array<Real>> operator[](const Arg&); or Array<Array<double>> operator[](const Arg&); Here, ensemble objects have jackknife support, namely operator*(Ensemble<T>, Ensemble<T>); CVS package adat
Consider Dirac op det(D) = det(Dt + Ds/) Temporal precondition: det(D)=det(Dt)det(1+ Dt-1Ds/) Strategy:
Temporal preconditiong 3D even-odd preconditioning Improvement can increase with increasing According to Mike Peardon, typically factors of 3 improvement in CG iterations Improving condition number lowers fermionic force
Expectations
Motivation
Multi-threading
Test Environment
Multi-Core Architecture
PCI-E Bridge Core 1 Core 2 FB DDR2 ESB2 I/O Memory Controller DDR2 Core 1 Core 2 PCI-E Expansion HUB
PCI Express
PCI-X Bridge
Multi-Core Architecture
L1 Cache
L1 Cache
32 KB Data, 32 KB Instruction
4MB Shared among 2 cores 256 bit width 10.6 GB/s bandwidth to cores
64 KB Data, 64 KB Instruction
1 MB dedicated 128 bit width 6.4 GB/s bandwidth to cores Increased latency to access the other memory Memory affinity is important
L2 Cache
L2 Cache
FB-DDR2
NUMA (DDR2)
Increased Latency
Executions
Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers 3 128-bit SSE Units; One SSE instruction/cycle
Pipeline length 12; 16 bytes Fetch width; 72 reorder buffers 2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.
AMD Opteron
Parallel Programming
Messages
Machine 1 Machine 2
OpenMP/Pthread
OpenMP/Pthread
Performance Improvement on Multi-Core/SMP machines All threads share address space Efficient inter-thread communication (no memory copies)
OpenMP
Master
OpenMP
omp_set_num_threads, omp_get_thread_num
Posix Thread
Complex
Conclusions
Hand written QMT library can beat OMP compiler generated code.