0% found this document useful (0 votes)
44 views

Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions

The document discusses developing a database for storing and retrieving large amounts of correlation function data from lattice QCD simulations more efficiently. It describes using a Berkeley DB to store over 100,000 quantum numbers from configurations, with retrieval times of 1-4 seconds. Preliminary tests show inserting config data into 300 directories with 7,000 binary files each works well. The interface provides functions to read correlators by key and for analysis code.

Uploaded by

Phap Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
44 views

Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions

The document discusses developing a database for storing and retrieving large amounts of correlation function data from lattice QCD simulations more efficiently. It describes using a Berkeley DB to store over 100,000 quantum numbers from configurations, with retrieval times of 1-4 seconds. Preliminary tests show inserting config data into 300 directories with 7,000 binary files each works well. The interface provides functions to read correlators by key and for analysis code.

Uploaded by

Phap Nguyen
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 26

Database for Data-Analysis

Developer: Ying Chen (JLab) Computing 3(or N)-pt functions


Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Require better storage technique and better analysis code drivers

Inversion problem:

Development:

Database for Data-Analysis

Developer: Ying Chen (JLab) Computing 3(or N)-pt functions


Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Require better storage technique and better analysis code drivers

Inversion problem:

Development:

Database

Requirements:

For each config worth of data, will pay a one-time insertion cost Config data may insert out of order Need to insert or delete Requirements basically imply a balanced tree Try DB using Berkeley Sleepy Cat:
300 directories of binary files holding correlators (~7K files each dir.) A single key of quantum number + config number hashed to a string About 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec.

Solution:

Preliminary Tests:

Database and Interface

Database key:

String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpath Not intending (at the moment) any relational capabilities among sub-keys Array< Array<double> > read_correlator(const string& key);

Interface function

Analysis code interface (wrapper):


struct Arg {Array<int> p_i; Array<int> p_f; int gamma;}; Getter: Ensemble<Array<Real>> operator[](const Arg&); or Array<Array<double>> operator[](const Arg&); Here, ensemble objects have jackknife support, namely operator*(Ensemble<T>, Ensemble<T>); CVS package adat

(Clover) Temporal Preconditioning

Consider Dirac op det(D) = det(Dt + Ds/) Temporal precondition: det(D)=det(Dt)det(1+ Dt-1Ds/) Strategy:

Temporal preconditiong 3D even-odd preconditioning Improvement can increase with increasing According to Mike Peardon, typically factors of 3 improvement in CG iterations Improving condition number lowers fermionic force

Expectations

Multi-Threading on Multi-Core Processors


Jie Chen, Ying Chen, Balint Joo and Chip Watson Scientific Computing Group IT Division Jefferson Lab

Motivation

Next LQCD Cluster

What type of machines is going to used for the cluster?

Intel Dual Core or AMD Dual Core?

Software Performance Improvement

Multi-threading

Test Environment

Two Dual Core Intel 5150 Xeons (Woodcrest)

2.66 GHz 4 GB memory (FB-DDR2 667 MHz)


2.8 GHz 4 GB Memory (DDR2 667 MHz) i386 x86_64

Two Dual Core AMD Opteron 2220 SE (Socket F)


2.6.15-smp kernel (Fedora Core 5)

Intel c/c++ compiler (9.1), gcc 4.1

Multi-Core Architecture
PCI-E Bridge Core 1 Core 2 FB DDR2 ESB2 I/O Memory Controller DDR2 Core 1 Core 2 PCI-E Expansion HUB

PCI Express

PCI-X Bridge

Intel Woodcrest Intel Xeon 5100

AMD Opterons Socket F

Multi-Core Architecture

L1 Cache

L1 Cache

32 KB Data, 32 KB Instruction
4MB Shared among 2 cores 256 bit width 10.6 GB/s bandwidth to cores

64 KB Data, 64 KB Instruction
1 MB dedicated 128 bit width 6.4 GB/s bandwidth to cores Increased latency to access the other memory Memory affinity is important

L2 Cache

L2 Cache

FB-DDR2

NUMA (DDR2)

Increased Latency

memory disambiguation allows load ahead store instructions Executions

Executions

Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers 3 128-bit SSE Units; One SSE instruction/cycle

Pipeline length 12; 16 bytes Fetch width; 72 reorder buffers 2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.

Intel Woodcrest Xeon

AMD Opteron

Memory System Performance

Memory System Performance


Memory Access Latency in nanoseconds
L1 Intel AMD 1.1290 1.0720 L2 5.2930 4.3050 Mem 118.7 71.4 Rand Mem 150.3 173.8

Performance of Applications NPB-3.2 (gcc-4.1 x86-64)

LQCD Application (DWF) Performance

Parallel Programming
Messages
Machine 1 Machine 2

OpenMP/Pthread

OpenMP/Pthread

Performance Improvement on Multi-Core/SMP machines All threads share address space Efficient inter-thread communication (no memory copies)

Multi-Threads Provide Higher Memory Bandwidth to a Process

Different Machines Provide Different Scalability for Threaded Applications

OpenMP

Portable, Shared Memory Multi-Processing API


Compiler Directives and Runtime Library C/C++, Fortran 77/90 Unix/Linux, Windows Intel c/c++, gcc-4.x Implementation on top of native threads

Fork-join Parallel Programming Model


Time Fork Join

Master

OpenMP

Compiler Directives (C/C++)


#pragma omp parallel { thread_exec (); /* all threads execute the code */ } /* all threads join master thread */ #pragma omp critical #pragma omp section #pragma omp barrier #pragma omp parallel reduction(+:result)

Run time library

omp_set_num_threads, omp_get_thread_num

Posix Thread

IEEE POSIX 1003.1c standard (1995)


NPTL

(Native Posix Thread Library) Available on Linux since kernel 2.6.x.


Barrier, Pipeline, Master-slave, Reduction

Fine grain parallel algorithms

Complex

Not for general public

QCD Multi-Threading (QMT)

Provides Simple APIs for Fork-Join Parallel paradigm


typedef void (*qmt_user_func_t)(void * arg); qmt_pexec (qmt_userfunc_t func, void* arg);

The user func will be executed on multiple threads.

Offers efficient mutex lock, barrier and reduction


qmt_sync (int tid); qmt_spin_lock(&lock);

Performs better than OpenMP generated code?

OpenMP Performance from Different Compilers (i386)

Synchronization Overhead for OMP and QMT on Intel Platform (i386)

Synchronization Overhead for OMP and QMT on AMD Platform (i386)

QMT Performance on Intel and AMD (x86_64 and gcc 4.1)

Conclusions

Intel woodcrest beats AMD Opterons at this stage of game.


Intel has better dual-core micro-architecture AMD has better system architecture

Hand written QMT library can beat OMP compiler generated code.

You might also like