0% found this document useful (0 votes)

44 views

Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions

The document discusses developing a database for storing and retrieving large amounts of correlation function data from lattice QCD simulations more efficiently. It describes using a Berkeley DB to store over 100,000 quantum numbers from configurations, with retrieval times of 1-4 seconds. Preliminary tests show inserting config data into 300 directories with 7,000 binary files each works well. The interface provides functions to read correlators by key and for analysis code.

Uploaded by

Phap Nguyen

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions

Uploaded by

Phap Nguyen

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 26

Database for Data-Analysis

Developer: Ying Chen (JLab) Computing 3(or N)-pt functions

Many correlation functions (quantum numbers), at many momenta for a fixed configuration Data analysis requires a single quantum number over many configurations (called an Ensemble quantity) Can be 10K to over 100K quantum numbers Time to retrieve 1 quantum number can be long Analysis jobs can take hours (or days) to run. Once cached, time can be considerably reduced Require better storage technique and better analysis code drivers

Inversion problem:

Development:

Database for Data-Analysis

Developer: Ying Chen (JLab) Computing 3(or N)-pt functions

Inversion problem:

Development:

Database

Requirements:

For each config worth of data, will pay a one-time insertion cost Config data may insert out of order Need to insert or delete Requirements basically imply a balanced tree Try DB using Berkeley Sleepy Cat:
300 directories of binary files holding correlators (~7K files each dir.) A single key of quantum number + config number hashed to a string About 9GB DB, retrieval on local disk about 1 sec, over NFS about 4 sec.

Solution:

Preliminary Tests:

Database and Interface

Database key:

String = source_sink_pfx_pfy_pfz_qx_qy_qz_Gamma_linkpath Not intending (at the moment) any relational capabilities among sub-keys Array< Array<double> > read_correlator(const string& key);

Interface function

Analysis code interface (wrapper):

struct Arg {Array<int> p_i; Array<int> p_f; int gamma;}; Getter: Ensemble<Array<Real>> operator[](const Arg&); or Array<Array<double>> operator[](const Arg&); Here, ensemble objects have jackknife support, namely operator*(Ensemble<T>, Ensemble<T>); CVS package adat

(Clover) Temporal Preconditioning

Consider Dirac op det(D) = det(Dt + Ds/) Temporal precondition: det(D)=det(Dt)det(1+ Dt-1Ds/) Strategy:

Temporal preconditiong 3D even-odd preconditioning Improvement can increase with increasing According to Mike Peardon, typically factors of 3 improvement in CG iterations Improving condition number lowers fermionic force

Expectations

Multi-Threading on Multi-Core Processors

Jie Chen, Ying Chen, Balint Joo and Chip Watson Scientific Computing Group IT Division Jefferson Lab

Motivation

Next LQCD Cluster

What type of machines is going to used for the cluster?

Intel Dual Core or AMD Dual Core?

Software Performance Improvement

Multi-threading

Test Environment

Two Dual Core Intel 5150 Xeons (Woodcrest)

2.66 GHz 4 GB memory (FB-DDR2 667 MHz)

2.8 GHz 4 GB Memory (DDR2 667 MHz) i386 x86_64

Two Dual Core AMD Opteron 2220 SE (Socket F)

2.6.15-smp kernel (Fedora Core 5)

Intel c/c++ compiler (9.1), gcc 4.1

Multi-Core Architecture
PCI-E Bridge Core 1 Core 2 FB DDR2 ESB2 I/O Memory Controller DDR2 Core 1 Core 2 PCI-E Expansion HUB

PCI Express

PCI-X Bridge

Intel Woodcrest Intel Xeon 5100

AMD Opterons Socket F

Multi-Core Architecture

L1 Cache

32 KB Data, 32 KB Instruction
4MB Shared among 2 cores 256 bit width 10.6 GB/s bandwidth to cores

64 KB Data, 64 KB Instruction
1 MB dedicated 128 bit width 6.4 GB/s bandwidth to cores Increased latency to access the other memory Memory affinity is important

L2 Cache

FB-DDR2

NUMA (DDR2)

Increased Latency

memory disambiguation allows load ahead store instructions Executions

Executions

Pipeline length 14; 24 bytes Fetch width; 96 reorder buffers 3 128-bit SSE Units; One SSE instruction/cycle

Pipeline length 12; 16 bytes Fetch width; 72 reorder buffers 2 128-bit SSE Units; One SSE instruction = two 64-bit instructions.

Intel Woodcrest Xeon

AMD Opteron

Memory System Performance

Memory Access Latency in nanoseconds
L1 Intel AMD 1.1290 1.0720 L2 5.2930 4.3050 Mem 118.7 71.4 Rand Mem 150.3 173.8

Performance of Applications NPB-3.2 (gcc-4.1 x86-64)

LQCD Application (DWF) Performance

Parallel Programming
Messages
Machine 1 Machine 2

OpenMP/Pthread

Performance Improvement on Multi-Core/SMP machines All threads share address space Efficient inter-thread communication (no memory copies)

Multi-Threads Provide Higher Memory Bandwidth to a Process

Different Machines Provide Different Scalability for Threaded Applications

OpenMP

Portable, Shared Memory Multi-Processing API

Compiler Directives and Runtime Library C/C++, Fortran 77/90 Unix/Linux, Windows Intel c/c++, gcc-4.x Implementation on top of native threads

Fork-join Parallel Programming Model

Time Fork Join

Master

OpenMP

Compiler Directives (C/C++)

#pragma omp parallel { thread_exec (); /* all threads execute the code */ } /* all threads join master thread */ #pragma omp critical #pragma omp section #pragma omp barrier #pragma omp parallel reduction(+:result)

Run time library

omp_set_num_threads, omp_get_thread_num

Posix Thread

IEEE POSIX 1003.1c standard (1995)

NPTL

(Native Posix Thread Library) Available on Linux since kernel 2.6.x.

Barrier, Pipeline, Master-slave, Reduction

Fine grain parallel algorithms

Complex

Not for general public

QCD Multi-Threading (QMT)

Provides Simple APIs for Fork-Join Parallel paradigm

typedef void (*qmt_user_func_t)(void * arg); qmt_pexec (qmt_userfunc_t func, void* arg);

The user func will be executed on multiple threads.

Offers efficient mutex lock, barrier and reduction

qmt_sync (int tid); qmt_spin_lock(&lock);

Performs better than OpenMP generated code?

OpenMP Performance from Different Compilers (i386)

Synchronization Overhead for OMP and QMT on Intel Platform (i386)

Synchronization Overhead for OMP and QMT on AMD Platform (i386)

QMT Performance on Intel and AMD (x86_64 and gcc 4.1)

Conclusions

Intel woodcrest beats AMD Opterons at this stage of game.

Intel has better dual-core micro-architecture AMD has better system architecture

Hand written QMT library can beat OMP compiler generated code.

ISO - TC 135 - SC 2 - Surface Methods
No ratings yet
ISO - TC 135 - SC 2 - Surface Methods
2 pages
Mathcad Sheet of - Kpeter
No ratings yet
Mathcad Sheet of - Kpeter
3 pages
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
No ratings yet
Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions
26 pages
Lab01 PDF
No ratings yet
Lab01 PDF
5 pages
L-05
No ratings yet
L-05
18 pages
Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter
No ratings yet
Kernel in The Way Bypass and Offload Technologies: End User Summit 2012 New York Christoph Lameter
27 pages
hpc-Neal
No ratings yet
hpc-Neal
32 pages
Class04 X86assembly
No ratings yet
Class04 X86assembly
44 pages
Microblaze Linux: Using An FPGA-based Processor Is: Very Intelligent Very Stupid Don't Know
No ratings yet
Microblaze Linux: Using An FPGA-based Processor Is: Very Intelligent Very Stupid Don't Know
53 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Khem Raj Embedded Linux Conference 2014, San Jose, CA
No ratings yet
Khem Raj Embedded Linux Conference 2014, San Jose, CA
29 pages
DSP-8 (DSP Processors)
No ratings yet
DSP-8 (DSP Processors)
8 pages
All-Products Esuprt Software Esuprt It Ops Datcentr MGMT High-Computing-Solution-Resources White-Papers84 En-Us
No ratings yet
All-Products Esuprt Software Esuprt It Ops Datcentr MGMT High-Computing-Solution-Resources White-Papers84 En-Us
8 pages
Linux 4.10
No ratings yet
Linux 4.10
19 pages
Blas
No ratings yet
Blas
12 pages
Linux 4.14
No ratings yet
Linux 4.14
21 pages
09 ParallelizationRecap PDF
No ratings yet
09 ParallelizationRecap PDF
62 pages
21.optimization II
No ratings yet
21.optimization II
92 pages
Fpga Interview Questions
No ratings yet
Fpga Interview Questions
7 pages
Oracle 11g RAC Implementation Guide
100% (1)
Oracle 11g RAC Implementation Guide
38 pages
NEWS For R Version 3.0.0 (2013-04-03)
No ratings yet
NEWS For R Version 3.0.0 (2013-04-03)
97 pages
Linux 4.11
No ratings yet
Linux 4.11
20 pages
(Videogame) Rendering 102
No ratings yet
(Videogame) Rendering 102
32 pages
WIEN2k Installation
No ratings yet
WIEN2k Installation
12 pages
ARM Notes1
No ratings yet
ARM Notes1
15 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
Eli Bendersky C++11 Threads, Affinity and Hyperthreading
No ratings yet
Eli Bendersky C++11 Threads, Affinity and Hyperthreading
24 pages
D02 - Combs - Intro To Writing Wireshark Packet Dissectors
No ratings yet
D02 - Combs - Intro To Writing Wireshark Packet Dissectors
51 pages
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
No ratings yet
Poor Man's Computing Revisited: Alexander Shchepetkin, I.G.P.P. UCLA
12 pages
Architecture PDF
No ratings yet
Architecture PDF
19 pages
previosyear3rd
No ratings yet
previosyear3rd
28 pages
CQL - A Flat File Database Query Language: Murray Hill, New Jersey 07974
No ratings yet
CQL - A Flat File Database Query Language: Murray Hill, New Jersey 07974
12 pages
Xeon Phi Complete
No ratings yet
Xeon Phi Complete
88 pages
Network Processors: Jeffrey Shafer
No ratings yet
Network Processors: Jeffrey Shafer
19 pages
NEWS
No ratings yet
NEWS
97 pages
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
100% (1)
Product Availability Update: Processamento Paralelo em GPU's Na Arquitetura Fermi
44 pages
PowerPC Assembly Overview
No ratings yet
PowerPC Assembly Overview
9 pages
Intel Pentium 4 Processor: Presented by Michele Co
No ratings yet
Intel Pentium 4 Processor: Presented by Michele Co
60 pages
Lec 1
No ratings yet
Lec 1
27 pages
Brendan Gregg: Container Performance Analysis
No ratings yet
Brendan Gregg: Container Performance Analysis
75 pages
Sics A Talk
No ratings yet
Sics A Talk
12 pages
L7 Multicore 1
No ratings yet
L7 Multicore 1
50 pages
Assignment 03 (1)
No ratings yet
Assignment 03 (1)
5 pages
Design Compiler
100% (1)
Design Compiler
48 pages
Parralel Demro 001
No ratings yet
Parralel Demro 001
45 pages
AMP Manual
No ratings yet
AMP Manual
22 pages
CRGC Mcore PDF
No ratings yet
CRGC Mcore PDF
124 pages
Percona XtraDB 集群文档
No ratings yet
Percona XtraDB 集群文档
37 pages
Hes-Dn2076k10 20110921
No ratings yet
Hes-Dn2076k10 20110921
44 pages
Linux Kernel Networking
No ratings yet
Linux Kernel Networking
192 pages
389 Directory Server
No ratings yet
389 Directory Server
24 pages
NEWS
No ratings yet
NEWS
34 pages
10 GB Ethernet Mark Wagner: Senior Software Engineer, Red Hat
No ratings yet
10 GB Ethernet Mark Wagner: Senior Software Engineer, Red Hat
58 pages
Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking With Low-Cost Parallel Hardware
No ratings yet
Are Your Passwords Safe: Energy-Efficient Bcrypt Cracking With Low-Cost Parallel Hardware
7 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Practical Guide Rac
No ratings yet
Practical Guide Rac
63 pages
All My IT Tech Posts
From Everand
All My IT Tech Posts
Stephen Edwards
No ratings yet
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
From Everand
C# Package Mastery: 100 Essentials in 1 Hour - 2024 Edition
Tenko
No ratings yet
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
From Everand
CISCO PACKET TRACER LABS: Best practice of configuring or troubleshooting Network
Mulayam Singh
No ratings yet
Basic Information About C language PDF
From Everand
Basic Information About C language PDF
Suraj Das
No ratings yet
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
From Everand
Practical Reverse Engineering: x86, x64, ARM, Windows Kernel, Reversing Tools, and Obfuscation
Bruce Dang
No ratings yet
On Infinite Products: Kordonowy
No ratings yet
On Infinite Products: Kordonowy
49 pages
Formula Kircshmer
No ratings yet
Formula Kircshmer
4 pages
Fabrication of Self Lubrication System For Complicated Machines
100% (2)
Fabrication of Self Lubrication System For Complicated Machines
48 pages
Lecture 08 Slides
No ratings yet
Lecture 08 Slides
25 pages
Debasis Kundu, Swagata Nandi Statistical Signal Processing Frequency Estimation 2012
No ratings yet
Debasis Kundu, Swagata Nandi Statistical Signal Processing Frequency Estimation 2012
140 pages
Phase Equilibrium
No ratings yet
Phase Equilibrium
20 pages
STP0260 - Processing of Tire Treads - 2007 Atlanta
No ratings yet
STP0260 - Processing of Tire Treads - 2007 Atlanta
25 pages
ANSWER KEY-Quarter Exam Physics
No ratings yet
ANSWER KEY-Quarter Exam Physics
3 pages
Btech Ce 5 Sem Geotechnical Engineering Kce501 2023
No ratings yet
Btech Ce 5 Sem Geotechnical Engineering Kce501 2023
2 pages
Reduction of Vibrations G.B. Warburton, J. Wiley & Sons, Chichester, 1992, 91 Pages, 17.50 - 1993
No ratings yet
Reduction of Vibrations G.B. Warburton, J. Wiley & Sons, Chichester, 1992, 91 Pages, 17.50 - 1993
2 pages
TExas Instruments
No ratings yet
TExas Instruments
29 pages
716camouflage Technology PDF
No ratings yet
716camouflage Technology PDF
4 pages
Boletin Construyendo 1
No ratings yet
Boletin Construyendo 1
31 pages
Bartlett-Arithmetic Growth PDF
100% (2)
Bartlett-Arithmetic Growth PDF
29 pages
Viscosity - Temperature Chart: Input Data
No ratings yet
Viscosity - Temperature Chart: Input Data
6 pages
2909952-BS en Iso 7963 2010
No ratings yet
2909952-BS en Iso 7963 2010
20 pages
Rolling Process of An Aluminium
No ratings yet
Rolling Process of An Aluminium
18 pages
Maths PQ2
No ratings yet
Maths PQ2
6 pages
Mathematics 3
No ratings yet
Mathematics 3
67 pages
Paint Coat
No ratings yet
Paint Coat
9 pages
11042016000000T - Thermanit 25 - 09 CuT - Tig Rod
No ratings yet
11042016000000T - Thermanit 25 - 09 CuT - Tig Rod
1 page
Subjet Index
No ratings yet
Subjet Index
14 pages
ENYO English Brochure
No ratings yet
ENYO English Brochure
6 pages
A Cell-Based Smoothed Finite Element Method For TH
No ratings yet
A Cell-Based Smoothed Finite Element Method For TH
14 pages
CH 06
No ratings yet
CH 06
49 pages
v3dm Tutorial 1 - 10e
No ratings yet
v3dm Tutorial 1 - 10e
22 pages
"Facts Are Not Science - As The Dictionary Is Not Literature." Martin H. Fischer
No ratings yet
"Facts Are Not Science - As The Dictionary Is Not Literature." Martin H. Fischer
16 pages
Designing of 2X2 Circular Micro Strip Patch Mimo Antenna For 2.4 GHZ Applications
No ratings yet
Designing of 2X2 Circular Micro Strip Patch Mimo Antenna For 2.4 GHZ Applications
26 pages

Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions

Uploaded by

Database For Data-Analysis: Developer: Ying Chen (Jlab) Computing 3 (Or N) - PT Functions

Uploaded by

Database for Data-Analysis

Developer: Ying Chen (JLab) Computing 3(or N)-pt functions

Database for Data-Analysis

Developer: Ying Chen (JLab) Computing 3(or N)-pt functions

Database and Interface

Analysis code interface (wrapper):

(Clover) Temporal Preconditioning

Multi-Threading on Multi-Core Processors

Next LQCD Cluster

What type of machines is going to used for the cluster?

Intel Dual Core or AMD Dual Core?

Software Performance Improvement

Two Dual Core Intel 5150 Xeons (Woodcrest)

2.66 GHz 4 GB memory (FB-DDR2 667 MHz)

Two Dual Core AMD Opteron 2220 SE (Socket F)

2.6.15-smp kernel (Fedora Core 5)

Intel c/c++ compiler (9.1), gcc 4.1

Intel Woodcrest Intel Xeon 5100

AMD Opterons Socket F

memory disambiguation allows load ahead store instructions Executions

Intel Woodcrest Xeon

Memory System Performance

Memory System Performance

Performance of Applications NPB-3.2 (gcc-4.1 x86-64)

LQCD Application (DWF) Performance

Multi-Threads Provide Higher Memory Bandwidth to a Process

Different Machines Provide Different Scalability for Threaded Applications

Portable, Shared Memory Multi-Processing API

Fork-join Parallel Programming Model

Compiler Directives (C/C++)

Run time library

IEEE POSIX 1003.1c standard (1995)

(Native Posix Thread Library) Available on Linux since kernel 2.6.x.

Fine grain parallel algorithms

Not for general public

QCD Multi-Threading (QMT)

Provides Simple APIs for Fork-Join Parallel paradigm

The user func will be executed on multiple threads.

Offers efficient mutex lock, barrier and reduction

Performs better than OpenMP generated code?

OpenMP Performance from Different Compilers (i386)

Synchronization Overhead for OMP and QMT on Intel Platform (i386)

Synchronization Overhead for OMP and QMT on AMD Platform (i386)

QMT Performance on Intel and AMD (x86_64 and gcc 4.1)

Intel woodcrest beats AMD Opterons at this stage of game.

You might also like