Datastage Answers
Datastage Answers
What is config file; what does it consists; difference b/w static config file and
dynamic config file?
Configfile
- Datastage configuration file is a master control file (a textfile which sits on the server side) for
jobs which describes the parallel system resources and architecture. The configuration file provides hardware
configuration for supporting such architectures as SMP (Single machine with multiple CPU , shared memory and
disk), Grid , Cluster or MPP (multiple CPU, mulitple nodes and dedicated memory per node). DataStage
understands the architecture of the system through this file.
This is one of the biggest strengths of Datastage. For cases in which you have changed your processing
configurations, or changed servers or platform, you will never have to worry about it affecting your jobs
since all the jobs depend on this configuration file for execution. Datastage jobs determine which node to run
the process on, where to store the temporary data, where to store the dataset data, based on the entries
provide in the configuration file. There is a default configuration file available whenever the server is installed.
The configuration files have extension ".apt". The main outcome from having the configuration file is to
separate software and hardware configuration from job design. It allows changing hardware and software
resources without changing a job design. Datastage jobs can point to different configuration files by using job
parameters, which means that a job can utilize different hardware architectures without being recompiled.
2.
INSTALL_DIR/etc, where INSTALL_DIR ($APT_ORCHHOME) is the top level directory of DataStage installation.
segments, which is then processed independently by each node in parallel. It helps make a benefit of parallel
architectures like SMP, MPP, Grid computing and Clusters.
1. Keyless partitioning
Keyless partitioning methods distribute rows without examining the contents of the data.
2. Keyed partitioning
Keyed partitioning examines the data values in one or more key columns,ensuring that records with the same
values in those key columns are assigned to the same partition. Keyed partitioning is used when business rules
(for example, Remove Duplicates) or stage requirements (for example, Join) require processing on groups of
related records.
Data Partitioning Methods :
Datastage supports a few types of Data partitioning methods which can be implemented in parallel stages
like Auto,Db2,Entire,Same,Hash,Moduius,Range,Round robin,Random.
B. Collecting is the opposite of partitioning and can be defined as a process of bringing back data partitions
into a single sequential stream (one data partition).
Data collecting methods
3.
Join Stage:
1.) It has n input links(one being primary and remaining
being secondary links), one output link and there is no
reject link
2.) It has 4 join operations: inner join, left outer join,
right outer join and full outer join
3.) join occupies less memory, hence performance is high in
join stage
4.) Here default partitioning technique would be Hash
partitioning technique
5.) Prerequisite condition for join is that before
performing join operation, the data should be sorted.
Look up Stage:
1.) It has n input links, one output link and 1reject link
2.) It can perform only 2 join operations: inner join and
left outer join
3.) Join occupies more memory, hence performance reduces
The unix rm utility cannot be used to delete the datasets. The orchadmin delete or rm command should be
used to delete one or more persistent data sets.
-f options makes a force delete. If some nodes are not accesible then -f forces to delete the dataset partitions
from accessible nodes and leave the other partitions in inaccesible nodes as orphans.
-x forces to use the current config file to be used while deleting than the one stored in data set.
Constraints - Can be treated as a filter condition which limits the number of rows/records coming
from our input based on the business rules we defined. Stage variable can be used in constraints.
Column derivations - Used to get or modify our input values, i.e. concatenation of two values from
inputs, set the column to constant value, etc.