CS614 Final Term Solved Subjectives With Referencesby Moaaz
CS614 Final Term Solved Subjectives With Referencesby Moaaz
Solved Subjective
From Midterm Papers
MC100401285
Feb 05,2013
PSMD01
4;Dirty bit?2
Answer:- (Page 438)
It can be boolean type column
This column will help us in keeping record of rows with errors, during data profiling
5;What are the problem face industry when the growth in usage of master table file increase?3
Answer:- (Page 12)
Data coherency i.e. the need to synchronize data upon update.
Program maintenance complexity.
Program development complexity.
Requirement of additional hardware to support many tapes.
6;Indexing using I/O bottelneck?3
Answer:- (Page 221)
Throwing more hardware at the problem doesn't really help, either. Expensive and multiprocessing servers can
certainly accelerate the CPU-intensive parts of the process, but the bottom line of database access is disk
access, so the process is I/O bound and I/O doesn't scale as fast as CPU power. You can get around this by
putting the entire database into main memory, but the cost of RAM for a multi-gigabyte database is likely to be
higher than the server itself! Therefore we index.
7; what is mean by the classification process? How we measure the accuracy of classifiers? 3
Answer:- (Page 259)
Classification means that based on the properties of existing data, we have made or groups i.e. we have made
classification.
Accuracy is the measure of correctness of your model e.g. in classification we have two data sets, training and
test sets. A classification model is built based on the data properties and relationships in training data. Once
built the model is tested for accuracy in terms of % correct results as the classification of the test data is already
known. So we specify the correctness or confidence level of the technique in terms % accuracy.
8;SQL server meta services advantages(3)
Answer:- (Page 385)
We may maintain meta data information of the databases involved in the packages and we may keep version
information of each package. Furthermore package can be stored in a structured file and Microsoft visual basic
file.
9;Why pilot strategy is recommended for construction of DWH(5)
Answer:- (Page 334)
Will adopt a pilot project approach, because:
A full-blown DWH requires extensive investment.
Show users the value of DSS.
Establish blue print for full-blown system.
Identify problem areas.
Reveal true data demographics.
Pilot projects are supposed to work with limited data.
2
Write a query to find out total numbers of female students registered in BS Telecom. 3 Marks
Answer:- (Page 425)
SELECT COUNT(DISTINCT r.SID) AS Expr1
FROM Registration r INNER JOIN
Student s ON r.SID = s.SID AND
s.[Last Degree] IN ('F.Sc.', 'FSc',
'HSSC', 'A-Level', 'A level') AND
r.Discipline = 'TC' AND s.Gender = '1
BOOK
ISBN
TITLE
PUBLISHER
ADDRESS
Is this table is in First and second normal form? If yes then what about third normal form? 5 Marks
What issues may occur during data acquisition and cleansing in agriculture case study? 5 Marks
Answer:- (Page 341)
The pest scouting sheets are larger than A4 size (8.5 x 11), hence the right end was cropped when
scanned on a flat-bed A4 size scanner.
The right part of the scouting sheet is also the most troublesome, because of pesticide names for a single
record typed on multiple lines i.e. for multiple farmers.
As a first step, OCR (Optical Character Reader) based image to text transformation of the pest scouting
sheets was attempted. But it did not work even for relatively clean sheets with very high scanning
resolutions.
Subsequently DEOs (Data Entry Operators) were employed to digitize the scouting sheets by typing.
The pest scouting sheets are larger than A4 size (8.5 x 11), hence the right end was cropped when scanned on
a flat-bed A4 size scanner. The right part of the scouting sheet is also the most troublesome, because of
pesticide names for a single record typed on multiple lines i.e. for multiple farmers.
As a first step, OCR (Optical Character Reader) based image to text transformation of the pest scouting sheets
was attempted. But it did not work even for relatively clean sheets with very high scanning resolutions, such as
600 dpi. Subsequently DEOs (Data Entry Operators) were employed to digitize the scouting sheets by typing.
To reduce spelling errors in pesticide names and addresses, drop down menu or combo boxes with standard and
correct names were created and used.
Qno.3 what is mean by click stream? how it can be useful in a web DWH environment
Answer:- (Page 363)
Web-intensive businesses have access to a new kind of data, in some cases literally consisting of the gestures of
every Web site visitor. This is called as the clickstream. In its most elemental form, the clickstream is every
page event recorded by the web server. The clickstream contains a number of new dimensions such as page,
session, and referrer-that were previously unknown in conventional DWH environment.
Qno.4 Differentiate between DDS, Data mining and Data Warehouse DWH
Answer:- http://dwbi1.wordpress.com/2012/07/14/what-is-big-data-data-warehouse-data-mining/
A DDS is a database that stores the data warehouse data in a different format than OLTP.
Data mining is the process of exploring data to find the patterns and relationships that describe the data and to
predict the unknown or future values of the data. The key value of data mining is the ability to understand why
some things happened in the past and the ability to predict what will happen in the future.
A data warehouse is a system that retrieves and consolidates data periodically from the source systems into a
dimensional or normalized data store.
Qno.5 Suppose that we collected data of agriculture from the Punjab, We are require underutilized process data
to be successful and what your suggestion after decode the data?
Qno.6 What affect on the Data Warehouses if the data across an erroneous record, what should be taken
measure by support decision?
Answer:- (Page 159)
Erroneous data leads to unnecessary costs and probably bad reputation when used to support business
processes.
Move all such records to exception table.
Qno.7 Differentiate between Range partitioning and Expression Partitioning
Answer:- (Page 66)
The most common use of range partitioning is on date. This is especially true in data warehouse deployments
where large amounts of historical data are often retained.
Expression partitioning is usually deployed when expressions can be used to group data together in such a way
that access can be targeted to a small set of partitions for a significant portion of the DW workload.
Qno.8 What are the two extremes for technical architecture design?
Answer:- Rep
Qno.9 What should be done in the case where golden copy is missing dates
Answer:- (Page 457)
If the dates are missing we must need to consult golden copy. If gender is missing we are not required to
consult golden copy. Name can help us in identifying the gender of the person.
When golden copy is unavailable replace with a global value 1/1/50
8. Write SQL to fetch total number of female students registered in BS telcom (5 marks)
Answer:- Rep
9. Is it possible to have erroneous data in DWH? How this will impact business processes? (5 marks)
Answer:- (Page )
Any record having value other than 0 or 1 is erroneous.
Erroneous data leads to unnecessary costs and probably bad reputation when used to support business
processes. Consider a company using a list of consumer addresses and buying habits and preferences to
advertise a new product by direct mailing. Invalid addresses cause the letters to be returned as undeliverable.
People being duplicated in the mailing list account for multiple letters sent to the same person, leading to
unnecessary expenses and frustration. Inaccurate information about consumer buying habits and preferences
contaminate and falsify the target group, resulting in advertisement of products that do not correspond to
consumers needs. Companies trading such data face the possibility of an additional loss of reputation in case of
erroneous data.
10. How is hardware utilization different in DWH? (2 marks)
Answer:- (Page 24)
Although there are peaks and valleys in the operational processing, but ultimately there is relatively static
pattern of utilization. There is an essentially different pattern of hardware utilization in the data warehouse
environment i.e. a binary pattern of utilization, either the hardware is utilized fully or not at all. Calculating a
mean utilization for a DWH is not a meaningful activity. Therefore, trying to mix the two environments is a
recipe for disaster. You can optimize the machine for the performance of one type of application, not for both.
10
11. In case of non uniform distribution, what will be the impact on performance? (5 marks)
Answer:http://help.sap.com/saphelp_470/helpdata/en/06/8a983b8e847725e10000000a114084/content.htm
Changes in the B* tree structure can lead to a non-uniform distribution of the data pages. If certain branches of
the B* tree have more pages than others, data page distribution is not uniform. This can affect performance,
because more accesses are generally required to find data.
Non-uniform distributions are detected by SAP DB when INSERT, UPDATE, or DELETE statements are
performed, and the tree is rebalanced when the current operation is carried out. This means that the tree is
constantly maintained for optimum operation. During this procedure, page entries are moved to new locations
and page pointers are redirected. As a result, data pages are used more efficiently.
Uniform distribution of data prevents individual data regions from overflowing. The only restriction on the size
of tables is the storage space available in the database system.
12. Why pilot project methodology is highly recommended in DWH?
Answer:- Rep
11
(5)
14
Similarity or dissimilarity matrix is the measure the similarity or dissimilarity obtained by pair wise comparison
of rows. First of all you measure the similarity of the row1 in data matrix with itself that will be 1. So 1 is
placed at index 1, 1 of the similarity matrix. Then you compare row 1 with row 2 and the measure or similarity
value goes at index 1, 2 of the similarity matrix and son. In this way the similarity matrix is filled. It should be
noted that the similarity between row1 and row2 will be same as between row 2 and 1. Obviously, the
similarity matrix will then be a square matrix, symmetric and all values along the diagonal will be same (here
1). So if your data matrix has n rows and m columns then your similarity matrix will have n rows and n
columns. What will be the time complexity of computing similarity/dissimilarity matrix? It will be O (n2) (m),
where m accounts for the vector or header size of the data. Now how to measure or quantify the similarity or
dissimilarity? Different techniques available like Pearson correlation and Euclidean distance etc. but in this
lecture we have used Pearson correlation which you might have studied in your statistics course.
i)
ii)
iii)
iv)
Give the simple example of company name that using the DWH
Financial Services/insurance
Telecommunication
Transportations
Government
Answer:- Rep
16
Q3: Cleansing can be breaking down in Whom many steps, write their names?
Answer:- (Page 168)
break down the cleansing into six steps:
elementizing, standardizing, verifying, matching, house holding, and documenting.
2 marks
Q4: What do you mean by keep competition hot in context of production selection and transformation
while designing a data warehouse . 2 marks
Answer:- (Page 305)
Make private not public commitment.
Dont let the vendor you are completely sold.
During trial period, put to real use.
Near the end of trial, negotiate.
Q5: Who murge column are selected in case of sort merge? 2 marks
Answer:- (Page 243)
There may be multiple equalities in the WHERE clause, in such a case, the merge columns are taken from only
some of the given equality clauses.
Q6: purposes dta data profiling
3 marks
Answer:- (Page 264)
Data profiling, gathering information about columns, fulfils the following two purposes
Identify the type and extent to which transformation is required
Gives us a detailed view of data quality
Q7: What issues may Accor during data acquisition and cleansing in agriculture case
Study? 3marks
Answer:- Rep
Q8: Meant of classification process, How measure accuracy of classification?
Answer:- Rep
3marks
17
5 marks
Predictive modeling
Segmentation (Clustering)
Dependency Modeling
Summarization
Change and deviation detection
18
Q3: which scripting language is used to perform complex transformation in DST package? 2
Answer:- Rep
Q4: a person wanted to visit and understand the data warehouse implementation strategies adopted in
that organization has refused to allow. What may be the carrier of this refusal?
Q4: how the applications of parallelism differ for OLTP and DSS environment? 2
Answer:- (Page 205)
There is a big difference.
In DSS Parallelization of a SINGLE query
In OLTP Parallelization of MULTIPLE queries Or Batch updates in parallel
Q5: keeping view the uniform distribution in hash based partition .if the partitions are not
uniformly distributed across the process? 3
Answer:- (Page 218)
There can be two types of skews i.e. non uniform distribution when the data is distributed across the processors.
One type of skew is dependent in the properties of the data, consider the example of data about cancellation of
reservations. It is obvious that most cancellations in the history of airline travel occurred during the last quarter
of 2001. Therefore, whenever the data is distributed based on date for year 2001 it will be always skewed. This
can also be looked at from the perspective of partition skew, as date is typically seen to result in non-uniform
istribution of data.
Q6: what is the task performed through import export data wizard to load data? 3
Answer:- Rep
Q7: what is mean by click stream? How it can be useful in a web DWH environment? 3
Answer:- Rep
Q8: what is mean by the classification process? How we measure the accuracy of classifiers? 3
Answer:- Rep
Q9: discuss need for indexing with reference to i/o speed? 3
Answer:- (Page 220)
Consider the Find operation in Windows; a user search is initiated and a search starts through each file on the
hard disk. When a directory is encountered, the search continues through each directory. With only a few
thousand files on a typical laptop, a typical find operation takes a minute or longer.
19
20
21
Q #6: Differentiate in dense index and spars index? What are advantages and disadvantages of these?
Answer:- Rep
Q #7: what is classification process? How we measure the accuracy of classifiers?
Answer:- Rep
Q #8: For what reason trivial quires give wrong result in Agri-DWH? Explain with Example.
Q #9: Purposes of data profiling?
Answer:- Rep
Q #10: TQM benefits? Why organizations prefer this technique on other techniques?
Q #11: How mistakes should avoid in data warehouse process?
Q #12: How time contiguous log entries and HTTP secure socket layer are used for user session
identification? Limitation of this.
Answer:- (Page 365)
Using Time-contiguous Log Entries
A session can be consolidated by collecting time-contiguous log entries from the same host (Internet
Protocol, or IP, address).
Limitations
The method breaks down for visitors from large ISPs
Different IP addresses
Browsers that are behind some firewalls.
In many cases, the individual hits comprising a session can be consolidated by collating time-contiguous log
entries from the same host (Internet Protocol, or IP, address). If the log contains a number of entries with the
same host ID in a short period of time (for example, one hour), one can reasonably assume that the entries are
for the same session.
Limitations
This method breaks down for visitors from large ISPs because different visitors may reuse dynamically
assigned IP addresses over a brief time period.
Different IP addresses may be used within the same session for the same visitor.
This approach also presents problems when dealing with browsers that are behind some firewalls.
Using HTTP's secure sockets layer (SSL)
Offers an opportunity to track a visitor session
Limitations
To track the session, the entire information exchange needs to be in high overhead SSL
Each host server must have its own unique security certificate.
Visitors are put-off by pop-up certificate boxes.
This offers an opportunity to track a visitor session because it may include a login action by the visitor and the
exchange of encryption keys.
22
Limitations
The downside to using this method is that to track the session, the entire information exchange needs to be in
high overhead SSL, and the visitor may be put off by security advisories that can pop up when certain browsers
are used.
Each host server must have its own unique security certificate.
23
24
2 marks
25
26
6- Give name of activities to be performed in planning and design phase as discussed in agri-DWH case
study
Answer:- (Page 335)
1. Determine Users' Needs
2. Determine DBMS Server Platform
3. Determine Hardware Platform
4. Information & Data Modeling
5. Construct Metadata Repository
7- What are the three methods of creating a DTS package
Answer:- (Page 380)
Package can be created by one of the following three methods:
Import/Export wizard
DTS Designer
Programming DTS applications
27