DM Unit 5

data mining spectrum r18 jntuh

Uploaded by

saisudhir1728

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

0% found this document useful (0 votes)

393 views47 pages

DM Unit 5

data mining spectrum r18 jntuh

Uploaded by

saisudhir1728

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF or read online on Scribd

You are on page 1/ 47

y ~euGRUY pee . , 2 * roneed Concepts Bote Concepts in Mining Date Streams ~ Mining Time-series Data ~ Mining Sequence Pattern infonecional Databases ~ Mining Object ~ Spatial ~ Mulimedia ~ Text and Web Data ~ Spatial Dene Mining ‘uimedia Data Mining ~ Text Mining — Mining the World Wide Web = LEARNING OBJECTIVES Basic Methodologies for Stream Data Processing and Querying Lossy Counting Algorithm Methods for Classifying Stream Data Techniques for Mining Time-series Data Methods of Sequential Pattern Mining Concept of Spatial and Multimedia Data Mining SSN NNN Various information Retrieval Methods | Concept of Text and Web Mining. INTRODUCTION random sampling, sliding windows, Dato Stream Management System lata streams which are potentially Pebsilemethodologies for stream date processing and querying are Day Mult rescltion methods, sketches, rondomized algorithms ie aint Stream Query Processing. Where, DSMS uses multiple ond are temporally ordered. Atime-series the series of data that changes with "B60 tim g of d large amount of space-related Moreover, Multimedia database audio-video equipment, CD- database is defined os the database which contains deta ge ie: Whereas, a spatial database is a database consistin ‘rien 8 maps, medical imaging data and VLSI chip layout date. "Ohne "2 all these data which is widely applicable in digital cameras, ; 15 that often occurs tern mining refers to the process of extracting the ordered ever or shen ht ctenecas Whereas, Text Mining is defined as a process of extracting 1 nay erleted information "meng @°°Men!) databases. Moreover, the World Wide Web (WWV') © 8 rn ne “erigen 2° In the distribution of information services across the WETE” "en, education, entertainment, consumer information, finan cial services, electronic commerce. " ENTS 70) specreun quinone JOURNAL FOR ENGINEERING STOP!‘i nitecture? Q1. What are the components of a stream data query processing 2° Answer : oa : “The three components of a stream data query processing architecture arS+ (a) End User It is responsible for issuing a query to the DSMS. (6) Query Processor 2 . ‘ch space and provides th It takes the query and then processes it using appropriate information present in seratch SP‘ 1 Tsay the user. (©) Seratch Space Itconsists ofthe appropriate information which is used by the query processor. Queries are a (i) Gne-time query Gi) Continuous queries. 2. Write about tilted time frame. Answer = ‘Tilted time frame isa time dimension model used for recording time at various levels of granularity. The mos recent and the more distant time are recorded at finest and coarser granularities, respectively various analysis tasks can be pen with the help of this model, This model also ensures that the amount of data which is to be stored in the memory space I Few of the possible ways of designing a tilted time frame are as follows, 1. Natural tilted time frame model 2. Logarithmic tilted time frame model 3._ Progressive logarithmic tilted time frame model. Q3. What are the advantages of hoeffding tree algorithm? , Answer + Model Paper ‘The advantages of hoeffiding tree algorithm are as follows, 1. It provides more accuracy with less number of samples. 2. Ttavoids multiple scanning of same data which results in the wastage of memory. 3, Itis incremental and can be used for classification of data during data construction. Q4. _ Define the following terms, (a). Trend movements (b) Cyclic movements. Answer = (a) Trend Movements “The trend movements are also known as long-term movements, They specif} . FY the di jes gra certain period of time. Ss irection of a time series s?™ “The trend movement can be represented with the help of a trend curve o z ‘ © a trend lit _ . ro “sing the vo methods, the weighted moving average method and least Squares eg eae nee cvs cin a (Cyclic Movements These moveinents are also known as cyclic variations, They typically t : lations of a trend line on curve. Dicilly refer to the longterm periodic (0 oscil In other words, it refers to the cycles that may (or) may not follow the same Sequence after equal time interv* equal time intiq generalization of structured data? 183 and object. Oriented databases ig th 'e having ability to store, access tat they are having ving. abil ore, i aalued data pi comples stucture-valued dan, risrane applications of spatial data mining, answer? gst data mining has wide applications in th Following areas, 1. Geographic information systems 3 2, Geomarketing si 3. Remote sensing 4, Medical imaging & 6 Traffic control 3, Environmental studies etc. Discuss about document choosing meéthod. Answer: ‘ Mode! Papers. 210) In document choosing method, the query represents constraints in order to choose useful documents from the cluster. -aoleanrenieval model” is one such method, wherein user provides a boolean expression of keywords like “audio or video 2 hath st of Keywords are used to represent a document. The information retrieval system accepts a boolean query ors ‘ec and provides documents based on the boolean expression. This method can operate well when the user POSSESS 9 good ixvsiedge of document collection and has the ability of formulating a good query. %. What is meant by stop words? Answer : : Dec t9(R16), O10) \_Thestlection of feature involves eradication of case sensitivity sto Stxamples of stop words are a, about, also, among, are, around, at, by, ete. %. What are the two measures of text retrieval? cost Pepin, 24) Answer + i p-words, punctuation and uncommon words. Some of The basic measures for text retrieval are as follows, () Precision (i) Recall, Precision Pretision is defined as the ratio ofthe set of docume ee rete teltieved, Typically, it can’be formulated 2s, | {Relevant _documents: yo ngntvasonmen = ents [{Retrieved_docum -roeumeats and that 7 7 ; ints which are bot h relevant and reteieved 1 the set of documents that 5 whi Precisio 5 es Fecal ants which Riis defined asthe rato of he set of documents WE we re] ‘evant. Typically, it can be. formulated #5 10. period. documents] = [{Relevant_d : Recall = levant_documents: il F ie ‘ 14 Relevant_docutDATA MINING [LJNTU-Hypp, 184 . Q10. Dofine Toxt Minin Answer : information, for text (JocuMeD! daha. TEM Mining isdetnod asa process af extracting high quality oriented rennet numeric indices from theg Purpose ofthis mining isto process unstrctred information ad to extract med Ty aa mining is an important part of dat wing to make the infirmation accessible to differ : ations Te eae ‘cause, soch mining enables the usct to make a comparison among several doc Seidentify the procedure of sevetal documents, Q41. Give the taxonomy of wob mining. Answer ‘The different types of web mining includes the following, @ Web Usage Mining This type of mini ig refers toa process ‘of extracting textual data/multimedia data contained within the Server logs fy, based on the user's requirements, : G_— Web Content Mining, THis type of mining als called “web text mining’ refers to the process of discovering some useful informa ‘text, audio, video, images, etc., contained within the web servers. Web Structure Mining ‘This type of mining refers to the process of anal lyzing the nodes and connection structure of a particular web se typically done by the graph theory. Q12. Define web content mining. . Answer : Model Pipe ‘Web content mining isa process of extracting relevant information from web contents Basically, web contentece rmation, real-time information and hyperlink. The textual informatit ‘unstructured due to which text mining techniques can be used for per ‘web content mining. - ‘Anyone fotnd ity Is LIABLE to face LEGAL procesgasic CONCEPTS IN MINING DATA stg i i . ‘AMS jain in detail the Basic methodologigg for stream ao : a " Processing ang ae 11 etnodologies for Stream Data Processing and Queryin ‘a. gene thodologies for stream q , g Model Pape, ‘re basic mel , 8 im data processing and querying eee Per, toa 1, Random sampling . 8S Follows, Sliding windows Histograms : Multi-resolution methods Sketches Randomized algorithms Data stream management system and § Stream query processing, 1 Random Sampling stream queries aS reservoir using whic! i ‘oir sampling is simple, itis high e “Stofreservoir is extremely large, Therefore, to avoid this is form a true random sai um. Then these new sue, it is beneficial to maintain a candidste mple of element. Then new elements keep on clements have high probability of replacing thems a "1. can be said that if there are “" elements in entire data strear om this data str, Siding Windows vaaltate ‘or performing analysis of data stream is “sliding window” mol In is oe ‘Sino = nest data instead of performing computation on the pos In sim er words i consid atime me NES then + nnsidered as time when this element expires, fo Meagadel can be applicable nee ee ks or sensor networks sine the eooentate mee yon ese evens ahr me em hs thes et or :mory efficiency without wasting it, since Pein the pat f10m this, the sliding window model also uses me Memory for storing a small window of dat. Set of element with the reservoir, such that there entering whenever the data stream flows into the elves with old element that are selected randomly. im and a sample of candidate set of stream “3¢ ie ‘am, then the probability of replacing an old element (selected randomly) with new element is “MIN” Bane listributic f values of an clement present > mati uency distribution o een pest i ay se tact data structure, tesponsible for approximating the frequ Tonks Te bck may mpi a Pia sally, it partitions the data into a set containing aa smote i called “equal-vidh ra sigs t i m. AS This rule is easy ie wens \Pon the type of partitioning rule applied 2a te A cca Te yim \ wae ong ™Sttuction of histograms wherein the wid! ofa ee tie inion ed eae ae te geg MPling the probability distribution function. St at be Be nce fee in sucl ment of i ih “sc histograms, define the sizes of the bucket in Iso act s a replace! ich j - istograms can a ting th Tesults in data distribution. These histog answers, of query. ‘S186 DATA MINING [INTUHYDERABay a eee In order to get multiple resolutions, a clustering method is used which organizes a single data stream into a hierarchical structure of trees using which hierarchy of microclusters can be created. The statistics of data stream is ppdated periodically nan incremental fashion hierarchy of microclusters, whenever data stream flows dynamically. In order to get the general data statistics at multiresolution, the information present in several microclusters can be combined to form a single larger macrocluster. ‘The construction of multi-resolution hierarchy structure for data stream cari be done using.a technique called “Wavelets” (e., data stream), This technique, is used to take an input signal that breaks it into an orthogonal basis functions. The most simple ‘basis function among these functions is the “Haar Wavelet”. This wavelet results in computing average and difference at multiple resolution levels recursively. These both wavelets are easy to understand and implement. They are exceptionally good while dealing with spatial and multimedia data. Wavelets is a famous multiresolution method used for compressing data streams. This is hecause, they act as an’approximations to histograms for optimizing a query and also due to the fact that histograms based on wavelet can be maintained for longer time. 5. Sketches . Sketches require single pass to operate on data whereas, methods like histograms and wavelets need multiple passes. Consider data stream containing a universe of elements for maintaining a complete histogram. ‘The universe “U,” can be represented as, U,=1,2,3, 0m) ‘The data stream “D” can be represented as De {dy dyy dy) : For every value of “i” in the universe, number of occurrences associated with “/" must bé maintained in an order “D*. The increase in the size of the structure occurs due to the increase in the size of the universe. If the size of a structure increases then representation becomes quite complex. Thus, smaller representation can be made smaller which is necessary. Let us asstume the frequency moments of ‘D' which are numbers, s0 Fis defined as,” In the above expression, Fis the count of number, n is the size ofthe universe, the frequency in the sequence ‘D". F,, F, and F, correspond to nusiber of distinet elements, length ofthe sequence and self-join size (repeat rete) respectively.» ; Frequency moment are capable of providing essential information about the data for database applications. ‘They can ven provide the degree of symmetry inthe data. This dewtce Gf symmetry helps in finding suitable partitioning algorithm for the data. y is smaller tha the size of the memory is st then " ust be implemented. This alternative is ref, ‘ ble for calculating frequen : pape this respons toas “sketches” which is responsible fo" toa ents, These sketches use randomized linear projec, Jnonder to establish a smallspace summary for a distiby, vrotor They ean even guarantee the quality ofthe answer yf, are approximate : Consider knumber of elements and a universe contr qh values. These type of sketches can approximate FF, Fin O(logn + log) space. Every elément must be hashed at random to eth (1, +1). Maintain a random variable = Ei mx, Squai of F" (ie., 12) results in a good estimate for F, For obtaining a clear and efficient estiniate, one has maintain random variable ic., V. Then ¥, value must be squay and its median value must be selected, which proves that value estimated lies closer to F Sketch positioning was introduced for increasing @ performance of sketching on a data stream. It makes use statistical information for partitioning the domain of atibi such that, the error rate gets reduced. 6. Randomized Algorithms Randomized algorithms are generally used to de with large data streams with high dimensions. The use randomization technique results in more simpler algorit when compared to the other deterministic algorithms. ‘Two types of randomized algorithms are as follows, (i) Las Vegas algorithm (i) Monte Carlo algorithm. @® Las Vegas Algorithm ‘The answer returned by this randomized algorithm always correct but there is a change in running time Gi) Monte Carlo Algorithm This randomized algorithm has limits on running time! ‘the output produced may or may not be the right ans* ~ Among the two algorithms, Monte Carlo is m0 preferred. The result ofa randomized algorithm gives ara variable: Let us apply limits on the tail probability of" Particular random variable. This implies that probability random variable getting deviated from the expected val comparatively low. A tool called. * lity be used. It states that, Se eee PUX n> n) In the above expression, “X, refers to a random variable “prefers to the mean ‘o' refers to the standard deviation. ‘Whereas ‘n"is the positive real number. Using this inequality, limi ied 0 variance of eng auality Timits can be a WARNING: Xeron/Photocopying ofthis book Is CRIMINAL act. Anyone found guitys LIABLE to face LEGAL procoodino®iC i ee COMPletel. imuligerent “Chernoff Bounds” technique er, inet wat ty independent Poisson trials be Pp G cons ot tral, the probability of success varies fy ofPosjenotes the sum of P, to P, then Chernoff 1a. ih, den e ois ms : shove expression, it can be observed that there th pe probability when itis away from the mean se ces ithe rate of poor estimates, Bi pat Stream Management System and Stream queries ‘pata Stream Management System (DSMS) uses naan streams. These data streams are potentially nt are temporally ordered. The elements present in the a are discarded once they are processed and can be Bexckonlyifthey have stored in the memory explicitly jou we. Tethree components of a stream data query processing ete ae, () End user. @ Query processor Gi) Scratch space. 1 ed User 4isrsponsible for issuing-a query to the DSMS. 1 Query Processor tals the query and then processes it using appropriate tion present in scratch space and provides the "sults to the user. 8 Sratch Space ” , eons ) A sin ofthe appropriate information which is used ; Suety processor. Queries are of two types, i) ir 7 One-time query i "tinuous queries Query Nitneas i tite span tt is evaluated only once for every point- i Yenuee PSHOt ofa data set by returning the answer to % OLS Queries “ ag thy glttes the data streams continuously as YY attive, teri a : hie isa go’ °F queries are predefined and adhoc. 2" fonef® °F continuous query whereas, adhoc I time que, ry. nerease lany Challenges in a © OF the d; memory space in ene A finite mai ain memory is requi answers. Hence, ¢ eae @PProximate answers rather tha technique of approximatel load Moreover, queries like adhoc queries equi tory in order to return a particular answer compression technique. Answer : Tilted Time Frame Tilted time frame is a time dimension model used for recording time at various levels of granularity. The most Fecent time and the more distant time ae recorded at finest and coarser granularities, respectively various analysis tasks can be performed with the help of this model. This model also ensures that the amount of data which is to be stored in the memory space is less. : Few of the possible ways of designing tilted time frame are as follows, = 1: Natural tilted time frame model 2. Logarithmic tilted time frame model Progressive logarithmic tilted time frame model. 3. Natural Tilted Time Frame Model ‘This model helps in computing the following frequent item sets, (i) Last hour with pr y with the precision of an-hour with the precision of a month. yn of a quarter of an hour (ii) Last day iii) Whole year e days bers dae bs Model 51 Tilted Time Frame 5 jqure (1): Natural 7 in multiple a figure, time frame is st ule In the 200 Ve eno natural vim ours and quartets188 DATA MINING [UNTUCHYDERAB gy r 2, Logarithmic Titled Time Frame Model i jthmic scale. In this model, the time frame is divided into multiple granularities with respect to a logarith 5124, 2560, 12 of, 3, at, ‘Time Figure 2k: Logarithmie Tilted Time Frame Model gad ha, the recent sot is holding the transactions ofthe curent quarter and the remaining slots ae holding ty quarter, the next two quarters, 4 quarters, 8 quarters, 16 quarters and so on. (.e., quarters are growing 310. Ty umber of time units required by this model are, log, (365 * 24 x 4) +1 = 16.1 units This model with one year of data and the finest precision at a quarter required only 17 time frames to store the compresg, information, 3. Progressive Logarithmic Tilted Time Frame Model ‘This model stores the snapshots at various levels of granularity with respect to the recency. Snapshots - . 1 59 $7 55 2) 56 52 58 3 51 53 50 4 4231 5 28. ‘Table: Progressive Logarithmic Tilted Time Frame Model (Consider 7, as the time which passed since the sream has been started. Each snapshot is divided into various frame vumben These frame numbers may lie inthe range of 9 and max_fiame. A max_frame lies between log, (T,)_max. capacity and og, Rhere max capacity indicates maximum number of snapshots to which a frame ean hold, Time stamp is used for reprecssey every snapshot. Following are the rules for inserting a snapshot into snapshot frame table. Rule jrnen (rod 2) = 0 and (t, mod 2"*!) +0, 4, cam be inserted into the frame_nnumber ‘n* only when ‘n* i less than mat frame. Ifthe value of ‘nis greater than max_frame then f, must be inserted in the max. Heme itselt Rule2- Every slot has got its own max_capacty I'he slot reaches the max_capacity while is being inserted into the frame. mune, then the snapshot which is ofdest one must be replaced with a new snapshot. These tilted time frame models have given a way for inserting the new values when the. existing in the framé becomes old Q15. Describe in detail about the critical layer technique. Answer Critical Layer © Critical layer can be refered to a a strategy which can be used with tilted time frame model for erformi tations in less amount of time. In computing a materialized cube, a singe tilted time frame model is used and Teanlte nine cost, since the cube consists of various levels wherein each level included various values Gistinet). Critical layer technique ™® introduced to overcome this limitation. ye Critical layer can be broadly classified into two layers depending upon their coy analysis, : 1. Minimal interest layer mutational importance in seam 2. Observation layer. WARNING: Xerox/Photocopyng ofthis book's @ CRIMINAL act. Anyone found gully LIABLE to face LEGAL proceedings. eee oa ss find it less intresting, as the minute sie re unnoticed. ion Layer is by got good capacity to attract the users as 1s erestng 0 fend the dat availabe inthis aes ust like making decision with respect taception, finding cells which possess exceptions ae ne performed By 8 ser piefor crea Layer xen dimensions at lowest evel of granularity as soot umber and minute, These three dimensions te io multiple_user, house_number, hour when they are mal interest layer. They transform into group_user, a mand ay WHEN they are at the observation layer, won ile building a design, one can reach the minimal yer butts better nottomiove beyond it, because itmay re edistraction ofthe user. Therefore, while constructing fous cuboid, the aggregate cells of a group-by must be sedand stored in order to represent the minimal interest yer The dimensions are rolled up to aggregate them. The e_user rolls upto multiple_usér, door_number eoesion sn i ito a house, number Whereas minute rolls upto a hour nthe case of observation layer, the cuboids are computed tycnsieing the ted time frame model In this layer, the 2esions roll upto group_user town and day respectively ats. What if a user needs a layer that would between the two critical layers? ps Answer # Inacttical layer, the cubes are materialized only at the ‘woccal layers (minimal interest and observation). When a user reedsthe layer which lies between the two critical layers, then ‘becompexty arses. The cuboids which lie in this layer may be fly computed, partially computed and completely uncomputed: Inorderto overcome this problem, anew approach called “yubrpath cubing” is introduced. In this approach, the cuboids ‘erled upon a single popular drilling path from the minimal ‘Sere layer to the observation layer. The popular drilling path [ates the ayes which come aeros it the other layers are (€szdae computed, whenever is essential compute them. esonable trade-off s achieved by this approach between the me spac, computation time and flexibility. This approach eas ee and ha quick aggregation and drilling time. feng ta Structure is used to reduce the space required Computation of aggregation: This data structure not an efficient computation but also helps in storing the Tat path that is introduced by the stream cube. ‘This “ahypetignca be maintained in the memory swith the help ve nd tee structure refered to as “pLtree”. This tree Pretineme ion oF ‘multi-dimensional and cane Aicienly H-tre follows the path of popular 2° cree ils ranches In order fo form voted euboids lon ye POPU dling path then the agarezate ls ‘Non-leaf nodes, SPECTRUM ALL-IN-ONE JOURNAL FOR ENG! This H-tree Approach minim ° aa minimizes the computatc Hae = because the cells which have tbe compl coat opal drilling path and the cuboids which have to are the non-leaf nodes present inthe tree. Hatree use efficient techni i cells present inthe cn iques for materi g the eam cube ether by drill au le tea y drilling on-line at per a single dimension or by dling wih combination of ranean, The architecture designed withthe help of an form it Hee wane like incremental updating iz. Explain in detail about the lossy counting \gorithm used for performing frequent pattern mining in data streams. Answer : Lossy Counting Algorithm () Finding Frequent Items Lossy counting algorithm was introduced for finding ‘the frequent items. The algorithm works as follows, Initially, a user must provide the following parameters as input, 1. Minimum-support threshold (0) 2. Erfor bound (e) ‘Now, the stream received whose length a is partitioned info buckets of widh ‘w', where B= 7. A frequeney list data structure is used by this algorithm for every item whose is more than 0. This list maintains the approximate frequency frequency count (/) and maximum possible eror (A) for every item present in it. “The algorithm processes the buckets of items as follows. ‘ew bucket arrives, its items are added to the frequency lst, Ifthe given item is found inthe Tist then frequency count) ofthat item is inereased the given Ae pera found on te ist then i is inserted int the ist whose frequency count is 1. Considet that, ifthe item belongs ton racket, then the maximum possible eror of that em its frequency count is set to 1. By this ters frequency counts ate ones. In very rin actual frequencies but not approxi ae imate. For example, consider Initially, when a ni fases they become approxi ia Frequency list must be evaluated when a bucket reaches its boundary, the frequency ist must be ‘examined. ial is the current bucket number | ‘then the item entry ‘from that bucket can be deleted only, when, pernsn list small, so “algorithm maintains the frequency so Bye nee .d in the main memory: that it ean be efficiently store NEERING STUDENTS190 ) ‘Under Estimation of Frequency Count Approximation ratio plays an important role in approximation algorithms, An item entry from the i bucket can be deleted for f+ gihe wastage of memory. see |. isintemenal and can be wed for clssicaton of dota during data construction eevee Decision Tree Algorithm (VFDT) Web hose |g) Very Fast Decision , D he user. Thi, A Very Fast Decision tree (VEDT) algorithm was state that for] gaekped inorder to overcome the limitations of hoeffding is suficien | jgestm by making certain modifications in it. This tation of tis | siddcation is made in hoeffding algorithm results in the ariable v, te | sxraseinspeed and utilization of memory. The following are truemesnef | enolications made in hoelTding algorithm, Breaking of near-ties while selecting the attributes. (Calculating the function ‘F” after a series of training examples. (9. Sipping the working 6f inefficient leaves inthe case of |, kss memory. (Rejecting the splitting attributes which are poot. (9) Developing the initialization method. AVEDT algorithm is highly compatible with the data It provides bet the ease of tin to EDT in ‘ten compared to VPLYT, since if ples and jue! la oes les and jucludes the latest Explain about the classifier e roach nsomble app to stream data classification. Answe: Classifier Ensemble Approach to Stream Data Classification This approach uses a classifier ensemble in order to classify stream data. The aim ofthis approach i to train a set of classifier from the series of chunks present inthe data stream. nother words, a new classifier is formed on the arrival of every ‘new chunk. Weighing of single clssifieris done depending upon its classification frequency (expected) in an environment which is time changing. The top-n classifiers are considered and then depending upon their votes, decisions are made. Applications 1. This approach is Helpful in the case where the attribute lying close to the root of a tree in CVEDT cannot pass the Hoeffding bound, which results inthe regrowth of a tree, Since this approach uses a single classifier, the process of regrowth will be quick and simple as well. 2, Itdeletes the examples which are least accurate and only keeps the examples which are up-to-date. Q20. Discuss in brief the new methodologies for performing effective clustering of stream data. Answer : Model Papers, a10(2) New Methodologies for Performing Effective Clustering of Stream Data ‘The following are new methodologies for performing. effective clustering of stream data, 1. Calculating and Saving the Summaries of the Previous Data jable. . i | Sean ao efficiently compares itself with the traditional ‘Summaries of the previous data must be calculated and vherein 5 is interms of speed and accuracy. their respective results must be saved since they can be Concept adapt is tin tised While calculating some essential statistics. This is vids 88] VED Ty Pane Vero Fast Decision Thee Algorsben sje, Deca the memory rece i less and le respotse te efit’ | This algorithm ean be called a cadiet requirements are quite quick. vation of | Vy im can be called as an improved version . butior me "ses the stiding window process which is used for |-2- Using the Divide-and Conquer Approach sca fs but every time a new model isnot created from Initially, the data streams must be partitioned depeniting reg Sell The counts of a new example are incremented ‘upon the order in which they arrive. Then the summaries Ni, cuts of an old example are decremented by itself of these chunks must be calculated. Finally, the 25 | Sea lil when adrift occurs, since few nodes might summaries have to be combined This is the way 0 OF | Seno fs the hoeffding bound, Ths ntur leads tothe onsruct big models from small blocks. ue | beating tae subtree whose root isthe new best | 5, performing Iueremental Clustering on Receiving of wes ofnes, a alemite subtree keeps growing with Data Streams Mace Caples, So, the tree continues 10 B7OW. als the stream should beinerenally 106d | Pee yl ng utized for the purpose of elasftion Theda ‘i “Mich ig lb replaced once an alternate subtree is ae se eet ae : } “fficient than the currentone. c if : STUDENTS 2° SPECTRUM aLLAN-ONE JOURNAL FOR ENGINEERING192 TL Computing both Micro and Macroclustering es are calculated and 5: Initially, the sunamaries venvclusier level The microclusters are then ret the help of hierarchical bottom-v, clustering cigorithan, Finally, the calculation of mactoclusier is Tene ina user defined evel. In this way, the data can be Compressed and the rate of occurrence of errr is reduced. Examining Multiple Time Granularity for Analyzing, In stream data analysis, the task performed by more previous data and remote data is completely distinct. ‘Therefore, a tilted frame model is used for saving the snapshots of summarized data at various intervals of time. ‘artitioning Stream Clustering into both On-line and Off-line Procedures ‘The streams data while being received has to be calculated, saved and upgraded incrementally. In order to maintain such clusters which keep changing, an on-line process is used. On the other hand, when the queries are: made by the user inquiring about the previous, current and changing clusters, their analysis can be done with the help of off-line process Q21. Explain about the following clustering algorithm, (a) STREAM (b) CLU stream. Answer : ‘ (2) STREAM Itis a one-pass, constant factor approximation algorithm “which was introduced in order to solve the n-medians problem. Inan n-median problem, kdata points are grouped into m clusters ‘wherein the sum squared error (SSQ) between the data points and center of the cluster to which those points are assigned is reduced. The aim here is to assign same points to the same cluster such that points present in all the clusters differ from each other. Consider a data stream model containing points P,, Pod? With time stamps S,, Sy Syne Sy In this case, stream clustering aims at providing a continuous clustering series using less memory and in short period of time. The stream data is divided into buckets containing “7” points wherein every bucket, fits into the main memory. The points present in these buckets ‘are grouped into n-clusters, The information present in these buckets is suminarized and only the essential information of k-centers is saved. These cluster centers are weighted depending ‘upon the number of points associated with that cluster, These points are then deleted and only the information regarding center is saved, Afier gathering sufficient cluster centers, the weighted clusters are then grouped into a set of O(n) cluster centers. At every level, this process is repeated so that maximum of “7* points are gathered, The result obtained from this technique is 4 single pass, constant-factor approximation algorithm whose time is O(nk) and space is O(A¢) for data stream with n-medians, DATA MINING LEY "= = “Advantages The advamiazes of A consistent lust is solved by assiening same pos, ‘points present in alla. & Thenmedian to the same cluster such that the Clusters differ from each omer dvantages “The disadvantages of stream are 25 Irneplects both evolution ofthe data and time granu, follows, + : fat the clusters whe vides less flexibility 0 calcula prow wat of user speciied time periods, (by) CLU Stream ‘This algorithm groups the data stream which constanty change depending upon the user_defined. on-line clustering justia, The clustering process is divided into the following ‘components, @ Giiy Off-line component. ‘The online component uses microclusters in ord ‘calculate and save the summary statistics regarding the da Stream. It is also capable of maintaining microclusters anf performing incremental online calculation for them. On-line component ‘The offiine component performs macroclustering and provides solution to different user queries with the help of stored summary statistics. These statistics are dependent oa the tilted time frame model. This model is used for grouping the data streams which constantly change depending upon bes historical and current stream data. The tilted time frame mode! also saves the snapshots of a group of microclusters based ca their recency, at various levels of granularity. The information saved can be helpful while processing user-defined clustering ‘queries which are related to the history. In clustream, microcluster is considered as a‘clustering feature. Consider a group of x-dimensional points P,, P,P, with time stamps S,, S,,...5,. The microcluster for these pois can be given as, (r+ 3) tuple (O4?,M?, Mz, My, K) Here My and M/ indicate the x-dimensional vectors Mz, M,' and K indicate scalars, ‘The task of Mf isto ke i * of the data values for every age eed of the sum of sq every dimencs a isto keep the record of sum of the dats ere ™ ‘The a Gi ofthe time: nang ene P the record of the sum of sus" of the sum of time stamps, Trek of Mi is to keep the \ps. The * of number of data points found nas eee the reco The clusters ; canbe added and eee Possess some properties Wi of clustering feare mate can be deleted, These prope®™ it : ! data stream cluster, eM quite essential for analyzit®! WARNING: xerox/Phofodopying of this book is a CRIMINAL act. Anjane found guilt ily is ABLE to taco face LEGAL proceedings.ood concepts ring Process luste! * oct takes place in the tering process pga vm y equ ber oF mioclstes ate gathered, pest er than the number of natural clusters ‘ ge ytd with IDE help of available memory. a ean pte merous ae upgraded, thatis 1 wi clued eterin he current cluster orn the ono thenedofa new listen every cute’ 90S undary must De specified. Ifthe new data point lies ssn egion ofthe boundary, then it can be included soc rene data point does not Lie nthe interior De rhe ‘poundary, then it can be considered as the initial tt ey csr Te dala points et absorbed byte ae ener tied since the mires cae property. When data point is included inthe lst, memory space forthe new cluster ig created either see cent cusesor by deleting the cluster which saspot used recently. 4g _ifine Microclustering Process ‘An offine component is used for analyzing a. cluster salon or performing uset_specified macroclustering ‘Guters which are too old cannot ‘be added. Microclusters lying, inbechaster an be referred to as pseudo points and are divided ‘nienifying clusters of higher level. olution of Clusters Tnorder to understand cluster evolu ‘efollowing example, Eumple tion, let us consider Let be the time hoyzon,p, and p, be the to clock tines Now cluster evolution analysis is used in order fo review Aecanging nature of data which is being received betwen @, ~hp,and (p, ~ 1, p,). This also jncludes answering of Various aur keane one particular point of ime (P) Were Pet brstat athe other time (p.) or Few of the original USERS Tatinthe proces, Then the net snapshots ofthe mictoclusteS ‘2d apsots changes over time can be caleuated Advantages Following are the advantages of custreamss cifically in the + » Itcan obtain clusters of high quality $P° ‘cases where the changes are extreme: ’ ; It provides User with a rich functionality, sre a maintains the historical information associate’ Ki €volution of cluster. uses the tilted time frame model and microclustring structure Torbeing accurate and efficient while WO" ing ‘on real data. - 4 ject to size of keeps the record of scalability with FesPeTh thst its dimensionality and the number oF er. 5 specTRUM ALLAN-ON! — 193 +4 MINING TIME-SERIES DATA Q22. What is a time-series database? different from sequence database? ‘a Answer : ad 3 pert, 104 ‘Time-series Database - A time-series database isd | e lefined as the database which contains the series of data that changes with respect to time. It meas that the data given is directly proportional tothe time. fere, time remains unaffected even if changes are made to dats. __Examples of time-series database includes weather condition, chemical reaction, etc. Inchemical reaction, the chemicals react witheach thet depending upon the time. ‘These database consist of events or value sequences which are acquired by iteratively measuring the time metries lke quarterly, half yearly such as inventory contro, stock ‘exchange. Data mining methods can be applied to these databases to extract the behaviour of object or ‘trend analysis, which help decision support system for planning long-term activities. Trend analysis require different level of time abstraction. Sequence Databases “These databases consists of event sequences which 1 time concepts. For example, ‘may or may not include actual uences et biological sequences, customer shopping | Q23. Explain the four major ‘components of trend ‘analysis for characterizing time-series data. , ‘Answel “The four major components of trend analysis for characterizing time-series data are as follows, 1. Trend Movements ‘The trend movements ate also known as long-term movements. They specify the direction ofa time series graph over a certain period of ime. ‘The trend movement can be represented withthe help ara trend curve ora trend line, The trend curve can be {termined using the twomethods, the weighted moving nd least squares method. average method a 3, Cyclic Movements ‘These movements are also known as cy‘ “They typically refer to the long-term periodic (of) non- periodic oscillations ofa trendline on curve, In other words, itrefers to the eyetes that may (or) may not follow the same sequence after equal time interval. 3, Seasonal Movements 5 “The seasonal movements are also Known as seasonal ‘Variations. They refer tothe events that occur every year. For example, increase in the sales of erackers before Deepawali festival. From the above example, the seasonal movements (increase in sales of crackers) that occur during a certain time period (month of Deepawali) can be identified. fe JOURNAL FOR ENGINEERING STUDENTS «-194 DATA MINING LINTU-HYDER ABA rn ee iene Can 4. Random Movements These movements 49 also he Known as ittepular movements, They refer to the changes that oeout atalomiy heute of some unplhnaned events Exanypie: damages caused by natural disasters, calamities ote Q24. Explain in briet timo-sories database mining. Answer: Model Papert, Q10(b) ‘Time-serics Database Mining For answer refer Unite Database, Q22, Topic: ens ‘The time-series data mining can be performed by the following two approaches, 1, Trend analysis 2. Similarity search, 1, Trend Analysis For answer refer UnitV, Q23. Examples Let the ahove four movements as well as the time- series are represented by using the variables L, Y, E, Rand T, respectively. Thus, the modeling of the time-seriés T can be done either by multiplying or adding the four variables i.e., . THL*Y*EXR (or) T=L+Y+E+R ‘Such modeling can also be termed as the decomposition ‘of time series in terms of four basic movements. Consider a given set of time-series data values, say, ff In order to adjust the given data values with the one of the component of trend analysis, ic, seasonal variations. The influences of the data that occur seasonally must be identified and removed from the time-series. For instance, the seasonal ‘or the expected regular fluctuations refer to the higher sales volumes at the time of Ramadan season. These fluctuations can hide both the seasonal and also the nonseasonal characteristics of the data. Thus, the seasonal variations from the time series must be identified first and then deseasonalized. Il, ‘This is done by introducing a concept called “seasonal index”. This index refers to a collection of numbers that represent the values of the variable at the time of season: occurrence. Suppose if 20%, 60% and 110% are the average ‘monthly sales for the entire year during the months April, August and September then the numbers 20, 60 and 110, refer to the seasonal index numbers for the year. In order to adjust the data with the seasonal variations or to deseasonalize the data, the actual monthly data must be divided by the respective seasonal index numbers. WARNING: asonal pat nether! Ie enh ent dane hy pep inne Y Pett Uv eto yas Fr estan, the «ered we coir every tern mionth (1.6 1 meas jon between the (80 rd he correlation coef Mere, the correla ween je determined by computing, Sinan-MPO M6559 %P ange lent as, e Tmo -The two variables P and Q refers to the random that represents the two time-series which are, , fy sh and wh Ut tart Cae fra fo Case (i) Ifthe resultant value of correlation coefficients zerig, 0 then no correlation relationship between.P and Q exig Case (ii) Ifthe resultant value of correlation coefficients pose -+ve then the correlation relationship between P acd :;;both the values of P and Q increase together. Case (ii) Ifthe resultant value of corrglation coefficientis negative ve then a negative correlation relationship betwee? and Q exists i. when one variable increases, the other varste decreases. Thus, the value of correlation coefficient is said be directly proportional to the correlation relationship ie, £ the value of correlation coefficient is high (low) and poste (negative) then the correlation relationship between the vari is also high (low) and positive (negative). In-order to determine the trend of the data, consi! method in which a moving average of order ‘m’ is evalusted® a sequence of arithmetic means, . FAR Hh tatty batty tet ttt ict, mo thst ‘Thus, by calculating the moving average and rep" With the time-series, the amount of variation existing int set can be eliminated, This process of removing the u™™ uctuations or variations is called smoothing of times "Note ~~ vyaj ouPboses the moving average is calculated ass of weighted arithmetic means then the resulting sea¥e™* be referred to as a weighted moving. ‘average of order 2. Stmilarity Search, For answer refer Unit-V, Q25, Xerox/Pholucopying of this book is a CRIMINAL act, Anyone found. ‘Guilty 's LIABLE to face LEGAL proceedin® Vara,noed Concepts rity search. Explain similarity series analysis. ava eh arty search refers 10 a search that identifies sil si pees that moderately vary oma given query jven collection of time-series sequences (X), the For are of 180 19S, oie equence Matching W rests matching refers toa search that identi- he data sequences from X, that are similar to the ery SUT fe Sequence Matching lence matching refers to a search that identi- identical sequences that are similar to one ‘whol whole sequ all the i ae froma given collection X. sinilarity search in time-series analysis is often used aos applications including medical diagnosis, financial ere nals, engineering databases ele, sly Searchin Time-serles Analysis ata Reduction and Transformation Techniques ” 1 ‘The time-series data are usually of larger size and the s of time-series analysis involves data reduction as its {tl sepsto save storage space and to improve the processing eed, ‘The various approaches to data reduction are as follows, (Attribute Subset Selection It removes those dimensions that are redundant or inlevant. nsionality Reduction Jtuses signal processing techniques to reduce the data. i) ‘Numerosity Reduction Ikteplaces the original data by their alternative and smaller representations. 0 The time-series data are commonly considered as high ine pata ence, dimensionality eduction isthe erucial sara tierent techniques for dimensionality reduction of ries data are as follows, © DFP Wiscrete Fourier Transform) t DWT (Discrete Wavelet Transform) ©) ‘SVD (Singular Value Decomposition) i) 3 ie ° Sketch techniques based on random projection. oe Methods for Similarity Search Xe cregg a'sforming the data, a multidimensional index eo Tain ET 10 enable an efficient time-series data “Uist anon ca Oe Benerated using the initial coefficients “yee orm , Stat ag ttl then used for retrieving all the data imilar to the query sequence. >. SPECTRUM ALL-IN-ONE JOURNAL FOR ENGINEER 195 In order to fi sequent, %aet 10 find subsequence matching, the data hangs ftFeved are then divided into subsequences each nga window ofngthw, Indexing method uses Evelie distance asa silty measures ne noe wes Puetidean a any different indexing methods are available which Speed Te Pettormance ofthe similarity search in terms of | They are R-trees, R* tree, E-kdBiree and suffix trees, Similarity Search Methods __ The Euclidean distance is used as a similarity measure in similarity analysis of time-series data. It involves the Computation of distance between any two time-series datasets Moreover, the distance and similarity are inversely proportional {toeach other (that is, the more the distance between the datasets, the less similar they are), Euclidean distance should be applied only after thoroughly examining the variation between the baseline and scale values of the two data set sequences. ‘A normalization transformation method can be implemented in order to solve the baseline and scale problem. For any given sequence S = (S,, S, »S,) the normalized sequence $'= (S}, $3, $3)... S,)-Itean be obtained by using the formula, Where, ¥= Mean deviation of S and (8) = Standard deviation of S. ‘The two data sequences must be compared only after transforming both the sequences using the above formula. ‘The similarity sedrch that allows gaps (or) differences in the baseline and scale values can be implemented by following the certain steps. Step 1: Atomic Matching It involves normalization of data and then identifying the gap-free similar windows which are of smaller size. ‘Step 2: Windows Stitching a This step involves stitching (or) grouping of all the ilar windows in order to create larger subsequences of similar windows thereby permitting the gaps. Step 3: Subsequence Ordering ‘This step involves the linear arrangement of subsequences matches in order to identify all the similar subsequences. Query Languages for Time-serles ‘The query languages used in time-series must be capable ofperforming simple searches (ike finding al similar sequences ofa given query) and more complex searches (like, identifying al he sequences that are similar to one class but dissimilar to the other class). 4, ING STUDENTS:196 Tt must also support other kinds of queries like range queries and nearest-neighbour queries. One of the most important type of time-sequence very language is shape definition language which enables the users (0 sake use of human-readable series of sequence transitions in order to define and query with the overall shape of time-sequences. Q26. Explain sequential pattern mining. Describe briefly about scalable methods for mining sequential patterns. Answer For answer refer Unit-Il, Q53. A 5.3 MINING SEQUENCE PATTERNS IN TRANSACTIONAL DATABASES What are conditions and parameters for ‘sequential pattern mining and explain brief about clospan sequential pattern mining method. 27. Answer = Conditions and Parameters for Sequential Pattern Mining Some of the conditions and parameters sequential pattem mining which drastically effects the mining process are as follows, 1. The first parameters for the sequential pattern mining describes about duration of time series thus it provides only restricted amount of time duration for mining the data, 2. The second parameter of the sequential pattern mining describes about the event overlapping window WV. That is, when group of events oceur within a specified time period then, it results into overlapping window. 3. The third parameter of the sequential pattem mining describes about the time interval, between events in the discovered pattern, Clospan Sequential Pattern Mining Method Clospan is a method of effectively mining the closed ‘sequential pattern. This mining method is based ona property of sequience databases called as equivalence of projected databases. ‘The two projected sequence databases s/a (or ‘s/B’) is defined as a sequence database ‘s’ that is being projected with respect to the sequence ‘a’ (or ‘B"). If number of items in s/a and s/B are equal then the two projected sequence databases are said to be equivalent. Using this property, Clospan performs pruning on nonelosed sequences so-that they need not be considered while performing frequent subsequence mining. This implies that whenever there exists two exactly similar prefix based projected databases, then there is no need of further mining any of the database. DATA MINING (INTUHYDERAg, jer the comple! ‘Afri some form of nonclosed sequent om tan be effectively pruned agtin by pers However, this sort of additional «jt ire yy implementing BIDE approach yj. miving the mining process. In contrast jg a ‘ates smaller set of sequential, ig apable of opti approach, Clospan gener more quickly) “constrained sequence mj 'n9 In.which situation constr mining can be used? Mt sequence Answer ? ‘The constraint-based mining was introduce in to overcome the limitations associated with frequent pans vo ine: In this mining technique, user-defined constrain» Considered while mining sequential patterns. Thus, only i red by the users are retrieved thereby, reducing, Pesrch space. There are different ways in which the consti gan be expressed. The constraints may contain either atu, Jalues, relationships among the attributes or certain parse patterns that dre to be mined. Regular expressions can sometimes be used as cur straints, These. expressions however, will be in the fom ¢ pattern templates, which define the type of pattems that ex tbe mined. The primary intention behind using constraint-bae ‘nining is to enhance the mining efficiency and to improvete interestingness of the patterns (i.c., only the desired patex will be searched). ‘Some of the typical constraints associated with minis of sequential pattems are given below, patterns desi 1.” Constraints related to duration of a sequence 2. Constraints related to event folding window 3. Constraints related to the gap (time) between’ events within the sequential patterns 4. Constraints relaied to the type of sequet# patterns. 1. __ Constraints Related to Duration of a Sequenst ‘There are three types of constraints within this cat They are, ' (@) Antimonotonic constraints Gi) Monotonic constraints G Antimonotonie constraints are the maxim! onstains wherein, the failure ofa particular sequence ‘onstraint also have its impact on its super seqUe™**, implies that, if a sequence cannot satisfy a constrain’: © its super sequences also cannot satisfy that consti ii) Succinet constraints. Monotonic constrai ost - d straints are just the OPPO a antimonotonic constraints. They Soithe mini! onstraints wherein, if a sequence satisfies a const all its super sequences also satisfy that constraint. WARNING: xeroxnotocopyng of this book s'@ CRIMINAL act. Anyone found gills LABLE tot oceeding* 0 face LEGAL pr ooGraintsare the constraints that are related tg cons as a particular year. These constraints vi psation: SY ficiency because, they eliminate the ie nits

ithin the ‘nego (in terms of time) between the events can also agaconsraint The gap can be set to 0 (ie., gap = 0), et 8 value or a range of values. In the first case with :F allie sequential patterns that do not have any gap in sure found. Inthe second case all the sequential patterns cvemeract gap gin between their events are found. In | sszdcase, where the gap is defined as a range of values, all sequtens that ae separated by at least minimum gap but astaaximum gap are found, | Constraints Related to the Type of Sequential Patterns: Inthis type of sequential pattern mining, regular -pssiuareused for providing pattern templates in the form lad parallel episodes. The group of events that occurs tutal order are said to be as serial episode. The parallel itt the other hand refers to the group of events whose is trivial, a cuenta patterns can be mine deyct {9 the mining of association rules. 12,2" efficient algorithm to mine multi ase ential patterns from a transaction toving ta coxam™Ple of such a pattern is the Weroson ncustomer who buys a PC will buy Wichone stare within three months”, on Mein ofthe ill down to find a more refined patterns, such as “A customer buys a Penti i ity ‘entium PC will buy Microsoft office Mes hte months”, eon rl Nasi F . vit tna el pattern is a sequence of ordered esting OM: The additional information © buys a poe Pattemns. To the given example, “A eat dite ill buy Microsoft software within Neietomaic’ such Petation on this pattern gives saat buy Mi *h as“ A customer who buys @ ‘ge Version one office within three months”. “th ofce inns Patter gives detail information in methods Sequential pattem, and multidimensio PrefixSpan mining algorithm for searchi searching multilevel nal sequential patterns, ean Answer : Periodicity Analysis It is the mining of périodic Patterns in time series databases. It is performed for ‘searching repetitive patterns that describe occurrence of events: over regular intervals of time, Due to this property, it is applicable in various fields like weather forecasting, daily traffic analysis, every day power consumption analysis, occurrence of tides, plants trajectory analysis. The time-related data that is used for periodicity analysis can be events or sequence of values which have occurred at regular intervals or varying intervals of time. The data can be related to frequency of events or about the usage of an item, Event is also called categorical data like occurrence of a storm, buying a computer, An item data is usually numerical data that can be electricity consumption, daily traffic. ‘A pperiodic patter is classified as full and partial periodic patterns, based on coverage of the patter. 1. Full Periodic Pattern “This isa patter that takes into account every point in time to bea part of the eycic behaviour of a time-elated sequence, An example ofl periodic pattem isthe growth of child ina year ean be atibuted o every day of thot year 2, Partial Periodic Pattern in time are isa pattern where only few points in time a sr ke eyli behaviour oa tme related sequence. contributed puch apatte ste dil routine ofan individual acer does few specific things on a fixed time everyday in it is activities. These jlarity on his other activities. bas aE ai yelic behaviours in the real patterns can be related to many €¥ world. | Gis ofsome Microsoft software. © lSPECTRUM @LLIN-ONE JOURNAL FOR ENGINEERING STUDENTS ~~ *Periodicity patterns can also be elasified based on the cy oftime. They are called synchronous and asynchronous ns. A synchronous periodicity pattern requires an event 10 occur only ata fixed time in each stable period while iy Tenehtonous periodicity patter, the ime of events ean be relaxed Rape extent. For example, waking up in the moming at 3:00 ‘AM every day isan example of synchronous periodie patter, A pattem can also be classified 88 precise or approximate }orning at 5:00 AM on some ‘lays but 5:15 AM on some other days. Full pet iodicity analysis on numerical data require ae ys ical data requires ‘Ourier transformation on data so that the data ‘omain can be transformed to frequency domain, Similarly, the periodicity analysis of partial periodic patterns for the same example is a time consuming process. Because it Tequites the details of fixed events to be specified in advance. Therefore, partial periodicity analysis result in discovery of period’ i tion rules that associate a set of Periodic events for efficient pattern search, 5.4 MINING OBJECT, SPATIAL, MULTIMEDIA, TEXT AND WEB DATA Q31. Explain about multidimensional analysis and descriptive mining of complex data objects. Answer : Model Papers, ati(a) Scientific research and engineering design are advanced data-intensive applications which are used for storing, indexing, accessing and manipulating complex data objects. It is a very difficult task to represent these complex objects as simple and consistent structured records. To serve the application requirements, such as efficient storage capacity and accessing ~ large amount of disk which deals with complex structured data objects motivated to design and develop database systems. There are two kinds of database systems, 1. Object-relational database system 2. Object-oriented database system : In the field of databases, rigorous research is conducted on object-relational and object-oriented database systems to know how efficiently complex objects can be indexed, stored, accessed and manipulated. In these systems, a large group of different data objects are organized into classes. These classes are categorized in turn into the hierarchies of classes and subclasses. DATA MINING LNTUHYOER A, ‘A class containing object is associated wit, ()_Anidentitier for each object. objectidening, include list-valued dat ii) Asetofatributes that include es eee vlerarehies of clases, subclasses and Tru data. " ch is used to specify the, t of methods which is used to specify the ey (0) Pies associated wih the objet clas. hat A systematic analysis and mining of large complex structured data object are said bea complex pt data. It consists of two important tasks @ Construction of multidimensional data Warchousg complex object data and performing OLAP oper! Efficient methods are developed for petforniy knowledge discovery by extracting relevant know (b) leg from different data warehouse. This skis aecompg by extracting particular type of data which ining multimedia, spatial, temporal data ete, Limitations of Data Warehouse and OLAP Tools Following are the limitations of data warchouse ay OLAP tools, 1, Multidimensional data analysis is confined to use linig number of data types associated with dimensions a measures. 2. "Data cube implementations restrict dimensions to th following, Categorical data ‘+ Measures which are simple aggregated values li count( ), sum( 932. What is structured data? How generalizator can be performed on structured data? Answer : Structured Data . Structured Data is a type of data, that is stored in structured format, in accordance to specific data models. T data needs to be restructured in several ways, so as to ext ‘implicit patterns that may not be apparent otherwise. Generalization of Structured Data“ The essential characteristi object-oriented databases is that access and model complex stru tic of object-relational a they are having ability ts ictured-valued data. Structured data can either be one of the following. 1. Setevalued data 2. List-valued data 3. Complex structure-valued data, WARNING: xerouPnotccpyigo is books 8 CRIMINAL ot Anyone und ty LABLE Waco LEGAL graces | a

FIOT Unit-1 Notes
No ratings yet
FIOT Unit-1 Notes
27 pages
CNS Notes
No ratings yet
CNS Notes
244 pages
States, State Graphs, and Transition Testing: Unit Iv
No ratings yet
States, State Graphs, and Transition Testing: Unit Iv
42 pages
Unit 1 (Fiot)
No ratings yet
Unit 1 (Fiot)
38 pages
HCI Unit 4 NOTES
No ratings yet
HCI Unit 4 NOTES
14 pages
Transaction Management Unit III
No ratings yet
Transaction Management Unit III
28 pages
ML Unit-1
No ratings yet
ML Unit-1
32 pages
Data Mining Notes Jntuh Compress
No ratings yet
Data Mining Notes Jntuh Compress
62 pages
Data Mining Introduction
No ratings yet
Data Mining Introduction
52 pages
Unit-2 Notes DW 2021
No ratings yet
Unit-2 Notes DW 2021
45 pages
DM Unit 2
No ratings yet
DM Unit 2
55 pages
Unit-2 Introduction To Data Mining
100% (1)
Unit-2 Introduction To Data Mining
11 pages
Efficient Convolution Algorithms
No ratings yet
Efficient Convolution Algorithms
13 pages
Da Unit-2
No ratings yet
Da Unit-2
23 pages
DM 5th Unit
No ratings yet
DM 5th Unit
54 pages
2-Edge Streamng Analytics
No ratings yet
2-Edge Streamng Analytics
21 pages
Stream Data
No ratings yet
Stream Data
70 pages
UNIT 4 Information Retrieval Using NLP
No ratings yet
UNIT 4 Information Retrieval Using NLP
13 pages
DM Unit 3
No ratings yet
DM Unit 3
39 pages
Computer Networks JNTUH Unit1 Notes
No ratings yet
Computer Networks JNTUH Unit1 Notes
6 pages
Information Visualization Technologies
No ratings yet
Information Visualization Technologies
15 pages
SM 6th-Sem Cse Internet-Of-Things
No ratings yet
SM 6th-Sem Cse Internet-Of-Things
76 pages
Ethical Hacking Unit-1
No ratings yet
Ethical Hacking Unit-1
30 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Data-Intensive Computing
No ratings yet
Data-Intensive Computing
88 pages
Klick Micro
No ratings yet
Klick Micro
3 pages
Unit-1 Cyber Laws
No ratings yet
Unit-1 Cyber Laws
21 pages
Darshan Institute of Engineering & Technology
No ratings yet
Darshan Institute of Engineering & Technology
49 pages
Se Unit2
No ratings yet
Se Unit2
115 pages
Data Stream Unit4
No ratings yet
Data Stream Unit4
20 pages
Data Mining Report
100% (1)
Data Mining Report
15 pages
CCA3002 - FOG-AND-EDGE-COMPUTING - LT - 1.0 - 34 - Fog and Edge Computing
No ratings yet
CCA3002 - FOG-AND-EDGE-COMPUTING - LT - 1.0 - 34 - Fog and Edge Computing
3 pages
Final Document
No ratings yet
Final Document
73 pages
DBDM Unit-3
No ratings yet
DBDM Unit-3
30 pages
Cloud Computing Unit-1 Notes
No ratings yet
Cloud Computing Unit-1 Notes
12 pages
CN Unit-3
No ratings yet
CN Unit-3
32 pages
Unit-3-Greedy Method PDF
No ratings yet
Unit-3-Greedy Method PDF
22 pages
Domain Specific Iot
No ratings yet
Domain Specific Iot
17 pages
Algorithm For Asynchronous Check Pointing and Recovery
No ratings yet
Algorithm For Asynchronous Check Pointing and Recovery
4 pages
CS2032 2 Marks & 16 Marks With Answers
100% (1)
CS2032 2 Marks & 16 Marks With Answers
30 pages
IOT Mod4@AzDOCUMENTS - in
No ratings yet
IOT Mod4@AzDOCUMENTS - in
17 pages
5.hyperparameters and Validation Sets (C)
No ratings yet
5.hyperparameters and Validation Sets (C)
3 pages
Dbms Lab Manual II Cse II Sem
No ratings yet
Dbms Lab Manual II Cse II Sem
58 pages
Database Management Systems: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
No ratings yet
Database Management Systems: ©silberschatz, Korth and Sudarshan 1.1 Database System Concepts
33 pages
DM Laqs
No ratings yet
DM Laqs
14 pages
Completed Final UNIT-V 9.10.17
100% (1)
Completed Final UNIT-V 9.10.17
74 pages
Advanced Computer Architecture: Program Flow Mechanisms
No ratings yet
Advanced Computer Architecture: Program Flow Mechanisms
14 pages
Evidence Protection System Using Blockchain Technology: 1) Background/ Problem Statement
No ratings yet
Evidence Protection System Using Blockchain Technology: 1) Background/ Problem Statement
8 pages
DWDM Online Bits
No ratings yet
DWDM Online Bits
3 pages
Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
94% (18)
Vtu 7TH Sem Cse/ise Data Warehousing & Data Mining Notes 10cs755/10is74
70 pages
Atm Uml Diagram
No ratings yet
Atm Uml Diagram
7 pages
Designing Gui Based On A Data Mining Query Language
0% (1)
Designing Gui Based On A Data Mining Query Language
2 pages
Notes - SN
No ratings yet
Notes - SN
5 pages
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
No ratings yet
Web Mining: Day-Today: International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
4 pages
Data Mining Metrices
No ratings yet
Data Mining Metrices
6 pages
CO Unit 1-2
No ratings yet
CO Unit 1-2
14 pages
Backup and Recovery
No ratings yet
Backup and Recovery
35 pages
A Model For Network Security
No ratings yet
A Model For Network Security
1 page
CS1352-Principles of Compiler Design Question Bank
100% (1)
CS1352-Principles of Compiler Design Question Bank
2 pages
OS 2 Marks
100% (11)
OS 2 Marks
15 pages

DM Unit 5

Uploaded by

DM Unit 5

Uploaded by

You might also like