# Data Mining and Data Warehousing

Running Head: DATA MINING AND DATA WAREHOUSING 1

DATA MINING AND DATA WAREHOUSING 2

Data Mining and Data Warehousing
Name
Institution
Professor
Course
Date

Q1. Explain why Clustering is called “Unsupervised Learning” while Classification is called “Supervised Learning”? Give three applications of Cluster Analysis and give examples on each?
Unsupervised learning is another name for clustering. Clustering is a task of grouping a set of data or objects in the same group which looks alike to the other than other groups. The process of learning is unsupervised because the labels on inputs examples of the class are not labeled. Clustering discovers classes within a data. Unsupervised learning can be taken as input or set of handwritten digits of images.
Supervised learning is called classification because during supervision in learning is from the labeled examples in a training data set. For instance a problem in postal address recognition and the set of handwritten postal code images as well as their corresponding machine readable translation which are used as the training examples.
Cluster analysis, that is, cluster algorithm can be used to discredited a numeric attribute (by placing the values of set of data in a group. Cluster analysis takes to consideration the distribution and closeness of the data points thus becoming applicable to the production of high quality discretization results. It can also be used to generate the concept of hierarchy for a set of data by either top down splitting strategy or the down up merging thus forming the hierarchy.

Q2. (a) What are the strength and weakness of the k-Means Clustering Partitioning Method?
K-Means Clustering Partitioning method strengths are such that, the approach applied is the top-down approach for loking to cluster of based on the quantitative frequent patterns. It an be applied in cluster to find the dimension that satisfies most minimum threshold. It examines the cluster for instance in 2D space generated to see if the combination passes the minimum threshold and if it fails higher dimensions are applied in the combination. For combinations which do not give the minimum threshold, it is partitioned with other combinations and dimensions which cannot have minimum support either. For its weaknesses, k-means in bottom up approach in higher dimensioning of cluster and support threshold. In high dimensional data cluster, finding the high dimensional cluster is tiresome and tedious thus the approach becoming less realistic for any application.
(b) What are the clustering methods that can be used with Numerical, categorical and mix data?
Fuzzy clustering
Partitional clustering

———
Q3. What is the difference between Single level Partition based clustering method vs. Hierarchical Clustering in terms of basic concept, strength and weakness?
Single Level Partitional clustering allows nesting of different clusters while Hierarchical partitioning does not. The two methods are good for cluster collections but they are not necessary for identification of subsets.
———————————————————————————————————————
Q4. (a) What do we aim for to have a good quality clustering in terms of Cohesiveness, and Distinctiveness?
A heterogeneous network contains an interconnected nodes and links of different types. Interconnected structures contains rich in information which can be used to mutually to enhance links as well as propagating knowledge from one form to another.

(b) List and briefly describe the three type of Clustering Measure of Quality?
Murkowski Metric which calculates the space between the two items x and y by comparing values of their features is used. It can be applied to frequency, binary values and the probability.
Kullback –Leibler Divergence is a measure of information from theory which determines inefficiency of assumptions on model distribution given the accurate distribution.

Cosine, measures similarity of two objects x and y by finding the cosine of angle in between their feature vectors. The similarity degree ranges in between -1 for the highest degree of dissimilarity of vector angle of 180®,over 0 for angles equals to 90® and 1 for the highest degree of similarity with a vector angle equals to 0®.

i. Fuzzy clustering- in this method each object belongs to the clusters that have a membership weight between 0(absolutely doesn’t belong) and 1(absolutely belongs).
ii. Hierarchical clustering – is referred to as a set of clusters that are nested and organized as a tree.
iii. Overlapping clustering- is the method used in reflecting the facts that objects can simultaneously belong to one group or more.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Q5. Many partitional clustering algorithms that automatically determine the number of clusters claim that this is an advantage. List two situations in which this is not the case.
i. Where numerical values are involved
ii. In cases where multiclass values are involve
Q6. Suppose we find K clusters using Ward’s method, bisecting K-means, and ordinary K-means. Which of these solutions represents a local or global minimum? Explain.
Bisecting K-means- because it is a straight forward extension method and it is a simple idea.

Q7. (a) Define following term
i. Geodesic Distance-
this is the distance between two vertices in a graphical representation is the number of edges in the shortest path linking them.

ii. Eccentricity-
this is said to be the locus of points whose distances to appoint and the line are in constant relation. The ratio is known as the eccentricity.

iii. Radius -this is the line segment that runs from the center of a circle or sphere to its perimeter
iv. Diameter- line that joins two points of a circle.
v. peripheral vertex- it is the vertex V of G if e(v) is the diameter of G
(
A
D
E
C
B
F
)(b) Measurements based on geodesic distance consider graph G in given figure and calculate following term

i. Eccentricity = B
iii. Diameter = A-D
iv. peripheral vertex = E-D

Q8. What are the challenges in Graph Clustering?
The challenge of this method is that the objects involved must be connected, the objects within a longer distance can also not be connected under this method.

. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .