Conversation with Merlin [email protected] · Tue Nov 21 2023

It works by grouping data objects into a tree of clusters.

It creates a hierarchical decomposition of the given set of data objects.

Instead, a series of partitions takes place, which may run from a single cluster containing all objects to n clusters each containing a single object.

exaplain in simple words

In simpler terms, this process involves organizing data groups in a tree-like structure. The goal is to break down the data into smaller and smaller groups, creating a sort of family tree of clusters. The process starts with all the data in one big cluster and then gradually separates it into smaller clusters, eventually ending up with as many clusters as there are individual data points.

Agglomerative or Bottom up strategy :

Starts with the points as individual clusters.

At each step, merge the closest pair of clusters until only one cluster.

i.e., begins with as many clusters as objects. Clusters are successively merged until only one cluster remains.

This strategy, often referred to as agglomerative or bottom-up, begins by treating each data point as its own cluster. Then, at each step, the two closest clusters are combined together until only one cluster remains. Starting with as many clusters as there are individual objects, the process involves progressively merging the clusters until only a single cluster is left.

Divisive or Top down strategy:

Start with one, all – inclusive cluster

At each step ,split a cluster until each cluster contains a point.

i.e., Begins with all objects in one cluster. Groups are continually divided until there are as many clusters as objects.

Agglomerative – AGNES -> AGglomerative NESting

Divisive - DIANA -> DIvisive ANAlysis.

The divisive or top-down strategy operates by initially creating one large, all-encompassing cluster that includes all the data points. Then, at each stage, this cluster is split into clusters until each cluster contains only one data point. Essentially, it starts with all objects in a single cluster and continually divides the groups until there are as many clusters as there are objects. The acronyms AGNES and DIANA stand for AGglomerative NESting and DIvisive ANAlysis, respectively, representing the agglomerative and divisive clustering algorithms.

BIRCH:Balanced Iterative Reducing & Clustering using Hierarchies.

BIRCH is designed for clustering a large amount of numerical data by integration of hierarchical clustering & other clustering methods such as iterative partitioning.

BIRCH overcomes the two difficulties of Agglomerative clustering –

  1. Scalability

  2. Inability to undo what was done in the previous step.

BIRCH Two Concepts:

Clustering Feature(CF): It is a 3 dimensional vector summarizing information about cluster of objects.

CF = (n , Ls, Ss)

Where n= no of points in the cluster

Ls = linear sum of the n points

Ss = Square sum of the data points.

CF Tree:

It is a height balance tree that stores the clustering features for a hierarchical clustering.

  • A non leaf node in a tree has decendents or “Children”

  • The non leaf nodes stores sum of the CF’s of their

children.

CF tree has two parameters:

  1. Branching factor: Specify the maximum no. of children, per non-leaf node.

2)Threashold: Maximum diameter of sub-cluster stored at the leaf nodes, of the tree.

Drawback: Handles only numeric data & sensitive to the order of the data record.

Advantage: BIRCH will efficiently cluster even the dynamic data set & CF Tree is very useful for this.

BIRCH, which stands for Balanced Iterative Reducing & Clustering using Hierarchies, is a method designed to cluster large amounts of numerical. It achieves this by integrating hierarchical clustering and other methods like iterative partitioning. BIRCH addresses two challenges of Agglomerative clustering - scalability and the inability to revert previous steps. It introduces two main concepts: Clustering Feature (CF), a 3-dimensional vector summarizing information about a cluster of objects, and CF Tree, a height-balanced tree that stores the clustering features for hierarchical clustering. This tree has parameters such as the branching factor and threshold, which specify the maximum number of children per non-leaf node and the maximum diameter of sub-clusters stored at the leaf nodes, respectively.

However, it's important to note that BIRCH has limitations as it can only handle numeric data and is sensitive to the order of data records. On the other hand, it efficiently clusters dynamic datasets, and the CF Tree is particularly useful in this aspect.