Towards Balanced Dendrograms in Hierarchical Document Clustering

Michael Han

2026-03-26 2026-03-26 864 words 4 minutes

An investigation into unbalanced dendrograms produced by Hierarchical Agglomerative Clustering, and the solutions we found, including a look at the recently proposed Chamfer Linkage.

Contents

Background

Given a large corpus of documents, we can create a hierarchical organisation such that the topical relatedness of each document is organised into a tree, or a dendrogram. When trying to construct such a dendrogram to be used as a topic routing decision tree, we encountered a problem.

Dendrograms

A dendrogram is a tree-like visual representation of the clusters produced throughout the Hierarchical Agglomerative Clustering (HAC) process. Each node represents a cluster, with its children being elements of that cluster. The leaves are the raw data points themselves, acting as their own single-element clusters. See the right-hand side of Figures 1 and 2.

Despite working with relatively large datasets (3000+ articles), our dendrograms typically produced highly unbalanced decision trees. In this article, we’ll cover our investigation into this issue and the solution we implemented. We also discuss some recent research ideas relevant to the problem.

The Problem: Unbalanced Dendrograms

When clustering a number of highly varying noisy items (such as article text embeddings) using HAC, we typically expect clusters to be globular — i.e. not exhibiting unique shapes such as rings or lines. We might also expect, if the corpus’s spread of concepts is even enough, for most clusters to be of similar size.

Hierarchical Agglomerative Clustering (HAC)

HAC is a common bottom-up clustering method. It begins by treating each data point as a separate cluster and then repeatedly merges the closest pairs into one cluster until only one cluster remains. We may then “cut” through the dendrogram with a horizontal line and take the nodes directly below the cut to produce the desired number of clusters.

A natural requirement is for the resulting dendrogram to be balanced. This mitigates the risk of the dendrogram taking on a degenerate tree structure, such as when only one article is connected to the root (Figure 1).

We want a roughly balanced dendrogram, with each cluster representing the “concepts” or “super-concepts” of its children (Figure 2). This is especially important when splitting a collection of articles into Mutually Exclusive and Collectively Exhaustive (MECE) sets — for example, deriving n categories for a given set of scholarly readings.

Linkage Methods

Before describing our approach, it is worth reviewing the common linkage methods used in HAC:

Single Linkage — the minimum distance between any pair of points from each cluster.
Complete Linkage — the maximum distance between any pair of points across each cluster.
Centroid Linkage — the distance between the cluster centroids.
Ward Linkage — the increase in within-cluster variance after merging.

All of these can be implemented in O(n²) or O(n² log n) time. The distance function itself can vary; common choices include Euclidean, Manhattan, and Cosine Similarity.

Our Investigation

1 Single Linkage with Cosine Similarity

We first attempted Single Linkage with Cosine Similarity, merging the two clusters with the highest cosine similarity between any two points across each cluster. However, Single Linkage has a known tendency to produce non-globular patterns and is sensitive to outliers.

In our experiments, Single Linkage produced highly unbalanced dendrograms (similar to Figure 1), despite no obvious cause upon closer inspection of the data.

2 Ward Linkage

We then switched to Ward Linkage (which uses Euclidean distance). Ward Linkage measures the increase in within-cluster variance after merging clusters A and B, repeating for every pair-wise combination of clusters, and finally merging the pair with the minimum increase in within-cluster variance.

We found that Ward Linkage drastically reduced the number of catastrophic outliers in our dendrograms and generally produced far more evenly-sized clusters across our dataset.

Recent Research: Chamfer Linkage

Researching even better approaches, we came across a recent paper by Gowda et al. (2026), published 11 February 2026. This paper is motivated by much of the same concerns we cover here — namely the need for balanced dendrograms and accurate conceptual representation.

The paper introduces a novel linkage function called Chamfer Linkage, taking inspiration from the Chamfer Distance used in computer vision. It is defined as the sum of distances from all points in one cluster to the corresponding closest points in the other cluster.

Figure 3. Chamfer Distance definition from Gowda et al. (2026)

Although we have not yet applied Chamfer Linkage in our own work, the paper’s tests suggest it outperforms all other common linkage methods on average in terms of:

Adjusted Rand Index (ARI)
Normalized Mutual Information (NMI)

Notably, Chamfer Linkage can be implemented in the same asymptotic time complexity, O(n²), as other linkage functions.

Summary

Linkage Method	Distance Function	Balanced?	Notes
Single	Cosine Similarity	✗	Prone to chaining; sensitive to outliers
Ward	Euclidean	✓	Our current approach; works well in practice
Chamfer	Flexible	✓ (reported)	Promising; outperforms others in ARI/NMI per Gowda et al. 2026

Our recommendation for practitioners dealing with imbalanced dendrograms on noisy text embedding corpora is to start with Ward Linkage as a robust baseline, and consider exploring Chamfer Linkage as the research matures.

References

Gowda, K. N., Fletcher, W., Bateni, M., Dhulipala, L., Hershkowitz, D. E., Jayaram, R., & Łącki, J. (2026). Chamfer-linkage for hierarchical agglomerative clustering. arXiv.
Murtagh, F., & Contreras, P. (2012). Algorithms for hierarchical clustering: an overview. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 2(1), 86–97.