Mapping the Landscape of Deep Learning: Topic Discovery from Book Embeddings

As a project in my Case Studies in Machine Learning course in the Masters of Science in Artificial Intelligence program at UT Austin, I explored mapping hierarchical concept relationships among terms relevant to deep learning within a corpus of 158 books.

Link to Full Paper

The project comprised:

  • Creating a dataset of deep learning concept embeddings
  • Extracting a hierarchical topic structure
  • Creating indexes for terms and topics across over 50k book pages

Abstract

This project involves mapping hierarchical concept relationships among terms relevant to deep learning within a corpus of 158 books on deep learning. Mistral-7B-Instruct-v0.2 is used to identify relevance of terminology and elicit topic names, and an index is created for those topics into terminology within the book pages. The all-mpnet-base-v2 sentence-transformers model is used to generate embeddings of the titles, pages, and topics. From this dataset, embeddings are dimensionally-reduced using UMAP, and clustered using HDBSCAN to extract a hierarchical topic structure of deep learning concepts. The resulting topic hierarchy provides a structured view of the landscape of deep learning. The embedding space and topic index provide complementary methods supporting retrieval of topic-relevant book content.

Dataset Creation

The dataset comprises 158 books whose title includes 'deep learning', totaling 50,794 pages.

Using a combination of classical NLP and modern language models, I extracted:

  • 6,217 deep learning related topics with embeddings
  • An index mapping topics to 1,902,953 instances within the book pages
  • Embeddings for all book titles, pages, and topics in a shared semantic space

Technical Pipeline

  • Prompting Mistral-7B-Instruct to identify relevant terminology and elicit topic names
  • Generating embeddings for book titles, pages and terms, using the all-mpnet-base-v2 sentence-transformers model
  • Creating a comprehensive index linking topic names and the specific terms and phrases to specific book pages

Topic Hierarchy Extraction

The analysis combines several techniques to extract meaningful topic structure:

Dimensionality Reduction

  • Applying UMAP on the embeddings of titles, pages and topics in a combined space
  • Combining the three types of embeddings into one space yielded cleaner clustering vs performing UMAP on each type separately

Combined UMAP Embedding Space

Hierarchical Clustering

  • Applying HDBSCAN at multiple granularity levels to identify topic clusters
  • Extracting parent-child relationships between clusters at different scales
  • Naming clusters based on their most central embeddings

HDBSCAN Clustering (400 clusters) HDBSCAN Clustering (69 clusters) HDBSCAN Clustering (13 clusters)

Key Findings

The project demonstrated several important insights:

  • Using a combined embedding space across titles, pages, and topics lead to clearer concept separation
  • UMAP captured more meaningful global structure compared to t-SNE while preserving local relationships
  • HDBSCAN achieved more coherent clusters compared to DBSCAN or k-means clustering mechanisms

Applications

This hierarchical topic mapping enables:

  • Enhanced library search and recommendation systems
  • Exploration of concept relationships in deep learning
  • Improved retrieval-augmented generation for language models