Mapping the Landscape of Deep Learning: Topic Discovery from Book Embeddings
As a project in my Case Studies in Machine Learning course in the Masters of Science in Artificial Intelligence program at UT Austin, I explored mapping hierarchical concept relationships among terms relevant to deep learning within a corpus of 158 books.
The project comprised:
- Creating a dataset of deep learning concept embeddings
- Extracting a hierarchical topic structure
- Creating indexes for terms and topics across over 50k book pages
Abstract
This project involves mapping hierarchical concept relationships among terms relevant to deep learning within a corpus of 158 books on deep learning. Mistral-7B-Instruct-v0.2
is used to identify relevance of terminology and elicit topic names, and an index is created for those topics into terminology within the book pages. The all-mpnet-base-v2 sentence-transformers
model is used to generate embeddings of the titles, pages, and topics. From this dataset, embeddings are dimensionally-reduced using UMAP, and clustered using HDBSCAN to extract a hierarchical topic structure of deep learning concepts. The resulting topic hierarchy provides a structured view of the landscape of deep learning. The embedding space and topic index provide complementary methods supporting retrieval of topic-relevant book content.
Dataset Creation
The dataset comprises 158 books whose title includes 'deep learning', totaling 50,794 pages.
Using a combination of classical NLP and modern language models, I extracted:
- 6,217 deep learning related topics with embeddings
- An index mapping topics to 1,902,953 instances within the book pages
- Embeddings for all book titles, pages, and topics in a shared semantic space
Technical Pipeline
- Prompting
Mistral-7B-Instruct
to identify relevant terminology and elicit topic names - Generating embeddings for book titles, pages and terms, using the
all-mpnet-base-v2 sentence-transformers
model - Creating a comprehensive index linking topic names and the specific terms and phrases to specific book pages
Topic Hierarchy Extraction
The analysis combines several techniques to extract meaningful topic structure:
Dimensionality Reduction
- Applying UMAP on the embeddings of titles, pages and topics in a combined space
- Combining the three types of embeddings into one space yielded cleaner clustering vs performing UMAP on each type separately
Hierarchical Clustering
- Applying HDBSCAN at multiple granularity levels to identify topic clusters
- Extracting parent-child relationships between clusters at different scales
- Naming clusters based on their most central embeddings
Key Findings
The project demonstrated several important insights:
- Using a combined embedding space across titles, pages, and topics lead to clearer concept separation
- UMAP captured more meaningful global structure compared to t-SNE while preserving local relationships
- HDBSCAN achieved more coherent clusters compared to DBSCAN or k-means clustering mechanisms
Applications
This hierarchical topic mapping enables:
- Enhanced library search and recommendation systems
- Exploration of concept relationships in deep learning
- Improved retrieval-augmented generation for language models