Free SERP-Based Keyword Clustering Tool for Large Keyword Datasets
To meet my specific needs, I developed a keyword clustering tool based on SERP similarity using MinHash and LSH (Locality-Sensitive Hashing). While vectorization isn't always necessary for URLs, representing a SERP as a set opens the door to various mathematical techniques. I initially created a clusterer based on the Jaccard index, but MinHash and LSH offer distinct advantages in terms of efficiency and scalability.
Why MinHash and LSH?
MinHash and LSH provide a powerful method for clustering keywords based on SERP similarity. These techniques offer several advantages over traditional methods like the Jaccard index:
MinHash and LSH vs. Jaccard Index
Efficiency and Scalability
- MinHash: Efficiently approximates the Jaccard similarity between sets, making it suitable for large datasets. Instead of comparing all pairs directly, MinHash reduces the computational load by creating compact representations (hashes) of the sets.
- LSH (Locality-Sensitive Hashing): Allows for quick identification of similar items by grouping similar hashes into buckets. This enables fast querying of similar sets without exhaustive comparisons.
- Jaccard Index: Calculates the exact similarity by directly comparing sets, which becomes computationally expensive as the dataset grows.
Approximate Similarity
- MinHash and LSH: Provide approximate similarity measures that are highly efficient, making them ideal for large-scale clustering tasks where exact similarity is less critical than speed and resource usage.
- Jaccard Index: Offers exact similarity but at the cost of increased computational resources, limiting its practicality for very large datasets.
High Performance with Extensive Datasets
- MinHash and LSH: Handle large volumes of data without significant performance degradation, making them suitable for datasets exceeding 500,000 keywords.
- Jaccard Index: Struggles with performance as the dataset size increases due to the need for direct comparison of all pairs.
Resource Efficiency
- MinHash and LSH: Utilize memory and computational resources efficiently, allowing for clustering of large datasets with minimal overhead.
- Jaccard Index: Requires substantial computational power for large datasets due to direct pairwise comparisons.
Key features
- Platform Independence: No OS constraints; works in any environment.
- High Performance: Efficiently handles extremely large datasets without hanging.
- Language Agnostic: Supports clustering in any language.
- Efficient Resource Use: Low on computational resources.
- Approximate Similarity: Uses MinHash and LSH for fast and scalable clustering.
Download the tool here: SERP-Based Keyword Clustering Tool (Python)
Project on Github: Github Repository
Instructions
Important! This tool only clusters keywords; it does not collect search engine results. For fetching search results, you can use other services such as A-Parser.
Setup Instructions
1. Install Required Libraries
pip install pandas
pip install tqdm
pip install datasketch
2. Command-Line Usage Instructions
Required Arguments:
input_file
: Path to the input CSV file.output_file
: Path to save the output file with clustered keywords.
Optional Arguments:
-s
,--separator
: Separator used in the input file (default: `,`).-k
,--keyword_col
: Name of the keyword column in the input file (default: `Keyword`).-u
,--url_col
: Name of the URL column in the input file (default: `URL`).-t
,--similarity_threshold
: Similarity threshold (default: `0.6`).
Example Command in Terminal:
python minhash-cluster-cli.py for-clustering.csv clustered_keywords.csv -s ';' -k 'keyword' -u 'url' -t 0.6
Output File
- Group Column: Each group in the Group column is numbered starting from 0.
- Keyword Clustering: Keywords grouped together will have the same group number.
- Unclustered Keywords: If a keyword has no common groups with others, it will be in a separate group.
- URLs Not Collected: If keywords have no associated URLs, they will all be grouped as -1.
- Browser Compatibility: This code does not work correctly in Safari. Please use Chrome.