This project identifies semantically similar words using large-scale co-occurrence patterns extracted from fixed 5-gram context windows. The goal is to approximate synonym relationships without using embeddings or labeled data, relying instead on distributed frequency statistics.
A key challenge is that high-frequency phrase structures (e.g., “once upon a time”) can dominate similarity scores, making it difficult to distinguish true semantic similarity from syntactic co-occurrence.
The system is implemented in Apache Spark and constructs a distributed inverted index to compute word pair co-occurrence statistics at scale.
def similarityAnalysis(stripesRDD, n):
stripe_lengths = stripesRDD.mapValues(lambda co_words: len(co_words))
stripes_with_length = stripesRDD.join(stripe_lengths)
postings = stripes_with_length.flatMap(
lambda x: [(co_word, (x[0], x[1][1])) for co_word in x[1][0]]
)
inv_index = postings.aggregateByKey(
set(),
lambda s, v: s | {v},
lambda s1, s2: s1 | s2
)
cooccurr = inv_index.flatMap(lambda x: [
((min(t1, t2), max(t1, t2)), 1)
for t1 in x[1] for t2 in x[1] if t1[0] < t2[0]
]).reduceByKey(lambda a, b: a + b)
simScores = cooccurr.map(lambda x: (
f"{x[0][0][0]}, {x[0][1][0]}",
{
'intersection': x[1],
'length1': x[0][0][1],
'length2': x[0][1][1]
}
)).mapValues(lambda d: {
'cosine': d['intersection'] / ((d['length1'] * d['length2']) ** 0.5),
'jaccard': d['intersection'] / (d['length1'] + d['length2'] - d['intersection']),
'dice': 2 * d['intersection'] / (d['length1'] + d['length2']),
'overlap': d['intersection'] / min(d['length1'], d['length2'])
})
top_n = simScores.takeOrdered(n, key=lambda x: -x[1]['cosine'])
bottom_n = simScores.takeOrdered(n, key=lambda x: x[1]['cosine'])
return simScores, top_n, bottom_n
| Most Similar |
|---|
Pair |Cosine, Jaccard, Overlap, Dice first, time | 0.89, 0.8012, 0.9149, 0.8897 time, well |0.8895, 0.801, 0.892, 0.8895 great, time | 0.875, 0.7757, 0.925, 0.8737 part, well | 0.874, 0.7755, 0.9018, 0.8735 first, well |0.8717, 0.7722, 0.8936, 0.8715 part, time |0.8715, 0.7715, 0.9018, 0.871 time, upon |0.8668, 0.763, 0.9152, 0.8656 made, time | 0.866, 0.7619, 0.9109, 0.8649 made, well |0.8601, 0.7531, 0.9022, 0.8592 time, way |0.8587, 0.7487, 0.9259, 0.8563 great, well |0.8526, 0.7412, 0.8988, 0.8514 time, two |0.8517, 0.7389, 0.9094, 0.8498 first, great |0.8497, 0.7381, 0.8738, 0.8493 first, part |0.8471, 0.7348, 0.8527, 0.8471 great, upon |0.8464, 0.7338, 0.8475, 0.8464 upon, well |0.8444, 0.729, 0.889, 0.8433 new, time |0.8426, 0.724, 0.9133, 0.8399 first, two |0.8411, 0.7249, 0.8737, 0.8405 way, well |0.8357, 0.7146, 0.8986, 0.8335 time, us |0.8357, 0.7105, 0.9318, 0.8308 |
| Least Similar |
|---|
Pair |Cosine, Jaccard, Overlap, Dice region, write |0.0067, 0.0032, 0.0085, 0.0065 relation, snow |0.0067, 0.0026, 0.0141, 0.0052 cardiac, took |0.0074, 0.0023, 0.0217, 0.0045 ever, tumor |0.0076, 0.002, 0.0263, 0.004 came, tumor |0.0076, 0.002, 0.0263, 0.004 let, therapy |0.0076, 0.003, 0.0161, 0.0059 related, stay |0.0078, 0.0036, 0.0116, 0.0072 factors, hear |0.0078, 0.0039, 0.0094, 0.0077 implications, round |0.0078, 0.0033, 0.0145, 0.0066 came, proteins |0.0079, 0.002, 0.0286, 0.0041 population, window |0.0079, 0.0039, 0.01, 0.0077 love, proportional | 0.008, 0.0029, 0.0185, 0.0058 got, multiple | 0.008, 0.0034, 0.0149, 0.0067 changes, fort |0.0081, 0.0032, 0.0161, 0.0065 layer, wife |0.0081, 0.0038, 0.0119, 0.0075 five, sympathy |0.0081, 0.0034, 0.0149, 0.0068 arrival, essential |0.0081, 0.004, 0.0093, 0.008 desert, function |0.0081, 0.0031, 0.0175, 0.0062 fundamental, stood |0.0081, 0.0038, 0.0115, 0.0077 patients, plain |0.0081, 0.004, 0.0103, 0.0079 |
High similarity scores are dominated by frequent phrase patterns rather than true semantic similarity. Words like “time,” “well,” and “first” repeatedly appear in top pairs due to structural usage in 5-gram templates.
This demonstrates both the scalability of Spark for large-scale text mining and the limitation of raw co-occurrence methods for semantic understanding. Frequency bias significantly distorts similarity rankings.
Future improvements include TF-IDF weighting, PMI scoring, or embedding-based methods to better capture true synonym relationships.