Synonym Analysis with Spark (5-gram Co-occurrence)

Problem

This project identifies semantically similar words using large-scale co-occurrence patterns extracted from fixed 5-gram context windows. The goal is to approximate synonym relationships without using embeddings or labeled data, relying instead on distributed frequency statistics.

A key challenge is that high-frequency phrase structures (e.g., “once upon a time”) can dominate similarity scores, making it difficult to distinguish true semantic similarity from syntactic co-occurrence.

Data

Methods

The system is implemented in Apache Spark and constructs a distributed inverted index to compute word pair co-occurrence statistics at scale.

Spark Implementation


def similarityAnalysis(stripesRDD, n):

    stripe_lengths = stripesRDD.mapValues(lambda co_words: len(co_words))
    stripes_with_length = stripesRDD.join(stripe_lengths)

    postings = stripes_with_length.flatMap(
        lambda x: [(co_word, (x[0], x[1][1])) for co_word in x[1][0]]
    )

    inv_index = postings.aggregateByKey(
        set(),
        lambda s, v: s | {v},
        lambda s1, s2: s1 | s2
    )

    cooccurr = inv_index.flatMap(lambda x: [
        ((min(t1, t2), max(t1, t2)), 1)
        for t1 in x[1] for t2 in x[1] if t1[0] < t2[0]
    ]).reduceByKey(lambda a, b: a + b)

    simScores = cooccurr.map(lambda x: (
        f"{x[0][0][0]}, {x[0][1][0]}",
        {
            'intersection': x[1],
            'length1': x[0][0][1],
            'length2': x[0][1][1]
        }
    )).mapValues(lambda d: {
        'cosine': d['intersection'] / ((d['length1'] * d['length2']) ** 0.5),
        'jaccard': d['intersection'] / (d['length1'] + d['length2'] - d['intersection']),
        'dice': 2 * d['intersection'] / (d['length1'] + d['length2']),
        'overlap': d['intersection'] / min(d['length1'], d['length2'])
    })

    top_n = simScores.takeOrdered(n, key=lambda x: -x[1]['cosine'])
    bottom_n = simScores.takeOrdered(n, key=lambda x: x[1]['cosine'])

    return simScores, top_n, bottom_n
  

Results

Most Similar Word Pairs
Most Similar
  Pair                     |Cosine, Jaccard, Overlap, Dice 
  first, time              |  0.89,  0.8012,  0.9149, 0.8897
  time, well               |0.8895,   0.801,   0.892, 0.8895
  great, time              | 0.875,  0.7757,   0.925, 0.8737
  part, well               | 0.874,  0.7755,  0.9018, 0.8735
  first, well              |0.8717,  0.7722,  0.8936, 0.8715
  part, time               |0.8715,  0.7715,  0.9018, 0.871
  time, upon               |0.8668,   0.763,  0.9152, 0.8656
  made, time               | 0.866,  0.7619,  0.9109, 0.8649
  made, well               |0.8601,  0.7531,  0.9022, 0.8592
  time, way                |0.8587,  0.7487,  0.9259, 0.8563
  great, well              |0.8526,  0.7412,  0.8988, 0.8514
  time, two                |0.8517,  0.7389,  0.9094, 0.8498
  first, great             |0.8497,  0.7381,  0.8738, 0.8493
  first, part              |0.8471,  0.7348,  0.8527, 0.8471
  great, upon              |0.8464,  0.7338,  0.8475, 0.8464
  upon, well               |0.8444,   0.729,   0.889, 0.8433
  new, time                |0.8426,   0.724,  0.9133, 0.8399
  first, two               |0.8411,  0.7249,  0.8737, 0.8405
  way, well                |0.8357,  0.7146,  0.8986, 0.8335
  time, us                 |0.8357,  0.7105,  0.9318, 0.8308
  
Least Similar
  Pair                     |Cosine, Jaccard, Overlap, Dice 
  region, write            |0.0067,  0.0032,  0.0085, 0.0065
  relation, snow           |0.0067,  0.0026,  0.0141, 0.0052
  cardiac, took            |0.0074,  0.0023,  0.0217, 0.0045
  ever, tumor              |0.0076,   0.002,  0.0263, 0.004
  came, tumor              |0.0076,   0.002,  0.0263, 0.004
  let, therapy             |0.0076,   0.003,  0.0161, 0.0059
  related, stay            |0.0078,  0.0036,  0.0116, 0.0072
  factors, hear            |0.0078,  0.0039,  0.0094, 0.0077
  implications, round      |0.0078,  0.0033,  0.0145, 0.0066
  came, proteins           |0.0079,   0.002,  0.0286, 0.0041
  population, window       |0.0079,  0.0039,    0.01, 0.0077
  love, proportional       | 0.008,  0.0029,  0.0185, 0.0058
  got, multiple            | 0.008,  0.0034,  0.0149, 0.0067
  changes, fort            |0.0081,  0.0032,  0.0161, 0.0065
  layer, wife              |0.0081,  0.0038,  0.0119, 0.0075
  five, sympathy           |0.0081,  0.0034,  0.0149, 0.0068
  arrival, essential       |0.0081,   0.004,  0.0093, 0.008
  desert, function         |0.0081,  0.0031,  0.0175, 0.0062
  fundamental, stood       |0.0081,  0.0038,  0.0115, 0.0077
  patients, plain          |0.0081,   0.004,  0.0103, 0.0079
  

Key Finding

High similarity scores are dominated by frequent phrase patterns rather than true semantic similarity. Words like “time,” “well,” and “first” repeatedly appear in top pairs due to structural usage in 5-gram templates.

Impact

This demonstrates both the scalability of Spark for large-scale text mining and the limitation of raw co-occurrence methods for semantic understanding. Frequency bias significantly distorts similarity rankings.

Future improvements include TF-IDF weighting, PMI scoring, or embedding-based methods to better capture true synonym relationships.