Missael Vasquez | Synonym Analysis (Spark 5-gram Co-occurrence)

Problem

This project identifies semantically similar words using large-scale co-occurrence patterns extracted from fixed 5-gram context windows. The goal is to approximate synonym relationships without using embeddings or labeled data, relying instead on distributed frequency statistics.

A key challenge is that high-frequency phrase structures (e.g., “once upon a time”) can dominate similarity scores, making it difficult to distinguish true semantic similarity from syntactic co-occurrence.

Data

~2 GB Google Books 5-grams subset
~269,000 vocabulary terms
~36 billion word-pair co-occurrences
Generated from sliding 5-word windows across large-scale text corpus

Methods

The system is implemented in Apache Spark and constructs a distributed inverted index to compute word pair co-occurrence statistics at scale.

Stripe representation of word co-occurrences
Inverted index via aggregateByKey
Pairwise expansion of co-occurring tokens
Reduction over billions of candidate pairs
Similarity scoring using multiple metrics

Spark Implementation


def similarityAnalysis(stripesRDD, n):

    stripe_lengths = stripesRDD.mapValues(lambda co_words: len(co_words))
    stripes_with_length = stripesRDD.join(stripe_lengths)

    postings = stripes_with_length.flatMap(
        lambda x: [(co_word, (x[0], x[1][1])) for co_word in x[1][0]]
    )

    inv_index = postings.aggregateByKey(
        set(),
        lambda s, v: s | {v},
        lambda s1, s2: s1 | s2
    )

    cooccurr = inv_index.flatMap(lambda x: [
        ((min(t1, t2), max(t1, t2)), 1)
        for t1 in x[1] for t2 in x[1] if t1[0] < t2[0]
    ]).reduceByKey(lambda a, b: a + b)

    simScores = cooccurr.map(lambda x: (
        f"{x[0][0][0]}, {x[0][1][0]}",
        {
            'intersection': x[1],
            'length1': x[0][0][1],
            'length2': x[0][1][1]
        }
    )).mapValues(lambda d: {
        'cosine': d['intersection'] / ((d['length1'] * d['length2']) ** 0.5),
        'jaccard': d['intersection'] / (d['length1'] + d['length2'] - d['intersection']),
        'dice': 2 * d['intersection'] / (d['length1'] + d['length2']),
        'overlap': d['intersection'] / min(d['length1'], d['length2'])
    })

    top_n = simScores.takeOrdered(n, key=lambda x: -x[1]['cosine'])
    bottom_n = simScores.takeOrdered(n, key=lambda x: x[1]['cosine'])

    return simScores, top_n, bottom_n

Results

Most Similar Word Pairs

Most Similar
Pair \|Cosine, Jaccard, Overlap, Dice first, time \| 0.89, 0.8012, 0.9149, 0.8897 time, well \|0.8895, 0.801, 0.892, 0.8895 great, time \| 0.875, 0.7757, 0.925, 0.8737 part, well \| 0.874, 0.7755, 0.9018, 0.8735 first, well \|0.8717, 0.7722, 0.8936, 0.8715 part, time \|0.8715, 0.7715, 0.9018, 0.871 time, upon \|0.8668, 0.763, 0.9152, 0.8656 made, time \| 0.866, 0.7619, 0.9109, 0.8649 made, well \|0.8601, 0.7531, 0.9022, 0.8592 time, way \|0.8587, 0.7487, 0.9259, 0.8563 great, well \|0.8526, 0.7412, 0.8988, 0.8514 time, two \|0.8517, 0.7389, 0.9094, 0.8498 first, great \|0.8497, 0.7381, 0.8738, 0.8493 first, part \|0.8471, 0.7348, 0.8527, 0.8471 great, upon \|0.8464, 0.7338, 0.8475, 0.8464 upon, well \|0.8444, 0.729, 0.889, 0.8433 new, time \|0.8426, 0.724, 0.9133, 0.8399 first, two \|0.8411, 0.7249, 0.8737, 0.8405 way, well \|0.8357, 0.7146, 0.8986, 0.8335 time, us \|0.8357, 0.7105, 0.9318, 0.8308

Most Similar

  Pair                     |Cosine, Jaccard, Overlap, Dice 
  first, time              |  0.89,  0.8012,  0.9149, 0.8897
  time, well               |0.8895,   0.801,   0.892, 0.8895
  great, time              | 0.875,  0.7757,   0.925, 0.8737
  part, well               | 0.874,  0.7755,  0.9018, 0.8735
  first, well              |0.8717,  0.7722,  0.8936, 0.8715
  part, time               |0.8715,  0.7715,  0.9018, 0.871
  time, upon               |0.8668,   0.763,  0.9152, 0.8656
  made, time               | 0.866,  0.7619,  0.9109, 0.8649
  made, well               |0.8601,  0.7531,  0.9022, 0.8592
  time, way                |0.8587,  0.7487,  0.9259, 0.8563
  great, well              |0.8526,  0.7412,  0.8988, 0.8514
  time, two                |0.8517,  0.7389,  0.9094, 0.8498
  first, great             |0.8497,  0.7381,  0.8738, 0.8493
  first, part              |0.8471,  0.7348,  0.8527, 0.8471
  great, upon              |0.8464,  0.7338,  0.8475, 0.8464
  upon, well               |0.8444,   0.729,   0.889, 0.8433
  new, time                |0.8426,   0.724,  0.9133, 0.8399
  first, two               |0.8411,  0.7249,  0.8737, 0.8405
  way, well                |0.8357,  0.7146,  0.8986, 0.8335
  time, us                 |0.8357,  0.7105,  0.9318, 0.8308

Least Similar
Pair \|Cosine, Jaccard, Overlap, Dice region, write \|0.0067, 0.0032, 0.0085, 0.0065 relation, snow \|0.0067, 0.0026, 0.0141, 0.0052 cardiac, took \|0.0074, 0.0023, 0.0217, 0.0045 ever, tumor \|0.0076, 0.002, 0.0263, 0.004 came, tumor \|0.0076, 0.002, 0.0263, 0.004 let, therapy \|0.0076, 0.003, 0.0161, 0.0059 related, stay \|0.0078, 0.0036, 0.0116, 0.0072 factors, hear \|0.0078, 0.0039, 0.0094, 0.0077 implications, round \|0.0078, 0.0033, 0.0145, 0.0066 came, proteins \|0.0079, 0.002, 0.0286, 0.0041 population, window \|0.0079, 0.0039, 0.01, 0.0077 love, proportional \| 0.008, 0.0029, 0.0185, 0.0058 got, multiple \| 0.008, 0.0034, 0.0149, 0.0067 changes, fort \|0.0081, 0.0032, 0.0161, 0.0065 layer, wife \|0.0081, 0.0038, 0.0119, 0.0075 five, sympathy \|0.0081, 0.0034, 0.0149, 0.0068 arrival, essential \|0.0081, 0.004, 0.0093, 0.008 desert, function \|0.0081, 0.0031, 0.0175, 0.0062 fundamental, stood \|0.0081, 0.0038, 0.0115, 0.0077 patients, plain \|0.0081, 0.004, 0.0103, 0.0079

Least Similar

  Pair                     |Cosine, Jaccard, Overlap, Dice 
  region, write            |0.0067,  0.0032,  0.0085, 0.0065
  relation, snow           |0.0067,  0.0026,  0.0141, 0.0052
  cardiac, took            |0.0074,  0.0023,  0.0217, 0.0045
  ever, tumor              |0.0076,   0.002,  0.0263, 0.004
  came, tumor              |0.0076,   0.002,  0.0263, 0.004
  let, therapy             |0.0076,   0.003,  0.0161, 0.0059
  related, stay            |0.0078,  0.0036,  0.0116, 0.0072
  factors, hear            |0.0078,  0.0039,  0.0094, 0.0077
  implications, round      |0.0078,  0.0033,  0.0145, 0.0066
  came, proteins           |0.0079,   0.002,  0.0286, 0.0041
  population, window       |0.0079,  0.0039,    0.01, 0.0077
  love, proportional       | 0.008,  0.0029,  0.0185, 0.0058
  got, multiple            | 0.008,  0.0034,  0.0149, 0.0067
  changes, fort            |0.0081,  0.0032,  0.0161, 0.0065
  layer, wife              |0.0081,  0.0038,  0.0119, 0.0075
  five, sympathy           |0.0081,  0.0034,  0.0149, 0.0068
  arrival, essential       |0.0081,   0.004,  0.0093, 0.008
  desert, function         |0.0081,  0.0031,  0.0175, 0.0062
  fundamental, stood       |0.0081,  0.0038,  0.0115, 0.0077
  patients, plain          |0.0081,   0.004,  0.0103, 0.0079

Key Finding

High similarity scores are dominated by frequent phrase patterns rather than true semantic similarity. Words like “time,” “well,” and “first” repeatedly appear in top pairs due to structural usage in 5-gram templates.

Impact

This demonstrates both the scalability of Spark for large-scale text mining and the limitation of raw co-occurrence methods for semantic understanding. Frequency bias significantly distorts similarity rankings.

Future improvements include TF-IDF weighting, PMI scoring, or embedding-based methods to better capture true synonym relationships.

Synonym Analysis with Spark (5-gram Co-occurrence)