Part 1: Count-Based Word Vectors

大多数词向量模型都是从以下想法开始的：

你应当通过一个词的同伴来认识这个词（Firth, J. R. 1957:11）

许多词向量实现是基于这样一个理念，即相似的词（即近义词）会在相似的上下文中使用。因此，相似的词通常会与一组共同的词（即上下文）一起被说出或写出。通过检查这些上下文，我们可以尝试为我们的词开发嵌入。基于这一直觉，许多“老派”构建词向量的方法依赖于词频统计。这里我们将详细说明其中的一种策略，即共现矩阵。

Co-Occurrence

共现矩阵计算在某种环境中事物共同出现的频率。给定文档中某个词 $w_i$ ，我们考虑围绕 $w_i$ 的上下文窗口。假设我们的固定窗口大小是n，那么这个窗口包括该文档中 $w_i$ 的前n个词和后n个词，即词 $w_{i-n}$ , … , $w_{i-1}$ 和 $w_{i+1}$ , … , $w_{i+n}$ 。我们构建一个共现矩阵M，这是一个对称的词与词之间的矩阵，其中 $M_{ij}$ 是所有文档中词 $w_j$ 出现在词 $w_i$ 的窗口中的次数。

示例：固定窗口大小为 n=1的共现：

文档 1: “all that glitters is not gold”

文档 2: “all is well that ends well”

在自然语言处理（NLP）中，我们通常使用 <START> 和 <END> 标记来表示句子、段落或文档的开头和结尾。这些标记被包括在共现计数中，封装每个文档，例如：“<START> All that glitters is not gold <END>”。

矩阵的行（或列）提供基于词与词共现的词向量，但它们可能会很大。为了减少维度，我们使用奇异值分解（SVD），类似于主成分分析（PCA），选择前 ( k ) 个主成分。SVD 过程将共现矩阵 ( A ) 分解为对角矩阵 ( S ) 中的奇异值和新的、更短的词向量在 ( U_k ) 中。

这种降维保持了语义关系；例如，doctor（医生）和 hospital（医院）将比 doctor（医生）和 dog（狗）更接近。

对于不熟悉特征值和 SVD 的人，可以在这里找到一个适合初学者的 SVD 介绍。更多深入理解的资源包括 CS168 课程的第 7、8 和 9 讲，这些高层次地讲解了这些算法。对于实际操作，建议使用 numpy、scipy 或 sklearn 等 Python 包中的预编程函数。虽然对大语料库应用完整的 SVD 可能会消耗大量内存，但存在一些可扩展技术，例如截断 SVD，可以高效地提取前 ( k ) 个向量分量。

Question 1.1: Implement `distinct_words`

编写一个方法来找出在语料库中出现的不同单词（词类型）。

你可以使用 for 循环来处理输入语料库（一个包含字符串列表的列表），但尝试使用 Python 列表推导式（通常更快）。特别是，这可能对扁平化列表的列表有用。如果你不熟悉 Python 列表推导式，这里有更多信息。

你返回的 corpus_words 应该是排序的。你可以使用 Python 的 sorted 函数来进行排序。

你可能会发现使用Python 集合来去除重复单词很有用。

def distinct_words(corpus):
    """ Determine a list of distinct words for the corpus.
        Params:
            corpus (list of list of strings): corpus of documents
        Return:
            corpus_words (list of strings): sorted list of distinct words across the corpus
            n_corpus_words (integer): number of distinct words across the corpus
    """
    corpus_words = []
    n_corpus_words = -1
    
    # ------------------
    # Write your implementation here.
    
    corpus_words=list(sorted({y for x in corpus for y in x}))
    n_corpus_words=len(corpus_words)
    # ------------------

    return corpus_words, n_corpus_words

Question 1.2: Implement `compute_co_occurrence_matrix`

编写一个方法来构建一个指定窗口大小 ( n )（默认值为4）的共现矩阵，考虑词前 ( n ) 和词后的 ( n ) 个单词。这里，我们开始使用 NumPy（np）来表示向量、矩阵和张量。如果你不熟悉 NumPy，可以参考这个 cs231n Python NumPy 教程的后半部分中的NumPy 教程。

def compute_co_occurrence_matrix(corpus, window_size=4):
    """ Compute co-occurrence matrix for the given corpus and window_size (default of 4).
    
        Note: Each word in a document should be at the center of a window. Words near edges will have a smaller
              number of co-occurring words.
              
              For example, if we take the document "<START> All that glitters is not gold <END>" with window size of 4,
              "All" will co-occur with "<START>", "that", "glitters", "is", and "not".
    
        Params:
            corpus (list of list of strings): corpus of documents
            window_size (int): size of context window
        Return:
            M (a symmetric numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): 
                Co-occurence matrix of word counts. 
                The ordering of the words in the rows/columns should be the same as the ordering of the words given by the distinct_words function.
            word2ind (dict): dictionary that maps word to index (i.e. row/column number) for matrix M.
    """
    words, n_words = distinct_words(corpus)
    M = None
    word2ind = {}
    # ------------------
    # Write your implementation here.
    M=np.zeros((n_words,n_words))
    word2ind={word:ix for ix,word in enumerate(words)}
    for sentence in corpus:
        for i,word in enumerate(sentence):
            for context in sentence[max(0,i-window_size):i]:
                M[word2ind[word]][word2ind[context]]+=1
            for context in sentence[i+1:min(len(sentence),i+window_size+1)]:
                M[word2ind[word]][word2ind[context]]+=1
    
    # ------------------

    return M, word2ind

Question 1.3: Implement `reduce_to_k_dim`

编写一个方法，对矩阵进行降维以生成 k 维嵌入。使用 SVD 提取前 k 个成分，并生成一个新的 k 维嵌入矩阵。

注意：numpy、scipy 和 scikit-learn（sklearn）都提供 SVD 的一些实现，但只有 scipy 和 sklearn 提供截断 SVD 的实现，并且只有 sklearn 提供高效的随机算法来计算大规模的截断 SVD。因此，请使用 sklearn.decomposition.TruncatedSVD。

def reduce_to_k_dim(M, k=2):
    """ Reduce a co-occurence count matrix of dimensionality (num_corpus_words, num_corpus_words)
        to a matrix of dimensionality (num_corpus_words, k) using the following SVD function from Scikit-Learn:
            - http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.TruncatedSVD.html
    
        Params:
            M (numpy matrix of shape (number of unique words in the corpus , number of unique words in the corpus)): co-occurence matrix of word counts
            k (int): embedding size of each word after dimension reduction
        Return:
            M_reduced (numpy matrix of shape (number of corpus words, k)): matrix of k-dimensioal word embeddings.
                    In terms of the SVD from math class, this actually returns U * S
    """    
    n_iters = 10    # Use this parameter in your call to `TruncatedSVD`
    M_reduced = None
    print("Running Truncated SVD over %i words..." % (M.shape[0]))
    
    # ------------------
    # Write your implementation here.
    svd = TruncatedSVD(n_components=k, n_iter=n_iters)
    M_reduced=svd.fit_transform(M)
    
    # ------------------

    print("Done.")
    return M_reduced

Question 1.4: Implement `plot_embeddings`

在这里，你将编写一个函数，在二维空间中绘制一组二维向量。对于图形绘制，我们将使用 Matplotlib（plt）。

对于这个例子，你可能会发现改编这段代码很有用。将来，要制作图表的一个好方法是查看 Matplotlib gallery，找到一个看起来与您想要的图表相似的图表，然后改编他们提供的代码。

def plot_embeddings(M_reduced, word2ind, words):
    """ Plot in a scatterplot the embeddings of the words specified in the list "words".
        NOTE: do not plot all the words listed in M_reduced / word2ind.
        Include a label next to each point.
        
        Params:
            M_reduced (numpy matrix of shape (number of unique words in the corpus , 2)): matrix of 2-dimensioal word embeddings
            word2ind (dict): dictionary that maps word to indices for matrix M
            words (list of strings): words whose embeddings we want to visualize
    """

    # ------------------
    # Write your implementation here.
    
    index=[ word2ind[word] for word in words]
    print(index)
    print(M_reduced.shape)
    X=M_reduced[index]
    plt.scatter(X[:,0],X[:,1])
    for i,word in enumerate(words):
        plt.text(X[i,0],X[i,1],word)
    plt.title("word embeddings")
    plt.show()
    # ------------------

Question 1.5: Co-Occurrence Plot Analysis

现在我们将把你编写的所有部分组合在一起！我们将计算一个固定窗口大小为 4（默认窗口大小）的共现矩阵，应用于大型电影评论语料库。然后我们将使用 TruncatedSVD 来计算每个词的二维嵌入。TruncatedSVD 返回的是 U*S，所以我们需要对返回的向量进行归一化，以使所有的向量都位于单位圆附近（因此，接近度是方向上的接近度）

运行下面的代码单元以生成图表。这可能需要几分钟时间。

# -----------------------------
# Run This Cell to Produce Your Plot
# ------------------------------
imdb_corpus = read_corpus()
M_co_occurrence, word2ind_co_occurrence = compute_co_occurrence_matrix(imdb_corpus)
M_reduced_co_occurrence = reduce_to_k_dim(M_co_occurrence, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced_co_occurrence, axis=1)
M_normalized = M_reduced_co_occurrence / M_lengths[:, np.newaxis] # broadcasting

words = ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge']

plot_embeddings(M_normalized, word2ind_co_occurrence, words)

Part 2: Prediction-Based Word Vectors

如课堂上讨论的，最近基于预测的词向量（如 word2vec 和 GloVe）展示了更好的性能，其中 GloVe 还利用了词频的优势。在这里，我们将探索 GloVe 生成的嵌入。有关 word2vec 和 GloVe 算法的更多细节，请回顾课堂笔记和讲座幻灯片。如果你感到冒险，可以挑战自己，尝试阅读 GloVe 的原始论文。

Reducing dimensionality of Word Embeddings

让我们直接比较 GloVe 嵌入与共现矩阵的嵌入。为了避免内存不足，我们将使用 40000 个 GloVe 向量的样本。运行以下单元以：

将 40000 个 GloVe 向量放入矩阵 ( M ) 中。
运行 reduce_to_k_dim（你的 Truncated SVD 函数）将向量从 200 维降到 2 维。

def get_matrix_of_vectors(wv_from_bin, required_words):
    """ Put the GloVe vectors into a matrix M.
        Param:
            wv_from_bin: KeyedVectors object; the 400000 GloVe vectors loaded from file
        Return:
            M: numpy matrix shape (num words, 200) containing the vectors
            word2ind: dictionary mapping each word to its row number in M
    """
    import random
    words = list(wv_from_bin.index_to_key)
    print("Shuffling words ...")
    random.seed(225)
    random.shuffle(words)
    print("Putting %i words into word2ind and matrix M..." % len(words))
    word2ind = {}
    M = []
    curInd = 0
    for w in words:
        try:
            M.append(wv_from_bin.get_vector(w))
            word2ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    for w in required_words:
        if w in words:
            continue
        try:
            M.append(wv_from_bin.get_vector(w))
            word2ind[w] = curInd
            curInd += 1
        except KeyError:
            continue
    M = np.stack(M)
    print("Done.")
    return M, word2ind

# -----------------------------------------------------------------
# Run Cell to Reduce 200-Dimensional Word Embeddings to k Dimensions
# Note: This should be quick to run
# -----------------------------------------------------------------
M, word2ind = get_matrix_of_vectors(wv_from_bin, words)
M_reduced = reduce_to_k_dim(M, k=2)

# Rescale (normalize) the rows to make them each of unit-length
M_lengths = np.linalg.norm(M_reduced, axis=1)
M_reduced_normalized = M_reduced / M_lengths[:, np.newaxis] # broadcasting

Question 2.1: GloVe Plot Analysis

Run the cell below to plot the 2D GloVe embeddings for ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge'].

1
2
3

words = ['movie', 'book', 'mysterious', 'story', 'fascinating', 'good', 'interesting', 'large', 'massive', 'huge']

plot_embeddings(M_reduced_normalized, word2ind, words)

Cosine Similarity

现在我们已经有了词向量，我们需要一种方法来量化这些向量之间的词的相似性。一个这样的度量是余弦相似度。我们将使用这个度量来找到彼此“接近”和“远离”的词。

我们可以将 ( n ) 维向量视为 ( n ) 维空间中的点。从这个角度来看，L1 和 L2 距离有助于量化在这两个点之间“我们必须移动”多少空间。另一种方法是检查两个向量之间的夹角。从三角学中我们知道：

我们可以不计算实际的角度，而是将相似度保留为 similarity=cos⁡( $\theta$ )。正式地，两个向量p和q之间的余弦相似度s定义为：

$s = \frac{p \cdot q}{||p||||q||},where\,\, s\, \in [-1,1]$

Question 2.2: Words with Multiple Meanings

多义词和同音异义词是具有多个意义的词（请参见这个维基页面了解多义词和同音异义词的区别）。找到一个至少有两种不同意义的词，使得根据余弦相似度，前 10 个最相似的词包含与这两种意义相关的词。例如，“leaves”在前 10 个相似词中同时包含“go_away”和“a_structure_of_a_plant”的意思，“scoop”在前 10 个相似词中同时包含“handed_waffle_cone”和“lowdown”的意思。你可能需要尝试几个多义词或同音异义词才能找到一个合适的词。

请说明你发现的词以及在前 10 个相似词中出现的多重含义。为什么你认为你尝试的许多多义词或同音异义词不起作用（即前 10 个最相似的词只包含该词的一个意义）？

注意：你应该使用 wv_from_bin.most_similar(word) 函数来获取前 10 个最相似的词。这个函数根据与给定词的余弦相似度对词汇表中的所有其他词进行排名。如需进一步帮助，请查看 GenSim 文档。

# ------------------
# Write your implementation here.

wv_from_bin.most_similar("wear")
# ------------------

[('wearing', 0.8140338063240051),
 ('wore', 0.7708778381347656),
 ('worn', 0.7581356763839722),
 ('dress', 0.747020423412323),
 ('wears', 0.712210476398468),
 ('clothes', 0.6622017621994019),
 ('shoes', 0.6472660899162292),
 ('pants', 0.641486644744873),
 ('shirts', 0.6383503079414368),
 ('attire', 0.6368460655212402)]

Question 2.3: Synonyms & Antonyms

在考虑余弦相似度时，通常更方便使用余弦距离，余弦距离等于1减去余弦相似度。

找到三个词（𝑤1、𝑤2、𝑤3），其中𝑤1和𝑤2是同义词，𝑤1和𝑤3是反义词，但余弦距离（𝑤1、𝑤3）小于余弦距离（𝑤1、𝑤2）。

例如，𝑤1=“happy”与𝑤3=“sad”的距离比与𝑤2=“cheerful”的距离更近。请找出一个符合上述条件的不同示例，并提供可能的解释，说明为何会出现这种反直觉的结果。

你可以使用 wv_from_bin.distance(w1, w2) 函数来计算两个词之间的余弦距离。有关更多帮助，请参考GenSim文档。

# ------------------
# Write your implementation here.
w1='yes'
w2='correct'
w3='no'
print(f"the distance between {w1} and {w2} is {wv_from_bin.distance(w1,w2)}")
print(f"the distance between {w1} and {w3} is {wv_from_bin.distance(w1,w3)}")

# ------------------

1 2	the distance between yes and correct is 0.6043116450309753 the distance between yes and no is 0.4196847677230835

Question 2.4: Analogies with Word Vectors

词向量有时被证明可以解决类比问题。

例如，对于类比”man : grandfather :: woman : x”（读作：man 与 grandfather 的关系如同 woman 与 x 的关系），x 是什么？

在下面的单元格中，我们展示了如何使用 most_similar 函数从 GenSim 文档中找到 x。该函数找到最类似于正面词列表中的词，并且最不类似于负面词列表中的词（同时省略输入词，因为它们通常是最相似的；请参阅这篇论文）。类比问题的答案将具有最高的余弦相似度（返回的数值最大）。

1 2	# Run this cell to answer the analogy -- man : grandfather :: woman : x pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'grandfather'], negative=['man']))

[('grandmother', 0.7608445286750793),
 ('granddaughter', 0.7200807332992554),
 ('daughter', 0.7168302536010742),
 ('mother', 0.7151536345481873),
 ('niece', 0.7005681395530701),
 ('father', 0.6659888029098511),
 ('aunt', 0.6623408794403076),
 ('grandson', 0.6618767380714417),
 ('grandparents', 0.6446609497070312),
 ('wife', 0.6445354223251343)]

设𝑚、𝑔、𝑤和𝑥分别表示“man”、“grandfather”、“woman”和答案的词向量。仅使用向量𝑚、𝑔、𝑤和向量运算符+和−来回答，最大化与𝑥的余弦相似度的表达式是什么？

提示：回忆一下，词向量只是表示一个词的多维向量。使用任意位置的每个向量绘制一个二维示例可能会有所帮助。在坐标平面上，“man”和“woman”相对于“grandfather”和答案会位于什么位置？

Question 2.5: Finding Analogies

a. 对于前面的例子，很明显“grandmother”完成了类比。但是，请给出一个直观的解释，说明为什么 most_similar 函数会给我们像“granddaughter”、“daughter”或“mother”这样的词？

b. 根据这些向量，找到一个符合类比关系的例子（即预期的词排在首位）。在你的解答中，请用 x:y :: a:b 的形式表述完整的类比。如果你认为类比关系比较复杂，请用一两句话解释为什么这个类比成立。

# For example: x, y, a, b = ("", "", "", "")
# ------------------
# Write your implementation here.

a = 'woman'
x = 'man'
y = 'actor'
b = 'actress'
# ------------------

# Test the solution
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] == b

Question 2.6: Incorrect Analogy

a. 以下，我们希望看到预期的类比 “手：手套::脚：袜子”，但我们却看到了意外的结果。请给出一个可能的原因，解释为什么这个类比会出现这种结果。

1	pprint.pprint(wv_from_bin.most_similar(positive=['foot', 'glove'], negative=['hand']))

[('45,000-square', 0.4922031760215759),
 ('15,000-square', 0.4649604558944702),
 ('10,000-square', 0.45447564125061035),
 ('6,000-square', 0.44975775480270386),
 ('3,500-square', 0.4441334009170532),
 ('700-square', 0.44257497787475586),
 ('50,000-square', 0.4356396794319153),
 ('3,000-square', 0.4348651170730591),
 ('30,000-square', 0.4330596923828125),
 ('footed', 0.43236875534057617)]

b. 找到一个不符合以下向量的类比例子。在你的解答中，按照 x:y :: a:b 的格式写出预期的类比，并根据词向量指出 b 的错误值（在之前的例子中，这将是 ‘45,000-square’）。

# For example: x, y, a, b = ("", "", "", "")
# ------------------
# Write your implementation here.
x='birds'
y='nests'
a='dog'
b='kennel'
# ------------------
pprint.pprint(wv_from_bin.most_similar(positive=[a, y], negative=[x]))
assert wv_from_bin.most_similar(positive=[a, y], negative=[x])[0][0] != b

[('puppy', 0.5015850067138672),
 ('dogs', 0.47369998693466187),
 ('sled', 0.44785985350608826),
 ('nest', 0.44290176033973694),
 ('rottweiler', 0.44138768315315247),
 ('litter', 0.42728888988494873),
 ('junkyard', 0.425918310880661),
 ('terrier', 0.40305447578430176),
 ('raccoon', 0.39537933468818665),
 ('puppies', 0.3936391770839691)]

Question 2.7: Guided Analysis of Bias in Word Vectors

我们需要意识到我们的词嵌入中隐含的偏见（性别、种族、性取向等）。这种偏见可能是危险的，因为它可以通过应用这些模型的程序来强化刻板印象。

运行下面的代码单元格，来检查以下内容：（a）哪些词语与“男人”和“职业”最相似且与“女人”最不相似；（b）哪些词语与“女人”和“职业”最相似且与“男人”最不相似。指出女性相关词汇列表和男性相关词汇列表之间的差异，并解释它们是如何反映性别偏见的。

# Run this cell
# Here `positive` indicates the list of words to be similar to and `negative` indicates the list of words to be
# most dissimilar from.

pprint.pprint(wv_from_bin.most_similar(positive=['man', 'profession'], negative=['woman']))
print()
pprint.pprint(wv_from_bin.most_similar(positive=['woman', 'profession'], negative=['man']))

[('reputation', 0.5250176191329956),
 ('professions', 0.5178037881851196),
 ('skill', 0.49046966433525085),
 ('skills', 0.49005505442619324),
 ('ethic', 0.4897659420967102),
 ('business', 0.487585186958313),
 ('respected', 0.4859202802181244),
 ('practice', 0.48210465908050537),
 ('regarded', 0.4778572916984558),
 ('life', 0.4760662019252777)]

[('professions', 0.5957457423210144),
 ('practitioner', 0.4988412857055664),
 ('teaching', 0.48292139172554016),
 ('nursing', 0.48211798071861267),
 ('vocation', 0.4788966476917267),
 ('teacher', 0.47160348296165466),
 ('practicing', 0.4693780839443207),
 ('educator', 0.4652431905269623),
 ('physicians', 0.46289944648742676),
 ('professionals', 0.4601393938064575)]

Question 2.8: Independent Analysis of Bias in Word Vectors

使用 most_similar 函数来找到另一组类比对，这些类比对展示了向量中存在的某些偏见。请简要解释你发现的偏见示例。

# ------------------
# Write your implementation here.

pprint.pprint(wv_from_bin.most_similar(positive=['black','math'], negative=['asian']))
# ------------------

[('graders', 0.4807853102684021),
 ('elementary', 0.4512704312801361),
 ('teacher', 0.4432070851325989),
 ('students', 0.43790796399116516),
 ('reading', 0.4332512319087982),
 ('teachers', 0.42953428626060486),
 ('mathematics', 0.4293864667415619),
 ('classroom', 0.42626458406448364),
 ('exam', 0.4233267903327942),
 ('school', 0.42323949933052063)]

Question 2.9: Thinking About Bias

a. 解释一个词向量中产生偏见的途径。简要描述一个实际例子，展示这种偏见的来源。你的实际例子应专注于词向量，而不是其他人工智能系统中的偏见（例如，ChatGPT）。

b. 你可以使用一种方法来减轻词向量所展示的偏见。这种方法的一个真实世界的例子是什么？简要描述一下这个例子。