Ccmmutty logo
Commutty IT
0 pv63 min read

第7章 解答

https://picsum.photos/seed/94b6207d957c481cb392ad4b8de88dae/1200/630

第7章: 単語ベクトル

単語の意味を実ベクトルで表現する単語ベクトル(単語埋め込み)に関して,以下の処理を行うプログラムを作成せよ.

60. 単語ベクトルの読み込みと表示

Google Newsデータセット(約1,000億単語)での学習済み単語ベクトル(300万単語・フレーズ,300次元)をダウンロードし,”United States”の単語ベクトルを表示せよ.ただし,”United States”は内部的には”United_States”と表現されていることに注意せよ.
結局、gensim っていうライブラリで読み込むのがノーマルらしい
python
from gensim.models import KeyedVectors

path = "/content/drive/MyDrive/nlp100/GoogleNews-vectors-negative300.bin"
model = KeyedVectors.load_word2vec_format(path, binary=True)
python
us_vec = model["United_States"]
us_vec
array([-3.61328125e-02, -4.83398438e-02, 2.35351562e-01, 1.74804688e-01, -1.46484375e-01, -7.42187500e-02, -1.01562500e-01, -7.71484375e-02, 1.09375000e-01, -5.71289062e-02, -1.48437500e-01, -6.00585938e-02, 1.74804688e-01, -7.71484375e-02, 2.58789062e-02, -7.66601562e-02, -3.80859375e-02, 1.35742188e-01, 3.75976562e-02, -4.19921875e-02, -3.56445312e-02, 5.34667969e-02, 3.68118286e-04, -1.66992188e-01, -1.17187500e-01, 1.41601562e-01, -1.69921875e-01, -6.49414062e-02, -1.66992188e-01, 1.00585938e-01, 1.15722656e-01, -2.18750000e-01, -9.86328125e-02, -2.56347656e-02, 1.23046875e-01, -3.54003906e-02, -1.58203125e-01, -1.60156250e-01, 2.94189453e-02, 8.15429688e-02, 6.88476562e-02, 1.87500000e-01, 6.49414062e-02, 1.15234375e-01, -2.27050781e-02, 3.32031250e-01, -3.27148438e-02, 1.77734375e-01, -2.08007812e-01, 4.54101562e-02, -1.23901367e-02, 1.19628906e-01, 7.44628906e-03, -9.03320312e-03, 1.14257812e-01, 1.69921875e-01, -2.38281250e-01, -2.79541016e-02, -1.21093750e-01, 2.47802734e-02, 7.71484375e-02, -2.81982422e-02, -4.71191406e-02, 1.78222656e-02, -1.23046875e-01, -5.32226562e-02, 2.68554688e-02, -3.11279297e-02, -5.59082031e-02, -5.00488281e-02, -3.73535156e-02, 1.25976562e-01, 5.61523438e-02, 1.51367188e-01, 4.29687500e-02, -2.08007812e-01, -4.78515625e-02, 2.78320312e-02, 1.81640625e-01, 2.20703125e-01, -3.61328125e-02, -8.39843750e-02, -3.69548798e-05, -9.52148438e-02, -1.25000000e-01, -1.95312500e-01, -1.50390625e-01, -4.15039062e-02, 1.31835938e-01, 1.17675781e-01, 1.91650391e-02, 5.51757812e-02, -9.42382812e-02, -1.08886719e-01, 7.32421875e-02, -1.15234375e-01, 8.93554688e-02, -1.40625000e-01, 1.45507812e-01, 4.49218750e-02, -1.10473633e-02, -1.62353516e-02, 4.05883789e-03, 3.75976562e-02, -6.98242188e-02, -5.46875000e-02, 2.17285156e-02, -9.47265625e-02, 4.24804688e-02, 1.81884766e-02, -1.73339844e-02, 4.63867188e-02, -1.42578125e-01, 1.99218750e-01, 1.10839844e-01, 2.58789062e-02, -7.08007812e-02, -5.54199219e-02, 3.45703125e-01, 1.61132812e-01, -2.44140625e-01, -2.59765625e-01, -9.71679688e-02, 8.00781250e-02, -8.78906250e-02, -7.22656250e-02, 1.42578125e-01, -8.54492188e-02, -3.18359375e-01, 8.30078125e-02, 6.34765625e-02, 1.64062500e-01, -1.92382812e-01, -1.17675781e-01, -5.41992188e-02, -1.56250000e-01, -1.21582031e-01, -4.95605469e-02, 1.20117188e-01, -3.83300781e-02, 5.51757812e-02, -8.97216797e-03, 4.32128906e-02, 6.93359375e-02, 8.93554688e-02, 2.53906250e-01, 1.65039062e-01, 1.64062500e-01, -1.41601562e-01, 4.58984375e-02, 1.97265625e-01, -8.98437500e-02, 3.90625000e-02, -1.51367188e-01, -8.60595703e-03, -1.17675781e-01, -1.97265625e-01, -1.12792969e-01, 1.29882812e-01, 1.96289062e-01, 1.56402588e-03, 3.93066406e-02, 2.17773438e-01, -1.43554688e-01, 6.03027344e-02, -1.35742188e-01, 1.16210938e-01, -1.59912109e-02, 2.79296875e-01, 1.46484375e-01, -1.19628906e-01, 1.76757812e-01, 1.28906250e-01, -1.49414062e-01, 6.93359375e-02, -1.72851562e-01, 9.22851562e-02, 1.33056641e-02, -2.00195312e-01, -9.76562500e-02, -1.65039062e-01, -2.46093750e-01, -2.35595703e-02, -2.11914062e-01, 1.84570312e-01, -1.85546875e-02, 2.16796875e-01, 5.05371094e-02, 2.02636719e-02, 4.25781250e-01, 1.28906250e-01, -2.77099609e-02, 1.29882812e-01, -1.15722656e-01, -2.05078125e-02, 1.49414062e-01, 7.81250000e-03, -2.05078125e-01, -8.05664062e-02, -2.67578125e-01, -2.29492188e-02, -8.20312500e-02, 8.64257812e-02, 7.61718750e-02, -3.66210938e-02, 5.22460938e-02, -1.22070312e-01, -1.44042969e-02, -2.69531250e-01, 8.44726562e-02, -2.52685547e-02, -2.96630859e-02, -1.68945312e-01, 1.93359375e-01, -1.08398438e-01, 1.94091797e-02, -1.80664062e-01, 1.93359375e-01, -7.08007812e-02, 5.85937500e-02, -1.01562500e-01, -1.31835938e-01, 7.51953125e-02, -7.66601562e-02, 3.37219238e-03, -8.59375000e-02, 1.25000000e-01, 2.92968750e-02, 1.70898438e-01, -9.37500000e-02, -1.09375000e-01, -2.50244141e-02, 2.11914062e-01, -4.44335938e-02, 6.12792969e-02, 2.62451172e-02, -1.77734375e-01, 1.23046875e-01, -7.42187500e-02, -1.67968750e-01, -1.08886719e-01, -9.04083252e-04, -7.37304688e-02, 5.49316406e-02, 6.03027344e-02, 8.39843750e-02, 9.17968750e-02, -1.32812500e-01, 1.22070312e-01, -8.78906250e-03, 1.19140625e-01, -1.94335938e-01, -6.64062500e-02, -2.07031250e-01, 7.37304688e-02, 8.93554688e-02, 1.81884766e-02, -1.20605469e-01, -2.61230469e-02, 2.67333984e-02, 7.76367188e-02, -8.30078125e-02, 6.78710938e-02, -3.54003906e-02, 3.10546875e-01, -2.42919922e-02, -1.41601562e-01, -2.08007812e-01, -4.57763672e-03, -6.54296875e-02, -4.95605469e-02, 2.22656250e-01, 1.53320312e-01, -1.38671875e-01, -5.24902344e-02, 4.24804688e-02, -2.38281250e-01, 1.56250000e-01, 5.83648682e-04, -1.20605469e-01, -9.22851562e-02, -4.44335938e-02, 3.61328125e-02, -1.86767578e-02, -8.25195312e-02, -8.25195312e-02, -4.05273438e-02, 1.19018555e-02, 1.69921875e-01, -2.80761719e-02, 3.03649902e-03, 9.32617188e-02, -8.49609375e-02, 1.57470703e-02, 7.03125000e-02, 1.62353516e-02, -2.27050781e-02, 3.51562500e-02, 2.47070312e-01, -2.67333984e-02], dtype=float32)

61. 単語の類似度

“United States”と”U.S.”のコサイン類似度を計算せよ.
python
U_S_vec = model["U.S."]
us_similarity = model.wv.similarity("United_States","U.S.")
us_similarity
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:2: DeprecationWarning: Call to deprecated wv (Attribute will be removed in 4.0.0, use self instead).
0.73107743

62. 類似度の高い単語10件

“United States”とコサイン類似度が高い10語と,その類似度を出力せよ.
python
us_most_similar = model.wv.most_similar("United_States")
us_most_similar
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated wv (Attribute will be removed in 4.0.0, use self instead). """Entry point for launching an IPython kernel.
[('Unites_States', 0.7877248525619507),
 ('Untied_States', 0.7541370391845703),
 ('United_Sates', 0.74007248878479),
 ('U.S.', 0.7310774326324463),
 ('theUnited_States', 0.6404393911361694),
 ('America', 0.6178410053253174),
 ('UnitedStates', 0.6167312264442444),
 ('Europe', 0.6132988929748535),
 ('countries', 0.6044804453849792),
 ('Canada', 0.6019070148468018)]

63. 加法構成性によるアナロジー

“Spain”の単語ベクトルから”Madrid”のベクトルを引き,”Athens”のベクトルを足したベクトルを計算し,そのベクトルと類似度の高い10語とその類似度を出力せよ.
python
europe_similar = model.wv.most_similar(positive=["Spain","Athena"],negative=["Madrid"])
europe_similar
/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:1: DeprecationWarning: Call to deprecated wv (Attribute will be removed in 4.0.0, use self instead). """Entry point for launching an IPython kernel.
[('Zeus', 0.4913957417011261),
 ('Samos', 0.46127381920814514),
 ('Hestia', 0.4443296790122986),
 ('Minerva', 0.4405948519706726),
 ('Aphrodite', 0.4389914274215698),
 ('Hera', 0.43806713819503784),
 ('Leda', 0.42030245065689087),
 ('Artemis', 0.41747787594795227),
 ('Greece', 0.4161592125892639),
 ('Althena', 0.4142044186592102)]

64. アナロジーデータでの実験

単語アナロジーの評価データをダウンロードし,vec(2列目の単語) - vec(1列目の単語) + vec(3列目の単語)を計算し,そのベクトルと類似度が最も高い単語と,その類似度を求めよ.求めた単語と類似度は,各事例の末尾に追記せよ.
python
import requests

response = requests.get("http://download.tensorflow.org/data/questions-words.txt")
data = response.text
python
print(data[:500])
: capital-common-countries Athens Greece Baghdad Iraq Athens Greece Bangkok Thailand Athens Greece Beijing China Athens Greece Berlin Germany Athens Greece Bern Switzerland Athens Greece Cairo Egypt Athens Greece Canberra Australia Athens Greece Hanoi Vietnam Athens Greece Havana Cuba Athens Greece Helsinki Finland Athens Greece Islamabad Pakistan Athens Greece Kabul Afghanistan Athens Greece London England Athens Greece Madrid Spain Athens Greece Moscow Russia Athens Greece Oslo Norway Athens G
python
rows = data.split("\n")
python
from tqdm import tqdm
new_rows = []
for row in tqdm(rows):
    if row[0] == ":":
        new_rows.append(row)
    else:
        cols = row.split(" ")
        v1 = cols[0]
        v2 = cols[1]
        v3 = cols[2]
        word,similarity = model.wv.most_similar(positive=[v1,v3],negative=[v1],topn=1)[0]
        new_rows.append(row + " " + word + " " + str(similarity))
0%| | 0/19559 [00:00<?, ?it/s]/usr/local/lib/python3.7/dist-packages/ipykernel_launcher.py:11: DeprecationWarning: Call to deprecated wv (Attribute will be removed in 4.0.0, use self instead). # This is added back by InteractiveShellApp.init_path() 2%|▏ | 434/19559 [02:44<2:00:53, 2.64it/s]
---------------------------------------------------------------------------

KeyboardInterrupt                         Traceback (most recent call last)

<ipython-input-24-86397de650e4> in <module>()
      9         v2 = cols[1]
     10         v3 = cols[2]
---> 11         word,similarity = model.wv.most_similar(positive=[v1,v3],negative=[v1],topn=1)[0]
     12         new_rows.append(row + " " + word + " " + str(similarity))


/usr/local/lib/python3.7/dist-packages/gensim/models/keyedvectors.py in most_similar(self, positive, negative, topn, restrict_vocab, indexer)
    540 
    541         limited = self.vectors_norm if restrict_vocab is None else self.vectors_norm[:restrict_vocab]
--> 542         dists = dot(limited, mean)
    543         if not topn:
    544             return dists


<__array_function__ internals> in dot(*args, **kwargs)


KeyboardInterrupt:
python
new_rows
[': capital-common-countries', 'Athens Greece Baghdad Iraq Mosul 0.838557779788971', 'Athens Greece Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Athens Greece Beijing China China 0.76484614610672', 'Athens Greece Berlin Germany Munich 0.6743212938308716', 'Athens Greece Bern Switzerland Hanover 0.5566277503967285', 'Athens Greece Cairo Egypt Tunis 0.7160316705703735', 'Athens Greece Canberra Australia Brisbane 0.7812012434005737', 'Athens Greece Hanoi Vietnam Ha_Noi 0.84051913022995', 'Athens Greece Havana Cuba Havana_Cuba 0.7055565118789673', 'Athens Greece Helsinki Finland Stockholm 0.7469482421875', 'Athens Greece Islamabad Pakistan Lahore 0.7515299320220947', 'Athens Greece Kabul Afghanistan Kandahar 0.7571498155593872', 'Athens Greece London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Athens Greece Madrid Spain Barcelona 0.7255446910858154', 'Athens Greece Moscow Russia Kiev 0.7665926814079285', 'Athens Greece Oslo Norway Oslo_Norway 0.7500494718551636', 'Athens Greece Ottawa Canada Toronto 0.7206088304519653', 'Athens Greece Paris France Parisian 0.6789354085922241', 'Athens Greece Rome Italy based_Isae_Institute 0.6246561408042908', 'Athens Greece Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Athens Greece Tehran Iran Teheran 0.8511031270027161', 'Athens Greece Tokyo Japan Osaka 0.7306584715843201', 'Baghdad Iraq Bangkok Thailand Chiang_Mai 0.7290725708007812', 'Baghdad Iraq Beijing China China 0.7648462057113647', 'Baghdad Iraq Berlin Germany Munich 0.6743212938308716', 'Baghdad Iraq Bern Switzerland Hanover 0.5566277503967285', 'Baghdad Iraq Cairo Egypt Tunis 0.7160316705703735', 'Baghdad Iraq Canberra Australia Brisbane 0.7812012434005737', 'Baghdad Iraq Hanoi Vietnam Ha_Noi 0.84051913022995', 'Baghdad Iraq Havana Cuba Havana_Cuba 0.7055564522743225', 'Baghdad Iraq Helsinki Finland Stockholm 0.7469482421875', 'Baghdad Iraq Islamabad Pakistan Lahore 0.7515299916267395', 'Baghdad Iraq Kabul Afghanistan Kandahar 0.7571498155593872', 'Baghdad Iraq London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Baghdad Iraq Madrid Spain Barcelona 0.7255446910858154', 'Baghdad Iraq Moscow Russia Kiev 0.7665926814079285', 'Baghdad Iraq Oslo Norway Oslo_Norway 0.7500494718551636', 'Baghdad Iraq Ottawa Canada Toronto 0.7206088304519653', 'Baghdad Iraq Paris France Parisian 0.6789354085922241', 'Baghdad Iraq Rome Italy based_Isae_Institute 0.624656081199646', 'Baghdad Iraq Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Baghdad Iraq Tehran Iran Teheran 0.8511031270027161', 'Baghdad Iraq Tokyo Japan Osaka 0.7306584119796753', 'Baghdad Iraq Athens Greece Athens_Greece 0.6553778648376465', 'Bangkok Thailand Beijing China China 0.76484614610672', 'Bangkok Thailand Berlin Germany Munich 0.6743212938308716', 'Bangkok Thailand Bern Switzerland Hanover 0.5566277503967285', 'Bangkok Thailand Cairo Egypt Tunis 0.7160316705703735', 'Bangkok Thailand Canberra Australia Brisbane 0.7812013030052185', 'Bangkok Thailand Hanoi Vietnam Ha_Noi 0.84051913022995', 'Bangkok Thailand Havana Cuba Havana_Cuba 0.7055565118789673', 'Bangkok Thailand Helsinki Finland Stockholm 0.7469482421875', 'Bangkok Thailand Islamabad Pakistan Lahore 0.7515299916267395', 'Bangkok Thailand Kabul Afghanistan Kandahar 0.7571498155593872', 'Bangkok Thailand London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Bangkok Thailand Madrid Spain Barcelona 0.7255446910858154', 'Bangkok Thailand Moscow Russia Kiev 0.7665926814079285', 'Bangkok Thailand Oslo Norway Oslo_Norway 0.7500494718551636', 'Bangkok Thailand Ottawa Canada Toronto 0.7206088304519653', 'Bangkok Thailand Paris France Parisian 0.6789354085922241', 'Bangkok Thailand Rome Italy based_Isae_Institute 0.6246561408042908', 'Bangkok Thailand Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Bangkok Thailand Tehran Iran Teheran 0.8511031270027161', 'Bangkok Thailand Tokyo Japan Osaka 0.7306584119796753', 'Bangkok Thailand Athens Greece Athens_Greece 0.6553778648376465', 'Bangkok Thailand Baghdad Iraq Mosul 0.8385578393936157', 'Beijing China Berlin Germany Munich 0.6743212938308716', 'Beijing China Bern Switzerland Hanover 0.5566277503967285', 'Beijing China Cairo Egypt Tunis 0.7160316705703735', 'Beijing China Canberra Australia Brisbane 0.7812012434005737', 'Beijing China Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Beijing China Havana Cuba Havana_Cuba 0.7055565118789673', 'Beijing China Helsinki Finland Stockholm 0.7469482421875', 'Beijing China Islamabad Pakistan Lahore 0.7515299320220947', 'Beijing China Kabul Afghanistan Kandahar 0.7571498155593872', 'Beijing China London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Beijing China Madrid Spain Barcelona 0.7255446910858154', 'Beijing China Moscow Russia Kiev 0.7665926814079285', 'Beijing China Oslo Norway Oslo_Norway 0.7500494718551636', 'Beijing China Ottawa Canada Toronto 0.7206088304519653', 'Beijing China Paris France Parisian 0.6789354681968689', 'Beijing China Rome Italy based_Isae_Institute 0.624656081199646', 'Beijing China Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Beijing China Tehran Iran Teheran 0.8511031270027161', 'Beijing China Tokyo Japan Osaka 0.7306584119796753', 'Beijing China Athens Greece Athens_Greece 0.6553779244422913', 'Beijing China Baghdad Iraq Mosul 0.8385578393936157', 'Beijing China Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Berlin Germany Bern Switzerland Hanover 0.5566276907920837', 'Berlin Germany Cairo Egypt Tunis 0.7160316705703735', 'Berlin Germany Canberra Australia Brisbane 0.7812012434005737', 'Berlin Germany Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Berlin Germany Havana Cuba Havana_Cuba 0.7055565118789673', 'Berlin Germany Helsinki Finland Stockholm 0.7469482421875', 'Berlin Germany Islamabad Pakistan Lahore 0.7515299916267395', 'Berlin Germany Kabul Afghanistan Kandahar 0.7571498155593872', 'Berlin Germany London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Berlin Germany Madrid Spain Barcelona 0.7255446910858154', 'Berlin Germany Moscow Russia Kiev 0.7665926814079285', 'Berlin Germany Oslo Norway Oslo_Norway 0.7500494718551636', 'Berlin Germany Ottawa Canada Toronto 0.7206088304519653', 'Berlin Germany Paris France Parisian 0.6789354085922241', 'Berlin Germany Rome Italy based_Isae_Institute 0.6246561408042908', 'Berlin Germany Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Berlin Germany Tehran Iran Teheran 0.8511031270027161', 'Berlin Germany Tokyo Japan Osaka 0.7306584119796753', 'Berlin Germany Athens Greece Athens_Greece 0.6553778648376465', 'Berlin Germany Baghdad Iraq Mosul 0.8385578393936157', 'Berlin Germany Bangkok Thailand Chiang_Mai 0.729072630405426', 'Berlin Germany Beijing China China 0.76484614610672', 'Bern Switzerland Cairo Egypt Tunis 0.7160316705703735', 'Bern Switzerland Canberra Australia Brisbane 0.7812012434005737', 'Bern Switzerland Hanoi Vietnam Ha_Noi 0.84051913022995', 'Bern Switzerland Havana Cuba Havana_Cuba 0.7055564522743225', 'Bern Switzerland Helsinki Finland Stockholm 0.7469482421875', 'Bern Switzerland Islamabad Pakistan Lahore 0.7515299320220947', 'Bern Switzerland Kabul Afghanistan Kandahar 0.7571498155593872', 'Bern Switzerland London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Bern Switzerland Madrid Spain Barcelona 0.7255446910858154', 'Bern Switzerland Moscow Russia Kiev 0.7665926814079285', 'Bern Switzerland Oslo Norway Oslo_Norway 0.7500494718551636', 'Bern Switzerland Ottawa Canada Toronto 0.7206088304519653', 'Bern Switzerland Paris France Parisian 0.6789354085922241', 'Bern Switzerland Rome Italy based_Isae_Institute 0.6246561408042908', 'Bern Switzerland Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Bern Switzerland Tehran Iran Teheran 0.8511031270027161', 'Bern Switzerland Tokyo Japan Osaka 0.7306584119796753', 'Bern Switzerland Athens Greece Athens_Greece 0.655377984046936', 'Bern Switzerland Baghdad Iraq Mosul 0.838557779788971', 'Bern Switzerland Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Bern Switzerland Beijing China China 0.76484614610672', 'Bern Switzerland Berlin Germany Munich 0.6743212938308716', 'Cairo Egypt Canberra Australia Brisbane 0.7812012434005737', 'Cairo Egypt Hanoi Vietnam Ha_Noi 0.84051913022995', 'Cairo Egypt Havana Cuba Havana_Cuba 0.7055565118789673', 'Cairo Egypt Helsinki Finland Stockholm 0.7469482421875', 'Cairo Egypt Islamabad Pakistan Lahore 0.7515299320220947', 'Cairo Egypt Kabul Afghanistan Kandahar 0.7571498155593872', 'Cairo Egypt London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Cairo Egypt Madrid Spain Barcelona 0.7255446910858154', 'Cairo Egypt Moscow Russia Kiev 0.7665926814079285', 'Cairo Egypt Oslo Norway Oslo_Norway 0.7500494718551636', 'Cairo Egypt Ottawa Canada Toronto 0.7206088304519653', 'Cairo Egypt Paris France Parisian 0.6789354085922241', 'Cairo Egypt Rome Italy based_Isae_Institute 0.624656081199646', 'Cairo Egypt Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Cairo Egypt Tehran Iran Teheran 0.8511031270027161', 'Cairo Egypt Tokyo Japan Osaka 0.7306584119796753', 'Cairo Egypt Athens Greece Athens_Greece 0.655377984046936', 'Cairo Egypt Baghdad Iraq Mosul 0.8385578393936157', 'Cairo Egypt Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Cairo Egypt Beijing China China 0.76484614610672', 'Cairo Egypt Berlin Germany Munich 0.6743212938308716', 'Cairo Egypt Bern Switzerland Hanover 0.5566277503967285', 'Canberra Australia Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Canberra Australia Havana Cuba Havana_Cuba 0.7055565118789673', 'Canberra Australia Helsinki Finland Stockholm 0.7469482421875', 'Canberra Australia Islamabad Pakistan Lahore 0.7515299320220947', 'Canberra Australia Kabul Afghanistan Kandahar 0.7571498155593872', 'Canberra Australia London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Canberra Australia Madrid Spain Barcelona 0.7255446910858154', 'Canberra Australia Moscow Russia Kiev 0.7665926814079285', 'Canberra Australia Oslo Norway Oslo_Norway 0.7500494718551636', 'Canberra Australia Ottawa Canada Toronto 0.7206088304519653', 'Canberra Australia Paris France Parisian 0.6789354681968689', 'Canberra Australia Rome Italy based_Isae_Institute 0.6246561408042908', 'Canberra Australia Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Canberra Australia Tehran Iran Teheran 0.8511031866073608', 'Canberra Australia Tokyo Japan Osaka 0.7306584119796753', 'Canberra Australia Athens Greece Athens_Greece 0.6553779244422913', 'Canberra Australia Baghdad Iraq Mosul 0.8385578393936157', 'Canberra Australia Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Canberra Australia Beijing China China 0.76484614610672', 'Canberra Australia Berlin Germany Munich 0.6743212938308716', 'Canberra Australia Bern Switzerland Hanover 0.5566277503967285', 'Canberra Australia Cairo Egypt Tunis 0.7160317301750183', 'Hanoi Vietnam Havana Cuba Havana_Cuba 0.7055563926696777', 'Hanoi Vietnam Helsinki Finland Stockholm 0.7469482421875', 'Hanoi Vietnam Islamabad Pakistan Lahore 0.7515299320220947', 'Hanoi Vietnam Kabul Afghanistan Kandahar 0.7571498155593872', 'Hanoi Vietnam London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Hanoi Vietnam Madrid Spain Barcelona 0.7255446910858154', 'Hanoi Vietnam Moscow Russia Kiev 0.7665926814079285', 'Hanoi Vietnam Oslo Norway Oslo_Norway 0.7500494718551636', 'Hanoi Vietnam Ottawa Canada Toronto 0.7206088304519653', 'Hanoi Vietnam Paris France Parisian 0.6789354085922241', 'Hanoi Vietnam Rome Italy based_Isae_Institute 0.624656081199646', 'Hanoi Vietnam Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Hanoi Vietnam Tehran Iran Teheran 0.8511031270027161', 'Hanoi Vietnam Tokyo Japan Osaka 0.7306584119796753', 'Hanoi Vietnam Athens Greece Athens_Greece 0.6553778648376465', 'Hanoi Vietnam Baghdad Iraq Mosul 0.8385578393936157', 'Hanoi Vietnam Bangkok Thailand Chiang_Mai 0.729072630405426', 'Hanoi Vietnam Beijing China China 0.76484614610672', 'Hanoi Vietnam Berlin Germany Munich 0.6743212938308716', 'Hanoi Vietnam Bern Switzerland Hanover 0.5566277503967285', 'Hanoi Vietnam Cairo Egypt Tunis 0.7160316705703735', 'Hanoi Vietnam Canberra Australia Brisbane 0.7812012434005737', 'Havana Cuba Helsinki Finland Stockholm 0.7469482421875', 'Havana Cuba Islamabad Pakistan Lahore 0.7515299320220947', 'Havana Cuba Kabul Afghanistan Kandahar 0.7571498155593872', 'Havana Cuba London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Havana Cuba Madrid Spain Barcelona 0.7255446910858154', 'Havana Cuba Moscow Russia Kiev 0.7665926814079285', 'Havana Cuba Oslo Norway Oslo_Norway 0.7500494718551636', 'Havana Cuba Ottawa Canada Toronto 0.7206088304519653', 'Havana Cuba Paris France Parisian 0.6789354085922241', 'Havana Cuba Rome Italy based_Isae_Institute 0.624656081199646', 'Havana Cuba Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Havana Cuba Tehran Iran Teheran 0.8511031270027161', 'Havana Cuba Tokyo Japan Osaka 0.7306584119796753', 'Havana Cuba Athens Greece Athens_Greece 0.6553778648376465', 'Havana Cuba Baghdad Iraq Mosul 0.8385578393936157', 'Havana Cuba Bangkok Thailand Chiang_Mai 0.7290725708007812', 'Havana Cuba Beijing China China 0.76484614610672', 'Havana Cuba Berlin Germany Munich 0.6743212938308716', 'Havana Cuba Bern Switzerland Hanover 0.5566277503967285', 'Havana Cuba Cairo Egypt Tunis 0.7160316705703735', 'Havana Cuba Canberra Australia Brisbane 0.7812012434005737', 'Havana Cuba Hanoi Vietnam Ha_Noi 0.84051913022995', 'Helsinki Finland Islamabad Pakistan Lahore 0.7515299320220947', 'Helsinki Finland Kabul Afghanistan Kandahar 0.7571498155593872', 'Helsinki Finland London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Helsinki Finland Madrid Spain Barcelona 0.7255446910858154', 'Helsinki Finland Moscow Russia Kiev 0.7665927410125732', 'Helsinki Finland Oslo Norway Oslo_Norway 0.7500494718551636', 'Helsinki Finland Ottawa Canada Toronto 0.7206088304519653', 'Helsinki Finland Paris France Parisian 0.6789354085922241', 'Helsinki Finland Rome Italy based_Isae_Institute 0.6246561408042908', 'Helsinki Finland Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Helsinki Finland Tehran Iran Teheran 0.8511031270027161', 'Helsinki Finland Tokyo Japan Osaka 0.7306584715843201', 'Helsinki Finland Athens Greece Athens_Greece 0.6553778648376465', 'Helsinki Finland Baghdad Iraq Mosul 0.8385578393936157', 'Helsinki Finland Bangkok Thailand Chiang_Mai 0.729072630405426', 'Helsinki Finland Beijing China China 0.7648462057113647', 'Helsinki Finland Berlin Germany Munich 0.6743212938308716', 'Helsinki Finland Bern Switzerland Hanover 0.5566277503967285', 'Helsinki Finland Cairo Egypt Tunis 0.7160316705703735', 'Helsinki Finland Canberra Australia Brisbane 0.7812012434005737', 'Helsinki Finland Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Helsinki Finland Havana Cuba Havana_Cuba 0.7055565118789673', 'Islamabad Pakistan Kabul Afghanistan Kandahar 0.7571498155593872', 'Islamabad Pakistan London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Islamabad Pakistan Madrid Spain Barcelona 0.7255446910858154', 'Islamabad Pakistan Moscow Russia Kiev 0.7665926814079285', 'Islamabad Pakistan Oslo Norway Oslo_Norway 0.7500494718551636', 'Islamabad Pakistan Ottawa Canada Toronto 0.7206088304519653', 'Islamabad Pakistan Paris France Parisian 0.6789354085922241', 'Islamabad Pakistan Rome Italy based_Isae_Institute 0.624656081199646', 'Islamabad Pakistan Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Islamabad Pakistan Tehran Iran Teheran 0.8511031866073608', 'Islamabad Pakistan Tokyo Japan Osaka 0.7306584715843201', 'Islamabad Pakistan Athens Greece Athens_Greece 0.6553778648376465', 'Islamabad Pakistan Baghdad Iraq Mosul 0.8385578393936157', 'Islamabad Pakistan Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Islamabad Pakistan Beijing China China 0.7648462057113647', 'Islamabad Pakistan Berlin Germany Munich 0.6743212938308716', 'Islamabad Pakistan Bern Switzerland Hanover 0.5566277503967285', 'Islamabad Pakistan Cairo Egypt Tunis 0.7160316705703735', 'Islamabad Pakistan Canberra Australia Brisbane 0.7812012434005737', 'Islamabad Pakistan Hanoi Vietnam Ha_Noi 0.84051913022995', 'Islamabad Pakistan Havana Cuba Havana_Cuba 0.7055565118789673', 'Islamabad Pakistan Helsinki Finland Stockholm 0.7469482421875', 'Kabul Afghanistan London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Kabul Afghanistan Madrid Spain Barcelona 0.7255446910858154', 'Kabul Afghanistan Moscow Russia Kiev 0.7665926814079285', 'Kabul Afghanistan Oslo Norway Oslo_Norway 0.7500494718551636', 'Kabul Afghanistan Ottawa Canada Toronto 0.7206088304519653', 'Kabul Afghanistan Paris France Parisian 0.6789354085922241', 'Kabul Afghanistan Rome Italy based_Isae_Institute 0.6246561408042908', 'Kabul Afghanistan Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Kabul Afghanistan Tehran Iran Teheran 0.8511031866073608', 'Kabul Afghanistan Tokyo Japan Osaka 0.7306584119796753', 'Kabul Afghanistan Athens Greece Athens_Greece 0.6553778648376465', 'Kabul Afghanistan Baghdad Iraq Mosul 0.8385578393936157', 'Kabul Afghanistan Bangkok Thailand Chiang_Mai 0.729072630405426', 'Kabul Afghanistan Beijing China China 0.76484614610672', 'Kabul Afghanistan Berlin Germany Munich 0.6743212938308716', 'Kabul Afghanistan Bern Switzerland Hanover 0.5566277503967285', 'Kabul Afghanistan Cairo Egypt Tunis 0.7160317301750183', 'Kabul Afghanistan Canberra Australia Brisbane 0.7812013030052185', 'Kabul Afghanistan Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Kabul Afghanistan Havana Cuba Havana_Cuba 0.7055565118789673', 'Kabul Afghanistan Helsinki Finland Stockholm 0.7469482421875', 'Kabul Afghanistan Islamabad Pakistan Lahore 0.7515299916267395', 'London England Madrid Spain Barcelona 0.7255446910858154', 'London England Moscow Russia Kiev 0.7665926814079285', 'London England Oslo Norway Oslo_Norway 0.7500494718551636', 'London England Ottawa Canada Toronto 0.7206088304519653', 'London England Paris France Parisian 0.6789354085922241', 'London England Rome Italy based_Isae_Institute 0.6246561408042908', 'London England Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'London England Tehran Iran Teheran 0.8511031270027161', 'London England Tokyo Japan Osaka 0.7306584715843201', 'London England Athens Greece Athens_Greece 0.6553779244422913', 'London England Baghdad Iraq Mosul 0.8385578393936157', 'London England Bangkok Thailand Chiang_Mai 0.729072630405426', 'London England Beijing China China 0.7648462057113647', 'London England Berlin Germany Munich 0.6743212938308716', 'London England Bern Switzerland Hanover 0.5566277503967285', 'London England Cairo Egypt Tunis 0.7160317301750183', 'London England Canberra Australia Brisbane 0.7812012434005737', 'London England Hanoi Vietnam Ha_Noi 0.8405190706253052', 'London England Havana Cuba Havana_Cuba 0.7055565118789673', 'London England Helsinki Finland Stockholm 0.7469482421875', 'London England Islamabad Pakistan Lahore 0.7515299320220947', 'London England Kabul Afghanistan Kandahar 0.7571498155593872', 'Madrid Spain Moscow Russia Kiev 0.7665926814079285', 'Madrid Spain Oslo Norway Oslo_Norway 0.7500494718551636', 'Madrid Spain Ottawa Canada Toronto 0.7206088304519653', 'Madrid Spain Paris France Parisian 0.6789354681968689', 'Madrid Spain Rome Italy based_Isae_Institute 0.624656081199646', 'Madrid Spain Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Madrid Spain Tehran Iran Teheran 0.8511031270027161', 'Madrid Spain Tokyo Japan Osaka 0.7306584119796753', 'Madrid Spain Athens Greece Athens_Greece 0.6553779244422913', 'Madrid Spain Baghdad Iraq Mosul 0.8385578393936157', 'Madrid Spain Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Madrid Spain Beijing China China 0.76484614610672', 'Madrid Spain Berlin Germany Munich 0.6743212938308716', 'Madrid Spain Bern Switzerland Hanover 0.5566277503967285', 'Madrid Spain Cairo Egypt Tunis 0.7160317301750183', 'Madrid Spain Canberra Australia Brisbane 0.7812012434005737', 'Madrid Spain Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Madrid Spain Havana Cuba Havana_Cuba 0.7055565118789673', 'Madrid Spain Helsinki Finland Stockholm 0.7469483017921448', 'Madrid Spain Islamabad Pakistan Lahore 0.7515299320220947', 'Madrid Spain Kabul Afghanistan Kandahar 0.7571498155593872', 'Madrid Spain London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Moscow Russia Oslo Norway Oslo_Norway 0.7500494718551636', 'Moscow Russia Ottawa Canada Toronto 0.7206088304519653', 'Moscow Russia Paris France Parisian 0.6789354085922241', 'Moscow Russia Rome Italy based_Isae_Institute 0.624656081199646', 'Moscow Russia Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Moscow Russia Tehran Iran Teheran 0.8511031270027161', 'Moscow Russia Tokyo Japan Osaka 0.7306584715843201', 'Moscow Russia Athens Greece Athens_Greece 0.6553778648376465', 'Moscow Russia Baghdad Iraq Mosul 0.8385578393936157', 'Moscow Russia Bangkok Thailand Chiang_Mai 0.729072630405426', 'Moscow Russia Beijing China China 0.76484614610672', 'Moscow Russia Berlin Germany Munich 0.6743212938308716', 'Moscow Russia Bern Switzerland Hanover 0.5566277503967285', 'Moscow Russia Cairo Egypt Tunis 0.7160316705703735', 'Moscow Russia Canberra Australia Brisbane 0.7812012434005737', 'Moscow Russia Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Moscow Russia Havana Cuba Havana_Cuba 0.7055565118789673', 'Moscow Russia Helsinki Finland Stockholm 0.7469483017921448', 'Moscow Russia Islamabad Pakistan Lahore 0.7515299320220947', 'Moscow Russia Kabul Afghanistan Kandahar 0.7571498155593872', 'Moscow Russia London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Moscow Russia Madrid Spain Barcelona 0.7255446910858154', 'Oslo Norway Ottawa Canada Toronto 0.7206088304519653', 'Oslo Norway Paris France Parisian 0.6789354085922241', 'Oslo Norway Rome Italy based_Isae_Institute 0.624656081199646', 'Oslo Norway Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Oslo Norway Tehran Iran Teheran 0.8511031270027161', 'Oslo Norway Tokyo Japan Osaka 0.7306584715843201', 'Oslo Norway Athens Greece Athens_Greece 0.6553779244422913', 'Oslo Norway Baghdad Iraq Mosul 0.8385578393936157', 'Oslo Norway Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Oslo Norway Beijing China China 0.76484614610672', 'Oslo Norway Berlin Germany Munich 0.6743212938308716', 'Oslo Norway Bern Switzerland Hanover 0.5566277503967285', 'Oslo Norway Cairo Egypt Tunis 0.7160316705703735', 'Oslo Norway Canberra Australia Brisbane 0.7812012434005737', 'Oslo Norway Hanoi Vietnam Ha_Noi 0.84051913022995', 'Oslo Norway Havana Cuba Havana_Cuba 0.7055565118789673', 'Oslo Norway Helsinki Finland Stockholm 0.7469482421875', 'Oslo Norway Islamabad Pakistan Lahore 0.7515299320220947', 'Oslo Norway Kabul Afghanistan Kandahar 0.7571498155593872', 'Oslo Norway London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Oslo Norway Madrid Spain Barcelona 0.7255446910858154', 'Oslo Norway Moscow Russia Kiev 0.7665926814079285', 'Ottawa Canada Paris France Parisian 0.6789354085922241', 'Ottawa Canada Rome Italy based_Isae_Institute 0.624656081199646', 'Ottawa Canada Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Ottawa Canada Tehran Iran Teheran 0.8511031270027161', 'Ottawa Canada Tokyo Japan Osaka 0.7306584715843201', 'Ottawa Canada Athens Greece Athens_Greece 0.6553778648376465', 'Ottawa Canada Baghdad Iraq Mosul 0.8385578393936157', 'Ottawa Canada Bangkok Thailand Chiang_Mai 0.7290726900100708', 'Ottawa Canada Beijing China China 0.76484614610672', 'Ottawa Canada Berlin Germany Munich 0.6743212938308716', 'Ottawa Canada Bern Switzerland Hanover 0.5566277503967285', 'Ottawa Canada Cairo Egypt Tunis 0.7160316705703735', 'Ottawa Canada Canberra Australia Brisbane 0.7812012434005737', 'Ottawa Canada Hanoi Vietnam Ha_Noi 0.84051913022995', 'Ottawa Canada Havana Cuba Havana_Cuba 0.7055565118789673', 'Ottawa Canada Helsinki Finland Stockholm 0.7469482421875', 'Ottawa Canada Islamabad Pakistan Lahore 0.7515299320220947', 'Ottawa Canada Kabul Afghanistan Kandahar 0.7571498155593872', 'Ottawa Canada London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Ottawa Canada Madrid Spain Barcelona 0.7255446910858154', 'Ottawa Canada Moscow Russia Kiev 0.7665926814079285', 'Ottawa Canada Oslo Norway Oslo_Norway 0.7500494718551636', 'Paris France Rome Italy based_Isae_Institute 0.624656081199646', 'Paris France Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Paris France Tehran Iran Teheran 0.8511031866073608', 'Paris France Tokyo Japan Osaka 0.7306584715843201', 'Paris France Athens Greece Athens_Greece 0.6553779244422913', 'Paris France Baghdad Iraq Mosul 0.8385578393936157', 'Paris France Bangkok Thailand Chiang_Mai 0.729072630405426', 'Paris France Beijing China China 0.76484614610672', 'Paris France Berlin Germany Munich 0.6743212938308716', 'Paris France Bern Switzerland Hanover 0.5566277503967285', 'Paris France Cairo Egypt Tunis 0.7160316705703735', 'Paris France Canberra Australia Brisbane 0.7812012434005737', 'Paris France Hanoi Vietnam Ha_Noi 0.84051913022995', 'Paris France Havana Cuba Havana_Cuba 0.7055565118789673', 'Paris France Helsinki Finland Stockholm 0.7469482421875', 'Paris France Islamabad Pakistan Lahore 0.7515299916267395', 'Paris France Kabul Afghanistan Kandahar 0.7571498155593872', 'Paris France London England EURASIAN_NATURAL_RESOURCES_CORP. 0.6739696860313416', 'Paris France Madrid Spain Barcelona 0.7255446910858154', 'Paris France Moscow Russia Kiev 0.7665926814079285', 'Paris France Oslo Norway Oslo_Norway 0.7500494718551636', 'Paris France Ottawa Canada Toronto 0.7206088304519653', 'Rome Italy Stockholm Sweden Stockholm_Sweden 0.7780462503433228', 'Rome Italy Tehran Iran Teheran 0.8511031270027161', 'Rome Italy Tokyo Japan Osaka 0.7306584119796753', 'Rome Italy Athens Greece Athens_Greece 0.6553778648376465', 'Rome Italy Baghdad Iraq Mosul 0.8385578393936157', 'Rome Italy Bangkok Thailand Chiang_Mai 0.729072630405426', 'Rome Italy Beijing China China 0.76484614610672', 'Rome Italy Berlin Germany Munich 0.6743212938308716', 'Rome Italy Bern Switzerland Hanover 0.5566277503967285', 'Rome Italy Cairo Egypt Tunis 0.7160317301750183', 'Rome Italy Canberra Australia Brisbane 0.7812012434005737', 'Rome Italy Hanoi Vietnam Ha_Noi 0.8405190706253052', 'Rome Italy Havana Cuba Havana_Cuba 0.7055565118789673', 'Rome Italy Helsinki Finland Stockholm 0.7469482421875', 'Rome Italy Islamabad Pakistan Lahore 0.7515299320220947']
遅いので途中でやめた。2時間くらいかかる。

65. アナロジータスクでの正解率

64の実行結果を用い,意味的アナロジー(semantic analogy)と文法的アナロジー(syntactic analogy)の正解率を測定せよ.
python
# 4列目と5列目が等しければ正解
# 意味的アナロジーのカテゴリと、文法的アナロジーのカテゴリがあって、それぞれで正解率を計算
# 前の問題の実行時間が長すぎるのでやらない

66. WordSimilarity-353での評価

The WordSimilarity-353 Test Collectionの評価データをダウンロードし,単語ベクトルにより計算される類似度のランキングと,人間の類似度判定のランキングの間のスピアマン相関係数を計算せよ.
python
!wget http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.zip
--2021-09-07 12:36:29-- http://www.gabrilovich.com/resources/data/wordsim353/wordsim353.zip Resolving www.gabrilovich.com (www.gabrilovich.com)... 208.97.177.37 Connecting to www.gabrilovich.com (www.gabrilovich.com)|208.97.177.37|:80... connected. HTTP request sent, awaiting response... 200 OK Length: 23257 (23K) [application/zip] Saving to: ‘wordsim353.zip.1’
wordsim353.zip.1    100%[===================>]  22.71K  --.-KB/s    in 0.04s   

2021-09-07 12:36:30 (631 KB/s) - ‘wordsim353.zip.1’ saved [23257/23257]
python
!unzip ./wordsim353.zip
Archive: ./wordsim353.zip replace combined.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: combined.csv
replace set1.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: set1.csv
replace set2.csv? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: set2.csv
replace combined.tab? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: combined.tab
replace set1.tab? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: set1.tab
replace set2.tab? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: set2.tab
replace instructions.txt? [y]es, [n]o, [A]ll, [N]one, [r]ename: y inflating: instructions.txt
python
python
import pandas as pd

df_combined = pd.read_csv("./combined.csv")
python
df_combined.head()
.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}
python
import numpy as np

rank_human = np.argsort(np.argsort(df_combined.loc[:,"Human (mean)"].values))
rank_human
array([209, 243, 352, 255, 271, 267, 139, 177, 259, 210, 252, 216, 163, 149, 227, 197, 213, 97, 140, 232, 301, 13, 12, 6, 17, 202, 70, 8, 254, 305, 281, 347, 1, 0, 329, 148, 201, 325, 275, 14, 351, 331, 343, 212, 198, 264, 206, 274, 33, 320, 246, 164, 207, 290, 306, 244, 105, 128, 324, 308, 49, 11, 153, 219, 286, 37, 339, 340, 349, 336, 346, 337, 342, 350, 334, 261, 233, 247, 187, 174, 39, 90, 141, 110, 24, 89, 88, 20, 50, 9, 51, 7, 2, 23, 3, 4, 321, 345, 344, 312, 266, 241, 326, 280, 221, 136, 56, 294, 296, 231, 217, 225, 103, 131, 143, 302, 226, 218, 251, 195, 184, 311, 276, 238, 204, 130, 258, 323, 299, 303, 297, 304, 291, 292, 142, 73, 41, 200, 32, 16, 160, 196, 215, 31, 43, 65, 25, 239, 112, 21, 145, 250, 240, 19, 114, 115, 211, 161, 223, 314, 96, 45, 135, 310, 262, 95, 179, 193, 29, 26, 332, 93, 124, 68, 47, 307, 176, 168, 191, 78, 34, 48, 132, 268, 52, 10, 57, 199, 102, 69, 84, 194, 86, 144, 150, 265, 40, 230, 127, 189, 289, 108, 111, 117, 228, 60, 28, 147, 248, 309, 98, 118, 113, 203, 287, 94, 101, 92, 54, 134, 67, 44, 120, 109, 186, 83, 100, 30, 107, 300, 123, 298, 151, 158, 125, 18, 341, 156, 205, 295, 104, 76, 116, 27, 185, 338, 220, 106, 35, 182, 283, 15, 85, 74, 42, 273, 190, 322, 260, 330, 180, 59, 157, 75, 272, 282, 348, 245, 162, 327, 315, 235, 146, 188, 58, 64, 234, 288, 155, 229, 278, 257, 166, 224, 62, 256, 317, 333, 284, 137, 169, 181, 81, 91, 154, 270, 319, 249, 38, 79, 236, 133, 263, 316, 335, 171, 313, 119, 328, 293, 222, 152, 82, 170, 279, 165, 46, 22, 71, 55, 66, 253, 183, 126, 172, 36, 5, 53, 99, 208, 121, 242, 138, 80, 61, 63, 129, 285, 77, 122, 214, 269, 237, 87, 192, 167, 277, 175, 159, 318, 173, 178, 72])
python
rank_model = []
for i in range(len(df_combined)):
    similarity = model.similarity(df_combined.iloc[i]["Word 1"],df_combined.iloc[i]["Word 2"])
    rank_model.append(similarity)
rank_model = np.argsort(np.argsort(rank_model))
rank_model
array([189, 315, 352, 258, 272, 278, 264, 244, 236, 331, 267, 165, 339, 322, 335, 149, 283, 28, 300, 245, 298, 72, 27, 13, 56, 181, 17, 31, 229, 186, 281, 332, 21, 64, 340, 145, 285, 342, 337, 52, 153, 206, 349, 345, 310, 268, 222, 227, 171, 249, 190, 291, 118, 301, 260, 290, 34, 41, 327, 347, 168, 44, 116, 312, 325, 105, 324, 334, 346, 326, 313, 178, 304, 319, 330, 263, 257, 214, 148, 152, 7, 254, 50, 215, 265, 65, 109, 155, 63, 135, 163, 232, 99, 14, 6, 25, 221, 333, 107, 230, 193, 19, 187, 185, 20, 177, 38, 320, 284, 286, 307, 314, 26, 231, 299, 321, 195, 113, 220, 70, 132, 239, 174, 121, 303, 183, 273, 138, 270, 309, 225, 336, 201, 218, 262, 40, 1, 251, 10, 33, 202, 253, 139, 36, 11, 57, 12, 252, 45, 74, 176, 276, 84, 43, 158, 59, 205, 182, 241, 280, 161, 88, 127, 219, 166, 80, 15, 271, 18, 184, 348, 55, 92, 194, 30, 302, 151, 243, 279, 83, 134, 61, 228, 240, 208, 8, 75, 114, 67, 29, 24, 210, 199, 277, 91, 297, 157, 162, 197, 120, 173, 110, 108, 23, 293, 94, 32, 269, 128, 351, 98, 104, 198, 141, 255, 154, 47, 226, 66, 69, 85, 49, 103, 79, 122, 133, 169, 54, 96, 192, 62, 213, 78, 68, 156, 76, 344, 106, 288, 282, 129, 146, 142, 46, 123, 266, 191, 102, 73, 58, 204, 4, 305, 77, 53, 200, 100, 329, 242, 317, 150, 35, 170, 89, 246, 323, 180, 179, 9, 318, 209, 90, 144, 167, 275, 51, 2, 259, 188, 235, 308, 296, 207, 238, 233, 223, 316, 341, 216, 224, 125, 117, 131, 172, 196, 124, 343, 248, 37, 292, 111, 71, 234, 274, 287, 60, 350, 101, 328, 140, 115, 137, 93, 126, 250, 159, 160, 22, 119, 143, 217, 306, 311, 175, 136, 16, 3, 48, 130, 295, 86, 147, 39, 42, 112, 5, 261, 338, 164, 212, 203, 294, 237, 87, 82, 0, 289, 247, 81, 256, 97, 211, 95])
python
from scipy.stats import spearmanr

cor, pv = spearmanr(rank_model, rank_human)
print(cor,pv)
0.6997112576768793 3.324508417374433e-53

67. k-meansクラスタリング

国名に関する単語ベクトルを抽出し,k-meansクラスタリングをクラスタ数k=5として実行せよ.
python
countries = []
category = ""
lines = data.split("\n")
for line in lines:
    if len(line)>0:
        if line[0] == ":":
            category = line[2:]
            print(category)
        else:
            words = line.split()
            if category in ['capital-common-countries', 'capital-world']:
                countries.append(words[2])
            elif category in ['currency', 'gram6-nationality-adjective']:
                countries.append(words[1])
countries = list(set(countries))
capital-common-countries capital-world currency city-in-state family gram1-adjective-to-adverb gram2-opposite gram3-comparative gram4-superlative gram5-present-participle gram6-nationality-adjective gram7-past-tense gram8-plural gram9-plural-verbs
python
contries_vec = [model[country] for country in countries]
python
from sklearn.cluster import KMeans
import numpy as np
kmeans = KMeans(n_clusters=5, random_state=0).fit(contries_vec)
python
pred = kmeans.predict(contries_vec)
python
countries
['hryvnia', 'Tehran', 'Bamako', 'Bucharest', 'Dublin', 'Ashgabat', 'Irish', 'euro', 'Podgorica', 'Stockholm', 'ruble', 'Rabat', 'Accra', 'Canberra', 'Greek', 'Swedish', 'Madrid', 'German', 'Khartoum', 'Abuja', 'Polish', 'rial', 'Macedonian', 'Icelandic', 'Kampala', 'zloty', 'peso', 'Kathmandu', 'leu', 'Valletta', 'Nuuk', 'French', 'Budapest', 'Portuguese', 'Sofia', 'Athens', 'kwanza', 'Albanian', 'Chinese', 'Minsk', 'Libreville', 'Astana', 'Oslo', 'Rome', 'Thimphu', 'Islamabad', 'Muscat', 'Nairobi', 'Austrian', 'Tbilisi', 'Harare', 'Tunis', 'Windhoek', 'Luanda', 'Tallinn', 'Suva', 'Bujumbura', 'Berlin', 'Antananarivo', 'Conakry', 'Norwegian', 'Italian', 'Managua', 'Yerevan', 'ringgit', 'Brazilian', 'English', 'Warsaw', 'Mexican', 'Maltese', 'Bern', 'Damascus', 'Belmopan', 'Kigali', 'Chilean', 'Monrovia', 'Tashkent', 'Roseau', 'Kiev', 'litas', 'Vienna', 'Croatian', 'Copenhagen', 'Beirut', 'Caracas', 'dram', 'naira', 'Tirana', 'Egyptian', 'lev', 'Dhaka', 'Dutch', 'Vilnius', 'dollar', 'Australian', 'Lilongwe', 'Ukrainian', 'Ottawa', 'Tegucigalpa', 'Lima', 'Vaduz', 'London', 'riel', 'Skopje', 'Chisinau', 'Belorussian', 'Paramaribo', 'Bratislava', 'baht', 'forint', 'Peruvian', 'krona', 'Mogadishu', 'Tripoli', 'Algiers', 'lats', 'Swiss', 'Bangkok', 'Danish', 'Ankara', 'Russian', 'Nassau', 'krone', 'Dushanbe', 'Manama', 'Amman', 'Israeli', 'Moscow', 'kuna', 'Argentinean', 'dong', 'Baghdad', 'Niamey', 'Thai', 'Hanoi', 'Bishkek', 'Riga', 'Zagreb', 'Apia', 'Paris', 'Tokyo', 'Ljubljana', 'Cambodian', 'Jakarta', 'Slovakian', 'Lisbon', 'Moldovan', 'yen', 'won', 'Nicosia', 'real', 'Lusaka', 'Kabul', 'Taipei', 'Baku', 'Nouakchott', 'dinar', 'Kingston', 'Japanese', 'Funafuti', 'Brussels', 'Quito', 'Gaborone', 'Doha', 'Indian', 'Manila', 'Vientiane', 'Spanish', 'Helsinki', 'denar', 'Cairo', 'Korean', 'Maputo', 'Havana', 'Banjul', 'Dakar', 'Bulgarian', 'Beijing', 'Asmara', 'Georgetown', 'rupee', 'Santiago', 'Belgrade', 'Montevideo', 'Colombian']
python
pred
array([3, 2, 0, 1, 2, 1, 4, 3, 1, 2, 3, 2, 0, 2, 4, 4, 2, 4, 0, 0, 4, 3, 4, 4, 0, 3, 3, 2, 3, 2, 2, 4, 1, 4, 1, 2, 3, 4, 4, 1, 0, 2, 2, 2, 2, 2, 2, 0, 4, 1, 0, 2, 0, 0, 1, 0, 0, 2, 0, 0, 4, 4, 2, 1, 3, 4, 4, 1, 4, 4, 2, 2, 0, 0, 4, 0, 1, 2, 1, 3, 2, 4, 2, 2, 2, 3, 3, 1, 4, 3, 2, 4, 1, 3, 4, 0, 4, 2, 0, 2, 1, 2, 3, 1, 1, 4, 0, 1, 3, 3, 4, 3, 0, 2, 2, 3, 4, 2, 4, 1, 4, 2, 3, 1, 2, 2, 4, 1, 3, 4, 3, 2, 0, 4, 2, 1, 1, 1, 0, 2, 2, 1, 4, 2, 4, 2, 4, 3, 4, 1, 4, 0, 2, 2, 1, 0, 3, 2, 4, 2, 2, 2, 0, 2, 4, 2, 2, 4, 1, 4, 2, 4, 0, 2, 0, 0, 4, 2, 0, 2, 3, 2, 1, 2, 4], dtype=int32)

68. Ward法によるクラスタリング

国名に関する単語ベクトルに対し,Ward法による階層型クラスタリングを実行せよ.さらに,クラスタリング結果をデンドログラムとして可視化せよ.
python
from sklearn.cluster import AgglomerativeClustering
python
clustering = AgglomerativeClustering()
python
import numpy as np

from matplotlib import pyplot as plt
from scipy.cluster.hierarchy import dendrogram
from sklearn.datasets import load_iris
from sklearn.cluster import AgglomerativeClustering


def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)


# setting distance_threshold=0 ensures we compute the full tree.
model = AgglomerativeClustering(distance_threshold=0, n_clusters=None) # default = ward

model = model.fit(contries_vec)
plt.title('Hierarchical Clustering Dendrogram')
# plot the top three levels of the dendrogram
plot_dendrogram(model, truncate_mode='level', p=3, labels=pred)
plt.xlabel("Number of points in node (or index of point if no parenthesis).")
plt.show()
png

69. t-SNEによる可視化

ベクトル空間上の国名に関する単語ベクトルをt-SNEで可視化せよ.
python
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
cdict = {0:"red",1:"blue",2:"green",3:"orange",4:"yellow"}
cs = [cdict[i] for i in pred]
X_reduced = TSNE(n_components=2, random_state=0).fit_transform(contries_vec)

plt.scatter(X_reduced[:, 0], X_reduced[:, 1],c=cs)
plt.colorbar()

Discussion

コメントにはログインが必要です。