Chatbot: Google検索の仕組み1/3, Python, TF-IDF, CosineSimilarity

Last Updated on

I.TF_IDFとCosine Similarity

 

TF_IDFとCosine Similaryは類似性を算出するもので、テキスト・マイニングやチャット・ボットでよく用いられます。

難しそうですが、発想がわかると日常言語をプログラムでどう扱うのかがよくわかります。

Googleは、このTF_IDEを検索に使うことによって利用者数を飛躍的に増やしたそうです。

次の文書では、Google検索を例としてTF_IDFとCosaine Similarityがわかりやすく説明されています。

 

Tf-Idf and Cosine similarity

 

上記を読めばわかるのですが、私のように数学にうとく、日常言語プログラムの初学者という人のために補足して解説します。

 

II.Googleの利用者の増加

 

1998年と比べ2000年にの1日の平均検索数は6122倍で、それ以降も急速に増加しました。

Pythonで単純な棒グラフにしました。

 

# import library for dataframe and plot
import pandas as pd
import matplotlib.pyplot as plt
# read comma separated data
rawtext = """Year,SearchPerDay
1998,9800
2000,60000000
2007,1200000000
2008,1745000000
2009,2610000000
2010,3627000000
2011,4717000000
2012,5134000000"""
# split lines
rawtext_splitlines = rawtext.splitlines()
# split string with ","
# convert to a dataframe
df1 = pd.DataFrame([sub.split(",") for sub in rawtext_splitlines])
# set the column labels to the values in the 1st row
df1.columns = df1.iloc[0]
# drop the first row
df2 = df1.reindex(df1.index.drop(0))
# convert object to numeric
df2.SearchPerDay = pd.to_numeric(df2.loc[:,'SearchPerDay'], errors='coerce')
print(df2)
# show bar plot to confirm the numeric
%matplotlib inline # if you use Jupyter Notebook
df2.plot('Year', 'SearchPerDay', title = "Google search per day", kind='bar')
plt.show()

 

 

III.前処理(Text Pre-Processing)

 

Textを処理できるょうに、不要な記号を削除し単語にわけます。

 

# data
text1 = "The game of life is a game of everlasting learning."
text2 = "The unexamined life is not worth living."
text3 = "Never stop learning."

# cleaning text and lower casing all words
#  text1
for char in '-.,\n':
    text1=text1.replace(char,' ')
text1 = text1.lower()
# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
document1 = text1.split()

#  text2
for char in '-.,\n':
    text2=text2.replace(char,' ')
text2 = text2.lower()
# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
document2 = text2.split()

#  text3
for char in '-.,\n':
    text2=text2.replace(char,' ')
text3 = text3.lower()

# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
document3 = text3.split()

# show cleaned text
print(document1)
print(document2)
print(document3)

from collections import Counter
Counter(document1).most_common()
Counter(document2)
Counter(document3)

 

 

前処理できると、ライブラリcollectionsを利用して簡単に語の数を集計できます。

About shibatau

I was born and grown up in Kyoto. I studied western philosophy at the University and specialized in analytic philosophy, especially Ludwig Wittgenstein at the postgraduate school. I'm interested in new technology, especially machine learning and have been learning R language for two years and began to learn Python last summer. Listening toParamore, Sia, Amazarashi and MIyuki Nakajima. Favorite movies I've recently seen: "FREEHELD". Favorite actors and actresses: Anthony Hopkins, Denzel Washington, Ellen Page, Meryl Streep, Mia Wasikowska and Robert DeNiro. Favorite books: Fyodor Mikhailovich Dostoyevsky, "The Karamazov Brothers", Shinran, "Lamentations of Divergences". Favorite phrase: Salvation by Faith. Twitter: @shibatau

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.