Chatbot: Understanding Google Search 1/3, python, tf-idf, cosine_similarity ver. 4

Last Updated on November 28, 2021 by shibatau

I’m rewriting the post now.

0. Rewriting Now

This post is the one that I wrote a few years ago. I’m rewriting it now and have calculated some parts again in different ways. The scripts are here:

I. TF_IDF and Cosine Similarity

TF_IDFとCosine Similaryは類似性を算出するもので、テキスト・マイニングやチャット・ボットで用いられます。


次の文書では、Google検索を例としてTF_IDFとCosaine Similarityがわかりやすく説明されています。

Tf-Idf and Cosine similarity


II. Googleの利用者の増加


# import library for dataframe and plot
import pandas as pd
import matplotlib.pyplot as plt
# read comma separated data
rawtext = """Year,SearchPerDay
# split lines
rawtext_splitlines = rawtext.splitlines()
# split string with ","
# convert to a dataframe
df1 = pd.DataFrame([sub.split(",") for sub in rawtext_splitlines])
# set the column labels to the values in the 1st row
df1.columns = df1.iloc[0]
# drop the first row
df2 = df1.reindex(df1.index.drop(0))
# convert object to numeric
df2.SearchPerDay = pd.to_numeric(df2.loc[:,'SearchPerDay'], errors='coerce')
# show bar plot to confirm the numeric
%matplotlib inline # if you use Jupyter Notebook
df2.plot('Year', 'SearchPerDay', title = "Google search per day", kind='bar')

III. Text Pre-Processing


# data
text1 = "The game of life is a game of everlasting learning."
text2 = "The unexamined life is not worth living."
text3 = "Never stop learning."

# cleaning text and lower casing all words
#  text1
for char in '-.,\n':
    text1=text1.replace(char,' ')
text1 = text1.lower()
# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
document1 = text1.split()

#  text2
for char in '-.,\n':
    text2=text2.replace(char,' ')
text2 = text2.lower()
# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
document2 = text2.split()

#  text3
for char in '-.,\n':
    text2=text2.replace(char,' ')
text3 = text3.lower()

# split returns a list of words delimited by sequences of whitespace (including tabs, newlines, etc, like re's \s) 
document3 = text3.split()

# show cleaned text

from collections import Counter


About shibatau

I was born and grown up in Kyoto. I studied western philosophy at the University and specialized in analytic philosophy, especially Ludwig Wittgenstein at the postgraduate school. I'm interested in new technology, especially machine learning and have been learning R language for two years and began to learn Python last summer. Listening toParamore, Sia, Amazarashi and MIyuki Nakajima. Favorite movies I've recently seen: "FREEHELD". Favorite actors and actresses: Anthony Hopkins, Denzel Washington, Ellen Page, Meryl Streep, Mia Wasikowska and Robert DeNiro. Favorite books: Fyodor Mikhailovich Dostoyevsky, "The Karamazov Brothers", Shinran, "Lamentations of Divergences". Favorite phrase: Salvation by Faith. Twitter: @shibatau

Leave a Reply

Your email address will not be published.

This site uses Akismet to reduce spam. Learn how your comment data is processed.