First experiments with NLTK

In [1]:
import nltk
# Run this the first time you use NLTK
# nltk.download()
In [2]:
from nltk.tokenize import sent_tokenize, word_tokenize

EXAMPLE_TEXT = "Hello Mr. Smith, how are you doing today? The weather is great, and Python is awesome. The sky is pinkish-blue. You shouldn't eat cardboard."

print(sent_tokenize(EXAMPLE_TEXT))
print(word_tokenize(EXAMPLE_TEXT))
['Hello Mr. Smith, how are you doing today?', 'The weather is great, and Python is awesome.', 'The sky is pinkish-blue.', "You shouldn't eat cardboard."]
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']

Stop words

Stopwords should be removed from the list of words.

In [3]:
from nltk.corpus import stopwords
In [4]:
stop_words = set(stopwords.words('english'))
In [5]:
word_tokens = word_tokenize(EXAMPLE_TEXT)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
print(word_tokens)
print(filtered_sentence)
['Hello', 'Mr.', 'Smith', ',', 'how', 'are', 'you', 'doing', 'today', '?', 'The', 'weather', 'is', 'great', ',', 'and', 'Python', 'is', 'awesome', '.', 'The', 'sky', 'is', 'pinkish-blue', '.', 'You', 'should', "n't", 'eat', 'cardboard', '.']
['Hello', 'Mr.', 'Smith', ',', 'today', '?', 'weather', 'great', ',', 'Python', 'awesome', '.', 'sky', 'pinkish-blue', '.', "n't", 'eat', 'cardboard', '.']

Example one - Bob Dylan - Hurricane's lyrics

Lets analyse the lyrics from Bob Dylan's Hurricane. First we get the lyrics from his website:

In [6]:
import lxml.html
import requests
response = requests.get('https://bobdylan.com/songs/hurricane/')
etree = lxml.html.fromstring(response.text)
lyrics = etree.cssselect('div[class*=\'lyrics\']')[0].text_content()

Tokenize the lyrics based on words:

In [7]:
word_tokens = word_tokenize(lyrics)

Remove the stopwords from the words:

In [8]:
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

Create a word frequency graph

In [9]:
from nltk import FreqDist
fd = FreqDist(filtered_sentence)
fd.plot(30,cumulative=False)

What we see in the graph above is that punctuation is disturbing the graph. We use a regular expression now to remove these characters and only select words.

In [10]:
from nltk.tokenize import RegexpTokenizer

tokenizer = RegexpTokenizer(r'\w+')
word_tokens = tokenizer.tokenize(lyrics)
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]

Lets create the graph again:

In [11]:
fd = FreqDist(filtered_sentence)
fd.plot(30,cumulative=False)

Example two - Wikipedia about Nintendo

In [12]:
import wikipedia
nintendo = wikipedia.page("Nintendo")
word_tokens = word_tokenize(nintendo.content)
In [13]:
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
In [14]:
fd = FreqDist(filtered_sentence)
fd.plot(10,cumulative=False)
In [15]:
tokenizer = RegexpTokenizer(r'\w+')
word_tokens = tokenizer.tokenize(nintendo.content)
In [16]:
filtered_sentence = [w for w in word_tokens if not w.lower() in stop_words]
In [17]:
fd = FreqDist(filtered_sentence)
fd.plot(10,cumulative=False)