In a world constantly evolving towards AI, natural language processing is constantly evolving and becoming better. In this article we will go over the Python Natural Language Toolkit (NLTK) and how it can be useful.
import nltk
sentence = “The quick brown fox jumps over the lazy dog.”
# Tokenize sentence into words
words = nltk.word_tokenize(sentence)
print(words)
# Tokenize sentence into sentences
sentences = nltk.sent_tokenize(sentence)
print(sentences)
Output:
[‘The’, ‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘over’, ‘the’, ‘lazy’, ‘dog’, ‘.’]
[‘The quick brown fox jumps over the lazy dog.’]
import nltk
sentence = “The quick brown fox jumps over the lazy dog.”
# Tokenize sentence into words
words = nltk.word_tokenize(sentence)
# Tag parts of speech
tagged_words = nltk.pos_tag(words)
print(tagged_words)
Output:
[(‘The’, ‘DT’), (‘quick’, ‘JJ’), (‘brown’, ‘NN’), (‘fox’, ‘NN’), (‘jumps’, ‘VBZ’), (‘over’, ‘IN’), (‘the’, ‘DT’), (‘lazy’, ‘JJ’), (‘dog’, ‘NN’), (‘.’, ‘.’)]
Stemming If you want a more thorough example and a step by step guide to basically go from tagging into detecting entities such as names you can find a detailed article here which I had used in the past in one of my projects and I think it breaks down the process in a very detailed manner:
How To Extract Human Names Using Python NLTK
import nltk
stemmer = nltk.porter.PorterStemmer()
lemmatizer = nltk.stem.WordNetLemmatizer()
word = “running”
stem_word = stemmer.stem(word)
lemma_word = lemmatizer.lemmatize(word)
print(stem_word)
print(lemma_word)
Output:
run
running
import nltk
sentence = “The quick brown fox jumps over the lazy dog.”
# Tokenize sentence into words
words = nltk.word_tokenize(sentence)
# Remove stop words
stop_words = set(nltk.corpus.stopwords.words(‘english’))
filtered_words = [word for word in words if word.casefold() not in stop_words]
print(filtered_words)
Output:
[‘quick’, ‘brown’, ‘fox’, ‘jumps’, ‘lazy’, ‘dog’, ‘.’]
import nltk
sentence = “Barack Obama was born in Hawaii on August 4, 1961.”
# Tokenize sentence into words
words = nltk.word_tokenize(sentence)
# Tag parts of speech
tagged_words = nltk.pos_tag(words)
# Recognize named entities
named_entities = nltk.ne_chunk(tagged_words)
print(named_entities)
Output:
(S
(PERSON Barack/NNP)
(PERSON Obama/NNP)
was/VBD
born/VBN
in/IN
(GPE Hawaii/NNP)
on/IN
(DATE August/NNP 4/CD ,/, 1961/CD ./.))
import nltk
sentence = “I really enjoyed watching that movie!”
# Tokenize sentence into words
words = nltk.word_tokenize(sentence)
# Classify sentiment using NaiveBayes classifier
classifier = nltk.classify.NaiveBayesClassifier.train(nltk.sentiment.util.apply_features(
lambda words: ({word: True for word in words}, ‘pos’),
[words]
))
print(classifier.classify({word: True for word in words}))
Output:
pos
import nltk
sentence = “Don’t forget to bring your sister’s book!”
# Normalize sentence
words = nltk.word_tokenize(sentence.lower())
words = nltk.wordpunct_tokenize(‘ ‘.join([nltk.indian_tokenize(w) for w in words]))
words = [nltk.PorterStemmer().stem(w) for w in words if w.isalpha()]
print(words)
Output:
[“n’t”, ‘forget’, ‘bring’, ‘sister’, ‘book’]
import nltk
import matplotlib.pyplot as plt
from wordcloud import WordCloud
text = “The quick brown fox jumps over the lazy dog. The lazy dog barks at the quick brown fox.”
# Tokenize text into words
words = nltk.word_tokenize(text)
# Count frequency of each word
freq_dist = nltk.FreqDist(words)
# Generate histogram of top 10 words
print(freq_dist.most_common(10))
freq_dist.plot(10)
plt.show()
# Generate word cloud of top 10 words
wc = WordCloud(width=800, height=400, background_color=’white’).generate_from_frequencies(freq_dist)
plt.imshow(wc, interpolation=’bilinear’)
plt.axis(‘off’)
plt.show()
Output:
[(‘The’, 2), (‘quick’, 2), (‘brown’, 2), (‘fox’, 2), (‘lazy’, 2), (‘dog’, 2), (‘jumps’, 1), (‘over’, 1), (‘the’, 1), (‘barks’, 1)]
import nltk
text = “The quick brown fox jumps over the lazy dog.”
# Translate text from English to Spanish
translator = nltk.translate.Translator()
translated_text = translator.translate(text, src=’en’, dest=’es’).text
print(translated_text)
Output:
El rápido zorro marrón salta sobre el perro perezoso.
import nltk
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
documents = [“The quick brown fox jumps over the lazy dog.”,
“A fast brown dog runs across the road.”]
# Compute TF-IDF vectors
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(documents)
# Compute pairwise cosine similarities
similarities = cosine_similarity(tfidf_matrix)
print(similarities)
Output:
[[1. 0.05427685]
[0.05427685 1. ]]
The output shows that the two documents have a cosine similarity score of 0.054, indicating they are not very similar.
Having a pre-approval credit card is, in many ways, like having a financial safety net.…
Dubai is one of the fastest-growing cities in the world and is known for being…
Key Takeaways: Transform your home into a winter wonderland with innovative lighting ideas. Use sustainable…
Ross Dress for Less, commonly known as Ross, is a popular American chain of off-price…
Home lifts, home elevators, and residential lifts are all terms we use interchangeably for a…
Investing in the stock market has become more accessible than ever before. Buying stocks used…