url – https://en.wikipedia.org/wiki/Natural_language_processing
Task 1: Text Summarization with Word Frequencies
1.1 Use the web scraping technique with BeautifulSoup as shown in class to get the text data from the specified data location on the Wikipedia webpage. Hints: Please see code snippets for web scraping in the lecture slides.
1.2 Preprocess the text data, including word tokenization, stop words and punctuation removal, etc.
1.3 Calculate word frequencies or weighted word frequencies. The NLTK FreqDist() function can be used to get original word frequencies.
1.4 Calculate the sentence scores by summing up the word (term) frequencies for each sentence after preprocessing. You can use different approaches.
1.5 Rank the sentences. Rank of the sentences based on the sentence scores.
1.6 Build a webpage summary based on the N top scoring sentences. Then create a new summary by restricting the vocabulary of considered tokens by either: 1) only including the K most frequent tokens within the document or 2) only including tokens that occur in at least K sentences.
TASK -2
to generate N-grams, using NLTK:
from nltk.util import ngrams
Def generate_ngrams(text,n)
N_grams =. Ngrams(nltk.word_tokenize(text.lower()), n)
Return[‘ ‘.join(grams) for grams in n_grams]
2.2 Write the code for text summarization with any N-grams. Note that we will check your program using at least two different n-grams, e.g., n=2, 3, or 4. Hints: a) (0.5 points) NLTK can be used to get N-grams and FreqDist() to calculate the n-gram frequencies.
2.3 Find weighted frequency occurrences. You can use the similar function from Task 1.
2.4 Define the function like calculate_sentence_scores_ngram(sent_tokens, ngram_freqs, n_grams) to calculate the sentence scores for any N-grams. This function is similar to the one in Task 1.
2.5 Generate new document summaries , similar to Task 1.6, using the new ngram scoring function with n=3 (i.e. trigrams). Similar to Task 1.6, experiment with using the full vocabulary and a restricted vocabulary