Posts & Comments on Euro 2020 by the Redditors’

NLP: Extracting the main topics using Latent Dirichlet Allocation (LDA)

Jim Meng Kok
Towards Data Science

--

It’s been almost two months since the final of Euro 2020 between Italy and England that was held in Wembley Stadium, London ended. The football fever is slowly declining but the recent 2022 FIFA World Cup qualification matches have brought my attention and interest to examine what do people say about Euro 2020. Hence, I have leveraged a Kaggle dataset that contains posts that Redditors discussed Euro 2020.

The resources for this mini exercise can be found on my GitHub which includes the dataset, and the Python notebook file.

What’s LDA?

LDA is a topic modelling technique that assumes documents are produced from a mixture of topics and these topics generate words based on their probability distribution. It also builds a topic per document model and words per topic model.

Data Preprocessing

In order, the following were involved in pre-processing the data:

  • Handling empty data: The title and body columns are the only columns that are filled with texts. However, there are many empty data in the body column. Therefore, these empty data were filled with placeholders, “NaN”, that would be removed later. With that, a new column — text — is formed due to concatenating both the title and body columns.
  • Lowercase conversion: Converting all the texts from the text column to lowercase is necessary as it is beneficial for vectorisation.
  • Punctuations removal: Punctuations were removed by using Regular Expressions (Regex).
  • Numbers removal: Numbers were removed by using Regex.
  • Stop-words removal: By using the NLTK library as well as adding common irrelevant words such as “comment” and “nan” into the custom-built stop-words list, stop-words were removed.
  • Lemmatization: Normalising the words that help to preserve the meaning of the text messages for vectorisation.
  • Tokenisation: Splitting the text into sentences and the sentences into words.
NLTK’s stop-words, Image by GeeksforGeeks

Before moving on to conduct topic modelling using LDA, vectorisation, which aims to build both dictionary and the corpus (bag of words), is applied to the tokenised and lemmatised words.

dictionary = gensim.corpora.Dictionary(words_in_docs)
bow = [dictionary.doc2bow(doc) for doc in words_in_docs]

In the bag of words (bow), the Gensim library has created a unique identifier for each word and its word frequency in the document.

Let the LDA Begins!

To kick-start the LDA process, specification for the number of topics in the dataset is necessary. Therefore, I have set the minimum number of topics at 4 and the maximum of topics at 24.

# Topics: 4 Score: 0.637187704946018
# Topics: 6 Score: 0.5656505764313163
# Topics: 8 Score: 0.5608577918583089
# Topics: 10 Score: 0.5285639454916335
# Topics: 12 Score: 0.6549002572803391
# Topics: 14 Score: 0.5805171708843707
# Topics: 16 Score: 0.619577703739399
# Topics: 18 Score: 0.5787737269759226
# Topics: 20 Score: 0.5799681660889682
# Topics: 22 Score: 0.6062730130523755
# Topics: 24 Score: 0.5941403131806395
Coherence Score Chart, Image by Author

Based on the above results and visualisation of the coherence scores, the optimal number of topics to specify is 12 topics.

While we start with 12 unique topics for the LDA to process, we set the number of passes to be at 9 (75% training dataset).

lda_model =  gensim.models.LdaMulticore(bow, 
num_topics = 12,
id2word = dictionary,
passes = 9,
workers = 3)

Interpretation of the Results

A good model contains low perplexity and high topic coherence. The results below show that the built model fulfils the criteria.

Perplexity:  -7.724412623387592Coherence Score:  0.6073296144040131

The output from the model shows the 12 unique topics that are each categorised by words. As LDA does not provide the theme of each topic, hence, it is subjective to infer the topics by giving each of them a theme. The following is the output as well as my inference of the themes that each topic belongs to.

Topic 0: Italy vs Spain
Words: "italy", "team", "group", "think", "spain", "game", "match", "final", "played", "goal"

Topic 1: Italy would score goals against England in the final
Words: "team", "euro", "goal", "football", "player", "game", "im", "italy", "would", "england"

Topic 2: England's controversial penalty against Denmark
Words: "england", "right", "penalty", "sure", "see", "well", "im", "football", "game", "player"

Topic 3: Don't underestimate England's performance
Words: "fan", "people", "english", "player", "dont", "country", "team", "one", "know", "football"

Topic 4: England would win against Italy in the final
Words: "england", "game", "like", "win", "team", "italy", "dont", "home", "final", "english"

Topic 5: World questioning England fans' behaviour
Words: "please", "fan", "england", "use", "cup", "question", "world", "contact", "im", "action"

Topic 6: Spain's and England's performances during the semi-finals
Words: "substitution", "match", "shot", "card", "yellow", "scored", "goal", "spain", "england", "thread"

Topic 7: Belgium's performance against its opponents during the tournament
Words: "denmark", "belgium", "v", "finland", "goal", "game", "hazard", "player", "portugal", "bruyne"

Topic 8: Italians don't like the idea of "Football's Coming Home"
Words: "fan", "england", "team", "home", "coming", "english", "dont", "like", "italian", "match"

Topic 9: No red cards for England's fouls throughout the tournament
Words: "england", "red", "card", "didnt", "foul", "would", "even", "im", "ref", "power"

Topic 10: Group A's final matches before Round of 16 started
Words: "switzerland", "italy", "substitution", "match", "card", "shot", "yellow", "turkey", "wale", "goal"

Topic 11: Getting the UEFA Euro 2020 Final tickets
Words: "ticket", "game", "england", "euro", "get", "final", "would", "uefa", "time", "player"

Conclusion

Based on the above analysis, we could conclude that the following are something that the Reddit users were interested in the Euro 2020:

  • The final between Italy and England
  • The England team and its performance
  • Controversial news that revolves around the England team and its fans
  • Matches that involve England, Italy, Spain, and Belgium
  • Belgian players — Kevin de Bruyne and Eden Hazard (or maybe even Thorgan Hazard who’s the latter’s brother!)

References

Bansal, S. (2016). Beginners Guide to Topic Modeling in Python. Analytics Vidhya. August 24. Retrieved from https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/

Blei, D. M., Ng, A. Y., & Jordan, M., I. (2016). Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003) 993–1022. Retrieved from http://www.jmlr.org/papers/volume3/blei03a/blei03a.pdf

GeeksforGeeks. (2020). Removing stop words with NLTK in Python. November 24. Retrieved from https://www.geeksforgeeks.org/removing-stop-words-nltk-python/

--

--

Diligent Learner, Data Analyst, Analytics Enthusiast, Movie Buff, Content Creator. Please reach out to me on LinkedIn: https://www.linkedin.com/in/jimmengkok/