Blog

Re-Imagining Topic Modeling in NLP: A Break from Conventional Approach

I recently spoke about Contextual Topic Modeling in NLP, at Google’s La Kopi event for developers. The feedback I received made my talk, a special one. So many folks reached out and mentioned that they found the topic, content and the technique, quite intriguing and helped them approach Topic Modeling in NLP from a different angle. So, I decided to post the talk here and I have also added the transcript below.

Hi everyone, I’ll start the talk with a general theory. So in general to tackle problems in our professional space I’m sure we are all used to adopting the approach that has been in use for a long time. It’s an industry standard or is a known formula. Even in data science there are ample of set of predefined models or algorithms for so many use cases that’s not surprising if somebody has already done what you want to implement but still the results sometimes don’t make sense to you or you’re just not happy about the quality of the insights well this happens a lot of times when you’re building data-driven solutions sort of similar thing happened with me too the problem statement the huge text corpus i had with me required me to extract themes or topics out of it what eventually happened is what this talk is going to be all about thought and took a completely uncharted path for a problem statement such as topic modeling which by the way is pretty standard and has established models all across the internet result that came out not only impressed although i’m probably building a lot of hype right on the first slide but yeah it also impressed the stakeholders because they got whole new level of dimensions into the insights which the regular topic modeling fails to deliver so let’s dig in a little bit about me I work in city Singapore as a machine learning engineer i previously held data science roles in ad tech domains i did my masters in data science with specialization in consumer analytics from national university of Singapore spent about three years prior to my masters working for early to mid-age startups and bore multiple hearts of product development and analytics as a part of giving back to the community i mentor and guide folks who are aspiring and applying for master’s studies in mentor them with their essays interview prep and overall application one fun fact about me i am a part of an imaginary tea cult i don’t have blood only t let me know if you want to join my tea cult okay so this is how the outline of the talk ooh that’s going to look like we will start with the what’s topic modeling we’ll discuss the conventional approach next i’ll introduce you to the new approach that addresses the flaws of the conventional method towards the end i’ll give you a few pointers to further enhance the capabilities of the new approach that we are going to discuss topic modeling is one technique in the field of text mining as the name suggests it’s a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus does assisting in better decision making i’m sure many of you would be aware of the relevance of topic modeling but some of the use cases are like tagging customer support tickets according to the topic routing specific customer complaints to the relevant teams and so much more we’ll also talk about one particular use case in the subsequent slides as well okay so first we’ll talk very briefly about the what’s the conventional approach although there are many approaches for obtaining tech obtaining topic from a text but LDA also popularly known as latent direct allocation is one of the most popular topic modeling techniques a very simple explanation on a layman explanation latent this refers to everything that we don’t know a priori and is hidden in the data here the themes or topics that document consists of are unknown then directed it’s a distribution of distributions here in the context of topic modeling the dairy slate is the distribution of topics in documents and distribution of words in the topic allocation this means that once we have dirichlet we’ll allocate topics to the document and words of the document to topics that’s it that’s what LDA is in a nutshell i won’t bore you with a lot of technical terminologies but obviously uh there are some flaws in lda and that’s why i’m presenting rights you have to decide the number of topics you want from the text before training the model that’s quite weird right how do you know how many topics will be able to justify all the content present in the text then LDA can’t capture correlations by correlation i mean let’s say if i’m searching for a particular topic and i come across a topic that makes me want to read that article so i would obviously be interested in another highly correlated topic that i may not know about next uh there’s no to very little context in the topics you have to infer the context all on your own and yeah LDA uses backer word approach which hampers in capturing context between the words you will know a lot more about these flaws later when i show you a couple examples okay so now we have gone through what’s topic modeling what’s LDA and what are some of its flaws when the flaw that we discussed was lack of context in the topics turned out by LDA so let’s look at contextual topic modeling first i’ll help you visualize what’s contextual topic imagine a scenario where you know the prominent theme that is present in the text that you are analyzing such as saturday or kids in let’s say restaurant reviews now visualize what people are writing about or around those themes such as saturday brunch and the actual review could be i want to have brunch on saturday with my family similarly great kids menu and the actual review could be the restaurant had a great kids menu that’s your topic yeah so the point is the overarching themes which in this example Saturday and kids plus finding the most probable phrase that phrase that occurs in the context of the theme in this case Saturday brunch orgeat kids menu these things together constitutes the contextual topics Saturday kids that you saw in the last slide were possible because of something called named entity recognition it locates and classifies an entity into predefined categories such as person product organization then comes the phrases like Saturday brunch kids menu available which are called collocations these are phrases or expressions that are highly likely to-occur together these two components constitute the essence of a new way to approach topic modeling i’m going to quickly run through the data pipeline that i used for example used tripadvisor reviews for EDA i just used the general of the shelf libraries but most prominently for the models i use spaces for its language model i used pavlov it’s a deep learning library for performing named entity recognition basically it’s a library in which you can do a lot of nlp tasks but i use the library for NER than i used an NLTK for collocations the data set that i used was Sentosa attraction reviews on Tripadvisor there were about 900 000words present in about 27 000 reviews performed NER on the data which led to the discovery of the entities within the text so the second the table that you see on the right column of the slide it describes the in our table the entities extracted from NER are represented in that table like person date day organization for example if i perform NER on the sentence ordered cake for my wife on our anniversary at Sentosa so here nor will be able to identify cake as a product entity wife as a person Sentosa as allocation and anniversary as an event cool isn’t it so the next step was to implement collocations and extracted two to three words surrounding each entity that’s how we got topics of phrases like kids menu available we celebrate girlfriend birthdays etc we’ll look at those examples in the upcoming slides so what you see here is a tree map visualization one thing to note about this kind of visualization is that the darker the color you can expect that particular topic and the box to be more prominent in the text this particular tree map visualization is just made out of the themes present in the day entity and the collocations present around the day word okay now i’ll explain you the power of this technique and how contextual topic modeling helped my stakeholders take better decisions we see that there’s a lot more conversation around mother’s day in the review seven more than the valentine’s day this allows digital marketers to target and run promotions and campaigns on mother’s day as well inside that might not be known to the marketing team but now a valuable insight all because of a contextual topic modeling that could dig what people are writing around the day theme it’s highly likely that for valentine’s day only a couple might visit the restaurant but for a mother’s day celebration a whole family might visit does leading to more revenue per marketing dollar so you see the power of the kind of insight that this contextual topic modeling is bringing on to the table another example there’s a lot of chatter around weekend mornings Sunday morning being the most talked about ironically avoid mornings also stood out thus it implies that people who visit the place on weekend mornings and find a lot of rash they also write about avoid morning hours this kind of contextual topic modeling also helps you understand when does your audience visit your place and avail your services it’s a great insight to have for your campaigns and tapping the behavior of your customers and audience another great insight to know who are your customers birthday celebration by girlfriend boyfriend indicates younger demographics this can help you frame your target audience and plan promotional campaigns accordingly for the same data set the five topics are extracted using LDA. LDA as shown as you’ll go through the topics you will see that it’s up to you how you want to make sense of your topics and stitch your story when you use LDA but that’s not the case when with the technique that i demonstrated so we saw a lot of evident differences between the topics found via LDA and the tree map visualization showing the contextual topics a few things are pretty clear the topics even though with fewer words than thrones given by LDA are much more meaningful and interpretable the variety of topics is way more in the case of NER and collocations the example you saw in the previous slides was just of the day theme we could pull out several themes and find topics around those you would uh you would have observed that i mentioned interpretable topics interpretability is the thing that your stakeholders your business would love to know but nothing is uh everything is not rosy there’s one problem though it’s not a problem but more of a tedious task one has to cherry pick the entities of interest from the NER exercise so for example person entity may have several themes such as wife husband boyfriend girlfriend the entity might have all days of the week so the point is that you have to keep discussing with your stakeholders this is more of a solution to tackle this tedious task the more you discuss with your stakeholders like what they want to know who are their customer base do they want to know whether people come to their place on a weekday afternoon or evening once you know these kind of information from the problem statement it becomes a lot more convenient to come through the entities one possible enhancement in this technique is performing sentiment analysis for example knowing whether people express negative or positive sentiment around a review like watch diverse feeding marine life it can change the narrative of the marketing campaign so from behavioral insights like who are your customers how many of them visit or avail your services what are they interested in what do they buy what activities do they engage with it can help you understand your offline and online audiences both so this wraps up my presentation i hope uh you took away something meaningful um and something to explore beyond the traditional approaches of traditional uh topic modeling you can find actual code samples on my blog and uh yeah you can reach out to me over linkedin twitter dm me just feel free to shoot questions or connect or even share with me your thoughts would be glad to know your takeaways from this presentation thank you.

Do check out the other posts  that I have written related to software engineering and data science. Please consider subscribing to my blog and feel absolutely free to reach out. I am also mentoring in the areas of Data Science, Career and Life in Singapore. You can book an appointment to talk to me.

Book an appointment with Shubhanshu Gupta using SetMore

Become a Story Telling Ninja: Present Data Science Models to Stakeholders

Story Telling Ninja
A Data Science Ninja

Over the last few years, I have presented a lot of projects involving Data Science models, Natural Language Processing to be more precise, to various stakeholders and leadership teams. While it’s super important to convey your technicalities, results and all the hard work you have put in building the Data Science models, visualizations, etc., what’s more important is how you convey those things! In this blog, we will talk about the art of story telling!

Continue reading “Become a Story Telling Ninja: Present Data Science Models to Stakeholders”

Evaluating Effectiveness of Mentorship: Open Sourcing My Framework

A while ago, I wrote about opening up my calendar for mentorship. Soon, quite a few people talked to me about career switches, interviews for university admissions, life in & career opportunities in Singapore, etc. I eventually asked all of them to score my mentoring. While the conversations are definitely very subjective, I decided that the scoring could be objective. I have decided to open source the evaluation criteria that I use for gauging how effective I am as a mentor. Besides, I will also put up and be very transparent about the scores that I receive, where am I lagging and where I am doing good.

Disclaimer: Even though I realize that feedback is very important, it’s voluntary and optional exercise for the mentees and I do not nudge anyone repeatedly to fill up my evaluation. Thus, I will keep on updating the chart as and when I get more responses.

You will find the evaluation criteria structured in the following way:

  1. Whether I answered all the questions satisfactorily, to-the-point, and in a structured way?
  2. Questions around:
    • Ease of approach
    • Content Expertise
    • Clear and comprehensive speaking
    • Any bias or prejudice while mentoring
    • Professional integrity
  3. Did I refer any 3rd party material or a subject-matter-expert?
  4. Whether I am worthy of being referred?
  5. Overall score and my ROTI (KPI)
  6. Subjective Feedback
Continue reading “Evaluating Effectiveness of Mentorship: Open Sourcing My Framework”

Become a Web Analytics Ninja: Analyze Bounce Rate Across Different Visitor Segments

Web analytics
Web Analytics

This post is a digression from the other data science blogs that I have written in the past and more so, from the work that I do in my day-to-day job. Well, I don’t mean digression in a negative connotation. I enjoyed and learnt so much that I implemented many of the strategies in my own website. In this post, I will be discussing how I did a deep dive on a 1 liner problem statement by my client, “The bounce rate has gone up since last few months from what it was before, Why?” That may seem trivial to investigate and analyze, but the lack of details and granularity, made the problem statement very broad and open ended. Not enough clarity also makes it pretty easy to hit a roadblock very early in the process, especially when you don’t know where to start. फ़िक्र न करें (Fear not)! You will see a structured way to approach such kind of problem statement.

Continue reading “Become a Web Analytics Ninja: Analyze Bounce Rate Across Different Visitor Segments”

Speed Up Pandas Dataframe Apply Function to Create a New Column

pandas
Pandas Library

Data cleaning is an essential step to prepare your data for the analysis. While cleaning the data, every now and then, there’s a need to create a new column in the Pandas dataframe. It’s usually conditioned on a function which manipulates an existing column. A strategic way to achieve that is by using Apply function. I want to address a couple of bottlenecks here:

  • Pandas: The Pandas library runs on a single thread and it doesn’t parallelize the task. Thus, if you are doing lots of computation or data manipulation on your Pandas dataframe, it can be pretty slow and can quickly become a bottleneck.
  • Apply(): The Pandas apply() function is slow! It does not take the advantage of vectorization and it acts as just another loop. It returns a new Series or dataframe object, which carries significant overhead.

So now, you may ask, what to do and what to use? I am going to share 4 techniques that are alternative to Apply function and are going to improve the performance of operation in Pandas dataframe.

Continue reading “Speed Up Pandas Dataframe Apply Function to Create a New Column”

Collocations in NLP using NLTK Library

Collocation in NLTK

Collocations are phrases or expressions containing multiple words, that are highly likely to co-occur. For example – ‘social media’, ‘school holiday’, ‘machine learning’, ‘Universal Studios Singapore’, etc.

Continue reading “Collocations in NLP using NLTK Library”

Saturday Kids: Code in the Community Experience

Kids of 8-10 years of age are incredibly smart who are treading high on the curve of curiosity and learning. Thus, it’s equally challenging to teach such kids. Did I just write challenging? Did I not mention that I feel a strange pull for anything challenging? Jokes apart, in June I came across an opportunity to teach Python/Scratch to kids in Singapore. The program briefed a 10 week Code in the Community program run by Saturday Kids in collaboration with Google. This post is an account of my experience and learnings throughout these 10 weeks with Saturday Kids.

Continue reading “Saturday Kids: Code in the Community Experience”

Time Series Analysis using Pandas

Time series, a series of data points ordered in time. Pretty intuitive, isn’t it? Time series analysis helps in businesses in analyzing past data, predict trends, seasonality, and numerous other use cases. Some examples of time series analysis in our day to day lives include:

  • Measuring weather
  • Measuring number of taxi rides
  • Stock prediction

In this blog, we will be dealing with stock market data and will be using Python 3, Pandas and Matplotlib.

Continue reading “Time Series Analysis using Pandas”

Handling Imbalanced Dataset with SMOTE in Python

Close your eyes and imagine that you live in a utopian world of perfect data. What do you see? What do you wish to see? Wait! are you imagining a flawless balanced dataset? A collection of data whose labels form a magnificent 1:1 ratio: 50% of this, 50% of that; not a bit to the left, nor a bit to the right. Just perfectly balanced, as all things should be. Now open your eyes, and come back to the real world. Well, this blog is all about how to handle imbalanced datasets.

Continue reading “Handling Imbalanced Dataset with SMOTE in Python”

A Survey of API Management Platforms

In my previous blog, I discussed how I landed up interning at Dentsu. I also discussed that I worked on scouting and building a POC for a cloud agnostic, open source API management tool/platform which could help in setting up API design, gateway, store, and analytics. In this blog, I will be jotting down my work in much more detail.

We will be exploring four API Management platforms, namely:

Continue reading “A Survey of API Management Platforms”