Re-Imagining Topic Modeling in NLP: A Break from Conventional Approach

I recently spoke about Contextual Topic Modeling in NLP, at Google’s La Kopi event for developers. The feedback I received made my talk, a special one. So many folks reached out and mentioned that they found the topic, content and the technique, quite intriguing and helped them approach Topic Modeling in NLP from a different angle. So, I decided to post the talk here and I have also added the transcript below.

Hi everyone, I’ll start the talk with a general theory. So in general to tackle problems in our professional space I’m sure we are all used to adopting the approach that has been in use for a long time. It’s an industry standard or is a known formula. Even in data science there are ample of set of predefined models or algorithms for so many use cases that’s not surprising if somebody has already done what you want to implement but still the results sometimes don’t make sense to you or you’re just not happy about the quality of the insights well this happens a lot of times when you’re building data-driven solutions sort of similar thing happened with me too the problem statement the huge text corpus i had with me required me to extract themes or topics out of it what eventually happened is what this talk is going to be all about thought and took a completely uncharted path for a problem statement such as topic modeling which by the way is pretty standard and has established models all across the internet result that came out not only impressed although i’m probably building a lot of hype right on the first slide but yeah it also impressed the stakeholders because they got whole new level of dimensions into the insights which the regular topic modeling fails to deliver so let’s dig in a little bit about me I work in city Singapore as a machine learning engineer i previously held data science roles in ad tech domains i did my masters in data science with specialization in consumer analytics from national university of Singapore spent about three years prior to my masters working for early to mid-age startups and bore multiple hearts of product development and analytics as a part of giving back to the community i mentor and guide folks who are aspiring and applying for master’s studies in mentor them with their essays interview prep and overall application one fun fact about me i am a part of an imaginary tea cult i don’t have blood only t let me know if you want to join my tea cult okay so this is how the outline of the talk ooh that’s going to look like we will start with the what’s topic modeling we’ll discuss the conventional approach next i’ll introduce you to the new approach that addresses the flaws of the conventional method towards the end i’ll give you a few pointers to further enhance the capabilities of the new approach that we are going to discuss topic modeling is one technique in the field of text mining as the name suggests it’s a process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus does assisting in better decision making i’m sure many of you would be aware of the relevance of topic modeling but some of the use cases are like tagging customer support tickets according to the topic routing specific customer complaints to the relevant teams and so much more we’ll also talk about one particular use case in the subsequent slides as well okay so first we’ll talk very briefly about the what’s the conventional approach although there are many approaches for obtaining tech obtaining topic from a text but LDA also popularly known as latent direct allocation is one of the most popular topic modeling techniques a very simple explanation on a layman explanation latent this refers to everything that we don’t know a priori and is hidden in the data here the themes or topics that document consists of are unknown then directed it’s a distribution of distributions here in the context of topic modeling the dairy slate is the distribution of topics in documents and distribution of words in the topic allocation this means that once we have dirichlet we’ll allocate topics to the document and words of the document to topics that’s it that’s what LDA is in a nutshell i won’t bore you with a lot of technical terminologies but obviously uh there are some flaws in lda and that’s why i’m presenting rights you have to decide the number of topics you want from the text before training the model that’s quite weird right how do you know how many topics will be able to justify all the content present in the text then LDA can’t capture correlations by correlation i mean let’s say if i’m searching for a particular topic and i come across a topic that makes me want to read that article so i would obviously be interested in another highly correlated topic that i may not know about next uh there’s no to very little context in the topics you have to infer the context all on your own and yeah LDA uses backer word approach which hampers in capturing context between the words you will know a lot more about these flaws later when i show you a couple examples okay so now we have gone through what’s topic modeling what’s LDA and what are some of its flaws when the flaw that we discussed was lack of context in the topics turned out by LDA so let’s look at contextual topic modeling first i’ll help you visualize what’s contextual topic imagine a scenario where you know the prominent theme that is present in the text that you are analyzing such as saturday or kids in let’s say restaurant reviews now visualize what people are writing about or around those themes such as saturday brunch and the actual review could be i want to have brunch on saturday with my family similarly great kids menu and the actual review could be the restaurant had a great kids menu that’s your topic yeah so the point is the overarching themes which in this example Saturday and kids plus finding the most probable phrase that phrase that occurs in the context of the theme in this case Saturday brunch orgeat kids menu these things together constitutes the contextual topics Saturday kids that you saw in the last slide were possible because of something called named entity recognition it locates and classifies an entity into predefined categories such as person product organization then comes the phrases like Saturday brunch kids menu available which are called collocations these are phrases or expressions that are highly likely to-occur together these two components constitute the essence of a new way to approach topic modeling i’m going to quickly run through the data pipeline that i used for example used tripadvisor reviews for EDA i just used the general of the shelf libraries but most prominently for the models i use spaces for its language model i used pavlov it’s a deep learning library for performing named entity recognition basically it’s a library in which you can do a lot of nlp tasks but i use the library for NER than i used an NLTK for collocations the data set that i used was Sentosa attraction reviews on Tripadvisor there were about 900 000words present in about 27 000 reviews performed NER on the data which led to the discovery of the entities within the text so the second the table that you see on the right column of the slide it describes the in our table the entities extracted from NER are represented in that table like person date day organization for example if i perform NER on the sentence ordered cake for my wife on our anniversary at Sentosa so here nor will be able to identify cake as a product entity wife as a person Sentosa as allocation and anniversary as an event cool isn’t it so the next step was to implement collocations and extracted two to three words surrounding each entity that’s how we got topics of phrases like kids menu available we celebrate girlfriend birthdays etc we’ll look at those examples in the upcoming slides so what you see here is a tree map visualization one thing to note about this kind of visualization is that the darker the color you can expect that particular topic and the box to be more prominent in the text this particular tree map visualization is just made out of the themes present in the day entity and the collocations present around the day word okay now i’ll explain you the power of this technique and how contextual topic modeling helped my stakeholders take better decisions we see that there’s a lot more conversation around mother’s day in the review seven more than the valentine’s day this allows digital marketers to target and run promotions and campaigns on mother’s day as well inside that might not be known to the marketing team but now a valuable insight all because of a contextual topic modeling that could dig what people are writing around the day theme it’s highly likely that for valentine’s day only a couple might visit the restaurant but for a mother’s day celebration a whole family might visit does leading to more revenue per marketing dollar so you see the power of the kind of insight that this contextual topic modeling is bringing on to the table another example there’s a lot of chatter around weekend mornings Sunday morning being the most talked about ironically avoid mornings also stood out thus it implies that people who visit the place on weekend mornings and find a lot of rash they also write about avoid morning hours this kind of contextual topic modeling also helps you understand when does your audience visit your place and avail your services it’s a great insight to have for your campaigns and tapping the behavior of your customers and audience another great insight to know who are your customers birthday celebration by girlfriend boyfriend indicates younger demographics this can help you frame your target audience and plan promotional campaigns accordingly for the same data set the five topics are extracted using LDA. LDA as shown as you’ll go through the topics you will see that it’s up to you how you want to make sense of your topics and stitch your story when you use LDA but that’s not the case when with the technique that i demonstrated so we saw a lot of evident differences between the topics found via LDA and the tree map visualization showing the contextual topics a few things are pretty clear the topics even though with fewer words than thrones given by LDA are much more meaningful and interpretable the variety of topics is way more in the case of NER and collocations the example you saw in the previous slides was just of the day theme we could pull out several themes and find topics around those you would uh you would have observed that i mentioned interpretable topics interpretability is the thing that your stakeholders your business would love to know but nothing is uh everything is not rosy there’s one problem though it’s not a problem but more of a tedious task one has to cherry pick the entities of interest from the NER exercise so for example person entity may have several themes such as wife husband boyfriend girlfriend the entity might have all days of the week so the point is that you have to keep discussing with your stakeholders this is more of a solution to tackle this tedious task the more you discuss with your stakeholders like what they want to know who are their customer base do they want to know whether people come to their place on a weekday afternoon or evening once you know these kind of information from the problem statement it becomes a lot more convenient to come through the entities one possible enhancement in this technique is performing sentiment analysis for example knowing whether people express negative or positive sentiment around a review like watch diverse feeding marine life it can change the narrative of the marketing campaign so from behavioral insights like who are your customers how many of them visit or avail your services what are they interested in what do they buy what activities do they engage with it can help you understand your offline and online audiences both so this wraps up my presentation i hope uh you took away something meaningful um and something to explore beyond the traditional approaches of traditional uh topic modeling you can find actual code samples on my blog and uh yeah you can reach out to me over linkedin twitter dm me just feel free to shoot questions or connect or even share with me your thoughts would be glad to know your takeaways from this presentation thank you.

Do check out the other posts  that I have written related to software engineering and data science. Please consider subscribing to my blog and feel absolutely free to reach out. I am also mentoring in the areas of Data Science, Career and Life in Singapore. You can book an appointment to talk to me.

Book an appointment with Shubhanshu Gupta using SetMore

Leave a Reply