Python Archives - Shubhanshu Gupta

Speed Up Pandas Dataframe Apply Function to Create a New Column

Data cleaning is an essential step to prepare your data for the analysis. While cleaning the data, every now and then, there’s a need to create a new column in the Pandas dataframe. It’s usually conditioned on a function which manipulates an existing column. A strategic way to achieve that is by using Apply function. I want to address a couple of bottlenecks here:

Pandas: The Pandas library runs on a single thread and it doesn’t parallelize the task. Thus, if you are doing lots of computation or data manipulation on your Pandas dataframe, it can be pretty slow and can quickly become a bottleneck.
Apply(): The Pandas apply() function is slow! It does not take the advantage of vectorization and it acts as just another loop. It returns a new Series or dataframe object, which carries significant overhead.

So now, you may ask, what to do and what to use? I am going to share 4 techniques that are alternative to Apply function and are going to improve the performance of operation in Pandas dataframe.

Collocations in NLP using NLTK Library

Collocations are phrases or expressions containing multiple words, that are highly likely to co-occur. For example – ‘social media’, ‘school holiday’, ‘machine learning’, ‘Universal Studios Singapore’, etc.

Handling Imbalanced Dataset with SMOTE in Python

Close your eyes and imagine that you live in a utopian world of perfect data. What do you see? What do you wish to see? Wait! are you imagining a flawless balanced dataset? A collection of data whose labels form a magnificent 1:1 ratio: 50% of this, 50% of that; not a bit to the left, nor a bit to the right. Just perfectly balanced, as all things should be. Now open your eyes, and come back to the real world. Well, this blog is all about how to handle imbalanced datasets.