“Nothing in life is to be feared. It is only to be understood”- Marie Curie 

If you’re among the thousands of people looking to dip your toes in the field of data science, and are having a hard time knowing where to start, welcome. I was in the same place when I began my journey towards being a data scientist, and I know exactly how it feels to be bombarded with information and not know what to do with it. There are tons of resources out there that all explain the basics of data science in different ways, and they use complicated terms that can somewhat confuse a beginner. It took a long while for me to get the basics straight, but when I did, my road to being a data scientist became that much easier. Understanding the concepts is the key to moving forward, and simplifying concepts by using real world analogies helped me comprehend stuff faster. 

In this article, I’m going to try and explain the basics of solving a data science problem with one such simple analogy. So if you’re up for it, and you like Diwali (who doesn’t?), let’s begin!

  1. Defining the problem

Description: The first part to solving anything, is defining exactly what the problem statement is. You need to understand what is required, and what the demands are, to be able to present the appropriate solution. This is done by asking questions, and/or doing research to understand the issue fully, and then creating a set of objectives to tackle it. 

Analogy: Imagine that diwali is next week and you have a lot of preparation to do before that. But to do that, you have to note down exactly what your goals are. For eg., you decide that you need to clean the entire house before diwali, and also plan a party for your loved ones. So you create a list of items you will need for the decorations, food etc., and also all the chores that you will have to complete before the big day. 

This is called understanding and defining the problem.

  1. Data Acquisition 

Description: After understanding the problem, the next step to solving it lies in gathering all the data required to do so. You need to collect information from multiple sources, eg., logs, sheets, websites, databases etc. 

Analogy: After creating a list of the chores and items you needed, you decide to go shopping. You go to a hardware store to get lights, a grocery store to get raw food material, and a bakery to get sweets for the party. You also put all the clothes, shoes and other items that need to be sorted in your house into a big pile called Pile 1. Now you have everything you need to move ahead with your plans.

This is called data acquisition.

  1. Data Preparation 

This is divided into two parts: Data Cleaning and Data Transformation.

  • Data Cleaning

Description: It means getting rid of null or wrong data, dealing with missing and/ or duplicate values, correcting misspelled attributes and inconsistent data types etc. You need to process and clean data as much as possible to get error free predictions.

Analogy: While putting all the items from Pile 1 into their proper places, you first need to sort them. You may throw torn clothes and give away any extra or duplicate items you have etc. You may also have to search for missing values, eg., missing sock pairs. 

This is data cleaning.

  • Data Transformation 

Description: This means modeling and modifying the data you have according to the layout you have created to reach a solution. 

Analogy: After cleaning all the items you have by throwing away stuff etc., you start to sort them into different piles. From Pile 1 you create different but specific piles; one for all the clothes, one for shoes, one for makeup, one for books etc. You do this so that it is easy for you to put them in their proper places like wardrobes and bookshelves. This is done so that you can complete one of your goals of cleaning the house. 

This is data transformation. 

Now you have prepared the data for further analysis. 

  1. Exploratory Data Analysis

Description: EDA consists of all the initial investigatory methods you use to map out patterns and anomalies, check hypotheses, and understand what exactly you can do with your data. It is one of the most important steps in solving a data science problem. 

Analogy: After you’ve cleaned your house and set everything, it is time for you to focus on the diwali party. You decide to have two parties, one on diwali eve, and a bigger party on the actual day of the festival. You make a list of everyone you’re going to invite. Before the actual party, you make note of the type of menu you can make, for eg., knowing to keep sugarless goodies for diabetic guests. You also know the type of decoration you can do based on the list of people who are coming, and whether or not you can keep crackers and flammable items (if animals or kids are coming). Because you have the invitees’ information, you have an idea of the type of parties you can throw for them to be a success. 

This is exploratory data analysis. 

  1. Data Modeling 

Description: Applying different techniques on data to identify the model best suited for the given problem. The techniques can be many, eg., KNN, decision tree, regression etc. They are first applied on a training dataset (to create a model) and then tested on a test dataset (to check the accuracy of the model).

Analogy: Before the party, you call a few of your friends to help you with the menu and the decor. You make them taste various dishes out of which they select the ones they like best, and give you tips on how to improve the taste of some of them. You also show them different curtain designs, and string lights etc for decorating the house, and you choose the layout that the majority likes. So you create the menu and decor that most of your friends like.

The day of the diwali eve party is when actual success is measured. If a majority of your guests love the menu and decor, it means it was accurate towards their preferences. If not, the menu you created with your friends (Train dataset) did not work on the actual guests (Test dataset). 

This is data modeling. 

  1. Communication 

Description: The key findings are conveyed to all the entities. This helps you to determine if the outcome of the project is a success or a failure depending upon the inputs from the model.

Analogy: You talk to people at the diwali eve party, explain your choices, and gain input from them as to whether they liked the food, the drinks, and decor etc., and what all could have been improved. 

This is data communication. 

  1. Deployment and Maintenance 

Description: Deploying the model on the real, production environment. This is when you keep a check on real time inputs of the model, and continuously update and improve the model according to them. 

Analogy: After gaining insights from the small diwali eve party, it is time for you to throw the main diwali party. During the party you keep a check on the food and drinks available, and keep replenishing their quantity. You check if everyone’s comfortable and enjoying themselves by listening to their demands and fulfilling them in real time. Viola! Your party was an absolute hit!

Thanks guys for sticking till the end. These were just some basic data science concepts that I wanted to explain, and I truly hope you learnt something today. The best way to master data science is through understanding, and the best way to understand the concepts is by creating your own examples and analogies. It would be a great idea to write down various complicated concepts by explaining them with the help of things you love the most, and I guarantee that you won’t easily forget them. All the very best for your journey of learning data science!!

About Author

Website | + posts

By Aeiknor

Leave a Reply

Your email address will not be published. Required fields are marked *