By Mukunda Varma | 16th October, 2020 | 8 min Read
Data Science is simply the science of data.
Data science is defined as an interdisciplinary field that uses scientific methods, processes, algorithms, and systems to extract knowledge and insights from many structural and unstructured data.
Did that seem confusing? Let’s break it down to understand better:
Before going deeper into data science, let us have a look at why data science is important.
“The world’s most valuable resource is no longer oil, but data.” — The Economist
The quote shows how important data is, in the present time. Let us get straight into a few stats. According to the Forbes 2019 report, the internet data created is almost equal to 2.5 Quintillion bytes per day. Every activity that is happening in the world is being transformed into data in one or the other way.
The concept of Big Data is getting bigger and bigger. With the availability of such vast amounts of data, more and more companies are looking forward to using this data to gain valuable insights and improve their businesses.
Since most of this data is unstructured, the need and usage of data science tools and methods are becoming important day by day. Data Scientist has been named the number one job in the U.S. by glassdoor for four years in a row. LinkedIn listed data scientists as one of the most promising jobs in 2017 and 2018. Reports of the Bureau of Labor Statistics of the USA estimates around 11.5 million new data science jobs to be created by the end of 2026.
Let’s dive into the applications -
From getting appropriate results in your google search to predicting the growth of cancers, the applications of data science are endless. With the presence of such a vast amount of data on the internet, Isn’t it amazing how google still shows you such accurate results? Imagine how much value can you bring to your company if you can know the likes and behaviours of your customers? how much money you would save if you could detect the fraudulent bank transactions well in advance and abort them. How many hundreds of lives do you think can be saved if you can detect various cancers and dreadful diseases in the early stages? All these are some of the most important applications that use data science.
Truly amazing Isn’t it? Now after knowing about the importance of data science and some of its real-life applications, aren’t you amazed to dig deeper into this field?
Let us get into the details of what data science has and the various steps involved.
On a higher level, 6 main steps are involved in data science:
To develop a better understanding of these 6 steps let us take a real-life example and see how each of the 6 steps plays a part in solving the problem.
Let us assume we want to predict heart failure in a person based on his/her various health parameters (The dataset was collected from here) :
This step involves extracting and collecting all the relevant data and putting it at one place. Data is generally collected from sensors(IoT devices), historical records, surveys, scraping data from websites, and other raw sources of data.
But don’t worry, you need not travel to different places or access IoT devices to obtain your desired datasets. A lot of data is already available online and can be downloaded for free. Google has a search engine made specifically to find datasets. Another easy way of collecting data is through Web scraping. Just select a website that you feel is authentic enough and satisfies your needs and start scraping data, the way you want.
Although any data that is publicly available can be scraped, be sure to check the website policies before you start scraping the data.
Data Collection can be tricky. Raw data is generally unstructured and needs to be brought into proper shape before it can be used for analysis.
The data can have a lot of discrepancies caused due to errors in the method of extraction or due to human errors. For example, the collected data might contain duplicate records, missing values of particular data, or some extreme values which might impact our results. All such discrepancies have to be resolved before analysis, to make sure our results are accurate.
Let’s take a look at how sample data looks after data extraction
Let’s see this in our case: This is how a sample of our data looks after data extraction
The data shown in the figure is to analyze if a person’s heart is prone to failure based on several factors.
The factors that include are age, sex, smoking habits, and various other health conditions. As mentioned above, the raw data is unstructured and needs to be processed before analysis.
Data Analysis can be considered as the heart of data science. It means using various methods and tools to analyze your data and gain meaningful insights from it.
Before starting with various analysis methods, it is important to list out the kind of results. The kind of insights you get from your data depends on the type of analysis performed. Data Analysis can be of four types:
While all these types share a lot of similarities, each of these serves a different purpose and provides varied insights. In our case, we will be doing the first three types:
We’ll first see the mean, median, and other general statistic values:
In the above diagram, we can infer that the median value(50%) of age is 60.000 which indicates that people of age around 60 are more prone to heart failures than others.
Similarly, the mean of sex(1 indicating male and 0 indicating female) says 0.64 which means that 64 in 100 people are male. We can draw similar assumptions with other factors as well.
General descriptive analysis can be:
Diagnostic analysis:
These are a few examples of analysis we could do. A lot of more comparisons and patterns can be derived from the data.
This is the most fascinating part of a data science problem. The data analysis part which gives us valuable insights regarding the data is indeed very important. But in most cases, it is not good enough.
For example, You collected the historical data about the heart failures that happened in a set of patients in the past 10 years. You did a very detailed analysis of all the causes of failure. Described in detail, the impact of each factor on the failure of the heart. But now what? How is your analysis worth your effort if it cannot predict and stop the next coming heart failure? right?
That is how important a prediction is. You see all the patterns and trends in the data, take into consideration all the factors that have a negative impact on the functioning of the heart, the level of influence each factor might have on the failure of heart, and form a certain mathematical/statistical model considering all the above. We will use this model to predict whether there is a chance of failure or not.
Well, to be very frank, no one gets the model right the very first time. Only by multiple evaluations and improvisations, one can make the model perfect. Evaluation generally involves two steps:
Two methods are used for evaluation:
The first method is generally used when the size of the dataset is large enough while the second method is used when we have less or limited data available.
While Accuracy is the most used metric in evaluation, it is not the right metric in all cases.
For example, the data sets used for credit card fraud detection are very skewed. You generally have around 2000 or 3000 fraud transactions in a dataset of 10 lakh records. So even if your model predicts all the fraudulent transactions as genuine transactions, you still end up at an accuracy of 99.7(997000/1000000). This result is an absolute blunder. As a result you need to consider metrics such as False Positive rate and f-measure in this case which will take into account the number of negatives that are considered positive.
In our case, the sample is not so skewed so we can go with the accuracy metric. Let’s see how accurate our model is on the first try (k means algorithm was used in the first try):
As mentioned above, the first try is never enough. On changing the algorithm and other inputs, the results were as follows:
We see that there is a slight increase in accuracy. Similarly, after doing several other optimizations, we arrived at a final accuracy of 79% which is significantly high:
We all know how important visualization is. No project is complete without the inclusion of data visualization. A visual ( an image or a graph foresay) has a different level of impact on your audience. It helps in communicating with the audience and users more effectively and makes understanding much easier.
Although we have listed this step as the last in the process, this can be used at any step in the process.
Some of the visualizations we did in our analysis: