Data Science is simply the science of data.
Before going deeper into data science, let us have a look at why
data science is important.
The quote shows how important data is, in the present time. Let us
get straight into a few stats. According to the Forbes 2019
report, the internet data created is almost equal to 2.5
Quintillion bytes per day. Every activity that is happening in the
world is being transformed into data in one or the other way.
The concept of Big Data is getting bigger and bigger. With the
availability of such vast amounts of data, more and more companies
are looking forward to using this data to gain valuable insights
and improve their businesses.
Since most of this data is unstructured, the need and usage of
data science tools and methods are becoming important day by day.
Data Scientist has been named the number one job in the U.S. by
glassdoor for four years in a row. LinkedIn listed data scientists
as one of the most promising jobs in 2017 and 2018. Reports of the
Bureau of Labor Statistics of the USA estimates around 11.5
million new data science jobs to be created by the end of 2026.
From getting appropriate results in your google search to
predicting the growth of cancers, the applications of data science
are endless. With the presence of such a vast amount of data on
the internet, Isn’t it amazing how google still shows you such
accurate results? Imagine how much value can you bring to your
company if you can know the likes and behaviours of your
customers? how much money you would save if you could detect the
fraudulent bank transactions well in advance and abort them. How
many hundreds of lives do you think can be saved if you can detect
various cancers and dreadful diseases in the early stages? All
these are some of the most important applications that use data
Truly amazing Isn’t it? Now after knowing about the importance of
data science and some of its real-life applications, aren’t you
amazed to dig deeper into this field?
Let us get into the details of what data science has and the
various steps involved.
To develop a better understanding of these 6 steps let us take a
real-life example and see how each of the 6 steps plays a part in
solving the problem.
Let us assume we want to predict heart failure in a person based
on his/her various health parameters (The dataset was collected
This step involves extracting and collecting all the relevant
data and putting it at one place. Data is generally collected
from sensors(IoT devices), historical records, surveys,
scraping data from websites, and other raw sources of data.
But don’t worry, you need not travel to different places or
access IoT devices to obtain your desired datasets. A lot of
data is already available online and can be downloaded for
free. Google has a search engine made
specifically to find datasets. Another easy way of collecting
data is through Web scraping. Just select a website that you
feel is authentic enough and satisfies your needs and start
scraping data, the way you want.
Although any data that is publicly available can be scraped,
be sure to check the website policies before you start
scraping the data.
Data Collection can be tricky. Raw data is generally
unstructured and needs to be brought into proper shape before
it can be used for analysis.
The data can have a lot of discrepancies caused due to errors
in the method of extraction or due to human errors. For
example, the collected data might contain duplicate records,
missing values of particular data, or some extreme values
which might impact our results. All such discrepancies have to
be resolved before analysis, to make sure our results are
Let’s take a look at how sample data looks after data
Let’s see this in our case: This is how a sample of our data
looks after data extraction
The data shown in the figure is to analyze if a person’s heart
is prone to failure based on several factors.
The factors that include are age, sex, smoking habits, and
various other health conditions. As mentioned above, the raw
data is unstructured and needs to be processed before
After performing all these operations, we will have data that
is clean and can be used for analysis. In the above example,
the final data looks similar to the below image.
- From the above image, we can find null values(NaN) (aka missing values) in the 2nd and 3rd rows.
These null values are generally replaced with mean or median values of that column or sometimes the
entire row is removed.
- Similarly, if you see the age of the third person, it says 245 which is not possible.
This means the data in that row is wrong and must be removed.
- Also, we can see that the last two rows have the same data and are duplicates.
We can exclude one of the two rows.
Data Analysis can be considered as the heart of data science.
It means using various methods and tools to analyze your data
and gain meaningful insights from it.
Before starting with various analysis methods, it is important
to list out the kind of results. The kind of insights you get
from your data depends on the type of analysis performed. Data
Analysis can be of four types:
Descriptive Analysis -- knowing what
Diagnostic Analysis -- finding out the
reason why something happened.
Predictive Analysis -- what is likely to
Prescriptive Analysis -- What should we do
for something to happen?
While all these types share a lot of similarities, each of
these serves a different purpose and provides varied insights.
In our case, we will be doing the first three types:
We’ll first see the mean, median, and other general statistic
In the above diagram, we can infer that the median value(50%)
of age is 60.000 which indicates that people of age around 60
are more prone to heart failures than others.
Similarly, the mean of sex(1 indicating male and 0 indicating
female) says 0.64 which means that 64 in 100 people are male.
We can draw similar assumptions with other factors as well.
General descriptive analysis can be:
These are a few examples of analysis we could do. A lot of
more comparisons and patterns can be derived from the data.
This is the most fascinating part of a data science problem.
The data analysis part which gives us valuable insights
regarding the data is indeed very important. But in most
cases, it is not good enough.
For example, You collected the historical data about the heart
failures that happened in a set of patients in the past 10
years. You did a very detailed analysis of all the causes of
failure. Described in detail, the impact of each factor on the
failure of the heart. But now what? How is your analysis worth
your effort if it cannot predict and stop the next coming
heart failure? right?
That is how important a prediction is. You see all the
patterns and trends in the data, take into consideration all
the factors that have a negative impact on the functioning of
the heart, the level of influence each factor might have on
the failure of heart, and form a certain
mathematical/statistical model considering all the above. We
will use this model to predict whether there is a chance of
failure or not.
Evaluation and Fine Tuning
Well, to be very frank, no one gets the model right the very
first time. Only by multiple evaluations and improvisations,
one can make the model perfect. Evaluation generally involves
Two methods are used for evaluation:
- The first one is the handout, in which you separate the data into three sets —
- Training data - data used to train the model
- Validation data - data used to provide an unbiased evaluation of the model while fine tuning the model parameters
- Testing data. - data used to evaluate the final model
- The second method is called cross-validation.
In this method, Data is randomly split into n subsets of equal size. Then:
- In the first round, the first subset is considered as the Validation set
and the remaining (n-1) subsets are used for training data.
- Then , the second method is considered as the Validation set and the remaining subsets as training sets.
We repeat this process for ‘n’ number of times and the mean of the accuracy is considered.
The first method is generally used when the size of the
dataset is large enough while the second method is used
when we have less or limited data available.
While Accuracy is the most used metric in evaluation, it
is not the right metric in all cases.
For example, the data sets used for credit card fraud
detection are very skewed. You generally have around 2000
or 3000 fraud transactions in a dataset of 10 lakh
records. So even if your model predicts all the fraudulent
transactions as genuine transactions, you still end up at
an accuracy of 99.7(997000/1000000). This result is an
absolute blunder. As a result you need to consider metrics
such as False Positive rate and f-measure in this case
which will take into account the number of negatives that
are considered positive.
In our case, the sample is not so skewed so we can go with
the accuracy metric. Let’s see how accurate our model is
on the first try (k means algorithm was used in the first
As mentioned above, the first try is never enough. On
changing the algorithm and other inputs, the results were
We see that there is a slight increase in accuracy.
Similarly, after doing several other optimizations, we
arrived at a final accuracy of 79% which is significantly
We all know how important visualization is. No project is complete without the inclusion of data visualization.
A visual ( an image or a graph foresay) has a different level of impact on your audience.
It helps in communicating with the audience and users more effectively and makes understanding much easier.
Although we have listed this step as the last in the process, this can be used at any step in the process.
Some of the visualizations we did in our analysis: