Saturday, March 19, 2016

Getting started with Data Science : Python

There is lot of buzz about data science being a super cool stream. So, i decided to do a write up on how to get started with it. Without wasting much time, here you go with the details:

Step 1: Setting up your machine
Download Anaconda here and install it. I am sure this will be a pretty simple installation on any of the platform.

Step 2: Learn the basics of Python language

If you want to get started with data science in Python, you need to know atleast the basics of python. Atleast an hello world in python is sufficient to get started. You can get hang of the language later.

And trust me, Python is the most easiest langauge to learn. Python 2.7 will not be used post 2020 and hence its better you take things with Python 3.
Some referance to Python 3:
Free e-book on python: http://python.swaroopch.com/index.html

Step 3: Learn Scientific libraries in Python – NumPy, Matplotlib and Pandas


Basics for exploratory data analysis and data handling:
Numpy - Array like structure for scientific computing
Pandas - Data structure to easily manipulate data.
Matplotlib - One of the famous Plotting library for SciPy stack.


Step 4: Learn Scikit-learn and Machine Learning


Here the fun part begins:
Scikit learn library provide most of the machine learning algorithms  for Python.
Web site: http://scikit-learn.org/stable/


Step 5: Practice, practice and Practice

I would rather say this is the most important step which can distinguish you from the rest.

Conclusion:
You have to accept the fact that you wont turn into data scientist overnight, it will take time and dedication to become one.

There are frequent question like how much time it will take? 
The answer really depends on how much time you devote to this subject and how quickly you understand the concepts.

Fortunately there are many sites which can be used as testing ground for your machine learning skills like https://www.kaggle.com/competitions or http://datahack.analyticsvidhya.com/contest/all which keep doing this awesome competitions, in which you can participate and test your knowledge

Friday, March 4, 2016

Cheat sheet : Exploratory data analysis


Here is short version of exploratory data analysis

1. Variable Identification (categorical, continuous, etc)
2. Univariate Analysis
    a. categorical variable : Frequency of occurance (count). Bar chart for visualization
    b. continuous variable: Mean, media, mode, min and max. Histogram for visualization

Ref: https://www.youtube.com/watch?v=wFabyCP54YA

3. Bi-variate Analysis
    a. Continuous & Continuous: Scatter plot to find out Correlation
Correlation varies between -1 and +1.

-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation

    b. Categorical & Categorical:
a. Two-way table: Have count and count% as metric
b. Stacked Column Chart:
c. Chi-Square Test: Need to read more on this but
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
c. Categorical & Continuous:
a. Z-Test/ T-Test:
b. ANOVA:  It assesses whether the average of more than two groups is statistically different.

Ref: https://www.youtube.com/watch?v=IA0unflfvQE
https://www.youtube.com/watch?v=zdU8C8QEHH0

..To be continued...

Ref: http://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/