Friday, July 15, 2016

Interview questions on SQL

Here are 5 interview questions on SQL, please try and share the answer in comments section

1. STudentDetails: StudentID, courseID, InstructorID, EnrollmentDate
Query to get the list of students who have enrolled to ALL COURSES available in 2016.


select StudentID from STudentDetails where EnrollmentDate>='2016-01-01 00:00:00' and EnrollmentDate<'2017-01-01 00:00:00'
and courseID

select distinct courseID from STudentDetails

2. Avg no. of students per course in 2016.
Eg : If 2 Courses and 10 students in C1 and 20 students in C2, then I would like to get avg number as 15.

select avg(count(StudentID))
from STudentDetails
where courseID in ('C1','C2')
group by courseID

3. Product : ID, Name, Unitcost
SalesRep: ID, Name
SAles : ProductID, salesrepId, custID, qty, sales_date

Query to get the list of productid, salesrepId where the Salesrep has sold a particular product in consecutive days.

S1 P1 01 Jan 2016
S1 P1 02 Jan 2016
S1 P2 03 Jan 2016
S1 p1 04 Jan 2016
S1 P1 05 jan 2016

Resuit set :
S1 | P1 01 Jan 2016 02 jan 2016
S1 | P1 04 Jan 2016 05 Jan 2016


select salesrepId,ProductID
    case(when sales_date<) as start_date,
    case(when ) as end_date
from SAles
group by salesrepId,ProductID,start_date,end_date

4. Design employee dimension as per SCD Type 2.

Employee :
    ID (auto-increment)
    EmpId
    Name
    City
    HQ,
    DeptID
    MangerID
    Bank Account
    StartDate

Dept:
    DeptID
    Name
   
ID,EmpId, Name, City, Startdate
1, 1, Mr. A, Hyd, 2016-01-01
2, 1, Mr. A, Pune, 2016-07-01

5. What is a role playing dimension?

Saturday, March 19, 2016

Getting started with Data Science : Python

There is lot of buzz about data science being a super cool stream. So, i decided to do a write up on how to get started with it. Without wasting much time, here you go with the details:

Step 1: Setting up your machine
Download Anaconda here and install it. I am sure this will be a pretty simple installation on any of the platform.

Step 2: Learn the basics of Python language

If you want to get started with data science in Python, you need to know atleast the basics of python. Atleast an hello world in python is sufficient to get started. You can get hang of the language later.

And trust me, Python is the most easiest langauge to learn. Python 2.7 will not be used post 2020 and hence its better you take things with Python 3.
Some referance to Python 3:
Free e-book on python: http://python.swaroopch.com/index.html

Step 3: Learn Scientific libraries in Python – NumPy, Matplotlib and Pandas


Basics for exploratory data analysis and data handling:
Numpy - Array like structure for scientific computing
Pandas - Data structure to easily manipulate data.
Matplotlib - One of the famous Plotting library for SciPy stack.


Step 4: Learn Scikit-learn and Machine Learning


Here the fun part begins:
Scikit learn library provide most of the machine learning algorithms  for Python.
Web site: http://scikit-learn.org/stable/


Step 5: Practice, practice and Practice

I would rather say this is the most important step which can distinguish you from the rest.

Conclusion:
You have to accept the fact that you wont turn into data scientist overnight, it will take time and dedication to become one.

There are frequent question like how much time it will take? 
The answer really depends on how much time you devote to this subject and how quickly you understand the concepts.

Fortunately there are many sites which can be used as testing ground for your machine learning skills like https://www.kaggle.com/competitions or http://datahack.analyticsvidhya.com/contest/all which keep doing this awesome competitions, in which you can participate and test your knowledge

Friday, March 4, 2016

Cheat sheet : Exploratory data analysis


Here is short version of exploratory data analysis

1. Variable Identification (categorical, continuous, etc)
2. Univariate Analysis
    a. categorical variable : Frequency of occurance (count). Bar chart for visualization
    b. continuous variable: Mean, media, mode, min and max. Histogram for visualization

Ref: https://www.youtube.com/watch?v=wFabyCP54YA

3. Bi-variate Analysis
    a. Continuous & Continuous: Scatter plot to find out Correlation
Correlation varies between -1 and +1.

-1: perfect negative linear correlation
+1:perfect positive linear correlation and
0: No correlation

    b. Categorical & Categorical:
a. Two-way table: Have count and count% as metric
b. Stacked Column Chart:
c. Chi-Square Test: Need to read more on this but
Probability of 0: It indicates that both categorical variable are dependent
Probability of 1: It shows that both variables are independent.
c. Categorical & Continuous:
a. Z-Test/ T-Test:
b. ANOVA:  It assesses whether the average of more than two groups is statistically different.

Ref: https://www.youtube.com/watch?v=IA0unflfvQE
https://www.youtube.com/watch?v=zdU8C8QEHH0

..To be continued...

Ref: http://www.analyticsvidhya.com/blog/2016/01/guide-data-exploration/

Tuesday, February 23, 2016

Starting with scikit-learn

If you are an aspiring Data analyst / Data scientist, this is the correct place to be.

Here is an awesome step-by-step documentation scikit-learn community has created.
http://scikit-learn.org/stable/tutorial/basic/tutorial.html

I will be providing some videos in the blog post later on how to get started.

Some of the existing videos which you can follow:
http://scikit-learn.org/stable/presentations.html#videos

Here are some FAQs:
Using my data with scikit-learn?
http://stackoverflow.com/questions/21492726/using-prepared-data-for-sci-kit-classification

Importing a CSV data to scikit-learn
http://stackoverflow.com/questions/11023411/how-to-import-csv-data-file-into-scikit-learn/11035992#11035992


Sunday, February 21, 2016

How to become data analyst / data scientist

If you are starting from scratch then this post is for you and interested in Python:

Step 1: Setting up your machine
Download anaconda and use spyder or Ipython for trying out small things.
Python is installed into this package itself.

Step 2: Basic of Python as a programming lang
 learn: Lists, Tuples, Dictionaries, List comprehensions, Dictionary comprehensions

Step 3:
Learn these libraries in Python – Pandas, NumPy, SciPy and Matplotlib

Step 4:
Start Machine learning with Scikit-learn

Step 5:
The thing which will differentiate you from other data scientist is practice.


Some nice blogs to follow:
 1. http://www.datasciencecentral.com/
 2. http://www.kdnuggets.com/
 3. http://www.analyticsvidhya.com/ [Indian community which came up very fast]

To start practising models, participate in competitions from:
1. https://www.kaggle.com/
2. http://www.analyticsvidhya.com/

Friday, February 19, 2016

Installing Jenkins

Jenkins has two ways to install
1. Installing a war file
2. Installing jenkins as a service

In my case, it is redhat distribution and there are simple steps to install Jenkins as a service

Thats it!!!
http://<IP address>:8080 is the place where you will get the default installation screen