Saturday, July 22, 2017

Data modeling - Normalization


This time we will see how basic database model  looks like and  how robust is 3-Normalized form over unNormalized ones.
Anything less than 3NF can cost in the form of data anomalies that 3NF is supposed to protect.


To keep things simple and concise, we will understand this by creating un-normalized entities and normalized form of "Hotstar", a digital and mobile entertainment platform.

Hotstar is an online media streaming platform where users can watch movies, tv series etc. after paying a subscription fee in monthly, biyearly or yearly.
After subscription, user can watch any sort of media available over there at hotstar.

here we have tried to present a basic data model of hotstar to understand the difference between unNormalized form and Normalized form.

                                            Hotstar - Online streaming platform

 Let's first identify primary key, we have email ID has unique identifer but does it work , if
same user uses free plan with that ID ,later say after a while switch to premium plan or multiple plans,
so our Primary id will be "Email ID + subscription date-time", because content depends on subscription
plan.


After identifying primary key, let's bring it into 1NF form.1NF - Requirement :   NO repeating group, split them into groups.













































1NF to 2- NF 

2 - NF requirement : 

No partial key dependency, all non key attributes should depend on primary key as a whole

                                                      .

























                                            












3 NF violation and solutions


3NF requirement :

No non-key interdependencies.
If some non-keys depend on other non-key, bring them out in other table.


    2 Normalized form                                                        3 Normalized form                                                  









Finally, our final representation of data model from unNormalized form to Normalized form!!




                   


In above model, if you see unNormalized form, there lies some hidden issues,
1. Redundancy
2. Suppose if customers do not watch any TV series, then we won't have TV series data since there won't be any entry for that which is a confounding issue.
3. More and more database grows, more complexity will be there because of the lack of well defined  model and so on.

Here comes, 3-NF to rescue us from this problem.
1. well defined
2. Clear and concise relationship between entity.
3. No data lost.

A good database model always caters for the perfectly defined entities and within entities, a clear dependency of non-key attributes over key attribute.


Happy learning!!!!!











Friday, July 14, 2017

Titanic survival prediction using R and caret



I found the below code while i was looking for the extreme gradient boosting and found it very elegantly written and simple.

Generally Feature engineering is applied on data to bring the best insights, but in this code, the code did that part. let's go to it to have a basic understanding of Caret package.

Author - David Langer.



#=======================================================================================
#
# File:        IntroToMachineLearning.R
# Author:      Dave Langer
# Description: This code illustrates the usage of the caret package for the An 
#              Introduction to Machine Learning with R and Caret" Meetup dated 
#              06/07/2017. More details on the Meetup are available at:
#
#                 https://www.meetup.com/data-science-dojo/events/239730653/
#
# NOTE - This file is provided "As-Is" and no warranty regardings its contents are
#        offered nor implied. USE AT YOUR OWN RISK!
#
#=======================================================================================

#install.packages(c("e1071", "caret", "doSNOW", "ipred", "xgboost"))
library(caret)
library(doSNOW)



#=================================================================
# Load Data
#=================================================================

train <- read.csv("train.csv", stringsAsFactors = FALSE)
View(train)




#=================================================================
# Data Wrangling
#=================================================================

# Replace missing embarked values with mode.
table(train$Embarked)
train$Embarked[train$Embarked == ""] <- "S"


# Add a feature for tracking missing ages.
summary(train$Age)
train$MissingAge <- ifelse(is.na(train$Age),
                           "Y", "N")


# Add a feature for family size.
train$FamilySize <- 1 + train$SibSp + train$Parch


# Set up factors.
train$Survived <- as.factor(train$Survived)
train$Pclass <- as.factor(train$Pclass)
train$Sex <- as.factor(train$Sex)
train$Embarked <- as.factor(train$Embarked)
train$MissingAge <- as.factor(train$MissingAge)


# Subset data to features we wish to keep/use.
features <- c("Survived", "Pclass", "Sex", "Age", "SibSp",
              "Parch", "Fare", "Embarked", "MissingAge",
              "FamilySize")
train <- train[, features]
str(train)




#=================================================================
# Impute Missing Ages
#=================================================================

# Caret supports a number of mechanism for imputing (i.e., 
# predicting) missing values. Leverage bagged decision trees
# to impute missing values for the Age feature.

# First, transform all feature to dummy variables.
dummy.vars <- dummyVars(~ ., data = train[, -1])
train.dummy <- predict(dummy.vars, train[, -1])
View(train.dummy)

# Now, impute!
pre.process <- preProcess(train.dummy, method = "bagImpute")
imputed.data <- predict(pre.process, train.dummy)
View(imputed.data)

train$Age <- imputed.data[, 6]
View(train)



#=================================================================
# Split Data
#=================================================================

# Use caret to create a 70/30% split of the training data,
# keeping the proportions of the Survived class label the
# same across splits.
set.seed(54321)
indexes <- createDataPartition(train$Survived,
                               times = 1,
                               p = 0.7,
                               list = FALSE)
titanic.train <- train[indexes,]
titanic.test <- train[-indexes,]


# Examine the proportions of the Survived class lable across
# the datasets.
prop.table(table(train$Survived))
prop.table(table(titanic.train$Survived))
prop.table(table(titanic.test$Survived))




#=================================================================
# Train Model
#=================================================================

# Set up caret to perform 10-fold cross validation repeated 3 
# times and to use a grid search for optimal model hyperparamter
# values.
train.control <- trainControl(method = "repeatedcv",
                              number = 10,
                              repeats = 3,
                              search = "grid")


# Leverage a grid search of hyperparameters for xgboost. See 
# the following presentation for more information:
# https://www.slideshare.net/odsc/owen-zhangopen-sourcetoolsanddscompetitions1
tune.grid <- expand.grid(eta = c(0.05, 0.075, 0.1),
                         nrounds = c(50, 75, 100),
                         max_depth = 6:8,
                         min_child_weight = c(2.0, 2.25, 2.5),
                         colsample_bytree = c(0.3, 0.4, 0.5),
                         gamma = 0,
                         subsample = 1)
View(tune.grid)


# Use the doSNOW package to enable caret to train in parallel.
# While there are many package options in this space, doSNOW
# has the advantage of working on both Windows and Mac OS X.
#
# Create a socket cluster using 10 processes. 
#
# NOTE - Tune this number based on the number of cores/threads 
# available on your machine!!!
#
cl <- makeCluster(10, type = "SOCK")

# Register cluster so that caret will know to train in parallel.
registerDoSNOW(cl)

# Train the xgboost model using 10-fold CV repeated 3 times 
# and a hyperparameter grid search to train the optimal model.
caret.cv <- train(Survived ~ ., 
                  data = titanic.train,
                  method = "xgbTree",
                  tuneGrid = tune.grid,
                  trControl = train.control)
stopCluster(cl)


# Examine caret's processing results
caret.cv


# Make predictions on the test set using a xgboost model 
# trained on all 625 rows of the training set using the 
# found optimal hyperparameter values.
preds <- predict(caret.cv, titanic.test)


# Use caret's confusionMatrix() function to estimate the 
# effectiveness of this model on unseen, new data.
confusionMatrix(preds, titanic.test$Survived)


Happy learning!!

# Use caret's confusionMatrix() function to estimate the
# effectiveness of this model on unseen, new data.
confusionMatrix(preds, titanic.test$Survived)

Tuesday, July 11, 2017

An introduction to the data pipeline - III(fetching data from web using python)




            How to extract data from web using python



I was trying to fetch data from web using Scraper, but there was some issue while doing so,
then i decided to tinker with html page using python.


So, let's see how to do so,


we want to fetch all movies of 2016, so let's go to IMDB:


below is the basic code to fetch movie names out of that, i have done this for the very first page,
but we need to automate that for all pages.  

#Import  libraries:
import urllib

#Use function “prettify” to look at nested structure of HTML page
from bs4 import BeautifulSoup                   
soup = BeautifulSoup(page)

print(soup.prettify())

# I inspect the element and found our "movie names" lie in "h3" with class name as "lister-item-header"

import re
imdb = "http://www.imdb.com/search/title?year=2016,2016&title_type=feature&sort=moviemeter,asc"
page_initial = urllib.request.urlopen(imdb)
for i in range(1, 10):
    page = page_initial
    soup = BeautifulSoup(page)
    all_links=soup.findAll("h3", {"class" : "lister-item-header"})
    for i in range(1,len(all_links)):
        for j in all_links[i].findAll("a"):
            A.append(j.find(text=True))
           
    a = soup.find('a', href=True, text=re.compile("Next"), class_ = "lister-page-next next-page")
    link = a["href"]
    page_initial = urllib.request.urlopen("http://www.imdb.com/search/title"+link)
   
print("total movies are: "+len(A))


#output


Moana
Trolls
Suicide Squad
Split
A Cure for Wellness
Sing
Hacksaw Ridge
Captain America: Civil War
The Belko Experiment
Star Trek Beyond
Fantastic Beasts and Where to Find Them
Free Fire
The Deep End
Doctor Strange
The Bad Batch
La La Land
Batman v Superman: Dawn of Justice
X-Men: Apocalypse
Independence Day: Resurgence
Lion
Rogue One
Deadpool
The Magnificent Seven
Hidden Figures
Miss Peregrine's Home for Peculiar Children
Passengers
Arrival
The Great Wall
The Accountant
Bad Moms
Moonlight
Morgan
Loving
The Secret Life of Pets
Sausage Party
Nocturnal Animals
Snowden
Contratiempo
Me Before You
Inferno
Manchester by the Sea
The Promise
Nerve
Masterminds
War Dogs
The Girl on the Train
Dangal
Zootopia
Lady Macbeth
............





An Introduction to the data pipeline - II






                              How to fetch data from web

Let's start,

We want to fetch all the movies of some actress say Lindsay lohan into excel or any other format you want. let's go to imdb and search for lindsay lohan, you will get a page, scroll to the movie section and right click on the any movie name as below.




here, we can see "Scrape similar" (make sure you have downloaded scraper extension on chrome), click on that



above, you can see a scroll down option , choose Xpath.

see, on the very right of that, there is a textbox in which we have to the give the path of that element(in our case movies) in html,  how to fetch path?

right click on movie name and click on "Inspect element", there you can see a hierarchy of elements as 
          <body>...
             <div>...
                <div>.....
                   .
                   .
                   .
                   <b>...
                      <a>
under this <a> tag, lies our movie name, so the path becomes
//body/div/div/div/div/div/div/div/div/div/b/a

 Type above path in that textbox and "Enter"
you will see,
movie names in the side panel(see in the above picture)
you can also create another tab for some other detail(e.g. Date of release).
This data can be export to google docs.
This is how we can fetch any sort of data through scrapper available on net.


continue at :




WIll see something interesting in next post
till then Happy learning!!!



Friday, July 7, 2017

Time series - part I


                                                               Time Series

Overview of Time Series Characteristics



Definition:
A univariate time series is a steps of measurements of the same variable collected over time period.  Most often, the measurements are made at regular time intervals.


Basic Objectives of the Analysis
The basic objective usually is to determine a model that describes the pattern of the time series.  Uses are:
  1. To understand the important features of the time series pattern.
  2. To detail out how the past affects the future or how two time series can “interact”.
  3. To forecast future values of the series.
  4. To possibly serve as a control standard for a variable that measures the quality of product in some manufacturing situations.


Types of Models
There are two basic types of “time domain” models.
  1. Models that relate the present value of a series to past values and past prediction errors - these are called ARIMA models (for Autoregressive Integrated Moving Average).

  1. Ordinary regression models that use time indices as x-variables.  These can be helpful for an initial description of the data and form the basis of several simple forecasting methods.

General characteristics: 




  • Is there a trend, meaning that, on average, the measurements tend to increase (or decrease) over time?

  • Is there seasonality, meaning that there is a regularly repeating pattern of highs and lows related to calendar time such as seasons, quarters, months, days of the week, and so on?
  • Is there a long-run cycle or period unrelated to seasonality factors?

  • Is there constant variance over time, or is the variance non-constant?


One of the simplest ARIMA type models is a model in which we use a linear model to predict the value at the present time using the value at the previous time.  This is called an AR(1) model, standing for autoregressive model of order 1.  The order of the model indicates how many previous times we use to predict the present time

A start in evaluating whether an AR(1) might work is to plot values of the series against lag 1 values of the series.  Let xt denote the value of the series at any particular time t, so xt-1 denotes the value of the series one time before time t.  That is, xt-1 is the lag 1 value of xt.  As a short example, here are the first five values in the earthquake series along with their lag 1 values:
t
xt
xt-1 (lag 1 value)
1
13
*
2
14
13
3
8
14
4
10
8
5
16
10

If we plot the graph between lag(X-axis) vs Xt(Y-axis), we will see a positive linear association

The AR(1) model
Theoretically, the AR(1) model is written
 X(t) = Wt + Constant + A.X(t-1)

We assumed a factor of error W which is normally distributed with time t.


Equation after AR(regression) comes out to be
quakes = 9.19 + 0.543 lag1

P-value are less than 0.05, thus lag is a helpful predictor though R-squared value is week, so model won’t give us great predictions.


Residual Analysis
In traditional regression, a plot of residuals versus fits is a useful diagnostic tool.  The ideal for this plot is a horizontal band of points.  Following is a plot of residuals versus predicted values for our estimated model.  It doesn’t show any serious problems.

Example 2
A rough plot below shows a time series pattern of producing coffee.
Some important features are:
  • There is an upward trend, possibly a curved one.
  • There is seasonality – a regularly repeating pattern of highs and lows related to quarters of the year.
  • There are no obvious outliers.
  • There might be increasing variation as we move across time, although that’s uncertain.

picq.png


There are ARIMA methods for dealing with series that exhibit both trend and seasonality, which will be discussed in next post.

Part II continues below:


5 States data in geoChart