Friday, July 7, 2017

An introduction to the data pipeline


                                            An introduction to the data pipeline

A general data pipeline has following steps:

1. Acquisition 
2. Extraction 
3. Cleaning and transforming 
4. Analysis of data
5. Presentation of data



Let's start with the very first step i.e. Acquisition.

Acquisition :

It is the process of obtaining or gaining access to the data which are already available or 
generating new data.

How to get data from the web pages containing data in the form of excel sheet?

a. Lets understand it with the example:

we have data of urban areas in UK at below address,
https://en.wikipedia.org/wiki/List_of_urban_areas_in_the_United_Kingdom

we want to fetch data out of the above wikipedia page. below query will fwtch the required data.

=importHTML("http://en.wikipedia.org/wiki/List_of_largest_United_Kingdom_settlements_by_population","table",1)

b. Other way is with the help of Scraper.

Scraper is a tool which help us to scrape the similar kind of data from the web.
for this, the easiest way is to download Scraper chrome extension.

Suppose we we want to fetch list of all MPs in UK parliatment, so
i). go to http://www.parliament.uk/mps-lords-and-offices/mps/

ii). Right-click on any MP name and click on 'scrape similar'

iii). A window will pop up, containing the same which can be extracted to google docs etc.


There are many recipes available to do data extraction at:
https://schoolofdata.org/handbook/


I will keep updating the various other methods available in acquisition and so on!!

till then, happy learning!!

part - II continues...
https://www.blogger.com/blogger.g?blogID=7185820339193969011#editor/target=post;postID=3489600288863769443

courtesy : schoolofdata.org


No comments:

Post a Comment

5 States data in geoChart