Tuesday, July 11, 2017

An introduction to the data pipeline - III(fetching data from web using python)




            How to extract data from web using python



I was trying to fetch data from web using Scraper, but there was some issue while doing so,
then i decided to tinker with html page using python.


So, let's see how to do so,


we want to fetch all movies of 2016, so let's go to IMDB:


below is the basic code to fetch movie names out of that, i have done this for the very first page,
but we need to automate that for all pages.  

#Import  libraries:
import urllib

#Use function “prettify” to look at nested structure of HTML page
from bs4 import BeautifulSoup                   
soup = BeautifulSoup(page)

print(soup.prettify())

# I inspect the element and found our "movie names" lie in "h3" with class name as "lister-item-header"

import re
imdb = "http://www.imdb.com/search/title?year=2016,2016&title_type=feature&sort=moviemeter,asc"
page_initial = urllib.request.urlopen(imdb)
for i in range(1, 10):
    page = page_initial
    soup = BeautifulSoup(page)
    all_links=soup.findAll("h3", {"class" : "lister-item-header"})
    for i in range(1,len(all_links)):
        for j in all_links[i].findAll("a"):
            A.append(j.find(text=True))
           
    a = soup.find('a', href=True, text=re.compile("Next"), class_ = "lister-page-next next-page")
    link = a["href"]
    page_initial = urllib.request.urlopen("http://www.imdb.com/search/title"+link)
   
print("total movies are: "+len(A))


#output


Moana
Trolls
Suicide Squad
Split
A Cure for Wellness
Sing
Hacksaw Ridge
Captain America: Civil War
The Belko Experiment
Star Trek Beyond
Fantastic Beasts and Where to Find Them
Free Fire
The Deep End
Doctor Strange
The Bad Batch
La La Land
Batman v Superman: Dawn of Justice
X-Men: Apocalypse
Independence Day: Resurgence
Lion
Rogue One
Deadpool
The Magnificent Seven
Hidden Figures
Miss Peregrine's Home for Peculiar Children
Passengers
Arrival
The Great Wall
The Accountant
Bad Moms
Moonlight
Morgan
Loving
The Secret Life of Pets
Sausage Party
Nocturnal Animals
Snowden
Contratiempo
Me Before You
Inferno
Manchester by the Sea
The Promise
Nerve
Masterminds
War Dogs
The Girl on the Train
Dangal
Zootopia
Lady Macbeth
............





No comments:

Post a Comment

5 States data in geoChart