Pulling Data From the Internet

In this tutorial, we going to learn to pull data on websites with pandas. Before start, you can read Pandas beginner’s article for preparation.

Import Module

We need two 2 modules for this project the modules are JSON and pandas let’s add these modules to our project. The first module is the panda’s module this module helping to pull data on the website.

import pandas as pd

Second, the module is JSON, the module API for converting Python objects in memory into a serialized representation called JavaScript Object Notation.

import json as js
Read Data With Pandas

We are use read_html to read pulling data if you don’t know pandas library you should read this article. Firstly, we’ll examine this command syntax.

pd.read_html("Link")

Now, we ready to use this command, firstly, we going to create a new link in the string variable after that we can use this variable with the read_html function.

Link = "https://en.wikipedia.org/w/index.php?title=Fortune_Global_500&oldid=855890446"
Data = pd.read_html(Link , header = 0)[0]
Data

You can pull a table on Wikipedia with the code block above, you can access other tables by changing the index number on the side.

Pulling Data Examples

In this section, we will try to organize by pulling different data from different sites. We will process the JSON module in these projects, now let’s see what we learned first in a project.

import pandas as pd
Data = pd.read_html(Link)

Sometimes things are not that easy. We use the JSON library to understand the language of the sites, let’s create a fortune.

import pandas as pd
import json as js

Data = pd.read_html(Link)
fortune = js.loads(Data.to_json(orient="records"))

Now that we know the general information about data extraction, let’s use the newly learned information in a real project.

Using Pulled Data

The data we will use will be taken from Wikipedia’s table of the world’s largest companies and we will list the countries with the highest number of brands with the data we have received, with the function we have written.

import pandas as pd
import json as js

We start by adding libraries, then we will pull the data.

Link = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue"
Data = pd.read_html(Link)[0]
Data

You can write data directly to review. After reviewing, delete the last line to be printed on the screen.

fortune = js.loads(Data.to_json(orient="records"))

Now that we have finished extracting and storing data, we can use this data for our function. Read this article, you need to understand the index logic before sorting in the function.

def Sort():
   country = (Data["Country[note 1]"])
   occurence_count = Counter(country)
   return(occurence_count)

print(Sort())

The counter function here measures the usage numbers of all values. To use this function, you must call the counter function from collections.

Leave a Reply

Your email address will not be published. Required fields are marked *