In this tutorial, we going to learn to pull data on websites with pandas. Before start, you can read Pandas beginner’s article for preparation.
We need two 2 modules for this project the modules are JSON and pandas let’s add these modules to our project. The first module is the panda’s module this module helping to pull data on the website.
import pandas as pd
import json as js
Read Data With Pandas
We are use read_html to read pulling data if you don’t know pandas library you should read this article. Firstly, we’ll examine this command syntax.
Now, we ready to use this command, firstly, we going to create a new link in the string variable after that we can use this variable with the read_html function.
Link = "https://en.wikipedia.org/w/index.php?title=Fortune_Global_500&oldid=855890446" Data = pd.read_html(Link , header = 0) Data
You can pull a table on Wikipedia with the code block above, you can access other tables by changing the index number on the side.
Pulling Data Examples
In this section, we will try to organize by pulling different data from different sites. We will process the JSON module in these projects, now let’s see what we learned first in a project.
import pandas as pd Data = pd.read_html(Link)
Sometimes things are not that easy. We use the JSON library to understand the language of the sites, let’s create a fortune.
import pandas as pd import json as js Data = pd.read_html(Link) fortune = js.loads(Data.to_json(orient="records"))
Now that we know the general information about data extraction, let’s use the newly learned information in a real project.
Using Pulled Data
The data we will use will be taken from Wikipedia’s table of the world’s largest companies and we will list the countries with the highest number of brands with the data we have received, with the function we have written.
import pandas as pd import json as js
We start by adding libraries, then we will pull the data.
Link = "https://en.wikipedia.org/wiki/List_of_largest_companies_by_revenue" Data = pd.read_html(Link) Data
You can write data directly to review. After reviewing, delete the last line to be printed on the screen.
fortune = js.loads(Data.to_json(orient="records"))
Now that we have finished extracting and storing data, we can use this data for our function. Read this article, you need to understand the index logic before sorting in the function.
def Sort(): country = (Data["Country[note 1]"]) occurence_count = Counter(country) return(occurence_count) print(Sort())
The counter function here measures the usage numbers of all values. To use this function, you must call the counter function from collections.