The purpose of this post is to give a hands-on demo on how to fetch data from World Bank’s famous World Development Indicators using a very handy wbdata package from python. Also, along this tutorial, you will get to know some regular expression usage in python, and how we can improve the matplotlib charting library with seaborn package.

First, let us install the library wbdata.

!pip install wbdata

Then import the library as usual.

import wbdata

Actually this package can connect to lots of data sources maintained by World Bank. We can view these data sources as follows.

wbdata.get_source()
11	Africa Development Indicators
36	Statistical Capacity Indicators
31	Country Policy and Institutional Assessment
41	Country Partnership Strategy for India (FY2013 - 17)
1 	Doing Business
30	Exporter Dynamics Database ��� Indicators at Country-Year Level
12	Education Statistics
13	Enterprise Surveys
28	Global Financial Inclusion
33	G20 Financial Inclusion Indicators
14	Gender Statistics
15	Global Economic Monitor
27	Global Economic Prospects
32	Global Financial Development
21	Global Economic Monitor Commodities
55	Commodity Prices- History and Projections
34	Global Partnership for Education
29	The Atlas of Social Protection: Indicators of Resilience and Equity
16	Health Nutrition and Population Statistics
39	Health Nutrition and Population Statistics by Wealth Quintile
40	Population estimates and projections
18	IDA Results Measurement System
45	Indonesia Database for Policy and Economic Research
6 	International Debt Statistics
54	Joint External Debt Hub
25	Jobs
37	LAC Equity Lab
19	Millennium Development Goals
24	Poverty and Equity
20	Quarterly Public Sector Debt
23	Quarterly External Debt Statistics GDDS
22	Quarterly External Debt Statistics SDDS
44	Readiness for Investment in Sustainable Energy
46	Sustainable Development Goals 
35	Sustainable Energy for All
5 	Subnational Malnutrition Database
38	Subnational Poverty
50	Subnational Population
43	Wealth accounting
57	WDI Database Archives
2 	World Development Indicators
3 	Worldwide Governance Indicators

We are interested in source no. 2. This source contains thousands of indicators, so how can we search them by keyword. Here is the way.

wbdata.search_indicators('emission', source=2)
EN.CO2.TRAN.ZS      	CO2 emissions from transport (% of total fuel combustion)
EN.CO2.OTHX.ZS      	CO2 emissions from other sectors, excluding residential buildings and commercial and public services (% of total fuel combustion)
EN.CO2.MANF.ZS      	CO2 emissions from manufacturing industries and construction (% of total fuel combustion)
EN.CO2.ETOT.ZS      	CO2 emissions from electricity and heat production, total (% of total fuel combustion)
EN.CO2.BLDG.ZS      	CO2 emissions from residential buildings and commercial and public services (% of total fuel combustion)
EN.CLC.GHGR.MT.CE   	GHG net emissions/removals by LUCF (Mt of CO2 equivalent)
EN.ATM.SF6G.KT.CE   	SF6 gas emissions (thousand metric tons of CO2 equivalent)
EN.ATM.PFCG.KT.CE   	PFC gas emissions (thousand metric tons of CO2 equivalent)
EN.ATM.NOXE.ZG      	Nitrous oxide emissions (% change from 1990)
EN.ATM.NOXE.KT.CE   	Nitrous oxide emissions (thousand metric tons of CO2 equivalent)
EN.ATM.NOXE.EG.ZS   	Nitrous oxide emissions in energy sector (% of total)
EN.ATM.NOXE.EG.KT.CE	Nitrous oxide emissions in energy sector (thousand metric tons of CO2 equivalent)
EN.ATM.NOXE.AG.ZS   	Agricultural nitrous oxide emissions (% of total)
EN.ATM.NOXE.AG.KT.CE	Agricultural nitrous oxide emissions (thousand metric tons of CO2 equivalent)
EN.ATM.METH.ZG      	Methane emissions (% change from 1990)
EN.ATM.METH.KT.CE   	Methane emissions (kt of CO2 equivalent)
EN.ATM.METH.EG.ZS   	Energy related methane emissions (% of total)
EN.ATM.METH.EG.KT.CE	Methane emissions in energy sector (thousand metric tons of CO2 equivalent)
EN.ATM.METH.AG.ZS   	Agricultural methane emissions (% of total)
EN.ATM.METH.AG.KT.CE	Agricultural methane emissions (thousand metric tons of CO2 equivalent)
EN.ATM.HFCG.KT.CE   	HFC gas emissions (thousand metric tons of CO2 equivalent)
EN.ATM.GHGT.ZG      	Total greenhouse gas emissions (% change from 1990)
EN.ATM.GHGT.KT.CE   	Total greenhouse gas emissions (kt of CO2 equivalent)
EN.ATM.GHGO.ZG      	Other greenhouse gas emissions (% change from 1990)
EN.ATM.GHGO.KT.CE   	Other greenhouse gas emissions, HFC, PFC and SF6 (thousand metric tons of CO2 equivalent)
EN.ATM.CO2E.SF.ZS   	CO2 emissions from solid fuel consumption (% of total)
EN.ATM.CO2E.SF.KT   	CO2 emissions from solid fuel consumption (kt) 
EN.ATM.CO2E.PP.GD.KD	CO2 emissions (kg per 2011 PPP $ of GDP)
EN.ATM.CO2E.PP.GD   	CO2 emissions (kg per PPP $ of GDP)
EN.ATM.CO2E.PC      	CO2 emissions (metric tons per capita)
EN.ATM.CO2E.LF.ZS   	CO2 emissions from liquid fuel consumption (% of total) 
EN.ATM.CO2E.LF.KT   	CO2 emissions from liquid fuel consumption (kt) 
EN.ATM.CO2E.KT      	CO2 emissions (kt)
EN.ATM.CO2E.KD.GD   	CO2 emissions (kg per 2010 US$ of GDP)
EN.ATM.CO2E.GF.ZS   	CO2 emissions from gaseous fuel consumption (% of total) 
EN.ATM.CO2E.GF.KT   	CO2 emissions from gaseous fuel consumption (kt) 
NY.ADJ.SVNX.GN.ZS   	Adjusted net savings, excluding particulate emission damage (% of GNI)
NY.ADJ.SVNX.CD      	Adjusted net savings, excluding particulate emission damage (current US$)
NY.ADJ.SVNG.GN.ZS   	Adjusted net savings, including particulate emission damage (% of GNI)
NY.ADJ.SVNG.CD      	Adjusted net savings, including particulate emission damage (current US$)
NY.ADJ.DPEM.GN.ZS   	Adjusted savings: particulate emission damage (% of GNI)
NY.ADJ.DPEM.CD      	Adjusted savings: particulate emission damage (current US$)

As you can see, the function returns a list of indicator names and their codes. Let us choose the indicator with title CO2 emissions (kg per 2011 PPP $ of GDP) and code EN.ATM.CO2E.PP.GD.KD

We can get details of a particular indicator by knowing its code number as follows.

ind = wbdata.get_indicator('EN.ATM.CO2E.PP.GD.KD', display=False)

When you specify display=False then you enable the output to be kept in a variable (in our case it is ind), which we can easily display as below.

ind
[{'id': 'EN.ATM.CO2E.PP.GD.KD',
  'name': 'CO2 emissions (kg per 2011 PPP $ of GDP)',
  'source': {'id': '2', 'value': 'World Development Indicators'},
  'sourceNote': 'Carbon dioxide emissions are those stemming from the burning of fossil fuels and the manufacture of cement. They include carbon dioxide produced during consumption of solid, liquid, and gas fuels and gas flaring.',
  'sourceOrganization': 'Carbon Dioxide Information Analysis Center, Environmental Sciences Division, Oak Ridge National Laboratory, Tennessee, United States.',
  'topics': [{'id': '19', 'value': 'Climate Change'},
   {'id': '6', 'value': 'Environment '}]}]

As you can see, this function returns the value as a dictionary, from which we can extract desired information. Let us say we want to know the name of the indicator.

ind[0]['name']
'CO2 emissions (kg per 2011 PPP $ of GDP)'

After we found the indicator, let us choose a number of countries to investigate. For that let us search for country codes.

wbdata.search_countries("Bangladesh")
BGD	Bangladesh
wbdata.search_countries("Saudi")
SAU	Saudi Arabia

We now have indicator code and country codes, all we need now is two dates to narrow our search. Let us say that we want to search the database from 2001 to 2015. Here is the way we keep the two dates.

import datetime
data_dates = (datetime.datetime(2001,1,1), datetime.datetime(2015,1,1))

By now we have all the ingredients, all we need now is to call the most important function: get_dataframe to fetch the data and keep it inside a dataframe.

import pandas as pd
data = wbdata.get_dataframe({'EN.ATM.CO2E.PP.GD.KD':'values'}, 
                            country=('BGD', 'SAU'), 
                            data_date=data_dates, 
                            convert_date=False, keep_levels=True)
data.head()
values
country date
Bangladesh 2015 NaN
2014 NaN
2013 0.154309
2012 0.160099
2011 0.160661

It seems the dataframe is indexed twice: country-wise as well as date-wise. I am not very comfortable in dealing with multi-indexed dataframes, so I would feel better to reset_index so that I can apply filters more conveniently.

data = data.reset_index()

Now, let us see few lines of the dataframe after resetting index.

data.head()
country date values
0 Bangladesh 2015 NaN
1 Bangladesh 2014 NaN
2 Bangladesh 2013 0.154309
3 Bangladesh 2012 0.160099
4 Bangladesh 2011 0.160661

After this proof of concept, let us wrap up these small bits and pieces into a function that will take a country, an indicator ID and start and end dates and return a list which can be used to plot charts later.

def country_data(country_code, indicator, start=2000, end=2015):
    import datetime
    import wbdata
    data_dates = (datetime.datetime(start,1,1), datetime.datetime(end,1,1))
    #call the api
    data = wbdata.get_dataframe({indicator:'indicator'}, 
                                country=country_code, 
                                data_date=data_dates, 
                                convert_date=True, 
                                keep_levels=False)
    
    data = data.reset_index()
    #data = data.dropna() #if I want I can drop the na's
    return data[['indicator']]

And a small test to check things are running well.

country_data('IDN','EN.ATM.CO2E.PP.GD.KD')
indicator
0 NaN
1 NaN
2 0.197236
3 0.260391
4 0.264045
5 0.209629
6 0.231842
7 0.226354
8 0.216339
9 0.211427
10 0.221036
11 0.230642
12 0.227291
13 0.230597
14 0.231680
15 0.214482

Now it is time to write a grand function that takes as input a list of country codes, then uses our function above country_data to get a list of indicators for each country, then plot a line diagram.

def plot_indicators(country_list, indicator, start=2000, end=2015):
    import matplotlib.pyplot as plt
    import seaborn as sns
    import wbdata
    import re
    ind = wbdata.get_indicator(indicator, display=False)
    # capture the title which includes the unit after bracket
    title = ind[0]['name']
    # now take entire text from first letter to before opening bracket
    title = title[:title.find('(')-1]
    # this is the patter to match anything between two brackets
    p = re.compile('\((.*?)\)')
    ylab = p.findall(ind[0]['name'])[0]
    sns.set_style('white')
    fig, axis = plt.subplots()
    for c in country_list:
        axis.plot(range(start,end+1),country_data(c,indicator,start,end))
    plt.legend(country_list)
    plt.title(title)
    plt.ylabel(ylab)
    plt.show()

Let us test the function with gulf countries: Saudi Arabia, Qatar, Bahrain, Kuwait, Oman and U.A.E since 1990.

plot_indicators(['SAU', 'QAT', 'BHR', 'KWT', 'OMN', 'ARE'],
	'EN.ATM.CO2E.PP.GD.KD',1990,2015)

png

There are few points to illustrate in the function above.

First, note that -in order to render better chart- I have used seaborn data visualization package which basically sits on top of matplotlib. I just set_style our chart to one of seaborn’s styles for better output but the matplotlib commands remains the same.

Second, I had to extract from the title of the indicator the unit which sits in between parenthesis. To do that I used regular expression. So, let me explain this point further.

You saw above how I got the name of the indicator which returned something like GDP per unit of energy use (constant 2011 PPP $ per kg of oil equivalent)

Note that the name of all World Bank indicators starts with the actual title of the indicator, followed by the unit in between parenthesis. So, how we can extract these two parts.

Let us start with the first part which is easier. We want to start taking string subset from position 0 until we see the first opening bracket, that would be our title part. In order to do that, I used strings find() function as follows.

test_string = 'GDP per unit of energy use (constant 2011 PPP $ per kg of oil equivalent)'
test_string[:test_string.find('(')-1]
'GDP per unit of energy use'

find returns be the position of '(', so I take the string from position 0 to position of open bracket minus one (because the last character before bracket is a space which I want to ignore).

Now, the more difficult part is the unit which sits in between brackets. I could have used the same find() to search for index of closing bracket. But let me use regular expression so practice this powerful tool.

First we need to import regular expression package re. Then we need to design the pattern we are interested in which is anything in between open and close parenthesis. That is: '\((.*?)\)'. Check this stackoverflow post to get more information. I am using findall() function to return a list of all instances, and then I take the first instance of the list (at position 0).

import re
p = re.compile('\((.*?)\)')
p.findall('GDP per unit of energy use (constant 2011 PPP $ per kg of oil equivalent)')[0]
'constant 2011 PPP $ per kg of oil equivalent'

With these separation between title and unit, we can easily incorporate them in the ylabel and title of our matplotlib charts as done in our function above: plot_indicators.

With this, I hope you can carry-on on your own with much more interesting queries and experiments.