Visa free travels - exploratory analysis with Kaggle dataset
29 Dec 2017This December I am celebrating 3 months anniversary with Python. On the last weekend of September I participated in PyLove workshop which was my first ever encounter with programming. Many lessons, few books and an online course later I dared to download my first dataset found on Kaggle and play around with it. I chose Visa Free Travel by Ctizenship 2016, Inequality in world citizenship and added more information using Countries REST API.
The problem I wanted to tackle was to measure the persisting inequality and see whether countries with similar rank share some characteristics with each other.
#1 DATA PREPARATION
Input:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests
%matplotlib inline
sns.set()
# Loading Kaggle data set
visa = pd.read_csv('VisaFreeScore.csv')
visa.head()
Output:
country | visarank | visafree | visaonarrive | visawelc | |
---|---|---|---|---|---|
0 | Germany | 1.0 | 157.0 | 117.0 | 84.0 |
1 | Sweden | 1.0 | 157.0 | 117.0 | 84.0 |
2 | Finland | 2.0 | 156.0 | 117.0 | 85.0 |
3 | France | 2.0 | 156.0 | 116.0 | 84.0 |
4 | Italy | 2.0 | 156.0 | 117.0 | 84.0 |
Input:
visa.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
country 199 non-null object
visarank 199 non-null float64
visafree 199 non-null float64
visaonarrive 199 non-null float64
visawelc 199 non-null float64
dtypes: float64(4), object(1)
memory usage: 7.9+ KB
Input:
# Loading country data from open API to a json file
countries = requests.get('https://restcountries.eu/rest/v2/all')
countries = countries.json()
# Creating empty lists of information I chose and filling them with data from API
name = []
borders = []
region = []
latlng = []
area = []
population = []
gini = []
x = 0
for country in countries:
name.append(countries[x]['name'])
borders.append(countries[x]['borders'])
region.append(countries[x]['region'])
latlng.append(countries[x]['latlng'])
area.append(countries[x]['area'])
population.append(countries[x]['population'])
gini.append(countries[x]['gini'])
x += 1
# Change from list of borders into number of shared boarders for each country
y = 0
for border in borders:
borders[y] = len(border)
y += 1
# I want to separate latlng column, but couldn't simply loop through it as it has some empty lists.
# Finding where latlng doesn't contain 2 values
for value in latlng:
if len(value) == 0:
print(latlng.index(value))
elif len(value) == 1:
print(latlng.index(value))
Output: 33
Input:
latlng[33] = ['', '']
lat = []
lng = []
z = 0
for value in latlng:
lat.append(latlng[z][0])
lng.append(latlng[z][1])
z += 1
# Creating Pandas DataFrame from API data
df_countries = pd.DataFrame(
{'country': name,
'borders': borders,
'region': region,
'lat': lat,
'lng': lng,
'area': area,
'population': population,
'gini':gini
})
df_countries.info()
Output:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 8 columns):
area 240 non-null float64
borders 250 non-null int64
country 250 non-null object
gini 153 non-null float64
lat 250 non-null object
lng 250 non-null object
population 250 non-null int64
region 250 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 15.7+ KB
Input:
# Merging two datasets into one table on country row
vc = pd.merge(visa, df_countries, how='inner', on='country')
vc = vc.set_index('country')
vc.info()
Output:
<class 'pandas.core.frame.DataFrame'>
Index: 178 entries, Germany to Afghanistan
Data columns (total 11 columns):
visarank 178 non-null float64
visafree 178 non-null float64
visaonarrive 178 non-null float64
visawelc 178 non-null float64
area 178 non-null float64
borders 178 non-null int64
gini 140 non-null float64
lat 178 non-null object
lng 178 non-null object
population 178 non-null int64
region 178 non-null object
dtypes: float64(6), int64(2), object(3)
memory usage: 16.7+ KB
#2 EXPLORATORY ANALYSIS
vc[['visarank', 'visafree', 'visaonarrive', 'visawelc', 'region']].head(10)
visarank | visafree | visaonarrive | visawelc | region | |
---|---|---|---|---|---|
country | |||||
Germany | 1.0 | 157.0 | 117.0 | 84.0 | Europe |
Sweden | 1.0 | 157.0 | 117.0 | 84.0 | Europe |
Finland | 2.0 | 156.0 | 117.0 | 85.0 | Europe |
France | 2.0 | 156.0 | 116.0 | 84.0 | Europe |
Italy | 2.0 | 156.0 | 117.0 | 84.0 | Europe |
Spain | 2.0 | 156.0 | 115.0 | 84.0 | Europe |
Switzerland | 2.0 | 156.0 | 116.0 | 84.0 | Europe |
Belgium | 3.0 | 155.0 | 115.0 | 84.0 | Europe |
Denmark | 3.0 | 155.0 | 116.0 | 83.0 | Europe |
Netherlands | 3.0 | 155.0 | 115.0 | 84.0 | Europe |
Input:
vc[['visarank', 'visafree', 'visaonarrive', 'visawelc', 'region']].tail(10)
Output:
visarank | visafree | visaonarrive | visawelc | region | |
---|---|---|---|---|---|
country | |||||
Sri Lanka | 85.0 | 38.0 | 14.0 | 181.0 | Asia |
Bangladesh | 86.0 | 37.0 | 16.0 | 174.0 | Asia |
Ethiopia | 86.0 | 37.0 | 6.0 | 41.0 | Africa |
Libya | 86.0 | 37.0 | 8.0 | 3.0 | Africa |
Sudan | 86.0 | 37.0 | 5.0 | 10.0 | Africa |
South Sudan | 87.0 | 36.0 | 7.0 | 5.0 | Africa |
Somalia | 89.0 | 32.0 | 5.0 | 0.0 | Africa |
Iraq | 90.0 | 31.0 | 5.0 | 1.0 | Asia |
Pakistan | 91.0 | 28.0 | 6.0 | 9.0 | Asia |
Afghanistan | 92.0 | 25.0 | 3.0 | 0.0 | Asia |
vc.info()
<class 'pandas.core.frame.DataFrame'>
Index: 178 entries, Germany to Afghanistan
Data columns (total 11 columns):
visarank 178 non-null float64
visafree 178 non-null float64
visaonarrive 178 non-null float64
visawelc 178 non-null float64
area 178 non-null float64
borders 178 non-null int64
gini 140 non-null float64
lat 178 non-null object
lng 178 non-null object
population 178 non-null int64
region 178 non-null object
dtypes: float64(6), int64(2), object(3)
memory usage: 16.7+ KB
-
First look at the data shows that top 10 countries are dominated by Europe, whereas the lowest rank countries represent Asia and Africa.
-
The lowest rank is 92 which means that some of 178 countries share the same rank.
-
Gini coefficient is the only variable with missing values.
vc[['visarank', 'visafree', 'visaonarrive', 'visawelc']].corr()
visarank | visafree | visaonarrive | visawelc | |
---|---|---|---|---|
visarank | 1.000000 | -0.996567 | -0.990366 | -0.125559 |
visafree | -0.996567 | 1.000000 | 0.995280 | 0.117318 |
visaonarrive | -0.990366 | 0.995280 | 1.000000 | 0.131792 |
visawelc | -0.125559 | 0.117318 | 0.131792 | 1.000000 |
-
Visa rank is calculated on the basis of number of visa free destinations and possibility of acquiring visa on arrive. Therefore correlation between those 3 columns is almost 1.
-
Visawelc - number of countries that are not required to have a visa while visiting shows almost no connection with the rank.
((vc['visarank'].groupby(by=vc['region']).count()/178)*100).plot(kind='bar', alpha=0.7)
plt.title('Percentage of countries by region');
plt.hist(vc['visarank'], bins=20, ec='white', alpha=0.7)
plt.title('Distribution of visa rank')
plt.xlim(vc['visarank'].min(), vc['visarank'].max());
- The lower the visa rank, the higher freedom of travel citizens of a country have. This histogram shows inequality in free travelling. There are two local max values - for countries with highest degree of freedom to travel, and for countries ranking just under 80, meaning many restrictions by visa requirements.
plt.hist(vc['visafree'], bins=20, ec='white', alpha=0.7)
plt.title('Distribution of visa free destinations per country')
plt.xlim(vc['visafree'].min(), vc['visafree'].max());
- Distribution of visa free destinations allows to put a measure on inequality shown on the previous chart. Highest degree of freedom to travel means being able to visit almost 160 countries all over the world with no formal requirements. Countries with most restricions can travel to only about one third of this number.
fig = plt.figure(figsize=(8,5))
fig.add_subplot(111)
vc['visafree'].groupby(by=vc['region']).mean().plot(kind='bar', color='#19436B', position=0, width=0.3, alpha=0.7, label = 'Average number of visa free destinations')
vc['visawelc'].groupby(by=vc['region']).mean().plot(kind='bar', position=1, width=0.3, alpha=0.7, label = 'Average number of countries accepted visa free')
plt.legend();
-
Africa and Asia are regions, where citizens of each country have lowest number of visa free destinations.
-
European citizens have strong advantage over other regions.
-
At the same time Europe is a region that on average accepts the lowest number of visitors without visa.
-
Americas’ citizens also can travel visa free more than they offer visa free entries, however the difference is visibly smaller.
-
All the other regions have an opposite ratio and the biggest difference between those values persists among African countries.
vc1 = pd.DataFrame()
vc1['welcome_count'] = vc['visawelc'].groupby(by=vc['region']).sum()
vc1['free_count'] = vc['visafree'].groupby(by=vc['region']).sum()
vc1['free_welcome_ratio'] = vc1['free_count'] / vc1['welcome_count']
vc1
welcome_count | free_count | free_welcome_ratio | |
---|---|---|---|
region | |||
Africa | 4106.0 | 2735.0 | 0.666098 |
Americas | 3226.0 | 3426.0 | 1.061996 |
Asia | 3583.0 | 2999.0 | 0.837008 |
Europe | 3296.0 | 5742.0 | 1.742112 |
Oceania | 1308.0 | 1191.0 | 0.910550 |
- Free_welcome_ratio columns shows how many countries can an average citizen of a region travel visa free for welcoming one country without such requirement.
vc[vc['visawelc']==0][['visarank', 'visawelc', 'region']]
visarank | visawelc | region | |
---|---|---|---|
country | |||
Turkmenistan | 74.0 | 0.0 | Asia |
Somalia | 89.0 | 0.0 | Africa |
Afghanistan | 92.0 | 0.0 | Asia |
vc[vc['visawelc']==vc['visawelc'].max()][['visarank', 'visawelc', 'region']]
visarank | visawelc | region | |
---|---|---|---|
country | |||
Seychelles | 24.0 | 198.0 | Africa |
Samoa | 35.0 | 198.0 | Oceania |
Timor-Leste | 47.0 | 198.0 | Asia |
Tuvalu | 51.0 | 198.0 | Oceania |
Uganda | 67.0 | 198.0 | Africa |
Mauritania | 71.0 | 198.0 | Africa |
Togo | 72.0 | 198.0 | Africa |
Cambodia | 75.0 | 198.0 | Asia |
Guinea-Bissau | 75.0 | 198.0 | Africa |
Madagascar | 76.0 | 198.0 | Africa |
Mozambique | 76.0 | 198.0 | Africa |
Comoros | 78.0 | 198.0 | Africa |
Burundi | 82.0 | 198.0 | Africa |
- 13 countries welcome all other without visa requirements, 3 countries allow no one to enter visa free.
plt.figure(figsize=(4,4))
vc['population'].groupby(np.where(vc['visafree']>vc['visawelc'], 'In advantage', 'In disadvantage')).sum().plot(kind='pie', label='')
plt.title('Share of people globally by advantage and disadvantage in ratio of visa free destinations to welcoming visa free visitors');
- Number of citizens of countries that are in discriminating travelling agreements is approaching a quarter of population globally.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,7), )
vc[['population', 'visafree']].plot(ax = axes[0,0], x='population', y='visafree', kind='scatter')
vc[['gini', 'visafree']].plot(ax = axes[0,1], x='gini', y='visafree', kind='scatter')
vc[['borders', 'visafree']].plot(ax = axes[1,0], x='borders', y='visafree', kind='scatter')
vc[['area', 'visafree']].plot(ax = axes[1,1], x='area', y='visafree', kind='scatter');
The above diagrams check for patterns between number of visa free destinations and country characteristics.
-
There seems to be little conection between population size and visa free destinations.
-
Gini coefficient values are scattered, however there are two visible groups: gini score between 25 and 40 is paired with high number of visa free destinations (between 140 and 160 countries. Gini score between 30 and 45 is paired with lower number of visa free destinations (between 40 and 70 countries).
-
For different numbers of borders, number of visa free destinations tends to have full spectrum from very low to very high values.
-
There seems to be some pattern between area and visa free destinations. Ignoring extreme area values, with bigger area number of visa free destinations tends to cummulate around 30 to 70 countries.
fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,7))
vc[['population', 'visawelc']].plot(ax = axes[0,0], x='population', y='visawelc', kind='scatter')
vc[['gini', 'visawelc']].plot(ax = axes[0,1], x='gini', y='visawelc', kind='scatter')
vc[['borders', 'visawelc']].plot(ax = axes[1,0], x='borders', y='visawelc', kind='scatter')
vc[['area', 'visawelc']].plot(ax = axes[1,1], x='area', y='visawelc', kind='scatter');
The above diagrams check for patterns between number of welcoming visa free visitors and country characteristics.
-
Population again seems not to show any strong pattern against welcoming visa free visitors.
-
Low Gini coefficient (up to 30) is connected with visibly lower number of accepting visitors with no visa (around 80 countries). Only with Gini score higher than 30 we can observe full spectrum of visawelc values.
-
It’s difficult to observe a strong pattern between visawelc and number of borders. However interestingly, high numbers of allowing visa free visitors is more common for up to 2 shared borders. For higher values there are only single countries with such result.
-
Bigger area countries tend to allow up to 100 countries’ citizents with no visa requirement.
Ipynb file with full tables can be found here.