Visa free travels - exploratory analysis with Kaggle dataset

This December I am celebrating 3 months anniversary with Python. On the last weekend of September I participated in PyLove workshop which was my first ever encounter with programming. Many lessons, few books and an online course later I dared to download my first dataset found on Kaggle and play around with it. I chose Visa Free Travel by Ctizenship 2016, Inequality in world citizenship and added more information using Countries REST API.

The problem I wanted to tackle was to measure the persisting inequality and see whether countries with similar rank share some characteristics with each other.

#1 DATA PREPARATION

Input:

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import requests

%matplotlib inline
sns.set()
# Loading Kaggle data set
visa = pd.read_csv('VisaFreeScore.csv')
visa.head()

Output:

country visarank visafree visaonarrive visawelc
0 Germany 1.0 157.0 117.0 84.0
1 Sweden 1.0 157.0 117.0 84.0
2 Finland 2.0 156.0 117.0 85.0
3 France 2.0 156.0 116.0 84.0
4 Italy 2.0 156.0 117.0 84.0

Input:

visa.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 5 columns):
country         199 non-null object
visarank        199 non-null float64
visafree        199 non-null float64
visaonarrive    199 non-null float64
visawelc        199 non-null float64
dtypes: float64(4), object(1)
memory usage: 7.9+ KB

Input:

# Loading country data from open API to a json file
countries = requests.get('https://restcountries.eu/rest/v2/all')
countries = countries.json()
# Creating empty lists of information I chose and filling them with data from API
name = []
borders = []
region = []
latlng = []
area = []
population = []
gini = []

x = 0
for country in countries:
    name.append(countries[x]['name'])
    borders.append(countries[x]['borders'])
    region.append(countries[x]['region'])
    latlng.append(countries[x]['latlng'])
    area.append(countries[x]['area'])
    population.append(countries[x]['population'])
    gini.append(countries[x]['gini'])
    x += 1
# Change from list of borders into number of shared boarders for each country
y = 0 
for border in borders:
    borders[y] = len(border)
    y += 1
# I want to separate latlng column, but couldn't simply loop through it as it has some empty lists. 
# Finding where latlng doesn't contain 2 values
for value in latlng:
    if len(value) == 0:
        print(latlng.index(value))
    elif len(value) == 1:
        print(latlng.index(value))

Output: 33

Input:

latlng[33] = ['', '']
lat = []
lng = []
z = 0
for value in latlng:
    lat.append(latlng[z][0])
    lng.append(latlng[z][1])
    z += 1
# Creating Pandas DataFrame from API data
df_countries = pd.DataFrame(
    {'country': name,
     'borders': borders,
     'region': region,
     'lat': lat,
     'lng': lng,
     'area': area,
     'population': population,
     'gini':gini
    })
df_countries.info()

Output:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 250 entries, 0 to 249
Data columns (total 8 columns):
area          240 non-null float64
borders       250 non-null int64
country       250 non-null object
gini          153 non-null float64
lat           250 non-null object
lng           250 non-null object
population    250 non-null int64
region        250 non-null object
dtypes: float64(2), int64(2), object(4)
memory usage: 15.7+ KB

Input:

# Merging two datasets into one table on country row
vc = pd.merge(visa, df_countries, how='inner', on='country')
vc = vc.set_index('country')
vc.info()

Output:

<class 'pandas.core.frame.DataFrame'>
Index: 178 entries, Germany to Afghanistan
Data columns (total 11 columns):
visarank        178 non-null float64
visafree        178 non-null float64
visaonarrive    178 non-null float64
visawelc        178 non-null float64
area            178 non-null float64
borders         178 non-null int64
gini            140 non-null float64
lat             178 non-null object
lng             178 non-null object
population      178 non-null int64
region          178 non-null object
dtypes: float64(6), int64(2), object(3)
memory usage: 16.7+ KB

#2 EXPLORATORY ANALYSIS

vc[['visarank', 'visafree', 'visaonarrive', 'visawelc', 'region']].head(10)
visarank visafree visaonarrive visawelc region
country
Germany 1.0 157.0 117.0 84.0 Europe
Sweden 1.0 157.0 117.0 84.0 Europe
Finland 2.0 156.0 117.0 85.0 Europe
France 2.0 156.0 116.0 84.0 Europe
Italy 2.0 156.0 117.0 84.0 Europe
Spain 2.0 156.0 115.0 84.0 Europe
Switzerland 2.0 156.0 116.0 84.0 Europe
Belgium 3.0 155.0 115.0 84.0 Europe
Denmark 3.0 155.0 116.0 83.0 Europe
Netherlands 3.0 155.0 115.0 84.0 Europe

Input:

vc[['visarank', 'visafree', 'visaonarrive', 'visawelc', 'region']].tail(10)

Output:

visarank visafree visaonarrive visawelc region
country
Sri Lanka 85.0 38.0 14.0 181.0 Asia
Bangladesh 86.0 37.0 16.0 174.0 Asia
Ethiopia 86.0 37.0 6.0 41.0 Africa
Libya 86.0 37.0 8.0 3.0 Africa
Sudan 86.0 37.0 5.0 10.0 Africa
South Sudan 87.0 36.0 7.0 5.0 Africa
Somalia 89.0 32.0 5.0 0.0 Africa
Iraq 90.0 31.0 5.0 1.0 Asia
Pakistan 91.0 28.0 6.0 9.0 Asia
Afghanistan 92.0 25.0 3.0 0.0 Asia
vc.info()
<class 'pandas.core.frame.DataFrame'>
Index: 178 entries, Germany to Afghanistan
Data columns (total 11 columns):
visarank        178 non-null float64
visafree        178 non-null float64
visaonarrive    178 non-null float64
visawelc        178 non-null float64
area            178 non-null float64
borders         178 non-null int64
gini            140 non-null float64
lat             178 non-null object
lng             178 non-null object
population      178 non-null int64
region          178 non-null object
dtypes: float64(6), int64(2), object(3)
memory usage: 16.7+ KB
  • First look at the data shows that top 10 countries are dominated by Europe, whereas the lowest rank countries represent Asia and Africa.

  • The lowest rank is 92 which means that some of 178 countries share the same rank.

  • Gini coefficient is the only variable with missing values.


vc[['visarank', 'visafree', 'visaonarrive', 'visawelc']].corr()
visarank visafree visaonarrive visawelc
visarank 1.000000 -0.996567 -0.990366 -0.125559
visafree -0.996567 1.000000 0.995280 0.117318
visaonarrive -0.990366 0.995280 1.000000 0.131792
visawelc -0.125559 0.117318 0.131792 1.000000
  • Visa rank is calculated on the basis of number of visa free destinations and possibility of acquiring visa on arrive. Therefore correlation between those 3 columns is almost 1.

  • Visawelc - number of countries that are not required to have a visa while visiting shows almost no connection with the rank.


((vc['visarank'].groupby(by=vc['region']).count()/178)*100).plot(kind='bar', alpha=0.7)
plt.title('Percentage of countries by region');

png


plt.hist(vc['visarank'], bins=20, ec='white', alpha=0.7)
plt.title('Distribution of visa rank')
plt.xlim(vc['visarank'].min(), vc['visarank'].max());

png

  • The lower the visa rank, the higher freedom of travel citizens of a country have. This histogram shows inequality in free travelling. There are two local max values - for countries with highest degree of freedom to travel, and for countries ranking just under 80, meaning many restrictions by visa requirements.

plt.hist(vc['visafree'], bins=20, ec='white', alpha=0.7)
plt.title('Distribution of visa free destinations per country')
plt.xlim(vc['visafree'].min(), vc['visafree'].max());

png

  • Distribution of visa free destinations allows to put a measure on inequality shown on the previous chart. Highest degree of freedom to travel means being able to visit almost 160 countries all over the world with no formal requirements. Countries with most restricions can travel to only about one third of this number.

fig = plt.figure(figsize=(8,5))
fig.add_subplot(111)

vc['visafree'].groupby(by=vc['region']).mean().plot(kind='bar', color='#19436B', position=0, width=0.3, alpha=0.7, label = 'Average number of visa free destinations')
vc['visawelc'].groupby(by=vc['region']).mean().plot(kind='bar', position=1, width=0.3, alpha=0.7, label = 'Average number of countries accepted visa free')
plt.legend();

png

  • Africa and Asia are regions, where citizens of each country have lowest number of visa free destinations.

  • European citizens have strong advantage over other regions.

  • At the same time Europe is a region that on average accepts the lowest number of visitors without visa.

  • Americas’ citizens also can travel visa free more than they offer visa free entries, however the difference is visibly smaller.

  • All the other regions have an opposite ratio and the biggest difference between those values persists among African countries.


vc1 = pd.DataFrame()
vc1['welcome_count'] = vc['visawelc'].groupby(by=vc['region']).sum()
vc1['free_count'] = vc['visafree'].groupby(by=vc['region']).sum()
vc1['free_welcome_ratio'] = vc1['free_count'] / vc1['welcome_count'] 
vc1
welcome_count free_count free_welcome_ratio
region
Africa 4106.0 2735.0 0.666098
Americas 3226.0 3426.0 1.061996
Asia 3583.0 2999.0 0.837008
Europe 3296.0 5742.0 1.742112
Oceania 1308.0 1191.0 0.910550
  • Free_welcome_ratio columns shows how many countries can an average citizen of a region travel visa free for welcoming one country without such requirement.

vc[vc['visawelc']==0][['visarank', 'visawelc', 'region']]
visarank visawelc region
country
Turkmenistan 74.0 0.0 Asia
Somalia 89.0 0.0 Africa
Afghanistan 92.0 0.0 Asia
vc[vc['visawelc']==vc['visawelc'].max()][['visarank', 'visawelc', 'region']]
visarank visawelc region
country
Seychelles 24.0 198.0 Africa
Samoa 35.0 198.0 Oceania
Timor-Leste 47.0 198.0 Asia
Tuvalu 51.0 198.0 Oceania
Uganda 67.0 198.0 Africa
Mauritania 71.0 198.0 Africa
Togo 72.0 198.0 Africa
Cambodia 75.0 198.0 Asia
Guinea-Bissau 75.0 198.0 Africa
Madagascar 76.0 198.0 Africa
Mozambique 76.0 198.0 Africa
Comoros 78.0 198.0 Africa
Burundi 82.0 198.0 Africa
  • 13 countries welcome all other without visa requirements, 3 countries allow no one to enter visa free.

plt.figure(figsize=(4,4))
vc['population'].groupby(np.where(vc['visafree']>vc['visawelc'], 'In advantage', 'In disadvantage')).sum().plot(kind='pie', label='')
plt.title('Share of people globally by advantage and disadvantage in ratio of visa free destinations to welcoming visa free visitors');

png

  • Number of citizens of countries that are in discriminating travelling agreements is approaching a quarter of population globally.

fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,7), )
vc[['population', 'visafree']].plot(ax = axes[0,0], x='population', y='visafree', kind='scatter')
vc[['gini', 'visafree']].plot(ax = axes[0,1], x='gini', y='visafree', kind='scatter')
vc[['borders', 'visafree']].plot(ax = axes[1,0], x='borders', y='visafree', kind='scatter')
vc[['area', 'visafree']].plot(ax = axes[1,1], x='area', y='visafree', kind='scatter');

png

The above diagrams check for patterns between number of visa free destinations and country characteristics.

  • There seems to be little conection between population size and visa free destinations.

  • Gini coefficient values are scattered, however there are two visible groups: gini score between 25 and 40 is paired with high number of visa free destinations (between 140 and 160 countries. Gini score between 30 and 45 is paired with lower number of visa free destinations (between 40 and 70 countries).

  • For different numbers of borders, number of visa free destinations tends to have full spectrum from very low to very high values.

  • There seems to be some pattern between area and visa free destinations. Ignoring extreme area values, with bigger area number of visa free destinations tends to cummulate around 30 to 70 countries.


fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(12,7))
vc[['population', 'visawelc']].plot(ax = axes[0,0], x='population', y='visawelc', kind='scatter')
vc[['gini', 'visawelc']].plot(ax = axes[0,1], x='gini', y='visawelc', kind='scatter')
vc[['borders', 'visawelc']].plot(ax = axes[1,0], x='borders', y='visawelc', kind='scatter')
vc[['area', 'visawelc']].plot(ax = axes[1,1], x='area', y='visawelc', kind='scatter');

png

The above diagrams check for patterns between number of welcoming visa free visitors and country characteristics.

  • Population again seems not to show any strong pattern against welcoming visa free visitors.

  • Low Gini coefficient (up to 30) is connected with visibly lower number of accepting visitors with no visa (around 80 countries). Only with Gini score higher than 30 we can observe full spectrum of visawelc values.

  • It’s difficult to observe a strong pattern between visawelc and number of borders. However interestingly, high numbers of allowing visa free visitors is more common for up to 2 shared borders. For higher values there are only single countries with such result.

  • Bigger area countries tend to allow up to 100 countries’ citizents with no visa requirement.

Ipynb file with full tables can be found here.

Archive

all python pandas jupyter-notebook tableau numpy nltk matplotlib