Site icon Stata.Professor : Your Partner in Research

Getting Started with Data Visualization in Python Pandas

DOWNLOAD DATASETS

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

TED Talk Dataset Excercises

In [5]:
# Change directory
In [6]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"
D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment
In [7]:
import pandas as pd
In [8]:
ted = pd.read_csv('ted.csv')

1: Explore the Data attributes

In [11]:
ted.dtypes
Out[11]:
comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object
In [12]:
ted.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
In [13]:
ted.shape
Out[13]:
(2550, 17)

2. Which talk has the highest comments

In [77]:
ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()
Out[77]:
comments duration main_speaker
1787 2673 1117 David Chalmers
201 2877 1099 Jill Bolte Taylor
644 3356 1386 Sam Harris
0 4553 1164 Ken Robinson
96 6404 1750 Richard Dawkins

3 Find top 5 talks that have the highest views to comments ratio

In [16]:
ted['view_to_comment'] = ted['views'] / ted['comments']
In [17]:
ted['view_to_comment'].tail()
Out[17]:
2545    26495.882353
2546    69578.333333
2547    37564.700000
2548    13103.406250
2549    48965.125000
Name: view_to_comment, dtype: float64

4 . Create a histogram of comments

In [19]:
import matplotlib.pyplot as plot
ted['comments'].plot(kind = 'hist')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a233e3978>

5. Create histogram of comments where comments are less than 1000

In [35]:
# Get index of those row which have less than 1000 comments 
index = ted['comments']<1000
In [38]:
# Get only the comments column from these filtered rows
com1000 = ted[index]['comments']
In [39]:
# Make a plot of these filtered comments
com1000.plot(kind = 'hist')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a236dc7b8>
In [40]:
# When you expert, you can do the above just in one line
ted[ted['comments']<1000]['comments'].plot(kind='hist')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a2375cac8>
In [44]:
# How many rows were excluded from the above graph
ted[ted['comments'] >=1000].shape
Out[44]:
(32, 18)

6. Do the same as in 5, but using a query method

In [68]:
# Filter the whole dataset where comments are less than 1000
ted1000 = ted.query('comments <1000')
In [69]:
# Get only the comments column from the reduced dataset
comment1000 = ted1000['comments']
In [70]:
# Plot the filtered comments
comment1000.plot(kind = 'hist')
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a238fb630>

7. How to add more bins to the histogram

In [71]:
comment1000.plot(kind = 'hist', bins = 20)
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a23953278>

8. Make a box plot and identify outliers

In [73]:
comment1000.plot(kind = 'box')
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a23a4ba20>

The black dots show outliers

In [ ]:
 
Exit mobile version