DOWNLOAD DATASETS

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

TED Talk Dataset Excercises

In [5]:
# Change directory
In [6]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"
D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment
In [7]:
import pandas as pd
In [8]:
ted = pd.read_csv('ted.csv')

1: Explore the Data attributes

In [11]:
ted.dtypes
Out[11]:
comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object
In [12]:
ted.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB
In [13]:
ted.shape
Out[13]:
(2550, 17)

2. Which talk has the highest comments

In [77]:
ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()
Out[77]:
comments duration main_speaker
1787 2673 1117 David Chalmers
201 2877 1099 Jill Bolte Taylor
644 3356 1386 Sam Harris
0 4553 1164 Ken Robinson
96 6404 1750 Richard Dawkins

3 Find top 5 talks that have the highest views to comments ratio

In [16]:
ted['view_to_comment'] = ted['views'] / ted['comments']
In [17]:
ted['view_to_comment'].tail()
Out[17]:
2545    26495.882353
2546    69578.333333
2547    37564.700000
2548    13103.406250
2549    48965.125000
Name: view_to_comment, dtype: float64

4 . Create a histogram of comments

In [19]:
import matplotlib.pyplot as plot
ted['comments'].plot(kind = 'hist')
Out[19]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a233e3978>

5. Create histogram of comments where comments are less than 1000

In [35]:
# Get index of those row which have less than 1000 comments 
index = ted['comments']<1000
In [38]:
# Get only the comments column from these filtered rows
com1000 = ted[index]['comments']
In [39]:
# Make a plot of these filtered comments
com1000.plot(kind = 'hist')
Out[39]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a236dc7b8>
In [40]:
# When you expert, you can do the above just in one line
ted[ted['comments']<1000]['comments'].plot(kind='hist')
Out[40]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a2375cac8>
In [44]:
# How many rows were excluded from the above graph
ted[ted['comments'] >=1000].shape
Out[44]:
(32, 18)

6. Do the same as in 5, but using a query method

In [68]:
# Filter the whole dataset where comments are less than 1000
ted1000 = ted.query('comments <1000')
In [69]:
# Get only the comments column from the reduced dataset
comment1000 = ted1000['comments']
In [70]:
# Plot the filtered comments
comment1000.plot(kind = 'hist')
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a238fb630>

7. How to add more bins to the histogram

In [71]:
comment1000.plot(kind = 'hist', bins = 20)
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a23953278>

8. Make a box plot and identify outliers

In [73]:
comment1000.plot(kind = 'box')
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x14a23a4ba20>

The black dots show outliers

In [ ]: