DOWNLOAD DATASETS¶

To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv

TED Talk Dataset Excercises¶

# Change directory

cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"

D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment

import pandas as pd

ted = pd.read_csv('ted.csv')

1: Explore the Data attributes¶

ted.dtypes

comments               int64
description           object
duration               int64
event                 object
film_date              int64
languages              int64
main_speaker          object
name                  object
num_speaker            int64
published_date         int64
ratings               object
related_talks         object
speaker_occupation    object
tags                  object
title                 object
url                   object
views                  int64
dtype: object

ted.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2550 entries, 0 to 2549
Data columns (total 17 columns):
comments              2550 non-null int64
description           2550 non-null object
duration              2550 non-null int64
event                 2550 non-null object
film_date             2550 non-null int64
languages             2550 non-null int64
main_speaker          2550 non-null object
name                  2550 non-null object
num_speaker           2550 non-null int64
published_date        2550 non-null int64
ratings               2550 non-null object
related_talks         2550 non-null object
speaker_occupation    2544 non-null object
tags                  2550 non-null object
title                 2550 non-null object
url                   2550 non-null object
views                 2550 non-null int64
dtypes: int64(7), object(10)
memory usage: 338.8+ KB

ted.shape

(2550, 17)

2. Which talk has the highest comments¶

ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()

3 Find top 5 talks that have the highest views to comments ratio¶

ted['view_to_comment'] = ted['views'] / ted['comments']

ted['view_to_comment'].tail()

2545    26495.882353
2546    69578.333333
2547    37564.700000
2548    13103.406250
2549    48965.125000
Name: view_to_comment, dtype: float64

4 . Create a histogram of comments¶

import matplotlib.pyplot as plot
ted['comments'].plot(kind = 'hist')

<matplotlib.axes._subplots.AxesSubplot at 0x14a233e3978>

5. Create histogram of comments where comments are less than 1000¶

# Get index of those row which have less than 1000 comments 
index = ted['comments']<1000

# Get only the comments column from these filtered rows
com1000 = ted[index]['comments']

# Make a plot of these filtered comments
com1000.plot(kind = 'hist')

<matplotlib.axes._subplots.AxesSubplot at 0x14a236dc7b8>

# When you expert, you can do the above just in one line
ted[ted['comments']<1000]['comments'].plot(kind='hist')

<matplotlib.axes._subplots.AxesSubplot at 0x14a2375cac8>

# How many rows were excluded from the above graph
ted[ted['comments'] >=1000].shape

(32, 18)

6. Do the same as in 5, but using a query method¶

# Filter the whole dataset where comments are less than 1000
ted1000 = ted.query('comments <1000')

# Get only the comments column from the reduced dataset
comment1000 = ted1000['comments']

# Plot the filtered comments
comment1000.plot(kind = 'hist')

<matplotlib.axes._subplots.AxesSubplot at 0x14a238fb630>

7. How to add more bins to the histogram¶

comment1000.plot(kind = 'hist', bins = 20)

<matplotlib.axes._subplots.AxesSubplot at 0x14a23953278>

8. Make a box plot and identify outliers¶

comment1000.plot(kind = 'box')

<matplotlib.axes._subplots.AxesSubplot at 0x14a23a4ba20>

The black dots show outliers

	comments	duration	main_speaker
1787	2673	1117	David Chalmers
201	2877	1099	Jill Bolte Taylor
644	3356	1386	Sam Harris
0	4553	1164	Ken Robinson
96	6404	1750	Richard Dawkins