DOWNLOAD DATASETS¶
To download the datasets used in this tutorial, pleas see the following links
1. gapminder.tsv
2. pew.csv
3. billboard.csv
4. ebola.csv
5. tips.csv
TED Talk Dataset Excercises¶
In [5]:
# Change directory
In [6]:
cd "D:\Dropbox\CLASSES\Data Science for Finance\Python\Lecture 1 - Assignment"
In [7]:
import pandas as pd
In [8]:
ted = pd.read_csv('ted.csv')
1: Explore the Data attributes¶
In [11]:
ted.dtypes
Out[11]:
In [12]:
ted.info()
In [13]:
ted.shape
Out[13]:
2. Which talk has the highest comments¶
In [77]:
ted.sort_values('comments')[['comments', 'duration','main_speaker']].tail()
Out[77]:
3 Find top 5 talks that have the highest views to comments ratio¶
In [16]:
ted['view_to_comment'] = ted['views'] / ted['comments']
In [17]:
ted['view_to_comment'].tail()
Out[17]:
4 . Create a histogram of comments¶
In [19]:
import matplotlib.pyplot as plot
ted['comments'].plot(kind = 'hist')
Out[19]:
5. Create histogram of comments where comments are less than 1000¶
In [35]:
# Get index of those row which have less than 1000 comments
index = ted['comments']<1000
In [38]:
# Get only the comments column from these filtered rows
com1000 = ted[index]['comments']
In [39]:
# Make a plot of these filtered comments
com1000.plot(kind = 'hist')
Out[39]:
In [40]:
# When you expert, you can do the above just in one line
ted[ted['comments']<1000]['comments'].plot(kind='hist')
Out[40]:
In [44]:
# How many rows were excluded from the above graph
ted[ted['comments'] >=1000].shape
Out[44]:
6. Do the same as in 5, but using a query method¶
In [68]:
# Filter the whole dataset where comments are less than 1000
ted1000 = ted.query('comments <1000')
In [69]:
# Get only the comments column from the reduced dataset
comment1000 = ted1000['comments']
In [70]:
# Plot the filtered comments
comment1000.plot(kind = 'hist')
Out[70]:
7. How to add more bins to the histogram¶
In [71]:
comment1000.plot(kind = 'hist', bins = 20)
Out[71]:
8. Make a box plot and identify outliers¶
In [73]:
comment1000.plot(kind = 'box')
Out[73]:
The black dots show outliers
In [ ]:
Leave A Comment