Tuesday, February 16, 2016

Introduction to Pandas

Introduction

One of the biggest problem's today is how we can quickly analyze data. For example, let's say we wanted to give a specific class in school a survey asking each student their favorite food and favorite color and we wanted to know the most popular color and food for the theme of the next dance. Initially, it would be practical by counting. However, what would happen if we increased the survey population to include even more classes, or even more schools? Counting would just be too tedious and eventually be impossible because it would take too long.This is the same problem in the real world.

The solution to this problem is being able to store the data in an efficient manner. In order to store information in an organized manner, programmers use dataframes. A data frame is a table (or a two-dimensional array-like structure) which stores data. Each column contains measurements on one variable and each row contains one case. Using this data frame, we can easily access the information for analysis.

In this post, we will study data frames in Python using the library called Pandas

Data Frame Basics in Pandas

In this next section, we will start pandas, read in a csv file to store in a data frame, and then explore its different attributes.

In [3]:
import pandas as pd # start pandas
df = pd.read_csv("./data/2012/weather-2012-01-01.csv") # read in csv file from path and store in data frame, df
print df.shape # (number of rows, number of columns)
(23, 14)
In [4]:
print df.columns # names of columns
Index([u'TimeEST', u'TemperatureF', u'Dew PointF', u'Humidity',
       u'Sea Level PressureIn', u'VisibilityMPH', u'Wind Direction',
       u'Wind SpeedMPH', u'Gust SpeedMPH', u'PrecipitationIn', u'Events',
       u'Conditions', u'WindDirDegrees', u'DateUTC'],
      dtype='object')
In [5]:
print df.index # names of rows
Int64Index([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
            17, 18, 19, 20, 21, 22],
           dtype='int64')
In [8]:
print df.dtypes # data types of columns
TimeEST                  object
TemperatureF            float64
Dew PointF              float64
Humidity                  int64
Sea Level PressureIn    float64
VisibilityMPH           float64
Wind Direction           object
Wind SpeedMPH            object
Gust SpeedMPH            object
PrecipitationIn         float64
Events                  float64
Conditions               object
WindDirDegrees            int64
DateUTC                  object
dtype: object
In [9]:
print df.describe() # shows 1-var statistics of numeric columns of data frame (mean, std, min ...)
       TemperatureF  Dew PointF   Humidity  Sea Level PressureIn  \
count     23.000000   23.000000  23.000000             23.000000   
mean      52.613043   39.247826  63.782609             30.032174   
std       12.038699    6.323180  19.376547              0.117162   
min       35.100000   32.000000  34.000000             29.870000   
25%       39.100000   33.800000  46.000000             29.915000   
50%       57.900000   37.900000  58.000000             30.040000   
75%       63.500000   44.550000  81.500000             30.150000   
max       66.900000   52.000000  92.000000             30.160000   

       VisibilityMPH  PrecipitationIn  Events  WindDirDegrees  
count             23             1.00       0       23.000000  
mean              10             0.01     NaN      131.304348  
std                0              NaN     NaN      121.292447  
min               10             0.01     NaN        0.000000  
25%               10             0.01     NaN        0.000000  
50%               10             0.01     NaN      200.000000  
75%               10             0.01     NaN      210.000000  
max               10             0.01     NaN      310.000000  
In [10]:
print df.head(5) # shows first five records of a dataframe.
    TimeEST  TemperatureF  Dew PointF  Humidity  Sea Level PressureIn  \
0  12:51 AM          39.2        33.8        81                 30.15   
1   1:51 AM          39.2        33.8        81                 30.16   
2   2:51 AM          39.0        34.0        82                 30.15   
3   3:51 AM          36.0        33.1        89                 30.15   
4   4:51 AM          35.1        33.1        92                 30.15   

   VisibilityMPH Wind Direction Wind SpeedMPH Gust SpeedMPH  PrecipitationIn  \
0             10           Calm          Calm             -              NaN   
1             10           Calm          Calm             -              NaN   
2             10           Calm          Calm             -              NaN   
3             10           Calm          Calm             -              NaN   
4             10           Calm          Calm             -              NaN   

   Events Conditions  WindDirDegrees              DateUTC  
0     NaN      Clear               0  2012-01-01 05:51:00  
1     NaN      Clear               0  2012-01-01 06:51:00  
2     NaN      Clear               0  2012-01-01 07:51:00  
3     NaN      Clear               0  2012-01-01 08:51:00  
4     NaN      Clear               0  2012-01-01 09:51:00  

Extracting Data

In this section, we will slice information from our data frame.

In [12]:
print df.loc[:,['TemperatureF', 'Humidity']] # shows all records for the two columns: Temperature and Humidity
    TemperatureF  Humidity
0           39.2        81
1           39.2        81
2           39.0        82
3           36.0        89
4           35.1        92
5           37.0        86
6           37.9        83
7           37.9        86
8           46.9        71
9           53.6        58
10          60.1        44
11          63.0        38
12          66.9        34
13          66.0        38
14          66.9        44
15          64.9        48
16          64.0        54
17          64.0        56
18          60.1        75
19          59.0        78
20          60.1        57
21          57.9        51
22          55.4        41
In [14]:
df.loc[df.TemperatureF > 40,['TemperatureF', 'Humidity']]
#boolean indexing: Shows temperature and humidity for all records that have a temperature greater than 40
Out[14]:
TemperatureF Humidity
8 46.9 71
9 53.6 58
10 60.1 44
11 63.0 38
12 66.9 34
13 66.0 38
14 66.9 44
15 64.9 48
16 64.0 54
17 64.0 56
18 60.1 75
19 59.0 78
20 60.1 57
21 57.9 51
22 55.4 41

In this next experiment, we read all the csv files from a specific directory and concatenate all the data frames together.

In [15]:
import glob
path = "./data/2012" # path for csv files
allFiles = glob.glob(path + "/*.csv") # store all the csv files into "allFiles"
df2012 = None
first_time = True
for myfile in allFiles: # run through each csv file
    df5 = pd.read_csv(myfile)
    s = df5.columns
    scopy = [s[i] for i in range(0, len(s))]
    scopy[0] = 'TimeEST'
    df5.columns = scopy
    if first_time:# check if this is the first data frame.
        df2012 = df5
        first_time = False
    else:
        df2012 = df2012.append(df5, ignore_index=True) #concatenate the data frames

print df2012.shape
print df2012.describe()
(9617, 14)
       TemperatureF   Dew PointF     Humidity  Sea Level PressureIn  \
count   9617.000000  9617.000000  9616.000000           9617.000000   
mean      62.349756    50.201497    70.014247             30.040644   
std       15.410849   103.758176    19.626160              0.197613   
min       19.400000 -9999.000000    16.000000             29.290000   
25%       50.000000    37.900000    54.000000             29.910000   
50%       64.000000    54.000000    73.000000             30.040000   
75%       73.900000    64.900000    88.000000             30.170000   
max      105.100000    75.900000   100.000000             30.690000   

       VisibilityMPH  PrecipitationIn  WindDirDegrees  
count    9617.000000      1169.000000     9617.000000  
mean        9.062036         0.053336      118.418426  
std         2.323571         0.118034      111.996815  
min         0.100000         0.000000        0.000000  
25%        10.000000         0.000000        0.000000  
50%        10.000000         0.010000       90.000000  
75%        10.000000         0.050000      230.000000  
max        10.000000         1.420000      360.000000  

Plotting

Visual Data is much more easily understandable. To make data more visual, we will now plot the dataframes in differenty types of charts.

Univariate numeric plotting

The first type of plot is a univariate numeric plot. We analyze different numeric columns by using different graphs like the histogram, the density plot, and the box plot.

In [16]:
%matplotlib inline
df2012[['Humidity']].hist(alpha=0.5, bins=5) # plot histogram of humidity
Out[16]:
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x0000000008BE76D8>]], dtype=object)
In [17]:
df2012[['TemperatureF']].plot(kind="kde") # plot density plot of temperature
Out[17]:
<matplotlib.axes._subplots.AxesSubplot at 0x9524e10>
In [18]:
df2012['Humidity'].plot(kind="box") # plot box plot of humidity
Out[18]:
<matplotlib.axes._subplots.AxesSubplot at 0x95245f8>

Univariate Categorial plotting

The second type of plot is a univariate categorical plot. We analyze different categorical columns by using different graphs like the barplot and the pie chart.

In [27]:
df2012['Conditions'].value_counts().plot(kind='bar') # plot bar plot of conditions
Out[27]:
<matplotlib.axes._subplots.AxesSubplot at 0xca44ef0>
In [29]:
df2012['Conditions'].value_counts().plot(kind='pie', figsize=(7,7)) # plot pie chart
Out[29]:
<matplotlib.axes._subplots.AxesSubplot at 0xcf12a90>

Bivariate numeric plotting

The third type of plot is a bivariate numeric plot. We analyze two different numeric columns and see if there is correlation between the two variables by using the scatterplot.

In [30]:
df2012.plot(kind="scatter", x='Humidity', y='TemperatureF', s=2) # plot scatterplot of Humidity against Temperature.
Out[30]:
<matplotlib.axes._subplots.AxesSubplot at 0xd6687b8>

Bivariate: categorical and numeric plotting

The fourth type of plot is a bivariate categorical and numeric plot. We analyze one categorical and one numeric column and see if there is correlation between the two variables by using either multi boxplot or multi density plot.

In [33]:
import matplotlib.pyplot as plt
df2012.boxplot(column='TemperatureF', by='Conditions')# plot multiboxplot of Temperature and Conditions
Out[33]:
<matplotlib.axes._subplots.AxesSubplot at 0xe0c59e8>
In [35]:
def plotdensity(df_var):
    df_var['TemperatureF'].plot(kind='kde')
    
df2012.loc[0:700,['TemperatureF', 'Conditions']].groupby(['Conditions']).apply(plotdensity)
# Plot multi density plot of Temperature against Conditions
Out[35]: