Introduction¶
One of the biggest problem's today is how we can quickly analyze data. For example, let's say we wanted to give a specific class in school a survey asking each student their favorite food and favorite color and we wanted to know the most popular color and food for the theme of the next dance. Initially, it would be practical by counting. However, what would happen if we increased the survey population to include even more classes, or even more schools? Counting would just be too tedious and eventually be impossible because it would take too long.This is the same problem in the real world.
The solution to this problem is being able to store the data in an efficient manner. In order to store information in an organized manner, programmers use dataframes. A data frame is a table (or a two-dimensional array-like structure) which stores data. Each column contains measurements on one variable and each row contains one case. Using this data frame, we can easily access the information for analysis.
In this post, we will study data frames in Python using the library called Pandas
Data Frame Basics in Pandas¶
In this next section, we will start pandas, read in a csv file to store in a data frame, and then explore its different attributes.
import pandas as pd # start pandas
df = pd.read_csv("./data/2012/weather-2012-01-01.csv") # read in csv file from path and store in data frame, df
print df.shape # (number of rows, number of columns)
print df.columns # names of columns
print df.index # names of rows
print df.dtypes # data types of columns
print df.describe() # shows 1-var statistics of numeric columns of data frame (mean, std, min ...)
print df.head(5) # shows first five records of a dataframe.
Extracting Data¶
In this section, we will slice information from our data frame.
print df.loc[:,['TemperatureF', 'Humidity']] # shows all records for the two columns: Temperature and Humidity
df.loc[df.TemperatureF > 40,['TemperatureF', 'Humidity']]
#boolean indexing: Shows temperature and humidity for all records that have a temperature greater than 40
In this next experiment, we read all the csv files from a specific directory and concatenate all the data frames together.
import glob
path = "./data/2012" # path for csv files
allFiles = glob.glob(path + "/*.csv") # store all the csv files into "allFiles"
df2012 = None
first_time = True
for myfile in allFiles: # run through each csv file
df5 = pd.read_csv(myfile)
s = df5.columns
scopy = [s[i] for i in range(0, len(s))]
scopy[0] = 'TimeEST'
df5.columns = scopy
if first_time:# check if this is the first data frame.
df2012 = df5
first_time = False
else:
df2012 = df2012.append(df5, ignore_index=True) #concatenate the data frames
print df2012.shape
print df2012.describe()
Plotting¶
Visual Data is much more easily understandable. To make data more visual, we will now plot the dataframes in differenty types of charts.
Univariate numeric plotting¶
The first type of plot is a univariate numeric plot. We analyze different numeric columns by using different graphs like the histogram, the density plot, and the box plot.
%matplotlib inline
df2012[['Humidity']].hist(alpha=0.5, bins=5) # plot histogram of humidity
df2012[['TemperatureF']].plot(kind="kde") # plot density plot of temperature
df2012['Humidity'].plot(kind="box") # plot box plot of humidity
Univariate Categorial plotting¶
The second type of plot is a univariate categorical plot. We analyze different categorical columns by using different graphs like the barplot and the pie chart.
df2012['Conditions'].value_counts().plot(kind='bar') # plot bar plot of conditions
df2012['Conditions'].value_counts().plot(kind='pie', figsize=(7,7)) # plot pie chart
Bivariate numeric plotting¶
The third type of plot is a bivariate numeric plot. We analyze two different numeric columns and see if there is correlation between the two variables by using the scatterplot.
df2012.plot(kind="scatter", x='Humidity', y='TemperatureF', s=2) # plot scatterplot of Humidity against Temperature.
Bivariate: categorical and numeric plotting¶
The fourth type of plot is a bivariate categorical and numeric plot. We analyze one categorical and one numeric column and see if there is correlation between the two variables by using either multi boxplot or multi density plot.
import matplotlib.pyplot as plt
df2012.boxplot(column='TemperatureF', by='Conditions')# plot multiboxplot of Temperature and Conditions
def plotdensity(df_var):
df_var['TemperatureF'].plot(kind='kde')
df2012.loc[0:700,['TemperatureF', 'Conditions']].groupby(['Conditions']).apply(plotdensity)
# Plot multi density plot of Temperature against Conditions