Data exploration is one of the crucial steps of working with data. To perform more effective data exploration, we need to know what are the inputs and the desired outputs. This prerequisite makes data exploration one of the most crucial steps in Machine Learning and Data Science. Perhaps one of the most powerful Python packages that facilitate this aim is Pandas.

Pandas package makes our life easier! To clean and prepare data, automate common tasks, analyze data and handle large datasets, and whatever queries that help the data processing, Pandas is there to help us. Pandas is ofter referred to as the most important tool in data science. So, I can tell you that you should take it seriously! If you want to stand up and progress in Machine Learning and Data Science, your job is pretty hard without Pandas!

In this tutorial, you will learn:

How to read and load the data
The steps of initial checking of the dataset
Getting a sense of data in a short amount of time
Finding and getting rid of missing and outlier values

Subscribe to Download the Full Source Code!

The Dataset

Boston house prices dataset is our targeted dataset. This dataset contains information accumulated in 70’s by the U.S Census Service regarding housing in the area of Boston, Massachusetts, USA. You can check the details below:

Characteristics

  • Number of Instances: 506
  • The first 13 features are numeric/categorical predictive features.
  • The last one (attribute 14): Median Value is the target variable.

Attributes

  1. CRIM per capita crime rate by town
  2. ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  3. INDUS proportion of non-retail business acres per town
  4. CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  5. NOX nitric oxides concentration (parts per 10 million)
  6. RM average number of rooms per dwelling
  7. AGE proportion of owner-occupied units built prior to 1940
  8. DIS weighted distances to five Boston employment centers
  9. RAD index of accessibility to radial highways
  10. TAX full-value property-tax rate per $10,000
  11. PTRATIO pupil-teacher ratio by town
  12. B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  13. LSTAT % lower status of the population
  14. MEDV Median value of owner-occupied homes in $1000’s [target attribute]

The First Step of Data Exploration: Reading the Data

The first step of data exploration is to read the data. Pandas make life easy for us in this task. One of the easiest approaches to read the data is to use the read_csv() method. This method is in essence defined to read separated (ex: comma-separated) values (CSV) file into Pandas DataFrame. However, our data is formatted as `housing.data`, but since it has the same structure, read_csv() is one of the best ways to utilize for reading it.

Let’s explore its usage with an example:

# Might be necessary for ssl-based urls
import ssl
ssl._create_default_https_context = ssl._create_unverified_context

# Import Pandas packages
import pandas as pd

# URL of the dataset
url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data'

# List of column names to use.
column_names = ['CRIM','ZN','INDUS','CHAS','NOX',
                'RM', 'AGE', 'DIS','RAD','TAX','PTRATION', 'B', 'LSTAT', 'MEDV']

# Number of rows of file to read.               
nrows = 500   

# Delimiter to use: Clarify the separator character in the data. The comma is the default.
# But the data values in this dataset are separated with space.
separator=" "

# Skip spaces after delimiter
skipinitialspace = True

# Load data and create the data frame
dataset = pd.read_csv(filepath_or_buffer = url, names=column_names, sep=separator, skipinitialspace=skipinitialspace)

Let’s explain the above code step-by-step:

  • Lines 2-3: It might be necessary to use when you try to download a dataset that its URL has SSL. It depends on the framework that you are using too. For example, if you use Google Colaboratory, it is very likely you get an error if you do not use the SSL package.
  • Line 6: I just imported the Pandas package with the alias pd.
  • Line 9: The URL of the dataset file.
  • Lines 12-13: The desired column names in the final DataFrame. The column names are the attributes of the dataset as described above. NOTE: Sometimes the file may contain a header row. In this situation, you should explicitly pass header=0 to cancel the column names you used because the column names are already defined in the data. NOTE: If the header row is set and have the columns’ names, you should use the dataset.columns to observe the available columns in the dataset.
  • Line 16: Determines the number of desired data samples to read. NOTE: This is very useful if you are dealing with a HUGE dataset and you just want to (1) read a few lines to become familiar with the dataset or (2) want to process only a portion of data.
  • Line 20: It determines the delimiter (separator) to use. We should clarify how the elements in the dataset are separated. The comma is the default. However, in this particular dataset, space is the separator.
  • Line 23: When setting to TRUE, it skips spaces after delimiter. NOTE: It happens when there are different data points separated with some element but there are random spaces remaining. These spaces can ruin the data reading and preparation process. The skipinitialspace argument is a lifesaver here.
  • Line 26: Is the main function of the described arguments. It will read and create the desired data frame.

Depending on your data, there are other functions that you can use to read your data: pandas.read_pickle, pandas.read_table and pandas.read_excel, to name a few.

Now that we created the DataFrame, let’s continue by watching what is inside.

Checking the Data

The second stage of data exploration is to check and investigate the data. In this stage, we desire to see how our data look like to get some insight. Let’s investigate some functions as examples:

# This function returns last n rows from the object 
# based on position.
dataset.head(n=10)

The functions dataset.head(n=n_lines) and dataset.tail(n=n_lines) demonstrate n_lines number of data rows from the beginning and at the end, respectively. Let’s check the output of running above function:

We used head(n=10) which demonstrates the first 10 rows of the data.

A nice function that gives us a good understanding of the data is the describe() function. It gives a good brief summary statistics of the data by excluding the NaN values. Let’s use it and observe the results:

# Summary statistics
dataset.describe()

Descriptive statistics include those that summarize the central tendency, dispersion and shape of a dataset’s distribution, excluding NaN values.

Now, let’s move on and try to return some random samples from the data using the sample() function. Let’s check it out by returning 10 random samples from the dataset:

# Summary statistics
dataset.random(n=10)

Now let’s spend some time on one the most important issues in data processing.

Dealing With Missing Data

One of the most important parts of data exploration is to identify the missing values. Let’s dig more.

The Issue

Let’s look back to our dataset again. There are different rows in the data. Any of which represents a sample which has different features (columns represent features). Most of the time, there are missing values in the dataset. They are different types of missing values:

  1. Missing features (column value)
  2. Missing samples (complete row)

The availability of missing values might be due to the issues in dataset processing, data collection, or simply errors in the middle of the programming phase. Anyhow, the fact is, we CANNOT ignore missing values. How we are going to deal with them in Pandas?

Finding Missing Values

One great function that helps up find the missing values, is the isnull() function. This function, go through all elements and returns TRUE if any value is missing. Check the below example:

# Find missing data
dataset.isnull()

As you can see above, the naive utilization of the isnull() function is not very informative. Let’s use a more creative approach:

# Check and report the nulls in the data
n_missing = dataset.isnull().sum().sort_values(ascending=False)
n_nulls = dataset.isnull().sum()
n_samples = dataset.isnull().count()
percent = (n_nulls/n_samples).sort_values(ascending=False)
missing_data = pd.concat([n_missing, percent], axis=1, keys=['N_missing', 'Percentage'])
missing_data.head(20)

As can be observed, our dataset does NOT have any missing value! In the real-world application, this is rarely the case.

Removing Missing Values

Now that we found missing values, how should we remove them from the dataset? We can simply drop them from the dataset using the dropna() function:

# Remove rows with missing values
df = dataset.dropna(axis=0, how='any')

Let’s talk about the function arguments used above:

  • axis: Determine whether we remove rows or columns in which we have missing values. We chose axis=0, which means we are going to remove rows.
  • how: this argument can take one of the two possible values: ‘any’ or ‘all’. When we pick ‘any’, it means if any of the columns have a null value, we remove the rows. Choosing ‘all’ means we only remove a row if all the columns are nulls!!

Learning From Data

Individual Features

It would be informative if we can visualize the trend of individual features. Let’s investigate the median value of owner-occupied homes which is the MEDV column in the data.

import seaborn
seaborn.distplot(dataset['MEDV']);
data exploration

Numerical Variables Correlations

How about calculating the correlation between numerical variables? This gives us a very good insight into how data features are correlated. There are different approaches for measuring and visualizing this correlation.

Two Variables

Let’s compare two variables: RM and MEDV.

# Visual representation of training data
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
# With .pop() command, the associated columns are extracted.
x = dataset['RM']
y = dataset['MEDV']
ax.scatter(x, y, edgecolors=(0, 0, 0))
ax.set_xlabel('RM')
ax.set_ylabel('MEDV')
plt.show()
data exploration

Multiple Pairs of Variables

We can draw multiple plots alongside each other. Let’s draw the scatter plots!

#scatterplot
import seaborn as sns
sns.set(style="ticks", color_codes=True)
features = ['MEDV', 'TAX', 'RM']
seaborn.pairplot(dataset[features],kind='scatter')
data exploration

Now, let’s change the type from ‘scatter’ to ‘reg’. We basically changed the kind of the plot for the non-identity relationships.

#scatterplot
import seaborn as sns
sns.set(style="ticks", color_codes=True)
features = ['MEDV', 'TAX', 'RM']
seaborn.pairplot(dataset[features],kind='reg')
data exploration

Correlation matrix

So the next step is to extract the correlation matrix between different variables. The resulting heatmap gives us an overall understanding of the columns’ relationships.

#correlation matrix
corrmat = dataset.corr()
f, ax = plt.subplots(figsize=(14, 14))
sns.heatmap(corrmat, vmax=1, square=True);
data exploration

Sometimes:

  1. The number of our features (columns) is too much.
  2. We are only interested to find the correlation between some features and not all.

In the above cases, we should think about only picking some features and compare them with each other. Example: In our special case, we want to pick one column (‘MEDV’) and compare it with some other columns that have the highest correlation with ‘MEDV’. How do we do that? Check below code along with comments:

import numpy as np
#MEDV correlation matrix

# Number of columns for heatmap
n_features = 10 

# Get the rows of a DataFrame sorted by the n_features largest values of columns.
# Pick the columns that have the largest correlations
cols = corrmatrix.nlargest(n_features, 'MEDV')['MEDV'].index

# Use numpy to extract the values and return the transpose
cm = np.corrcoef(dataset[cols].values.T)

# Seaborn set with the scale of the font
sns.set(font_scale=1)

# Draw the heatmap
# Arguments:
#   cbar: colorbar
#   annot: annotation
#   square: Whether the return shape is square or not.
#   fmt: Determine the number of decimal points
#   annot_kws: Determine the size of the annotation
#   yticklabels, xticklabels: Labels
hm = sns.heatmap(cm, cbar=True, annot=True, square=False, fmt='.1f', annot_kws={'size': 10}, yticklabels=cols.values, xticklabels=cols.values)
plt.show()
data exploration

Outliers

We have to determine the outlier variables if possible. It is a very important concept in the data processing. But what are the outliers? Some values differ significantly from the majority of data points. As an example, let’s say we want to measure the heights of the elementary students and there are some students that 6 feet or higher tall!! Those are outliers! There are different ways to find and target outliers. It dependents on what characteristics are important for us.

What should we do with the outliers? It totally depends what is the reason of having outliers:

  • If the cause is data processing errors, we can just delete them.
  • If they are truly part of the population, then we should investigate how they are affecting our assessments.

Let’s just have a simple visualization approach for now. Let’s bring on of the plots we already demonstrated. We are going to visually watch it and draw red circles around outliers!

Train/Test Split

Usually training Machine Learning models requires splitting the dataset into training/testing sets. Doing so is very easy with Pandas:

# Split data into train/test
# p = training data portion
p=0.8
trainDataset = dataset.sample(frac=p,random_state=0)
testDataset = dataset.drop(trainDataset.index)

In the above code:

  1. We first randomly select a portion of the data as the train set.
  2. We dropped the training set from the data and the remainder is going to be our test set. This guaranty the generation of two disjoint sets.

Conclusion

Well, thank you for reading so far. Here, I talked about why you cannot escape from learning Pandas! If you desire to work in Data Science and Machine Learning, and basically anything that deals with data, you become a more effective person by simply knowing Pandas better! In this article, together we went through the process of data exploration, starting from reading the data. You learn how to get a sense of data by doing a few steps. An important aspect of this tutorial was to show you how to find missing data which is usually an irritating part! I hope you gained some knowledge by reading this post. Please feel free to comment and share your view and your questions.

Read More

To strengthen your knowledge, you may find the following resources useful.

0 0 vote
Article Rating
Subscribe
Notify of
guest
0 Comments
Inline Feedbacks
View all comments
0
Would love your thoughts, please comment.x
()
x
Scroll to Top

Join our journey...

Access to exclusive materials...

Tweet
Share
Pin
Share