Introduction To Data Visualization Using Seaborn

The Revolution of Data science has revolutionized the business with its visible impact. Data science is the learning of data or information and what the data is represented, from where it is collected and how to change it into a helpful approach when formulating industry and IT policy. It is considered as the most significant asset by every industry in today’s competitive world.

Seaborn is a statistical plotting framework that created on top of Matplotlib, and it has very attractive default styles, it’s also designed to run really well with Pandas data-fram Object. Let’s Get Started!

1.Installing Seaborn Library

There are two methods to install the Seaborn library on your machine, you can either use conda if you are working with Anaconda environment or just use a pip package installer if you are using the other version of python.

conda install seaborn
or
pip3 install seaborn

2.Getting Started 

We will start our code by importing the Seaborn library as an alias of sns, and add %matplotlib inline to the code so we can see our visualization on the Jupyter Notebook

import seaborn as sns
%matplotlib inline

3.Importing The Dataset

As we said earlier that Seaborn works pretty well with Pandas so we can import the data using Pandas read_csv function. But Seaborn comes with its own dataset that you can use and that what we are going to do.

So we load the data using sns.load_dataset(), and we pass in it the data that we gonna use. You can see all the data available in Seaborn in Github.

tips = sns.load_dataset("tips")

We have to see the data that we are working with by using the head() function from Seaborn. 

By default, it shows you the first 5 rows but if you want to see more data, just type the number the rows between the parenthesis.

tips.head()
An Overview of The Data

There are 7 columns as you see above, and it is essentially a data referring to peoples who had a meal and after that left a tip afterward. Feel free to see more explanation of it on kaggle.

4. Distribution Plots

Distribution plots are simply any kind of plot that focusing on plotting numerical data rather than categorical data, we will start with the dist plot. 

4.1.Dist plot

Dist plot allows you to display the distribution of a univariate set of observations.

We use sns.distplot() function and we pass the column that we want to turn it as a distribution:

sns.distplot(tips["total_bill"])
The Dist Plot

The histogram as you see above has a line, it is known as a kernel density estimation (KDE), you can remove that by using an argument called KDE and set it to a value of False:

sns.distplot(tips["total_bill"], kde=False)
The Dist Plot Without KDE

Now, you basically have a histogram and which is typically just distribution of where your total_bill lies. And you can see above that on the Y-axis you just have a count and then you have those bars on the X-axis as bins.

And this essentially indicates that the largest of your total_bills are somewhere among $10 and $20.

4.2.JointPlot

Let’s now talk about the joint plot. Essentially this plot from seaborne allows you to match up to these plots for different data, which means you can actually mix two different distribution plots.

We use the function sns.jointplot() and it takes three arguments:

  • x: The columns that you use for X-axis
  • y: The columns that you use for Y-axis
  • data: Where the data come from
sns.jointplot(x="total_bill", y="tip", data=tips)
The Joint Plot

As you see from the above distribution that we compare two plots which are the scatter plot and the histogram plot. 

Note that we can add another argument which is kind that allows you to change the type of the scatter plot to another plot. For instance, we want to use the hexagon distribution so we pass in the hex value:

sns.jointplot(x="total_bill", y="tip", data=tips, kind="hex")
The Joint Plot With Hexagon Plot

Hexagon distribution essentially gets darker for a certain number of points and for fewer points it gets lighter. There are many kind values that are limited to ( “scatter” | “reg” | “resid” | “kde” | “hex” ).

4.3.PairPlot

Now, we go-ahead to grow that idea by presenting you with the pair plot. Essentially pair plot is going to plot pairwise relations over a whole data frame at least for the digital columns. Also, it supports the color hue argument for the categorical columns.

We use sns.pairplot() function and we pass in the data as an argument:

sns.pairplot(tips)
The Pair Plot

It is worth noting that if you do have a large dataset, it may take some time to visualize the data, and it creates a pair plot for all the numerical data. For instance, it creates a scatter for size and total_bill, size and tip and so on.

Pair plot keeps doing that for all the data. But when it comes to two-argument that are the same such as size and size it creates a histogram instead of a scattering plot, it is an excellent way to quickly visualize your data.

The excellent thing about this plot is that you can pass in the hue as a parameter. For example, we pass the column sex as a hue and see what you gonna have:

sns.pairplot(tips, hue="sex)
The Pair Plot With a Hue Defined as Sex

It worth noting that you must pass a categorical column, not a numerical column. As you see from the graph above it will color the points based on the column you passed in.

There is an argument that works nearly with every Seaborn function which is the palette argument that allows you to color the graph with a specific color palette. Since there are tens of pallets that available in seaborn I can’t list all of them here but you can google it to see what works best for your visualization.

Let’s try on of this palette and see what we will get:

sns.pairplot(tips, hue="sex", palette="gist_stern")
gist_stern Pallet on Pair Plot

5.Categorical Plots

Categorical plots were essentially going to be concerned about displaying the distributions of a categorical column like the sex gender and reference that to one of the numerical columns or a different categorical column.

5.1.BarPlot

This is the common basic categorical plot which is the bar plot and you can perform that by using the sns.barplot() function. It is simply a generic plot that lets you aggregate the categorical columns data based on some function like the mean as a default parameter.

sns.barplot(x="sex", y="total_bill", data=tips)
The Bar Plot

We can add a fourth argument which is the estimator, essentially it changes the function that is used in the barplot (mean by default). We will change it to the standard deviation for example. Feel free to use your own function.

import numpy as np
sns.barplot(x="sex", y="total_bill", data=tips, estimator=np.std)
Apply Standard Deviation On Bar Plot

5.2.CountPlot

Seaborne has a function which is sns.countplot() and typically the same such as bar plot but the estimator is counting the number of the occurrence of every category in a certain column.

sns.countplot(x="sex", data=tips)
The Count Plot

5.3.BoxPlot

This type of plot boxplot is applied to display the distribution of categorical data. it displays the distribution of quantitative data in a form that will help comparisons between two variables. You see an example here:

sns.boxplot(x="day", y="total_bill", data=tips)
The Box Plot

5.4.ViolinPlot

violin plot acts in a pretty similar role as a box plot. it will also show the distribution of the data over some kind of category and it gonna takes the exact same parameters as a box plot.

sns.violinplot(x="day", y="total_bill", data=tips)
The Violin Plot

They are much similar to each other but the violin plot gives you more extra information. There are some arguments that you can add:

  • hue: represent the column that been used for color encoding
  • split: It can be either True or False (split the distribution)
sns.violinplot(x="day", y="total_bill", data=tips, hue="sex", split=True)
Apply The Hue On Violin Plot

5.5.StripPlot

StripPlot or Dot Plot is going to draw a scatterplot wherever one variable is categorical. And simply just a scatterplot based on the category. One problem with this strip plot (Dot Plot)is that you can not actually determine how many points are stacked on top of each other.

sns.stripplot(x="day", y="total_bill", data=tips)
The Dot Plot

We can fix that problem by adding an argument called jitter that takes value either True or False. Essentially, it does some noise to the points so it makes it a little bit readable.

sns.stripplot(x="day", y="total_bill", data=tips, jitter=True)
The Dot Plot With Jitter Argument

5.6.SwarmPlot
The swarm plot is a mixture of the two distribution dot plots and the violin plot in one graph. It is pretty similar to a strip plot (dot plot) but the points are adjusted so that they don’t get overlapped.

sns.swarmplot(x="day", y="total_bill", data=tips)
The Swarm Plot

You can combine the violin plot with the swarm plot by using the two functions in the same cell:

sns.violinplot(x="day", y="total_bill", data=tips)
sns.swarmplot(x="day", y="total_bill", data=tips, color="black")
A Combination Between Swarm Plot and Violin Plot

Conclusion

Seaborn is a powerful tool that makes data visualization so easy as you saw in this article that every plot we have made was just one line of code.


Note: This is a guest post, and the opinion in this article is of the guest writer. If you have any issues with any of the articles posted at www.pythonlearning.org please contact at asif@marktechpost.com

Leave a Reply

Your email address will not be published. Required fields are marked *