What is EDA ?

3Feb

by Madhura babar Blog

What is EDA ?

Exploratory data analysis (EDA) is used by data scientists to analyze and investigate data sets and summarize their main characteristics, often employing data visualization methods.

EDA helps determine how best to manipulate data sources to get the answers you need, making it easier for data scientists to discover patterns, spot anomalies, test a hypothesis, or check assumptions.

EDA is primarily used to see what data can reveal beyond the formal modeling or hypothesis testing task and provides a provides a better understanding of data set variables and the relationships between them. It can also help determine if the statistical techniques you are considering for data analysis are appropriate. Originally developed by American mathematician John Tukey in the 1970s, EDA techniques continue to be a widely used method in the data discovery process today.

Why is EDA Important in Data Science ?

The main purpose of EDA is to help look at data before making any assumptions. It can help identify obvious errors, as well as better understand patterns within the data, detect outliers or anomalous events, find interesting relations among the variables.

Data scientists can use exploratory analysis to ensure the results they produce are valid and applicable to any desired business outcomes and goals. EDA also helps stakeholders by confirming they are asking the right questions. EDA can help answer questions about standard deviations, categorical variables, and confidence intervals. Once EDA is complete and insights are drawn, its features can then be used for more sophisticated data analysis or modeling, including machine learning.

EDA Tools

Specific statistical functions and techniques you can perform with EDA tools include :

Clustering and dimension reduction techniques, which help create graphical displays of high-dimensional data containing many variables.
Univariate visualization of each field in the raw dataset, with summary statistics.
Bivariate visualizations and summary statistics that allow you to assess the relationship between each variable in the dataset and the target variable you’re looking at.
Multivariate visualizations, for mapping and understanding interactions between different fields in the data.
K-means clustering, which is a clustering method in unsupervised learning where data points are assigned into K groups, i.e. the number of clusters, based on the distance from each group’s centroid. The data points closest to a particular centroid will be clustered under the same category. K-means clustering is commonly used in market segmentation, pattern recognition, and image compression.
Predictive models, such as linear regression, use statistics and data to predict outcomes.

EDA Techniques

Some of the common techniques and methods used in Exploratory Data Analysis include the following:

Data Visualization

Data visualization involves generating visual representations of the data using graphs, charts, and other graphical techniques. Data visualization enables a quick and easy understanding of patterns and relationships within data. Visualization techniques include scatter plots, histograms, heatmaps and box plots

Correlation Analysis

Using correlation analysis, one can analyze the relationships between pairs of variables to identify any correlations or dependencies between them. Correlation analysis helps in feature selection and in building predictive models. Common correlation techniques include Pearson’s correlation coefficient, Spearman’s rank correlation coefficient and Kendall’s tau correlation coefficient.

Dimensionality Reduction

In dimensionality reduction, techniques like principal component analysis (PCA) and linear discriminant analysis (LDA) are used to decrease the number of variables in the data while keeping as many details as possible.

Descriptive Statistics

It involves calculating summary statistics such as mean, median, mode, standard deviation and variance to gain insights into the distribution of data. The mean is the average value of the data set and provides an idea of the central tendency of the data. The median is the mid-value in a sorted list of values and provides another measure of central tendency. The mode is the most common value in the data set.

Clustering

Clustering techniques such as K-means clustering, hierarchical clustering, and DBSCAN clustering help identify patterns and relationships within a dataset by grouping similar data points together based on their characteristics.

Outlier Detection

Outliers are data points that vary or deviate significantly from the rest of the data and can have a crucial impact on the accuracy of models. Identifying and removing outliers from data using methods like Z-score, interquartile range (IQR) and box plots method can help improve the data quality and the models’ accuracy.

Types Of EDA

Univariate non-graphical

This is simplest form of data analysis, where the data being analyzed consists of just one variable. Since it’s a single variable, it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it.

Univariate graphical

Non-graphical methods don’t provide a full picture of the data. Graphical methods are therefore required. Common types of univariate graphics include:

Stem-and-leaf plots, which show all data values and the shape of the distribution.
Histograms, a bar plot in which each bar represents the frequency (count) or proportion (count/total count) of cases for a range of values.
Box plots, which graphically depict the five-number summary of minimum, first quartile, median, third quartile, and maximum.

Multivariate non-graphical

Multivariate data arises from more than one variable. Multivariate non-graphical EDA techniques generally show the relationship between two or more variables of the data through cross-tabulation or statistics.

Multivariate graphical

Multivariate data uses graphics to display relationships between two or more sets of data. The most used graphic is a grouped bar plot or bar chart with each group representing one level of one of the variables and each bar within a group representing the levels of the other variable.

Other common types of multivariate graphics include:

Scatter plot, which is used to plot data points on a horizontal and a vertical axis to show how much one variable is affected by another.
Multivariate chart, which is a graphical representation of the relationships between factors and a response.
Run chart, which is a line graph of data plotted over time.
Bubble chart, which is a data visualization that displays multiple circles (bubbles) in a two-dimensional plot.
Heat map, which is a graphical representation of data where values are depicted by color.

Exploratory Data Analysis Languages

Some of the most common data science programming languages used to create an EDA include:

Python :

An interpreted, object-oriented programming language with dynamic semantics. Its high-level, built-in data structures, combined with dynamic typing and dynamic binding, make it very attractive for rapid application development, as well as for use as a scripting or glue language to connect existing components together. Python and EDA can be used together to identify missing values in a data set, which is important so you can decide how to handle missing values for machine learning.

R :

An open-source programming language and free software environment for statistical computing and graphics supported by the R Foundation for Statistical Computing. The R language is widely used among statisticians in data science in developing statistical observations and data analysis.

What is EDA ?

What is EDA ?

What is EDA ?

Why is EDA Important in Data Science ?

EDA Tools

EDA Techniques

Data Visualization

Correlation Analysis

Dimensionality Reduction

Descriptive Statistics

Clustering

Outlier Detection

Types Of EDA

Univariate non-graphical

Univariate graphical

Multivariate non-graphical

Multivariate graphical

Exploratory Data Analysis Languages

Python :

Leave a Reply
Cancel reply

Leave a Reply

About Us

Departments

Recent news

Den Främsta Online Casino Upplevelsen i Sverige med Yoyo Casino

Verde Casino – Fantastische Willkommensboni, bessere Gewinnausschüttungen exklusiv in Österreich

What is EDA ?

What is EDA ?

Why is EDA Important in Data Science ?

EDA Tools

EDA Techniques

Data Visualization

Correlation Analysis

Dimensionality Reduction

Descriptive Statistics

Clustering

Outlier Detection

Types Of EDA

Univariate non-graphical

Univariate graphical

Multivariate non-graphical

Multivariate graphical

Exploratory Data Analysis Languages

Python :

Leave a Reply Cancel reply

Leave a Reply

About Us

Departments

Recent news

Den Främsta Online Casino Upplevelsen i Sverige med Yoyo Casino

Verde Casino – Fantastische Willkommensboni, bessere Gewinnausschüttungen exklusiv in Österreich

Leave a Reply
Cancel reply