Normal Distribution

Normal Distribution of data or Gaussian Distribution is a statistics function/probability distribution used to illustrate the grouping of data as users get closer to the mean. A bell curve such as the one below concisely shows us how data samples become ever more frequent the closer, they are to the mean. MatDeck creates 3 normal distribution curves below using the normaldens function where users can enter their parameters for a curve to be automatically generated.

Normal Distribution Plots
Normal Distribution Plots

In short, data entries that are smaller or greater than the mean occur less frequently as shown by the bell curve in the normal distribution example above.

Normal Distribution Formula
Normal Distribution Formula

Here, the normal distribution function is shown in a MatDeck grid canvas. Users can define the values they want to enter the variables represented by the letters.

Drawing a Normal Distribution

While drawing a Normal Distribution curve (Bell Curve) is extremely difficult to do by hand, there is only one property you can accurately illustrate which is the Y-intercept. For users who wish to draw a Normal Distribution curve accurate down to the individual data point, there exists a solution in MatDeck. Users can implement the normaldens function in MatDeck canvases to create a Normal Distribution curve function with parameters they can edit.

Normal Distribution Plots on MatDeck
Normal Distribution Plots on MatDeck

In the above example the first normaldens function is implemented with the parameters X, 0, 1 respectively. The three parameters the normaldens function requires are:

  • X – Real parameter
  • 0 – Mean value of your data set otherwise known as the average
  • 1 – Standard deviation of your data set

By entering these simple parameters, a function for your normal distribution curve is instantly generated. You will have noticed that the normaldens functions are also wrapped by the curve2d function, another useful MatDeck function.  The curve2d function takes the normal distribution curve generated with a min and max X value paired with a customisable number of data points that are plotted and creates a 2D graph of the normal distribution accurate down to 100 data points.

Below is an example of how NormalM1 would be plotted on MatDeck.

Standardized Normal Distribution
Standardized Normal Distribution

Keen readers will also notice some small changes between the 3 functions in the canvas. The changes in mean used are reflected in the Normal distribution in real time.

Right Shifted Standardized Normal Distribution
Right Shifted Standardized Normal Distribution

Here the effect of the mean value of your data set is reflected on the 2d graphs that have been plotted. As explained before, the centre of a normal distribution is the mean value. By increasing or decreasing the mean value, the distribution will shift completely along the X-axis.

Left Shifted Standardized Normal Distribution
Left Shifted Standardized Normal Distribution

Standard Deviation in Normal Distribution

Normal Distributions are determined by the mean value calculated from the given data set and the standard deviation derived from that same data set. Standard deviation has a wide range of definitions but for our application it’s the measure of dispersion/variance of data relative to the mean of that same data.

The Standard Deviation Formula
The Standard Deviation Formula

Here, the standard deviation is defined in a MatDeck grid canvas where user can interact in real-time with its variables and parameters for instant calculations.

Being a measure of dispersion within our data set relative to the mean, changes in the value of our standard deviation influence our distribution greatly. When the mean (centre point) remains the same, but our standard deviation increases the distribution will become much wider and flatter as there is more disparity between each data entry. Respectively, if our deviation drops while our mean remains the same, the distribution will much more narrow and taller.

The graphs below were drawn in MatDeck and precisely show the effects of a changing deviation has on the normal distribution.

The effect of the Standard Deviation on a Normal Distribution
The effect of the Standard Deviation on a Normal Distribution

Equally users can effortlessly change their distributions on MatDeck by entering new values. Contrastingly, changing the mean value will shift the complete distribution with respect to the X-axis as generated below.

The effect of the Mean on a Normal Distribution
The effect of the Mean on a Normal Distribution

Normal Distribution and the Empirical Rule – 68-95-99.7

Empirical Rule is another aspect of normal distributions that can be applied and used extensively. It is also commonly known by the 3 parts it is comprised of, 68-95-99.7 rule.

Empirical rule states 3 distribution facts:

  • 68 – 68% of data entries placed on the normal distribution will lie within one standard deviation of the mean
  • 95 – 95% of data entries placed on the normal distribution will lie within two standard deviations of the mean
  • 99.7 – 99.7% of data entries placed on the normal distribution will lie within three standard deviations

While these may seem as straight forward static truths about your normal distribution, each number rule can be used to estimate ranges for specific data.

Expected properties of Natural Distributions

While Normal Distributions may vary in height and overall shape with means and standard deviations being specific to the data, there are a couple consistent properties that tend to remain true for most Normal Distributions.

  • Mean, Mode, Median are all equal
  • An absolute normal distribution has mean value of 0 and a standard deviation value of 1
  • Empirical rule applies to all normal distributions and as shown above can be used to determine the number of data points/samples that fall within certain ranges
  • Unimodal – all normal distributions take the shape of symmetric bell curves with one distinct peak(unimodal)
  • Mean is the centre of the distinct peak – exactly half of the data points are below the mean and the other half of data points are above the mean
  • Normal distribution have zero skew – Gaussian representations do not take skewed distributions
  • Standard normal distributions have a kurtosis of 3 and fall under the mesokurtic bracket of “tailedness”. A greater kurtosis would be reflected in a taller thinner bell curve whereas a lower kurtosis would be reflected in a wider and lower bell curve


Standard Score/Z-Score of Normal Distributions

Standard Score is another term used extensively when working with Normal Distributions and refers to the number of standard deviations between the select data entry and the mean that has been established. Standard Scores are not a standardised term themselves and can be called the Z-score or sigma as well.

A Standard score of 2 would essentially mean that the data point you used is 2 standard deviations away from the mean.

Normal Distribution Python Applications in MatDeck

While Normal Distributions can be roughly drawn by hand to represent the main features of the bell curve, other applications of the normal distribution can be done to represent the statistic function much more accurately and precisely. An example of this is the MD functions used above to draw Normal Distributions accurate to the data entry. Outside of MD functions, users can still implement accurate distributions by coding solutions using languages such as Python and C.

Python is an increasingly popular language used both for educational purposes and for complex applications requiring detail and expertise. Python applications for Normal Distributions in MatDeck are also available for use as an alternative to using the MD dedicated distributions functions.

Users can directly embed their Python code solution into the same MD documents where MD distribution functions can be used. The choice of coding a solution in Python or using dedicated functions is completely down to the user’s preference.

Python code to plot a Normal Distribution
Python code to plot a Normal Distribution

Above is an example of a Python solution for drawing a normal distribution coded in MatDeck. Users write standard Python code for their normal distribution applications in their most natural way.

While requiring more lines of code than a dedicated function, having the option of using coded normal distribution solutions allows users a new avenue to interact with their distributions using Python. Experienced Python users will recognise the merit in Python solutions for Normal Distributions and novice Python users can use this key stepping stone to sharpen their coding ability and the depth of understanding.

Normal Distribution created with the Python IDE
Normal Distribution created with the Python IDE

Above is the figure that the Python code shown earlier would generate when executed in MatDeck. Similar to the dedicated function a bell curve is produced. Contrastingly, the Python solution in MatDeck produced a representation of the actual data values as well so users can visualize how the normal distribution looks against their actual data entries.

Normal Distribution in Machine Learning

Normal Distribution is a widely used segment of Statistical Distribution and essential to the fundamentals of Machine Learning and Machine Learning algorithms. Normal Distribution are fundamental in natural data science already which leads to them being prevalent in almost any machine learning used for data analysis or statistics.

Normal Distributions can be used to represent naturally occurring data collections that may need to be implemented in the machine learning application. Linear/logistic regression both use the assumption that the data samples are normally distributed when building the model that will be used.

Normal Distribution and Central Limit Theorem

Central Limit theorem is a critical aspect of normal distribution and is used as the basis for reasoning with distributions. As the size of the data set increases, the distribution of sample means converges to a normal distribution. The Central Limit Theorem states, in simple terms, states that the more you increase the size of your data set or number of data entries, the distribution will naturally converge to that of a normal distribution regardless of the data set’s dispersion (standard deviation).

Central Limit Theorem utilises something statisticians call Law of Large Numbers. The Law of Large Numbers is a subset of probability theory that looks at the behaviour of hypothesis as greater numbers of the test are repeated. In the case of distribution, the more samples that are added to our data set, the closer the sample mean will be to the overall data set mean.

Normal Distributions in Residual Plots

Normal Distributions also have their own applications in residual plots. Residual plots are plots used to illustrate how close actual recorded data points are to the prediction equation’s graph.

Residual Plots in MatDeck
Residual Plots in MatDeck

Applications in Industry

Normal distribution is a cornerstone of statistics and is used in a wide variety of applications often and repetitively. Because of this, having a function capable of generating distributions instantly according to your parameters and data entries is essential in industries involving statistics.

Some industries that heavily use Normal Distribution include:

  • Finance and stock applications
  • Manufacturing
  • Marketing
  • Large scale logistics and inventory
  • Accounting
  • Numerous engineering applications

The partial list above is a testament to the importance and popularity of normal distributions in industry. Essential to the study of statistics, normal distributions are just as essential to different industries varying in scale and academia.

Normal Distribution in MatDeck

In MatDeck, the normaldist function is used calculate the cumulative distributions function for the normal distribution. The function only requires 3 arguments and can handle several different data types. We can see the function in use below, where it is called inside the curve2d function.

Normal probability distribution
Normal probability distribution

The first argument is the value or values for which you would like to calculate the normal CDF, it can be a single number or it can be a vector which contains multiple different value which the normal function would be applied to. The second argument is the mean/ average of the population, this can be directly obtained from data using the average function. The final argument is the standard deviation of the data, just like the mean/average you can directly calculate the standard deviation.

Here is an in-depth look at all the Normal Probability Distribution functions:

Inverse Normal Distribution

The inverse normal distribution is the mathematical operation that calculates the value corresponding to a specified probability under a given normal distribution. It provides the inverse mapping of the cumulative distribution function (CDF) of the normal distribution.

Example of using and plotting the Normal Distribution in the MatDeck documentExample of using and plotting the Normal Distribution in the MatDeck document

Example of using and plotting the Normal Distribution in the MatDeck document
Example of using and plotting the Normal Distribution in the MatDeck document

Normal Distribution in Python

The MD Python library brings all of MatDeck’s mathematical and statistical functions to Python, allowing the user to natively call and utilize all MatDeck functions. With MatDeck’s simple syntax and easy to understand functions, it’s a perfect fit alongside Python especially with the speed boost it provides. All MatDeck statistics functions are written in C++ allowing Python users to achieve speed similar to C++ without needing to lose any of Python famous simplicity.

A look at the normaldist function via MatDeck’s Function Help Centre
A look at the normaldist function via MatDeck’s Function Help Centre