Demystifying Artificial Neural Network

Greetings! I'm thrilled that your amazing brain led you here to read this blog.  You may be curious as " Why I'm specifically addressing your brain ?" - well, that's because today, it's the star of our show. That's right, we're taking a closer look at the magnificent human brain!


The human brain is an incredibly complex and remarkable organ that processes an astonishing 11 million bits of information every second! 

Today, we'll be taking a deep dive into the intricacies of this amazing organ and exploring ways to replicate its remarkable abilities in our machines. It may sound like something out of science fiction, but that's exactly what we do here.

You may have asked yourself questions many a time how our brain works?, how we're able to think?, what are thoughts?, where new ideas come from? Its a curious minds they have infinite number of question. Don’t worry you are not alone, for centuries humans have been captivated by the workings of the brain, and this has led to the creation of Artificial Neural Networks (ANN), one of today's most popular algorithms.

So, let's explore the world of ANN and see how it works. As I like to say, 

"ANYONE CAN LEARN ANN"

It's a story time…To learn more about ANN, we will be joining  YUE on an exciting journey as she delves into the secrets of the human brain to bring its capabilities to our machines and learn about artificial neural networks. This will be a great opportunity for us to explore the world of ANNs together.

Once upon a time, in the land of technology, there lived a young woman named YUE. She was a curious and adventurous soul who was fascinated by the magic of machine learning. One day, while she was busy experimenting with linear regression, she discovered that she could use linear regression to grow rip red tomatoes!

But YUE wasn't content to simply sit in her garden and admire her handiwork. No, she wanted to go on exciting adventures and explore the wonders of the world. So, she packed her bags, and set off on a new journey…

Along the way, YUE encountered all sorts of strange and wondrous creatures. She saw cheetahs that could run very fast, zebras  with white and black strips that helps them to blend into their surroundings, and ants that are smaller than her eye lashes. She marveled at the unique abilities of each and every animal she encountered.

But no matter how far she roamed, YUE always found herself drawn back to the sky and its calmness. The sight of those billowing clouds and flitting birds never ceased to take her breath away. And as she gazed up at the sky, she realized that the one thing that set humans apart from all other creatures was our imagination, our ability to dream and wonder.

Yue Gazing at the sky 

Yue became too fascinated by the way the human brain worked , its complexity, its ability and started wondering is it possible to create this capability for our computer system. Wouldn’t it be amazing if computers can also learn, think and make decisions just like a human.

One day, while Yue was researching about the same, she stumbled upon a revolutionary idea: Artificial Neural Networks, or ANNs. ANNs were modeled after the structure and function of the human brain, with tiny parts called neurons that processed information and made decisions. She got so intruged by this discovery and wanted to learn more about it . So she decided to visit the secret land of IntroToArtificialintelligence.

Upon arriving at the secret land, YUE began her inquiry about ANN, and in the process, she met the same boy from their previous encounter. Remembering how he had explained linear regression during their last meeting, she approached him to express her gratitude and introduced herself. YUE shared how his explanation had assisted her in learning about linear regression and helped her cultivate the world's best tomatoes. The boy, who introduced himself as Yedhant was happy to hear that his knowledge had proven useful to her. He expressed his interest in tasting her home grown tomatoes someday and was glad to met with her. After some conversation, YUE explained her reason of coming to this place and her fascination with ANN. Intrigued by her curiosity, Yedhant offered his assistance. Excited by the opportunity to learn, YUE accepted his offer and looked forward to embarking on this new adventure.

Yedhant took hold of Yue's hand and led her to a different realm. Intrigued, Yue asked, "Where are we?" Yedhant replied, "We're in a secret chamber of ANN. Let me take you on a journey into the world of ANN." With a snap of his fingers, a vivid blue human brain appeared before them.

"Before we delve into ANN, let's first familiarize ourselves with the workings of the human "

Human Brain 

image credit

He clicked his fingers again and a network of cells appeared, representing the neurons in the human brain.These are neurons,"Neurons are the fundamental building blocks of the human nervous system," explained Yedhant as he zoomed in on one of the neurons. "They play a vital role in receiving and transmitting impulses or information to various parts of the nervous system, allowing our brain to process and respond to stimuli from the environment.

Neurons of human brain

image credit

"Neurons work in complex networks consisting of millions and millions of cells, each with its own unique function. Some neurons, for example, are responsible for processing sensory information such as touch or sound, while others control movement or help regulate our bodily functions like breathing and digestion.

To better understand the importance of this complex network, consider the scenario of driving a car. When we're behind the wheel, our brain must process a vast amount of information in order to keep us safe and on track. We must monitor our speed, adjust our steering, react to the movements of other vehicles, and make quick decisions based on the ever-changing conditions around us.

This level of processing is made possible by the intricate network of neurons working together in our nervous system. Neurons send and receive messages rapidly, allowing us to react quickly and effectively in a variety of situations. Without this complex network, driving a car would be much more difficult, and our ability to navigate the road safely would be greatly compromised.


Then Yedhant zoomed on the neuron “Here we can seem many neuronal elements. 

Neuron and its different component

image credit 

Dendrites and axons are important components of neurons. Dendrites receive information from other neurons and transmit it to the cell body. They are like the "branches" of a neuron, allowing it to receive signals from many different sources. In contrast, axons transmit signals from the cell body to other neurons or cells. They are like the "trunks" of a neuron, allowing it to send signals over long distances. Together, dendrites and axons allow neurons to communicate with each other and form complex networks, enabling the nervous system to process and respond to information from the environment. 

Neural Network

Inspired by the structure of neuron as shown above in human brain and there complex network, scientist have developed artificial neural network. Yedhant again snaps the finger : Check this out, its a simple neural network that is composed of output y, a set of inputs, tx : x1 ,. .. , x N, input weights tw :w1 ,. .. , w N, bias b and activation function f.

Example of neural networks neural unit

image credit

In simple terms, imagine you have a robot that needs to do a task, like sorting objects. The robot has a bunch of sensors that can detect things about the objects, like their color or size. These sensors are like the inputs, which we call tx: x1, x2, ..., xN.

Now, the robot needs to use these inputs to decide what to do with the objects. It can't just add up all the inputs and make a decision based on that - it needs to give some inputs more importance than others. So, it assigns weights to each input, which we call tw: w1, w2, ..., wN. Think of these weights as how important each sensor is for the task.

But even with weights, the robot still needs to make a decision based on all the inputs. That's where the bias comes in. The bias is like a baseline value that the robot starts with before looking at the inputs. It's like saying, "I know a little bit about the objects already, so I'm going to take that into account too." The bias is like a weight for a fixed input that's always on which it has learned based on its experience.

Finally, the robot needs to decide what to do with all this information. It can't just give a yes or no answer - it needs to give a more nuanced response. So, it uses an activation function f to convert the weighted sum of inputs and bias into an output y. The activation function is like the robot's decision-making algorithm - it determines how the robot should respond based on the inputs and weights.

So, to summarize: a simple neural network is like a robot that uses inputs (sensors) with weights (importance) and a bias (baseline value) to make a decision using an activation function (decision-making algorithm).

History of ANN

To gain knowledge about a subject, it is often recommended to study its history and understand the foundational seeds that have contributed to the growth of the magnificent tree. So let's turn the time around and start to learn about each part of neural network. The history of Artificial Neural Networks (ANNs) can be traced back to the 1940s and 1950s, when researchers in the fields of psychology, mathematics, and engineering started exploring the idea of creating machine models of the human brain.

Threshold Logic Unit

Threshold Logic Unit (TLU)

One of the earliest contributions to the field was made by Warren McCulloch and Walter Pitts, who proposed a simple mathematical model of a neuron in 1943. This model, called a threshold logic unit, formed the basis for the design of early artificial neural networks. TLU is a simple binary classifier that maps its inputs to outputs based on fixed threshold value.In TLU each input is assigned a weight, and the weighted sum of the inputs is compared to a fixed threshold. 

weighted sum > =threshold =  1

weighted sum<threshold  = 0

TLU is simple ,fast and effective for many binary classification problem, but there limitations are that they can only model linear separations and are sensitive to the choices of threshold. It is used to implement binary decision and is videly used in many applications including digital circuits, computer vision, machine learning, ANN and more.

Perceptron

In the 1950s and 1960s, researchers continued to build on this work and developed early prototypes of ANNs, such as the Perceptron, developed by Frank Rosenblatt at Cornell Aeronautical Laboratory. It was one of the earliest models of ANN, which were inspired by the structure and function of human brain. 

These Perceptron consists of artificial neurons that receive inputs from one or more data points and produce a single output. These outputs are then combined to produce a final prediction. Each input is assigned a weight , which represents the importance of that input in determining the output. The perceptron can be trained to make accurate predictions by adjusting these weights.

Yue asked Yedhant,  if he could explain her with examples as she loves his examples? This made Yedhant blush and to hide his shyness he muttered sure and cleared his throat before starting to explain. 

Yedhant: YUE, a perceptron is like a TLU, but it can do a bit more. It takes in multiple inputs, just like the TLU, and produces a single output. However, unlike a TLU, a perceptron can learn from the inputs it receives and adjust the threshold value accordingly. This means that it can improve its accuracy over time.

Let's say you want to train a perceptron to identify ripe tomatoes in your garden. You would start by feeding the perceptron a bunch of tomatoes and indicating which ones are ripe and which ones are not. The perceptron would then adjust its threshold value to produce the correct output for each tomato.

For example, if you fed the perceptron a ripe tomato with a large size, a red color, and a soft firmness, it would produce an output indicating that the tomato is ripe. On the other hand, if you fed the perceptron an unripe tomato with a small size, a green color, and a hard firmness, it would produce an output indicating that the tomato is not ripe.

Over time, as you continue to train the perceptron with more tomatoes, it would learn to adjust its threshold value to become more accurate at distinguishing between ripe and unripe tomatoes."

Yue: "Hey Yedhant, I'm a bit confused. Could you clarify whether the tlu and perceptron are essentially the same, since they both take multiple inputs, add them together, and make a determination about whether tomatoes are ripe?"


Yedhant- Yue You're right, there is some similarity between threshold logic units (TLUs) and perceptrons, in that they both take multiple inputs and produce a single output based on a threshold value. However, perceptrons have an additional capability that TLUs do not have - the ability to learn from data and improve their accuracy over time.

In the context of identifying ripe tomatoes in your garden, a TLU and a perceptron would both take in multiple inputs such as size, color, and firmness, and produce an output indicating whether the tomato is ripe or not. However, if you were using a TLU to identify ripe tomatoes, you would need to manually adjust the threshold value to achieve the desired accuracy. On the other hand, a perceptron can automatically adjust its threshold value based on the training data it receives, which allows it to learn and improve its accuracy over time.

So while there is some similarity between TLUs and perceptrons, the ability to learn and improve is what sets perceptrons apart and makes them a powerful tool in the field of artificial neural networks.

 Yue: "Wow, that's pretty cool! So, just to clarify, are you saying that perceptrons can learn on their own? If so, could you explain to me how they actually learn?"

Perceptron Learning Rule

Sure,  The perceptron learning rule is a mathematical algorithm used to adjust the weights assigned to each input feature in a perceptron. These weights determine how much influence each input has on the perceptron's decision to classify a tomato as ripe or not.

Let's say that we have a perceptron with three input features: size, color, and firmness. The perceptron's current weights for each input feature are initially set to random values. We then feeds the perceptron with training data consisting of tomatoes that you have already identified as ripe or not. For each tomato, the perceptron produces an output indicating whether the tomato is ripe or not.

If the perceptron misclassifies a tomato, the weights are adjusted using the perceptron learning rule. Specifically, the weights are increased for input features that are associated with ripe tomatoes and decreased for input features that are associated with unripe tomatoes. This adjustment process continues with more training data until the perceptron's accuracy is acceptable.

Perceptron learning rule works in the following way:

The Perceptron learning rule is an example of supervised learning, as the algorithm uses labeled training data to learn the relationship between the inputs and the outputs. The Perceptron learning rule can be applied to binary classification problems, where the goal is to predict one of two possible outcomes.

Once the Perceptron has been trained on the training data, it can be used to make predictions for new, unseen data. The accuracy of the predictions depends on the quality of the training data, the choice of the learning rate, and the convergence of the algorithm.

The Perceptron was initially proposed as a solution for solving binary classification problems, where the goal is to predict one of two possible outcomes (e.g. yes/no, true/false, 0/1). Despite its simplicity, the Perceptron has been widely used in a variety of applications, such as image and speech recognition, natural language processing, and even finance.

However, it was later discovered that the Perceptron could not solve problems that involved non-linearly separable data, meaning that the data could not be separated into different classes by a simple line. 

Yue: "Oh no, this is terrible! So, what do we do about non-linearly separable data? I mean, practically most of the data is non-linear, right?"

Yedhant: Well, let's not lose hope just yet, my dear friend! The perceptron is limited to solving problems where the data can be separated into different classes by a simple line. However, there are ways to overcome this limitation using more advanced techniques like backpropagation.

BackPropogation

In the 1980s and 1990s, advances in computer technology and the development of new training algorithms, such as backpropagation, revitalized interest in ANNs. Researchers began exploring deeper and more complex networks, leading to the development of new architectures, such as Convolutional Neural Networks and Recurrent Neural Networks.

Back-propagation is a widely used algorithm for training artificial neural networks, especially multi-layer feedforward neural networks. It is an extension of the Perceptron learning rule, which was only able to learn linear relationships between inputs and outputs.

Back-propagation is used to adjust the weights of the artificial neurons in a neural network so that the network is able to make accurate predictions for a given set of training data. The key idea behind back-propagation is to propagate the error back through the network and use it to update the weights in a way that minimizes the error.

In backpropagation the training is carried out using loss function which is a mathematical function that measures how well the neural network is performing on a given set of training examples. The goal of back-propagation is to minimize the value of this loss function, which means improving the performance of the neural network on the training data.

When the neural network makes a prediction on a training example, the output is compared to the actual target output. The difference between the predicted output and the actual target output is called the error, or the loss. The loss function is a mathematical function that measures the difference between the predicted output and the actual target output.

The backpropagation algorithm works by computing the gradient of the loss function with respect to the weights of the neural network. The gradient tells us how much the loss function will change if we make a small change to the weights of the neural network.

Once we have the gradient, we can update the weights using an optimization algorithm like gradient descent. The gradient descent algorithm works by taking small steps in the direction of the negative gradient, which gradually reduces the value of the loss function and improves the performance of the neural network.

Before the invention of back-propagation, it was difficult to train multi-layer neural networks because there was no efficient way to compute the gradients of the loss function with respect to the weights. Without the gradient information, it was not possible to use gradient descent to update the weights of the neural network.

In the 1980s, the backpropagation algorithm was invented, which made it possible to efficiently compute the gradients of the loss function with respect to the weights of the neural network. This allowed researchers to train multi-layer neural networks and achieve state-of-the-art performance on many tasks, including image recognition and natural language processing.


Yue : Can you also give some example for perceptron learning and back-propagation

Yedhant: Sure, Perceptron learning and back-propagation are two different algorithms that are commonly used in neural networks. We can also say that back-propgation is an extension of the Perceptron learning rule, which was only able to learn linear relationships between inputs and outputs.

Yue: That sounds interesting. Can you give me an example of how they work?

Yedhant: Sure! Let's say we want to teach a computer to recognize handwritten digits, like the numbers 0-9. We could use perceptron learning to train the computer to recognize the digit 0. We would give the computer examples of the digit 0 and tell it whether its classification was correct or not. If it classified the digit 0 correctly, we would give it positive feedback, and if it classified it incorrectly, we would give it negative feedback. Over time, the computer would adjust its classification model until it could correctly classify the digit 0.

Yue: I see. So, perceptron learning is like trial-and-error, where the computer learns by getting feedback on its performance.

Yedhant: Yes, that's right. Now, let's say we want to train the computer to recognize all the digits from 0-9. We could use backpropagation to do this. We would give the computer a set of training examples that include all the digits, and we would tell it the correct classification for each example. The computer would start with a random classification model and make predictions based on that model. We would then calculate the error between the predicted classification and the correct classification, and use this error to update the model. We would repeat this process, adjusting the model each time, until the error was minimized and the computer could correctly classify all the digits.

Yue: Ah, I see. So, backpropagation is more complex than perceptron learning because it adjusts the model based on the error between the predicted classification and the correct classification.

Yedhant: Yes, that's right. Backpropagation is a more powerful algorithm that can be used to train more complex models, like neural networks with multiple layers. Lets see how does the back-propagation works in detail.

Here's how back-propagation works in detail:

Backpropagation is a powerful algorithm that allows neural networks to learn non-linear relationships between inputs and outputs. It has been used in many applications, including image and speech recognition, natural language processing, and even robotics. However, it can be computationally intensive, especially for large neural networks, and requires careful tuning of the hyperparameters, such as the learning rate and the number of hidden layers.

Activation Function

One of the key feature of neural network that helps in capturing non linearity is activation

Yedhant: Hey Yue, do you know what an activation function is in neural networks?

Yue: No, what is it?

Yedhant: Well, imagine you have a robot that needs to learn how to recognize different objects, like a ball or a book. The robot has a camera that takes pictures of the objects and sends the pictures to its brain, which is like a neural network. The brain needs to figure out what object is in the picture based on the patterns and features it sees.

Yue: Okay, I think I understand.

Yedhant: Good. Now, imagine that the robot's brain is made up of lots of little neurons that each look at different parts of the picture. Each neuron takes in some information about the picture and decides whether or not it should "fire" or activate. But just firing alone isn't enough to tell the robot what object it's seeing. We need to use an activation function to help the neurons work together and figure out what the object is.

Yue: What does the activation function do?

Yedhant: The activation function is like a filter that takes in the input from the neuron and applies some math to it. This math helps the neuron decide whether it should fire or not based on the patterns and features it sees. For example, if the neuron sees a bright red ball, the activation function might say "yes, fire!" because it recognizes that pattern.

Yue: I think I get it. So the activation function helps the neurons work together to figure out what object is in the picture.

Yedhant: Exactly! Without the activation function, the neurons would just be firing randomly and wouldn't be able to recognize patterns or make sense of the input. The activation function is what helps the neural network learn and model more complex relationships between the input and output.

So an activation function is a mathematical function that is applied to the input of a neuron in a neural network to produce an output or activation signal. The purpose of the activation function is to introduce non-linearity into the output of the neuron, which allows the neural network to learn and model more complex relationships between the input and output.

Without an activation function, the output of a neuron in a neural network would be a simple linear combination of its inputs, which can be limiting in terms of the types of patterns and relationships that can be modeled. By applying an activation function, the output of a neuron can be transformed in a non-linear way, which enables the neural network to learn and model more complex relationships and patterns in the data.

There are many different types of activation functions, each with its own mathematical form and properties. Some common activation functions include the sigmoid function, ReLU (Rectified Linear Unit) function, tanh (hyperbolic tangent) function, and softmax function. The choice of activation function depends on the specific requirements and characteristics of the neural network and the problem being solved. Few of the well known activation function are as follows.

Sigmoid Function:

The sigmoid function is a mathematical function that is commonly used as an activation function in artificial neural networks. It takes any input value and returns a value between 0 and 1.

The sigmoid function has an S-shaped curve that starts at 0 when the input is negative infinity and rises to 1 as the input approaches positive infinity. The midpoint of the curve is at 0, which means that when the input is 0, the output of the sigmoid function is 0.5.

The sigmoid function is often used in neural networks to introduce non-linearity into the output of neurons. By applying the sigmoid function to the output of a neuron, the output is transformed in a non-linear way, which allows the neural network to model more complex relationships between the input and output.

The sigmoid function can also be used to represent probabilities. If the input to the sigmoid function represents the log-odds of a binary event (e.g., whether a picture shows a dog or not), the output of the sigmoid function represents the probability of that event occurring.

For example, if the input to the sigmoid function is 2 (i.e., the log-odds of the event), the output of the sigmoid function is approximately 0.88, which represents a probability of 0.88 that the event will occur.

Therefore It is commonly used in binary classification problems where the output is a probability between 0 and 1. The sigmoid function is easy to compute, but it can suffer from the vanishing gradient problem when the input is too large or too small, making it slow to converge during training.

ReLU Function:  

ReLU (Rectified Linear Unit) is a commonly used activation function in artificial neural networks. It takes any input value and returns the input value if it is positive, or 0 if it is negative.

In other words, the ReLU function "turns on" for positive input values and "turns off" for negative input values. This simple activation function has become very popular in deep learning because it can help neural networks learn faster and avoid the vanishing gradient problem.

The ReLU function has a linear output for positive values, which means that it preserves the magnitude of positive inputs. However, for negative inputs, the output is always 0, which means that the ReLU function is not differentiable at 0. This can cause some problems in training neural networks, but there are workarounds, such as using a variant of ReLU called leaky ReLU, which allows a small positive output for negative input values.

ReLU is often used in the hidden layers of neural networks, where it can help to introduce non-linearity and sparsity in the output of the neurons. By "turning off" some of the neurons in the hidden layers, ReLU can help to simplify the neural network's computation and prevent overfitting.

In summary, the ReLU function is a simple activation function that can help neural networks learn faster and avoid the vanishing gradient problem. It "turns on" for positive input values and "turns off" for negative input values, making it a popular choice in deep learning due to its simplicity and speed of computation. ReLU does not suffer from the vanishing gradient problem, but it can lead to the "dying ReLU" problem where some neurons become inactive and stop learning.

tanh (hyperbolic tangent) function: 

The tanh (hyperbolic tangent) function is another commonly used activation function in artificial neural networks. It is a non-linear function that takes any input value and returns a value between -1 and 1.

Like the sigmoid function, the tanh function has an S-shaped curve that starts at -1 when the input is negative infinity and rises to 1 as the input approaches positive infinity. The midpoint of the curve is at 0, which means that when the input is 0, the output of the tanh function is 0.

The tanh function is similar to the sigmoid function in that it can introduce non-linearity into the output of neurons. However, the tanh function has some advantages over the sigmoid function. For example, the tanh function is symmetric around the origin (i.e., 0), which means that it can model negative inputs as well as positive inputs. Additionally, the output of the tanh function is centered around 0, which can help to reduce the impact of vanishing gradients during training.

The tanh function is often used in the hidden layers of neural networks, where it can help to introduce non-linearity and capture more complex patterns in the data. It is also sometimes used as an activation function in the output layer of neural networks when the output values are expected to lie between -1 and 1.

In summary, the tanh function is a non-linear activation function that takes any input value and returns a value between -1 and 1. It is similar to the sigmoid function but has some advantages, such as being symmetric around the origin and centered around 0. The tanh function is commonly used in the hidden layers of neural networks to introduce non-linearity and capture more complex patterns in the data. tanh can be useful in multi-class classification problems or when the output values have a wider range than 0 to 1. However, tanh can also suffer from the vanishing gradient problem eventhough less than sigmoid function.

Softmax function:

The softmax function is an activation function commonly used in the output layer of neural networks for classification problems. It takes a vector of arbitrary real-valued inputs and normalizes them to a probability distribution over several possible output classes.

In other words, the softmax function takes the input values and converts them into a set of probabilities that sum up to 1, where each probability represents the likelihood of the input belonging to a particular output class.

The formula for the softmax function is as follows:

softmax(x_i) = exp(x_i) / sum(exp(x_j))

where, 

x_i is the ith element of the input vector x, 

exp is the exponential function,

 and the sum is taken over all elements in the input vector.


The softmax function has a few important properties. First, it ensures that the output probabilities are always positive and sum up to 1, which is necessary for classification tasks. Second, it is differentiable, which allows for gradient-based optimization during training. Finally, it can handle multiple classes at once, making it useful for multi-class classification problems.

The softmax function is often used in conjunction with the cross-entropy loss function, which measures the difference between the predicted probability distribution and the true probability distribution. During training, the goal is to minimize the cross-entropy loss by adjusting the weights and biases of the neural network using a technique like backpropagation.

In summary, the softmax function is an activation function used in the output layer of neural networks for classification problems. It normalizes the input values to a probability distribution over multiple output classes. The softmax function has several important properties, such as ensuring that the output probabilities sum up to 1 and being differentiable, making it useful for gradient-based optimization during training.

Sigmoid vs Softmax

The sigmoid function is a type of logistic function that maps a real-valued input to an output in the range [0,1]. It is often used in the hidden layers of neural networks to introduce nonlinearity into the model. When the output of the neural network is continuous, the sigmoid function is a good choice for activation.

On the other hand, the softmax function is typically used in the output layer of neural networks when the output is categorical. It is also a type of logistic function, but it maps a real-valued input to a probability distribution over a set of discrete outcomes. The output of the softmax function is a vector of probabilities that sum to one. Therefore, it is suitable for multi-class classification tasks.

In summary, the main differences between sigmoid and softmax functions are:

Summary on activation function

Choosing the right activation function depends on the specific task and the characteristics of the data. Sigmoid and tanh can be useful in binary or multi-class classification problems, while ReLU can be used in deep learning architectures where speed is a concern. Softmax is useful in multi-class classification problems where the output values represent probabilities. However, each function has its own limitations and can lead to problems during training if not used carefully.

Challenges And Limitations

Yue: Thats interesting are there any issues that we may face while training neural networks

Yedhant: Thats a good question, there are two key issues one needs to take care while training neural network overfitting and vanishing gradient, have you heard those?

Yue: No, what's that?

Overfitting

Yedhant: Overfitting is when a neural network becomes too good at memorizing the training data, but doesn't generalize well to new data. It's like when you memorize all the answers for a test, but you don't really understand the concepts, so you can't apply them to new problems.

Yue: Oh, I see. So how can we prevent overfitting in neural networks?

Yedhant: One way is to use regularization techniques, like L1 or L2 regularization. It's like when you're trying to learn to play a new song on the piano, but you focus too much on playing every note exactly right. You end up playing the song perfectly, but it doesn't sound very musical. With regularization, you add a penalty for complexity, which encourages the network to learn simpler and more generalizable patterns.

Vanishing Gradient

Yue: That makes sense. What about the vanishing gradient problem?

Yedhant: The vanishing gradient problem is when the gradients in the network become too small to be useful for updating the weights. It's like when you're trying to fill up a water tank with a small straw, but the straw keeps getting clogged and the water can't flow through. One way to solve this is to use gradient clipping, which sets a maximum value for the gradients so they don't become too small or too large.

Yue: Got it. Are there any other techniques we can use?

Yedhant: Yes, another technique is called batch normalization. It's like when you're making a big batch of cookies, but you don't want some of them to be burnt and some of them to be undercooked. So you make sure the temperature in the oven is consistent and each cookie is baked for the same amount of time. Batch normalization ensures that the input to each layer of the network has a similar range, which can help prevent the vanishing gradient problem.

Yue: I see. So there are lots of different techniques we can use to train neural networks effectively.

Yedhant: That's right! And the more techniques we have in our toolbox, the better we can customize our approach to each specific problem we're trying to solve. Another challenge of ANNs is the need for large amounts of labeled data, which can be time-consuming and expensive to collect.

Yue: That makes sense. Are there any other challenges we should be aware of?

Accuracy vs Interpretability

Yedhant: Yes, another challenge is the tradeoff between accuracy and interpretability. ANNs can be seen as a "black box" with little insight into the decision-making process. For example, if we use an ANN to diagnose medical images, we might not be able to understand why the model made a certain diagnosis. To address this issue, there are techniques we can use to gain insight into the decision-making process, such as feature visualization or attribution methods.

Yue: Can you elaborate on feature visualization and attribution methods.

Yedhant: Sure, I can explain feature visualization and attribution methods 

Let's say you have a computer program that can tell you what kind of animal is in a picture. This program looks at the different parts of the picture to figure out what animal it is. But sometimes it's hard for us to understand how the program is figuring it out.

That's where feature visualization and attribution methods come in!

Feature visualization helps us understand what parts of the picture the program is looking at. We can use this to create cool pictures that the program likes to look at.

A fun example of feature visualization is the "DeepDream" algorithm, which uses feature visualization to generate trippy and surreal images. The algorithm works by starting with a random image and then iteratively modifying it to maximally activate specific neurons in the model. This can result in images that resemble psychedelic patterns or hallucinations.

Attribution methods help us understand which parts of the picture are the most important for the program to figure out what animal it is. This helps us understand why the program is making the decision it's making.

An example of attribution methods is the "Grad-CAM" algorithm, which stands for Gradient-weighted Class Activation Mapping. This algorithm highlights the regions of an image that are most important for a specific classification task. For example, we can use Grad-CAM to highlight the regions of an image that a model is using to recognize a certain breed of dog. This can help us understand which parts of the image are most important for that classification task.

So basically, feature visualization and attribution methods help us understand how a computer program makes decisions about pictures.

Computation and Memory Requirements

Yue: I see. And what about computation and memory requirements?

Yedhant: Yes, ANNs can be computationally and memory-intensive, especially for large-scale applications. For example, training a deep neural network for image recognition might require powerful hardware such as GPUs or TPUs, which can be expensive. It's important to consider these requirements when developing machine learning solutions. 

Deep Dream Example

image credit

ANN real life example

Yue: Can we use this ann to plant tomatos in my garden 

Yedhant : Why not Let's say we want to train an artificial neural network (ANN) to help us plant tomatoes in our garden. We can use the ANN to make recommendations about when and where to plant the tomatoes based on factors like soil quality, temperature, and sunlight.

To start, we need to collect some data on our garden and the conditions that are favorable for growing tomatoes. We can measure things like the pH level of the soil, the amount of sunlight the garden receives, and the average temperature during the growing season.

Once we have this data, we can use it to train the ANN to make predictions about when and where to plant the tomatoes. We can structure the ANN as a multi-layered neural network, with an input layer that takes in data about the garden conditions, one or more hidden layers that perform computations and transformations on the input data, and an output layer that gives us the recommendation for where and when to plant the tomatoes.

To train the ANN, we need a set of labeled training examples that tell us the correct recommendations for different sets of garden conditions. We can use this data to adjust the weights and biases of the neural network using the backpropagation algorithm.

For example, let's say we have a training example that tells us the optimal time to plant tomatoes is when the soil pH level is between 6.0 and 6.5, the average temperature is between 70 and 80 degrees Fahrenheit, and the garden receives at least 6 hours of direct sunlight per day. We would feed this data into the input layer of the neural network, and the network would make a prediction about where and when to plant the tomatoes.

We would then compare the predicted output with the correct output (in this case, the recommendation for where and when to plant the tomatoes), and use the backpropagation algorithm to adjust the weights and biases of the neural network to reduce the error between the predicted output and the correct output.

Over time, as we feed more training examples into the neural network, it will learn to make more accurate recommendations about when and where to plant tomatoes in our garden based on the conditions that we provide.

Yue: Thats interesting, how is ann better than linear regression for this problem?

Yedhant: Both artificial neural networks (ANNs) and linear regression can be used to make predictions about when and where to plant tomatoes in a garden based on factors like soil quality, temperature, and sunlight. However, there are some important differences between these two approaches.

Linear regression is a simple statistical technique that tries to find a linear relationship between the input variables (soil quality, temperature, sunlight) and the output variable (recommendation for when and where to plant tomatoes). The goal is to find a straight line that best fits the data, and use that line to make predictions for new input values.

On the other hand, ANNs are much more flexible and can capture much more complex relationships between the input and output variables. ANNs use multiple layers of interconnected neurons to perform nonlinear transformations on the input data, which allows them to learn much more intricate patterns and relationships.

In the case of our tomato planting example, ANNs may be better than linear regression because the conditions that are favorable for growing tomatoes are likely to be more complex and interdependent than a simple linear relationship. For example, soil pH levels may have a nonlinear effect on tomato growth, where pH levels between 6.0 and 6.5 are optimal, but levels outside of that range can be detrimental to growth. ANNs can capture these more complex relationships and make more accurate recommendations for when and where to plant tomatoes based on a wider range of input variables.

Additionally, ANNs have the ability to learn and adapt over time, whereas linear regression models are static and cannot improve their performance once they have been trained on a fixed set of data. ANNs can continue to learn and refine their predictions as more data becomes available or as the environmental conditions change over time.

Yue: Thanks, that's really helpful. You don't know how much you've helped me in learning all these concepts.

Summary

Yedhant: It was my pleasure, I really enjoyed spending time with you. Time really flew by, we didn't even realize it.

Yue: Yes, it's late. I should also be heading back.

Yedhant: Yes, ANN is a wide topic, and there is a lot to cover, from different ways to overcome overfitting and vanishing gradient to other activation functions. Also, there are different types of ANN. Let's keep that for our next meeting.

Yue: That sounds like a plan. Let's meet on next Monday.

Yedhant: Amazing! looking forward for our next meeting! But before we depart, why don't you summarize ANN in broad?

Yue: Ok. To summarize, we can say that artificial neural networks (ANNs) are computer programs that work like the human brain. They are made up of tiny parts called neurons, just like our brain is made up of tiny cells called neurons. ANNs can be trained to recognize patterns and make decisions, just like a human can.

ANNs are like a big maze of roads. When information comes in, it travels through the maze, stops at different neurons, and then gets sent to the next neuron. Each neuron decides whether the information should be sent on or not based on how important it is. The more important the information, the more likely it is to be sent on.

Yedhant: That's an interesting way to summarize. I may use it to explain it to others.

Yue: Sure... Now I must leave, I had to travel all the way back, and it's getting late. See you soon!

Yedhant: Ok, bye...Hasta la vista.