Artificial Intelligence (AI) relies heavily on data to function effectively. The type and quality of data fed into AI systems determine their intelligence and capabilities. Based on the nature of the data, AI can be categorized into three main domains:
Data Science involves analyzing data to derive insights and make informed decisions. In the context of Artificial Intelligence, data analysis is crucial for training machines to perform tasks autonomously. Here are some prominent applications of Data Science in various fields:
1. Fraud and Risk Detection in Finance:
2. Genetics and Genomics:
3. Internet Search:
4. Targeted Advertising:
5. Website Recommendations:
6. Airline Route Planning:
Data Sciences involves the use of Python along with mathematical concepts such as statistics, data analysis, and probability. These concepts are fundamental for analyzing data effectively in Python and can also be applied in the development of artificial intelligence (AI) applications.
Before delving deeper into data analysis, it's important to revisit how Data Sciences can be utilized to address pressing issues. Let's explore the AI project cycle framework in the context of Data Sciences through an example.
Remember the AI Project Cycle?
The Scenario
Humans are inherently social beings, which is why we often find ourselves organizing or taking part in various social gatherings. One of the activities we enjoy the most is dining out with friends and family. This love for eating out has led to the proliferation of restaurants everywhere, many of which offer buffets to provide customers with a wide range of food options.
Restaurants typically prepare food in large quantities, anticipating a good number of customers. However, at the end of the day, a significant amount of food often remains unsold. Restaurants are reluctant to serve stale food the next day, so this leftover food becomes unusable.
Every day, restaurants cook in bulk based on their expectations of customer turnout. When these expectations are not met, it results in a considerable amount of food waste, leading to financial losses for the establishment. They are faced with the dilemma of either throwing away the excess food or giving it away for free to those in need. Over the course of a year, these daily losses add up to a substantial amount.
In this section, we will explore the problem in detail by filling out the 4Ws problem canvas.
Who Canvas - Who is experiencing the problem?
What Canvas - What is the nature of their problem?
Where Canvas - Where does the problem occur?
Why? - Why is this problem worth solving?
Now that we have identified all the factors related to our problem, let's move on to filling out the problem statement template.
Now that we have finalized the goal of our project, let's focus on the various data features that impact the problem in one way or another. Any AI-based project requires data for testing and training, so it's essential to understand what kind of data needs to be collected to work towards our goal. In our case, several factors influence the quantity of food to be prepared for the next day's consumption in buffets:
Now, let's explore how these factors are related to our problem statement. To do this, we can use the System Maps tool to understand the relationship of these elements with the project's goal. The System Map illustrates the connections between each factor and our objective.
In the System Map, positive arrows indicate a direct relationship between elements, while negative arrows show an inverse relationship.
After examining the factors affecting our problem statement, it's time to look at the data that needs to be acquired for our goal. For this project, we need a dataset that includes all the mentioned elements for each dish prepared by the restaurant over a 30-day period. This data is collected offline through regular surveys, as it is a personalized dataset created for the specific needs of one restaurant.
The collected data falls under the following categories:
Introduction
Required Data
To achieve this, we need specific information:
Data Extraction and Cleaning
After preparing the dataset, we proceed to train our model using it. In this instance, we opt for a regression model, where the dataset is input as a dataframe and trained accordingly. Regression falls under the category of Supervised Learning and involves continuous data over a period of time. Since our data consists of continuous information over 30 days, a regression model is suitable for predicting future values in a similar manner.
The 30-day dataset is split into a 2:1 ratio for training and testing. Initially, the model is trained on the first 20 days of data and subsequently evaluated on the remaining 10 days.
After training the model on the initial 20-day dataset, it's crucial to evaluate its performance and accuracy in predicting food quantities. Here's how the evaluation process works:
Step 1: Inputting Data
Step 2: Historical Data Input
Step 3: Model Processing
Step 4: Making Predictions
Step 5: Comparison with Testing Dataset
Step 6: Testing the Model
Step 7: Evaluating Prediction Accuracy
Step 8: Assessing Model Accuracy
Once the model achieves optimal accuracy, it is ready for deployment in the restaurant for real-time food quantity predictions.
Data collection is an age-old practice that has been part of society for a long time. Even in the past, when people had limited understanding of calculations, records were kept in some form or another to track important information. Data collection is a task that doesn’t require any technological expertise. However, analyzing the data is where the challenge lies for humans, as it involves dealing with numbers and alphanumeric information. This is where Data Science becomes valuable.
Data Science not only provides a clearer understanding of the dataset but also enhances it by offering deeper and more precise analyses. With the incorporation of AI, machines can make predictions and suggestions based on the data.
After discussing an example of a Data Science project, we have a better understanding of the type of data that can be used for such projects. Data used in data domain-based projects is primarily in numerical or alphanumeric format and is organized in the form of tables. These datasets are commonly found in institutions for record-keeping and other purposes.
Banks
ATM Machines
Movie Theatres
Now, let’s explore some sources of data and discuss whether these datasets should be accessible to all.
Classroom
School
City
The data mentioned above is typically organized in tables containing numeric or alphanumeric information.
Data can be collected from various sources, and the process of data collection can be divided into two categories: Offline and Online.
When accessing data from any source, it is important to keep the following points in mind:
In Data Science, data is typically collected in tabular form, and these datasets can be stored in various formats. Here are some commonly used formats:
There are many other database formats available as well, which you can explore further online!
To utilize the collected data for programming, it's essential to know how to access it within Python code. Fortunately, there are several Python packages designed to facilitate access to structured data (in tabular format) seamlessly. Let's explore some of these packages:
To better understand arrays in NumPy, it's helpful to compare them with lists:
Pandas is a library in Python designed for data manipulation and analysis, especially for numerical tables and time series. The name "Pandas" comes from "panel data," which refers to data sets with observations over time for the same subjects.
It is suitable for various types of data, including:
The main data structures in Pandas are the Series (for 1-dimensional data) and DataFrame (for 2-dimensional data), which cover most use cases in finance, statistics, social science, and engineering. Pandas is built on NumPy and works well with other scientific computing libraries.
Some key features of Pandas include:
Matplotlib is a powerful library in Python used for creating 2D plots from arrays. It is built on top of NumPy arrays and is compatible with multiple platforms. One of the main advantages of data visualization is that it enables us to see large amounts of data in a more understandable way through visual representations.
Matplotlib offers a wide range of plots that help in understanding trends, patterns, and making correlations in quantitative information. Some examples of graphs that can be created using this package include:
In addition to plotting, Matplotlib allows users to customize and modify their plots according to their preferences. Users can style their plots and make them more descriptive and communicative. This package, along with others, helps in accessing and exploring datasets to gain a better understanding of them.
Data Science and Statistics
Importance of Python Packages
Key Statistical Concepts
Exploring Python for Statistics
Data quality issues:
Importance of data visualisation:
Scatter plots are used to plot discontinuous data; that is, the data which does not have any continuity in flow is termed as discontinuous. There exist gaps in data which introduce discontinuity. A 2D scatter plot can display information maximum upto 4 parameters.
In this scatter plot, 2 axes (X and Y) are two different parameters. The colour of circles and the size both represent 2 different parameters. Thus, just through one coordinate on the graph, one can visualise 4 different parameters all at once.
It is one of the most commonly used graphical methods. From students to scientists, everyone uses bar charts in some way or the other. It is a very easy to draw yet informative graphical representation. Various versions of bar chart exist like single bar chart, double bar chart, etc.
This is an example of a double bar chart. The 2 axes depict two different parameters while bars of different colours work with different entities ( in this case it is women and men). Bar chart also works on discontinuous data and is made at uniform intervals.
![]() |
Test: Data Science
|
Start Test |
Histograms are the accurate representation of a continuous data. When it comes to plotting the variation in just one entity of a period of time, histograms come into the picture. It represents the frequency of the variable at different points of time with the help of the bins.
In the given example, the histogram is showing the variation in frequency of the entity plotted with the help of XY plane. Here, at the left, the frequency of the element has been plotted and it is a frequency map for the same. The colours show the transition from low to high and vice versa. Whereas on the right, a continuous dataset has been plotted which might not be talking about the frequency of occurrence of the element.
When the data is split according to its percentile throughout the range, box plots come in haman. Box plots also known as box and whiskers plot conveniently display the distribution of data throughout the range with the help of 4 quartiles.
Quartile 3: The range from the 50th percentile to the 75th percentile is plotted within the box because its deviation from the mean is minimal. Quartiles 2 and 3, which span the 25th to the 75th percentiles, together form the Interquartile Range (IQR). The length of the box can vary depending on the spread of the data, similar to the whiskers.
Quartile 4: This quartile represents the range from the 75th percentile to the 100th percentile and corresponds to the whiskers plot for the top 25 percentile of data.
Outliers: Box plots have the advantage of clearly indicating outliers in a data distribution. Points that fall outside the specified range are plotted as dots or circles outside the graph and are considered outliers because they do not conform to the data range. Outliers are not errors; hence, they are included in the graph for visualisation purposes.
Now, let's proceed to explore data visualisation using Jupyter Notebook. The Matplotlib library will assist us in plotting various types of graphs, while Numpy and Pandas will aid in data analysis.
In this section, we will explore a classification model used in Data Science. But before diving into the technical aspects of the code, let's start with a fun game.
![]() |
Download the notes
Chapter Notes: Data Science
|
Download as PDF |
Step 1: Look at the map carefully. The arrows on the map indicate different qualities. Here are the qualities described by the axes:
K-Nearest Neighbours (KNN) is a straightforward and easy-to-use supervised machine learning algorithm suitable for both classification and regression tasks. The core idea behind KNN is that similar items are usually found close to each other, much like the saying, "Birds of a feather flock together." Here are some key features of KNN:
In a previous activity about personality prediction using KNN, we attempted to guess the animal for four students based on the animals closest to their data points. This is a simplified explanation of how KNN works. The 'K' in KNN represents the number of neighbours considered during the prediction and can be any integer value starting from 1.
Let's explore another example to clarify how KNN functions.
Imagine we want to predict the sweetness of a fruit based on data from similar fruits. We have three maps to make this prediction.
In the first graph, with K set to 1, the algorithm considers only the closest neighbour to the point X. Since the nearest point is blue (not sweet), the 1-nearest neighbour algorithm predicts that the fruit is not sweet.
In the second graph, where K is 2, the algorithm looks at the two nearest points to X. One is sweet (green), and the other is not sweet (blue), making it difficult for the algorithm to make a clear prediction.
In the third graph, with K set to 3, the algorithm considers the three nearest points to X. Two of these points are sweet (green), and one is not sweet (blue). Based on this information, the model predicts that the fruit is sweet.
KNN works by predicting an unknown value based on known values. The algorithm calculates the distance between the unknown point and all known points, selecting K points with the smallest distances to make a prediction.
The choice of K is crucial:
In classification problems where majority voting is used (such as picking the most common label), K is often set to an odd number to avoid ties.
24 videos|33 docs|8 tests
|
1. What is Data Science and why is it important? | ![]() |
2. What are the key steps in the AI Project Cycle related to Data Sciences? | ![]() |
3. How do you acquire data for a Data Science project? | ![]() |
4. What techniques are used for exploring data in Data Science? | ![]() |
5. How is model evaluation conducted in Data Science? | ![]() |