Page 1
In this chapter
» Introduction
» Descriptive Statistics
» Data Aggregations
» Sorting a DataFrame
» GROUP BY Functions
» Altering the Index
» Other DataFrame
Operations
» Handling Missing
Values
» Import and Export
of Data between
Pandas and MySQL
3.1 Introduct Ion As discussed in the previous chapter, Pandas
is a well established Python Library used for
manipulation, processing and analysis of
data. We have already discussed the basic
operations on Series and DataFrame like
creating them and then accessing data from
them. Pandas provides more powerful and
useful functions for data analysis.
In this chapter, we will be working with
more advanced features of DataFrame like
sorting data, answering analytical questions
using the data, cleaning data and applying
different useful functions on the data. Below
is the example data on which we will be
applying the advanced features of Pandas.
“We owe a lot to the Indians, who
taught us how to count, without
which no worthwhile scientific
discovery could have been made.”
— Albert Einstein
Chapter
3
Data Handling using
Pandas - II
Chapter 3.indd 63 11/26/2020 12:46:03 PM
2024-25
Page 2
In this chapter
» Introduction
» Descriptive Statistics
» Data Aggregations
» Sorting a DataFrame
» GROUP BY Functions
» Altering the Index
» Other DataFrame
Operations
» Handling Missing
Values
» Import and Export
of Data between
Pandas and MySQL
3.1 Introduct Ion As discussed in the previous chapter, Pandas
is a well established Python Library used for
manipulation, processing and analysis of
data. We have already discussed the basic
operations on Series and DataFrame like
creating them and then accessing data from
them. Pandas provides more powerful and
useful functions for data analysis.
In this chapter, we will be working with
more advanced features of DataFrame like
sorting data, answering analytical questions
using the data, cleaning data and applying
different useful functions on the data. Below
is the example data on which we will be
applying the advanced features of Pandas.
“We owe a lot to the Indians, who
taught us how to count, without
which no worthwhile scientific
discovery could have been made.”
— Albert Einstein
Chapter
3
Data Handling using
Pandas - II
Chapter 3.indd 63 11/26/2020 12:46:03 PM
2024-25
Informat Ics Pract Ices 64
Case Study
Let us consider the data of marks scored in unit tests
held in school. For each unit test, the marks scored by
all students of the class is recorded. Maximum marks
are 25 in each subject. The subjects are Maths, Science.
Social Studies (S.St.), Hindi, and English. For simplicity,
we assume there are 4 students in the class and the
table below shows their marks in Unit Test 1, Unit Test
2 and Unit Test 3. Table 3.1 shows this data.
Table 3.1 Case Study
Result
Name/
Subjects
Unit
Test
Maths Science S.St. Hindi Eng
Raman 1 22 21 18 20 21
Raman 2 21 20 17 22 24
Raman 3 14 19 15 24 23
Zuhaire 1 20 17 22 24 19
Zuhaire 2 23 15 21 25 15
Zuhaire 3 22 18 19 23 13
Aashravy 1 23 19 20 15 22
Aashravy 2 24 22 24 17 21
Aashravy 3 12 25 19 21 23
Mishti 1 15 22 25 22 22
Mishti 2 18 21 25 24 23
Mishti 3 17 18 20 25
20
Let us store the data in a DataFrame, as shown in
Program 3.1:
Program 3-1 Store the Result data in a DataFrame called marksUT.
>>> import pandas as pd
>>> marksUT= {'Name':['Raman','Raman','Raman','Zuhaire','Zuhaire','Zu
haire', 'Ashravy','Ashravy','Ashravy','Mishti','Mishti','Mishti'],
'UT':[1,2,3,1,2,3,1,2,3,1,2,3],
'Maths':[22,21,14,20,23,22,23,24,12,15,18,17],
'Science':[21,20,19,17,15,18,19,22,25,22,21,18],
'S.St':[18,17,15,22,21,19,20,24,19,25,25,20],
'Hindi':[20,22,24,24,25,23,15,17,21,22,24,25],
'Eng':[21,24,23,19,15,13,22,21,23,22,23,20]
}
>>> df=pd.DataFrame(marksUT)
>>> print(df)
Chapter 3.indd 64 11/26/2020 12:46:03 PM
2024-25
Page 3
In this chapter
» Introduction
» Descriptive Statistics
» Data Aggregations
» Sorting a DataFrame
» GROUP BY Functions
» Altering the Index
» Other DataFrame
Operations
» Handling Missing
Values
» Import and Export
of Data between
Pandas and MySQL
3.1 Introduct Ion As discussed in the previous chapter, Pandas
is a well established Python Library used for
manipulation, processing and analysis of
data. We have already discussed the basic
operations on Series and DataFrame like
creating them and then accessing data from
them. Pandas provides more powerful and
useful functions for data analysis.
In this chapter, we will be working with
more advanced features of DataFrame like
sorting data, answering analytical questions
using the data, cleaning data and applying
different useful functions on the data. Below
is the example data on which we will be
applying the advanced features of Pandas.
“We owe a lot to the Indians, who
taught us how to count, without
which no worthwhile scientific
discovery could have been made.”
— Albert Einstein
Chapter
3
Data Handling using
Pandas - II
Chapter 3.indd 63 11/26/2020 12:46:03 PM
2024-25
Informat Ics Pract Ices 64
Case Study
Let us consider the data of marks scored in unit tests
held in school. For each unit test, the marks scored by
all students of the class is recorded. Maximum marks
are 25 in each subject. The subjects are Maths, Science.
Social Studies (S.St.), Hindi, and English. For simplicity,
we assume there are 4 students in the class and the
table below shows their marks in Unit Test 1, Unit Test
2 and Unit Test 3. Table 3.1 shows this data.
Table 3.1 Case Study
Result
Name/
Subjects
Unit
Test
Maths Science S.St. Hindi Eng
Raman 1 22 21 18 20 21
Raman 2 21 20 17 22 24
Raman 3 14 19 15 24 23
Zuhaire 1 20 17 22 24 19
Zuhaire 2 23 15 21 25 15
Zuhaire 3 22 18 19 23 13
Aashravy 1 23 19 20 15 22
Aashravy 2 24 22 24 17 21
Aashravy 3 12 25 19 21 23
Mishti 1 15 22 25 22 22
Mishti 2 18 21 25 24 23
Mishti 3 17 18 20 25
20
Let us store the data in a DataFrame, as shown in
Program 3.1:
Program 3-1 Store the Result data in a DataFrame called marksUT.
>>> import pandas as pd
>>> marksUT= {'Name':['Raman','Raman','Raman','Zuhaire','Zuhaire','Zu
haire', 'Ashravy','Ashravy','Ashravy','Mishti','Mishti','Mishti'],
'UT':[1,2,3,1,2,3,1,2,3,1,2,3],
'Maths':[22,21,14,20,23,22,23,24,12,15,18,17],
'Science':[21,20,19,17,15,18,19,22,25,22,21,18],
'S.St':[18,17,15,22,21,19,20,24,19,25,25,20],
'Hindi':[20,22,24,24,25,23,15,17,21,22,24,25],
'Eng':[21,24,23,19,15,13,22,21,23,22,23,20]
}
>>> df=pd.DataFrame(marksUT)
>>> print(df)
Chapter 3.indd 64 11/26/2020 12:46:03 PM
2024-25
Data Han Dling using Pan Das - ii 65
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
1 Raman 2 21 20 17 22 24
2 Raman 3 14 19 15 24 23
3 Zuhaire 1 20 17 22 24 19
4 Zuhaire 2 23 15 21 25 15
5 Zuhaire 3 22 18 19 23 13
6 Ashravy 1 23 19 20 15 22
7 Ashravy 2 24 22 24 17 21
8 Ashravy 3 12 25 19 21 23
9 Mishti 1 15 22 25 22 22
10 Mishti 2 18 21 25 24 23
11 Mishti 3 17 18 20 25 20
3.2 d escr Ipt Ive s tat Ist Ics Descriptive Statistics are used to summarise the given
data. In other words, they refer to the methods which
are used to get some basic idea about the data.
In this section, we will be discussing descriptive
statistical methods that can be applied to a DataFrame.
These are max, min, count, sum, mean, median, mode,
quartiles, variance. In each case, we will consider the
above created DataFrame df.
3.2.1 Calculating Maximum Values
DataFrame.max() is used to calculate the maximum
values from the DataFrame, regardless of its data types.
The following statement outputs the maximum value of
each column of the DataFrame:
>>> print(df.max())
Name Zuhaire #Maximum value in name column
#(alphabetically)
UT 3 #Maximum value in column UT
Maths 24 #Maximum value in column Maths
Science 25 #Maximum value in column Science
S.St 25 #Maximum value in column S.St
Hindi 25 #Maximum value in column Hindi
Eng 24 #Maximum value in column Eng
dtype: object
If we want to output maximum value for the columns
having only numeric values, then we can set the
parameter numeric_only=True in the max() method, as
shown below:
Chapter 3.indd 65 11/26/2020 12:46:04 PM
2024-25
Page 4
In this chapter
» Introduction
» Descriptive Statistics
» Data Aggregations
» Sorting a DataFrame
» GROUP BY Functions
» Altering the Index
» Other DataFrame
Operations
» Handling Missing
Values
» Import and Export
of Data between
Pandas and MySQL
3.1 Introduct Ion As discussed in the previous chapter, Pandas
is a well established Python Library used for
manipulation, processing and analysis of
data. We have already discussed the basic
operations on Series and DataFrame like
creating them and then accessing data from
them. Pandas provides more powerful and
useful functions for data analysis.
In this chapter, we will be working with
more advanced features of DataFrame like
sorting data, answering analytical questions
using the data, cleaning data and applying
different useful functions on the data. Below
is the example data on which we will be
applying the advanced features of Pandas.
“We owe a lot to the Indians, who
taught us how to count, without
which no worthwhile scientific
discovery could have been made.”
— Albert Einstein
Chapter
3
Data Handling using
Pandas - II
Chapter 3.indd 63 11/26/2020 12:46:03 PM
2024-25
Informat Ics Pract Ices 64
Case Study
Let us consider the data of marks scored in unit tests
held in school. For each unit test, the marks scored by
all students of the class is recorded. Maximum marks
are 25 in each subject. The subjects are Maths, Science.
Social Studies (S.St.), Hindi, and English. For simplicity,
we assume there are 4 students in the class and the
table below shows their marks in Unit Test 1, Unit Test
2 and Unit Test 3. Table 3.1 shows this data.
Table 3.1 Case Study
Result
Name/
Subjects
Unit
Test
Maths Science S.St. Hindi Eng
Raman 1 22 21 18 20 21
Raman 2 21 20 17 22 24
Raman 3 14 19 15 24 23
Zuhaire 1 20 17 22 24 19
Zuhaire 2 23 15 21 25 15
Zuhaire 3 22 18 19 23 13
Aashravy 1 23 19 20 15 22
Aashravy 2 24 22 24 17 21
Aashravy 3 12 25 19 21 23
Mishti 1 15 22 25 22 22
Mishti 2 18 21 25 24 23
Mishti 3 17 18 20 25
20
Let us store the data in a DataFrame, as shown in
Program 3.1:
Program 3-1 Store the Result data in a DataFrame called marksUT.
>>> import pandas as pd
>>> marksUT= {'Name':['Raman','Raman','Raman','Zuhaire','Zuhaire','Zu
haire', 'Ashravy','Ashravy','Ashravy','Mishti','Mishti','Mishti'],
'UT':[1,2,3,1,2,3,1,2,3,1,2,3],
'Maths':[22,21,14,20,23,22,23,24,12,15,18,17],
'Science':[21,20,19,17,15,18,19,22,25,22,21,18],
'S.St':[18,17,15,22,21,19,20,24,19,25,25,20],
'Hindi':[20,22,24,24,25,23,15,17,21,22,24,25],
'Eng':[21,24,23,19,15,13,22,21,23,22,23,20]
}
>>> df=pd.DataFrame(marksUT)
>>> print(df)
Chapter 3.indd 64 11/26/2020 12:46:03 PM
2024-25
Data Han Dling using Pan Das - ii 65
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
1 Raman 2 21 20 17 22 24
2 Raman 3 14 19 15 24 23
3 Zuhaire 1 20 17 22 24 19
4 Zuhaire 2 23 15 21 25 15
5 Zuhaire 3 22 18 19 23 13
6 Ashravy 1 23 19 20 15 22
7 Ashravy 2 24 22 24 17 21
8 Ashravy 3 12 25 19 21 23
9 Mishti 1 15 22 25 22 22
10 Mishti 2 18 21 25 24 23
11 Mishti 3 17 18 20 25 20
3.2 d escr Ipt Ive s tat Ist Ics Descriptive Statistics are used to summarise the given
data. In other words, they refer to the methods which
are used to get some basic idea about the data.
In this section, we will be discussing descriptive
statistical methods that can be applied to a DataFrame.
These are max, min, count, sum, mean, median, mode,
quartiles, variance. In each case, we will consider the
above created DataFrame df.
3.2.1 Calculating Maximum Values
DataFrame.max() is used to calculate the maximum
values from the DataFrame, regardless of its data types.
The following statement outputs the maximum value of
each column of the DataFrame:
>>> print(df.max())
Name Zuhaire #Maximum value in name column
#(alphabetically)
UT 3 #Maximum value in column UT
Maths 24 #Maximum value in column Maths
Science 25 #Maximum value in column Science
S.St 25 #Maximum value in column S.St
Hindi 25 #Maximum value in column Hindi
Eng 24 #Maximum value in column Eng
dtype: object
If we want to output maximum value for the columns
having only numeric values, then we can set the
parameter numeric_only=True in the max() method, as
shown below:
Chapter 3.indd 65 11/26/2020 12:46:04 PM
2024-25
Informat Ics Pract Ices 66
>>> print(df.max(numeric_only=True))
UT 3
Maths 24
Science 25
S.St 25
Hindi 25
Eng 24
dtype: int64
Program 3-2 Write the statements to output the
maximum marks obtained in each subject
in Unit Test 2.
>>> dfUT2 = df[df.UT == 2]
>>> print('\nResult of Unit Test 2:
\n\n',dfUT2)
Result of Unit Test 2:
Name UT Maths Science S.St Hindi Eng
1 Raman 2 21 20 17 22 24
4 Zuhaire 2 23 15 21 25 15
7 Ashravy 2 24 22 24 17 21
10 Mishti 2 18 21 25 24 23
>>> print('\nMaximum Mark obtained in
Each Subject in Unit Test 2: \n\n',dfUT2.
max(numeric_only=True))
Maximum Mark obtained in Each Subject in Unit
Test 2:
UT 2
Maths 24
Science 22
S.St 25
Hindi 25
Eng 24
dtype: int64
By default, the max() method finds the maximum
value of each column (which means, axis=0). However,
to find the maximum value of each row, we have to
specify axis = 1 as its argument.
#maximum marks for each student in each unit
test among all the subjects
The output of Program
3.2 can also be
achieved using the
following statements
>>> dfUT2=df[df
['UT']==2].max
(numeric_only=True)
>>> print(dfUT2)
Chapter 3.indd 66 11/26/2020 12:46:04 PM
2024-25
Page 5
In this chapter
» Introduction
» Descriptive Statistics
» Data Aggregations
» Sorting a DataFrame
» GROUP BY Functions
» Altering the Index
» Other DataFrame
Operations
» Handling Missing
Values
» Import and Export
of Data between
Pandas and MySQL
3.1 Introduct Ion As discussed in the previous chapter, Pandas
is a well established Python Library used for
manipulation, processing and analysis of
data. We have already discussed the basic
operations on Series and DataFrame like
creating them and then accessing data from
them. Pandas provides more powerful and
useful functions for data analysis.
In this chapter, we will be working with
more advanced features of DataFrame like
sorting data, answering analytical questions
using the data, cleaning data and applying
different useful functions on the data. Below
is the example data on which we will be
applying the advanced features of Pandas.
“We owe a lot to the Indians, who
taught us how to count, without
which no worthwhile scientific
discovery could have been made.”
— Albert Einstein
Chapter
3
Data Handling using
Pandas - II
Chapter 3.indd 63 11/26/2020 12:46:03 PM
2024-25
Informat Ics Pract Ices 64
Case Study
Let us consider the data of marks scored in unit tests
held in school. For each unit test, the marks scored by
all students of the class is recorded. Maximum marks
are 25 in each subject. The subjects are Maths, Science.
Social Studies (S.St.), Hindi, and English. For simplicity,
we assume there are 4 students in the class and the
table below shows their marks in Unit Test 1, Unit Test
2 and Unit Test 3. Table 3.1 shows this data.
Table 3.1 Case Study
Result
Name/
Subjects
Unit
Test
Maths Science S.St. Hindi Eng
Raman 1 22 21 18 20 21
Raman 2 21 20 17 22 24
Raman 3 14 19 15 24 23
Zuhaire 1 20 17 22 24 19
Zuhaire 2 23 15 21 25 15
Zuhaire 3 22 18 19 23 13
Aashravy 1 23 19 20 15 22
Aashravy 2 24 22 24 17 21
Aashravy 3 12 25 19 21 23
Mishti 1 15 22 25 22 22
Mishti 2 18 21 25 24 23
Mishti 3 17 18 20 25
20
Let us store the data in a DataFrame, as shown in
Program 3.1:
Program 3-1 Store the Result data in a DataFrame called marksUT.
>>> import pandas as pd
>>> marksUT= {'Name':['Raman','Raman','Raman','Zuhaire','Zuhaire','Zu
haire', 'Ashravy','Ashravy','Ashravy','Mishti','Mishti','Mishti'],
'UT':[1,2,3,1,2,3,1,2,3,1,2,3],
'Maths':[22,21,14,20,23,22,23,24,12,15,18,17],
'Science':[21,20,19,17,15,18,19,22,25,22,21,18],
'S.St':[18,17,15,22,21,19,20,24,19,25,25,20],
'Hindi':[20,22,24,24,25,23,15,17,21,22,24,25],
'Eng':[21,24,23,19,15,13,22,21,23,22,23,20]
}
>>> df=pd.DataFrame(marksUT)
>>> print(df)
Chapter 3.indd 64 11/26/2020 12:46:03 PM
2024-25
Data Han Dling using Pan Das - ii 65
Name UT Maths Science S.St Hindi Eng
0 Raman 1 22 21 18 20 21
1 Raman 2 21 20 17 22 24
2 Raman 3 14 19 15 24 23
3 Zuhaire 1 20 17 22 24 19
4 Zuhaire 2 23 15 21 25 15
5 Zuhaire 3 22 18 19 23 13
6 Ashravy 1 23 19 20 15 22
7 Ashravy 2 24 22 24 17 21
8 Ashravy 3 12 25 19 21 23
9 Mishti 1 15 22 25 22 22
10 Mishti 2 18 21 25 24 23
11 Mishti 3 17 18 20 25 20
3.2 d escr Ipt Ive s tat Ist Ics Descriptive Statistics are used to summarise the given
data. In other words, they refer to the methods which
are used to get some basic idea about the data.
In this section, we will be discussing descriptive
statistical methods that can be applied to a DataFrame.
These are max, min, count, sum, mean, median, mode,
quartiles, variance. In each case, we will consider the
above created DataFrame df.
3.2.1 Calculating Maximum Values
DataFrame.max() is used to calculate the maximum
values from the DataFrame, regardless of its data types.
The following statement outputs the maximum value of
each column of the DataFrame:
>>> print(df.max())
Name Zuhaire #Maximum value in name column
#(alphabetically)
UT 3 #Maximum value in column UT
Maths 24 #Maximum value in column Maths
Science 25 #Maximum value in column Science
S.St 25 #Maximum value in column S.St
Hindi 25 #Maximum value in column Hindi
Eng 24 #Maximum value in column Eng
dtype: object
If we want to output maximum value for the columns
having only numeric values, then we can set the
parameter numeric_only=True in the max() method, as
shown below:
Chapter 3.indd 65 11/26/2020 12:46:04 PM
2024-25
Informat Ics Pract Ices 66
>>> print(df.max(numeric_only=True))
UT 3
Maths 24
Science 25
S.St 25
Hindi 25
Eng 24
dtype: int64
Program 3-2 Write the statements to output the
maximum marks obtained in each subject
in Unit Test 2.
>>> dfUT2 = df[df.UT == 2]
>>> print('\nResult of Unit Test 2:
\n\n',dfUT2)
Result of Unit Test 2:
Name UT Maths Science S.St Hindi Eng
1 Raman 2 21 20 17 22 24
4 Zuhaire 2 23 15 21 25 15
7 Ashravy 2 24 22 24 17 21
10 Mishti 2 18 21 25 24 23
>>> print('\nMaximum Mark obtained in
Each Subject in Unit Test 2: \n\n',dfUT2.
max(numeric_only=True))
Maximum Mark obtained in Each Subject in Unit
Test 2:
UT 2
Maths 24
Science 22
S.St 25
Hindi 25
Eng 24
dtype: int64
By default, the max() method finds the maximum
value of each column (which means, axis=0). However,
to find the maximum value of each row, we have to
specify axis = 1 as its argument.
#maximum marks for each student in each unit
test among all the subjects
The output of Program
3.2 can also be
achieved using the
following statements
>>> dfUT2=df[df
['UT']==2].max
(numeric_only=True)
>>> print(dfUT2)
Chapter 3.indd 66 11/26/2020 12:46:04 PM
2024-25
Data Han Dling using Pan Das - ii 67
>>> df.max(axis=1)
0 22
1 24
2 24
3 24
4 25
5 23
6 23
7 24
8 25
9 25
10 25
11 25
dtype: int64
Note: In most of the python function calls, axis = 0 refers
to row wise operations and axis = 1 refers to column wise
operations. But in the call of max(), axis = 1 gives row wise
output and axis = 0 (default case) gives column-wise output.
Similar is the case with all statistical operations discussed
in this chapter.
3.2.2 Calculating Minimum Values
DataFrame.min() is used to display the minimum values
from the DataFrame, regardless of the data types. That
is, it shows the minimum value of each column or row.
The following line of code output the minimum value of
each column of the DataFrame:
>>> print(df.min())
Name Ashravy
UT 1
Maths 12
Science 15
S.St 15
Hindi 15
Eng 13
dtype: object
Program 3-3 Write the statements to display the
minimum marks obtained by a particular
student ‘Mishti’ in all the unit tests for
each subject.
>>> dfMishti = df.loc[df.Name == 'Mishti']
n otes Chapter 3.indd 67 11/26/2020 12:46:04 PM
2024-25
Read More