Humanities/Arts Exam  >  Humanities/Arts Notes  >  Informatics Practices for Class 12  >  Chapter Notes: Data Handling using Pandas - II

Data Handling using Pandas - II Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts PDF Download

Chapter Notes - Data Handling using Pandas - II

Introduction

  • Pandas is a well-established Python library used for manipulation, processing, and analysis of data.
  • Basic operations on Series and DataFrame, such as creating and accessing data, were discussed in the previous chapter.
  • Pandas provides powerful and useful functions for advanced data analysis.
  • This chapter focuses on advanced DataFrame features, including sorting data, answering analytical questions, cleaning data, and applying various functions.
  • The chapter uses example data of marks scored by students in unit tests to demonstrate these features.
  • Topics covered include descriptive statistics, data aggregations, sorting, GROUP BY functions, altering the index, other DataFrame operations, handling missing values, and importing/exporting data between Pandas and MySQL.

Descriptive Statistics

  • Descriptive statistics are used to quantitatively summarize data, providing a basic understanding of the dataset.
  • Statistical methods applicable to a DataFrame include max, min, count, sum, mean, median, mode, quartiles, and variance.
  • These methods are applied to the example DataFrame containing student marks.

Calculating Maximum Values

  • The DataFrame.max() function calculates the maximum values from the DataFrame, regardless of data types.
  • It returns the maximum value for each column by default.
  • To focus on numeric columns only, set the parameter numeric_only=True in the max() method.
  • For row-wise maximum values, use max(axis=1), which returns the maximum value for each row across specified columns.
  • Note that in max(), axis=0 (default) operates column-wise, while axis=1 operates row-wise, unlike most Python functions where axis=0 is row-wise.

Calculating Minimum Values

  • The DataFrame.min() function displays the minimum values from the DataFrame, regardless of data types.
  • It returns the minimum value for each column by default.
  • To calculate minimum values for specific rows or students, filter the DataFrame using conditions (e.g., df.loc[df.Name == 'Mishti']) and apply min() to selected columns.
  • Like max(), min() uses axis=0 for column-wise operations and axis=1 for row-wise operations.

Calculating Sum of Values

  • The DataFrame.sum() function calculates the sum of values in the DataFrame, regardless of data type.
  • It returns the sum for each column by default, but summing text values (e.g., names) may not be meaningful.
  • To sum values for a specific column, specify the column name (e.g., df['Maths'].sum()).
  • To calculate the total marks for a specific student, filter the DataFrame by student name and apply sum() to relevant columns.
  • For row-wise sums (e.g., total marks per unit test for a student), use sum(axis=1).

Calculating Number of Value

  • The DataFrame.count() function returns the total number of non-null values for each column or row.
  • By default, it counts values column-wise (axis=0).
  • To count values row-wise, use count(axis=1).
  • This function is useful for understanding the completeness of data in each column or row.

Data Aggregations

  • Data aggregation involves applying functions like sum, mean, or count to grouped data to summarize it.
  • Pandas supports aggregation through functions like groupby() combined with aggregate functions.
  • Aggregation is useful for deriving insights, such as average marks per student or total marks per subject.

Sorting a DataFrame

  • Sorting arranges data in a specified order, either ascending or descending.
  • The DataFrame.sort_values() function is used to sort a DataFrame by one or more columns.
  • Sorting can be applied to numeric or categorical columns, with options to specify ascending or descending order.

GROUP BY Functions

  • The groupby() function groups data based on one or more columns, allowing aggregation operations on the grouped data.
  • It is used to perform operations like sum, mean, or count on groups (e.g., average marks per student or per unit test).
  • Grouping is a powerful tool for summarizing data and answering analytical questions.

Altering the Index

  • Altering the index refers to changing the labels of rows or columns in a DataFrame.
  • The reset_index() function resets the index to default integer values, moving the current index to a column.
  • The set_index() function sets a specified column as the new index.
  • These functions are useful for reorganizing data or preparing it for specific analyses.

Other DataFrame Operations

  • Pandas provides additional operations like filtering, merging, joining, and concatenating DataFrames.
  • Filtering allows selecting rows based on conditions (e.g., df[df['Name'] == 'Raman']).
  • Merging and joining combine multiple DataFrames based on common columns or indices.
  • Concatenation stacks DataFrames vertically or horizontally.

Handling Missing Values

  • Missing values (represented as NaN in Pandas) are a common issue in data analysis and must be handled properly.
  • Two primary strategies for handling missing values are:
    • Dropping the rows or columns containing missing values.
    • Filling or estimating missing values with appropriate values.
  • Missing values can arise in datasets, such as when a student misses a test (e.g., Raman’s missing marks in Unit Test 4).

Checking Missing Values

  • The isnull() function checks for missing values in a DataFrame, returning True for missing values and False otherwise.
  • It can be applied to the entire DataFrame or specific columns (e.g., df['Science'].isnull()).
  • The any() function, used with isnull(), checks if any missing values exist in a column or the entire DataFrame.
  • The isnull().sum() function counts the number of missing values per column.
  • The isnull().sum().sum() function calculates the total number of missing values in the DataFrame.

Dropping Missing Values

  • The dropna() function removes rows or columns containing missing values.
  • By default, it drops rows with any missing values (how='any').
  • Setting inplace=True modifies the original DataFrame; otherwise, a new DataFrame is returned.
  • Dropping is suitable when missing values are few, as it reduces the dataset size.
  • Example: Dropping Raman’s Unit Test 4 row removes the row with missing values, affecting percentage calculations.

Estimating Missing Values

  • Estimating missing values involves replacing them with approximations, such as the previous value, next value, mean, or a constant (e.g., 0).
  • The fillna(num) function replaces missing values with a specified value (e.g., fillna(0) replaces NaN with 0).
  • The fillna(method='pad') function replaces missing values with the previous value in the column.
  • The fillna(method='bfill') function replaces missing values with the next value in the column.
  • Estimating missing values alters the dataset and affects analysis results, providing an approximation rather than exact values.

Import and Export of Data between Pandas and MySQL

  • In real-world scenarios, data is often stored in files (e.g., CSV) or databases, requiring import to Pandas DataFrames or export from DataFrames to databases.
  • Pandas supports importing data from MySQL databases and exporting DataFrames to MySQL tables.
  • A connection to MySQL is established using the pymysql driver and sqlalchemy library.
  • Install pymysql using pip install pymysql and sqlalchemy using pip install sqlalchemy.
  • The create_engine() function establishes a connection to MySQL using a connection string with parameters: driver, username, password, host, port, and database name.
  • Syntax: engine = create_engine('mysql+pymysql://username:password@host:port/database').
  • Example connection string: mysql+pymysql://root:smsm@localhost:3306/CARSHOWROOM.

Importing Data from MySQL to Pandas

Importing data involves reading a MySQL table or query result into a Pandas DataFrame.
Three functions are used for importing:

  • pandas.read_sql_query(query, sql_conn): Reads an SQL query into a DataFrame using the connection identifier.
  • pandas.read_sql_table(table_name, sql_conn): Reads an entire SQL table into a DataFrame.
  • pandas.read_sql(sql, sql_conn): Reads either an SQL query or table into a DataFrame.

Example: df = pd.read_sql_query('SELECT * FROM INVENTORY', engine) loads the INVENTORY table into a DataFrame.

Exporting Data from Pandas to MySQL

Exporting data involves writing a Pandas DataFrame to a MySQL table.
The DataFrame.to_sql() function is used: df.to_sql(table, sql_conn, if_exists='fail', index=False).
Parameters:

  • table: Name of the MySQL table to write to.
  • sql_conn: Connection identifier from create_engine().
  • if_exists: Specifies behavior if the table exists:
    • 'fail': Raises a ValueError if the table exists (default).
    • 'replace': Replaces the table’s contents with the DataFrame.
    • 'append': Appends the DataFrame to the existing table (column names must match).
  • index: If True, includes the DataFrame index as a column; if False, ignores the index.

Example: df.to_sql('showroom_info', engine, if_exists='replace', index=False) creates or replaces the showroom_info table with the DataFrame’s contents.

The document Data Handling using Pandas - II Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts is a part of the Humanities/Arts Course Informatics Practices for Class 12.
All you need of Humanities/Arts at this link: Humanities/Arts
14 docs

FAQs on Data Handling using Pandas - II Chapter Notes - Informatics Practices for Class 12 - Humanities/Arts

1. What is the process to calculate the number of values in a Pandas DataFrame?
Ans. To calculate the number of values in a Pandas DataFrame, you can use the `.count()` method, which returns the count of non-null entries for each column. If you want to get the total number of entries in the DataFrame, you can use the `.shape` attribute, which provides a tuple of the number of rows and columns.
2. How can I calculate the mean of a column in a Pandas DataFrame?
Ans. You can calculate the mean of a specific column in a Pandas DataFrame using the `.mean()` method. For example, if your DataFrame is named `df` and you want to calculate the mean of a column named `column_name`, you would use `df['column_name'].mean()`.
3. What are quartiles and how do I calculate them using Pandas?
Ans. Quartiles are values that divide a dataset into four equal parts. You can calculate quartiles in Pandas using the `.quantile()` method. For example, to find the first quartile (25th percentile), you would use `df['column_name'].quantile(0.25)`.
4. What is the method to calculate the standard deviation of a DataFrame in Pandas?
Ans. You can calculate the standard deviation of a DataFrame using the `.std()` method. For example, to calculate the standard deviation of a column named `column_name`, you would use `df['column_name'].std()`. This method computes the standard deviation of the values, excluding any missing values.
5. How can I check for missing values in a Pandas DataFrame?
Ans. To check for missing values in a Pandas DataFrame, you can use the `.isnull()` method combined with `.sum()`. For example, `df.isnull().sum()` will return the count of missing values for each column in the DataFrame. This allows you to identify which columns have missing data.
Related Searches

pdf

,

MCQs

,

Extra Questions

,

Data Handling using Pandas - II Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts

,

video lectures

,

Previous Year Questions with Solutions

,

Data Handling using Pandas - II Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts

,

past year papers

,

Important questions

,

study material

,

Data Handling using Pandas - II Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts

,

Sample Paper

,

mock tests for examination

,

Semester Notes

,

Viva Questions

,

Objective type Questions

,

ppt

,

Exam

,

Summary

,

practice quizzes

,

Free

,

shortcuts and tricks

;