Humanities/Arts Exam  >  Humanities/Arts Notes  >  Informatics Practices for Class 12  >  Chapter Notes: Data Handling using Pandas - I

Data Handling using Pandas - I Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts PDF Download

Chapter Notes - Data Handling Using Pandas - I

Introduction to Python Libraries

  • Python libraries are collections of built-in modules that enable various actions without writing detailed programs.
  • Each library contains numerous modules that can be imported and used for specific functionalities.
  • NumPy, Pandas, and Matplotlib are well-established Python libraries designed for scientific and analytical purposes.
  • These libraries facilitate easy and efficient manipulation, transformation, and visualization of data.
  • NumPy (Numerical Python) is a package for numerical data analysis and scientific computing, utilizing multidimensional array objects and providing tools for array operations.
  • Pandas (PANel DAta) is a high-level data manipulation tool built on NumPy and Matplotlib, offering a convenient platform for data analysis and visualization.
  • Pandas includes three key data structures: Series, DataFrame, and Panel, which streamline data analysis processes.
  • Matplotlib is a Python library for plotting graphs and visualizations, capable of generating publication-quality plots, histograms, bar charts, and scatterplots with minimal code.
  • Matplotlib is also built on NumPy and integrates well with NumPy and Pandas.
  • Differences between Pandas and NumPy:
    • NumPy arrays require homogeneous data, whereas Pandas DataFrames support heterogeneous data types (float, int, string, datetime, etc.).
    • Pandas provides a'simpler interface for operations like file loading, plotting, selection, joining, and GROUP BY, which are useful in data-processing applications.
    • Pandas DataFrames use column names, making data tracking easier.
    • Pandas is ideal for tabular data, while NumPy is suited for numeric array-based data manipulation.

Installing Pandas

  • Pandas installation is similar to NumPy installation and requires Python to be pre-installed on the system.
  • The command to install Pandas via the command line is:

    pip install pandas

  • Both NumPy and Pandas, along with other Python libraries, depend on an existing Python installation.

Data Structure in Pandas

  • A data structure is a collection of data values and operations that enable efficient storage, retrieval, and modification of data.
  • An example is the NumPy ndarray, which facilitates easy storage, access, and updates of data.
  • Pandas provides two commonly used data structures:
    • Series: A one-dimensional array with labeled indices.
    • DataFrame: A two-dimensional labeled data structure similar to a spreadsheet.

Series

  • A Series is a one-dimensional array containing a sequence of values of any data type (int, float, list, string, etc.) with default numeric indices starting from zero.
  • Each value in a Series is associated with a data label called an index, which can also be user-defined (e.g., strings or other data types).
  • A Pandas Series can be visualized as a single column in a spreadsheet.
  • Example of a Series with student names:

    Index Value
    0 Anab
    1 Sanridhi
    2 Rani t
    3 Di vyam
    4 Kritika

Creation of Series


To create or use a Series, the Pandas library must be imported.
A Series can be created in multiple ways:
From Scalar Values:
  • A Series can be created using scalar values, with default indices ranging from 0 to N-1 if not specified.
  • Example:

    import pandas as pd
    series1 = pd.Series([10, 20, 30])
    print(series1)
    # Output:
    # 0 10
    # 1 20
    # 2 30
    # dtype: int64

  • User-defined indices can be assigned, such as numbers or strings:

    series2 = pd.Series(["Kavi", "Shyam", "Ravi"], index=[3, 5, 1])
    print(series2)
    # Output:
    # 3 Kavi
    # 5 Shyam
    # 1 Ravi
    # dtype: object

  • String indices example:

    series2 = pd.Series([2, 3, 4], index=["Feb", "Mar", "Apr"])
    print(series2)
    # Output:
    # Feb 2
    # Mar 3
    # Apr 4
    # dtype: int64

From NumPy Arrays:
  • A Series can be created from a one-dimensional NumPy array.
  • Example:

    import numpy as np
    import pandas as pd
    array1 = np.array([1, 2, 3, 4])
    series3 = pd.Series(array1)
    print(series3)
    # Output:
    # 0 1
    # 1 2
    # 2 3
    # 3 4
    # dtype: int32

  • Custom indices can be used with NumPy arrays:

    series4 = pd.Series(array1, index=["Jan", "Feb", "Mar", "Apr"])
    print(series4)
    # Output:
    # Jan 1
    # Feb 2
    # Mar 3
    # Apr 4
    # dtype: int32

  • The length of the index and array must match, or a ValueError is raised:

    series5 = pd.Series(array1, index=["Jan", "Feb", "Mar"])
    # Output: ValueError: Length of passed values is 4, index implies 3

From Dictionary:
  • A Series can be created from a Python dictionary, where dictionary keys become the indices.
  • Example:

    dict1 = {'India': 'NewDelhi', 'UK': 'London', 'Japan': 'Tokyo'}
    series8 = pd.Series(dict1)
    print(series8)
    # Output:
    # India NewDelhi
    # UK London
    # Japan Tokyo
    # dtype: object

Accessing Elements of a Series


Elements of a Series can be accessed using two common methods: Indexing and Slicing.
Indexing: Indexing in Series is similar to NumPy arrays and can be positional or labeled.
  • Positional index uses integer values starting from 0 to access elements:

    seriesNum = pd.Series([10, 20, 30])
    seriesNum[2]
    # Output: 30

  • Labeled index uses user-defined labels to access elements:

    seriesMonths = pd.Series([2, 3, 4], index=["Feb", "Mar", "Apr"])
    seriesMonths["Mar"]
    # Output: 3

  • Example with country capitals:

    seriesCapCntry = pd.Series(['NewDelhi', 'WashingtonDC', 'London', 'Paris'],
    index=['India', 'USA', 'UK', 'France'])
    seriesCapCntry['India']
    # Output: 'NewDelhi'
    seriesCapCntry[1]
    # Output: 'WashingtonDC'

  • Multiple elements can be accessed using a list of positional or labeled indices:

    seriesCapCntry[[3, 2]]
    # Output:
    # France Paris
    # UK London
    # dtype: object
    seriesCapCntry[['UK', 'USA']]
    # Output:
    # UK London
    # USA WashingtonDC
    # dtype: object

  • Index values can be altered by assigning new indices:

    seriesCapCntry.index = [10, 20, 30, 40]
    seriesCapCntry
    # Output:
    # 10 NewDelhi
    # 20 WashingtonDC
    # 30 London
    # 40 Paris
    # dtype: object

Slicing: Slicing extracts a portion of a Series using start and end parameters [start:end].
  • With positional indices, the value at the end index is excluded:

    seriesCapCntry = pd.Series(['NewDelhi', 'WashingtonDC', 'London', 'Paris'],
    index=['India', 'USA', 'UK', 'France'])
    seriesCapCntry[1:3]
    # Output:
    # USA WashingtonDC
    # UK London
    # dtype: object

  • With labeled indices, the value at the end index is included:

    seriesCapCntry['USA':'France']
    # Output:
    # USA WashingtonDC
    # UK London
    # France Paris
    # dtype: object

  • A Series can be displayed in reverse order using slicing:

    seriesCapCntry[::-1]
    # Output:
    # France Paris
    # UK London
    # USA WashingtonDC
    # India NewDelhi
    # dtype: object

  • Slicing can modify Series values, excluding the end index for positional indices:

    import numpy as np
    seriesAlph = pd.Series(np.arange(10, 16, 1),
    index=['a', 'b', 'c', 'd', 'e', 'f'])
    seriesAlph[1:3] = 50
    seriesAlph
    # Output:
    # a 10
    # b 50
    # c 50
    # d 13
    # e 14
    # f 15
    # dtype: int32

  • With labeled indices, the end index is included when modifying values:

    seriesAlph['c':'e'] = 500
    seriesAlph
    # Output:
    # a 10
    # b 50
    # c 500
    # d 500
    # e 500
    # f 15
    # dtype: int32

Attributes of Series

Series attributes are properties accessed using the Series name.
Common attributes include:

  • name: Assigns a name to the Series.
  • index.name: Assigns a name to the index of the Series.
  • values: Returns a list of values in the Series.
  • size: Returns the number of values in the Series.
  • empty: Returns True if the Series is empty, False otherwise.
Example Series:

seriesCapCntry
# Output:
# India NewDelhi
# USA WashingtonDC
# UK London
# France Paris
# dtype: object

Methods of Series


Series methods perform operations on Series data.
Common methods include:
  • head(n):Returns the first n members of the Series; defaults to 5 if n is not specified.

    seriesTenTwenty.head(2)
    # Output:
    # 0 10
    # 1 20
    # dtype: int32
    seriesTenTwenty.head()
    # Output:
    # 0 10
    # 1 20
    # 2 30
    # 3 40
    # 4 50
    # dtype: int32

  • count():Returns the number of non-NaN values in the Series.

    seriesTenTwenty.count()
    # Output: 10

  • tail(n):Returns the last n members of the Series; defaults to 5 if n is not specified.

    seriesTenTwenty.tail(2)
    # Output:
    # 8 90
    # 9 100
    # dtype: int32
    seriesTenTwenty.tail()
    # Output:
    # 5 60
    # 6 70
    # 7 80
    # 8 90
    # 9 100
    # dtype: int32

Mathematical Operations on Series

  • Mathematical operations (addition, subtraction, multiplication, division, etc.) on Series are performed element-wise, similar to NumPy arrays.
  • Operations automatically align data based on indices, and unmatched indices result in NaN values.

DataFrame

  • A DataFrame is a two-dimensional labeled data structure, resembling a spreadsheet, with rows and columns.
  • Each column in a DataFrame is a Series, and all columns share the same row indices.
  • DataFrames can be created from dictionaries, lists, or other data structures.
  • Example DataFrame of student marks:

    ResultDF
    # Output:
    # Arnab Ramit Sanridhi Riya Milika
    # Maths 90 92 89 81 94
    # Science 91 81 91 71 95
    # Hindi 97 96 88 67 99
    # English 95 86 95 80 95

Creating a DataFrame

DataFrames can be created from dictionaries where keys become column labels and values are lists or Series.
Example:

data = {
'Arnab': [90, 91, 97, 95],
'Ramit': [92, 81, 96, 86],
'Sanridhi': [89, 91, 88, 95],
'Riya': [81, 71, 67, 80],
'Milika': [94, 95, 99, 95]
}
ResultDF = pd.DataFrame(data, index=['Maths', 'Science', 'Hindi', 'English'])

Modifying a DataFrame

Adding Rows or Columns:
  • Rows can be added using the loc[] method:

    ResultDF.loc['Maths'] = [90, 92, 89, 81, 94]

  • Columns can be added by assigning a list or Series:

    ResultDF['Preeti'] = [89, 78, 76, 99]

  • Adding a row or column with mismatched lengths results in a ValueError.
Updating DataFrame Values:
  • Specific rows can be updated:

    ResultDF.loc['Maths'] = 0
    # Output:
    # Arnab Ramit Sanridhi Riya Milika Preeti
    # Maths 0 0 0 0 0 0
    # Science 91 81 91 71 95 78
    # Hindi 97 96 88 67 99 76
    # English 95 86 95 80 95 99

  • All values can be set to a specific value:

    ResultDF[:] = 0

Deleting Rows or Columns:
  • The drop() method is used to delete rows (axis=0) or columns (axis=1):

    ResultDF = ResultDF.drop('Science', axis=0)
    # Output:
    # Arnab Ramit Sanridhi Riya Milika
    # Maths 90 92 89 81 94
    # Hindi 97 96 88 67 99
    # English 95 86 95 80 95

  • Multiple columns can be dropped:

    ResultDF = ResultDF.drop(['Sanridhi', 'Ramit', 'Riya'], axis=1)
    # Output:
    # Arnab Milika
    # Maths 90 94
    # Hindi 97 99
    # English 95 95

  • Duplicate rows with the same label are all dropped:

    ResultDF = ResultDF.drop('Hindi', axis=0)

Renaming Rows or Columns:
The rename() method is used to change row or column labels:

ResultDF.rename(columns={'Arnab': 'Student1', 'Ramit': 'Student2', 'Sanridhi': 'Student3', 'Milika': 'Student4'})

Accessing DataFrames Element through Indexing

DataFrame elements can be accessed using label-based or boolean indexing.
Label-Based Indexing:

  • The loc[] method is used for label-based indexing.
  • A single row label returns a Series:

    ResultDF.loc['Science']
    # Output:
    # Arnab 91
    # Ramit 81
    # Sanridhi 91
    # Riya 71
    # Milika 95
    # Name: Science, dtype: int64

  • A single column label returns a Series:

    ResultDF.loc[:, 'Arnab']
    # Output:
    # Maths 90
    # Science 91
    # Hindi 97
    # Name: Arnab, dtype: int64

  • Multiple rows can be accessed:

    ResultDF.loc[['Science', 'Hindi']]
    # Output:
    # Arnab Ramit Sanridhi Riya Milika
    # Science 91 81 91 71 95
    # Hindi 97 96 88 67 99

Boolean Indexing:
  • Boolean indexing filters data based on conditions, returning True or False:

    ResultDF.loc['Maths'] > 90
    # Output:
    # Arnab False
    # Ramit True
    # Sanridhi False
    # Riya False
    # Milika True
    # Name: Maths, dtype: bool

  • Example for a specific column:

    ResultDF.loc[:, 'Arnab'] > 90
    # Output:
    # Maths False
    # Science True
    # Hindi True
    # Name: Arnab, dtype: bool

Accessing DataFrames Element through Slicing

Slicing selects subsets of rows and/or columns using row labels:

ResultDF.loc['Maths':'Hindi']
# Output:
# Arnab Ramit Sanridhi Riya Milika
# Maths 90 92 89 81 94
# Science 91 81 91 71 95
# Hindi 97 96 88 67 99

Joining, Merging and Concatenation of DataFrames

Joining:
  • The append() method merges two DataFrames by appending rows of the second DataFrame to the first.
  • Columns not present in the first DataFrame are added as new columns.
  • Example:

    dFrame1 = pd.DataFrame([[1, 2, 3], [4, 5], [6]], columns=['C1', 'C2', 'C3'], index=['R1', 'R2', 'R3'])
    dFrame2 = pd.DataFrame([[10, 20], [30], [40, 50]], columns=['C2', 'C5'], index=['R4', 'R2', 'R5'])
    dFrame1 = dFrame1.append(dFrame2)
    # Output:
    # C1 C2 C3 C5
    # R1 1.0 2.0 3.0 NaN
    # R2 4.0 5.0 NaN NaN
    # R3 6.0 NaN NaN NaN
    # R4 NaN 10.0 NaN 20.0
    # R2 NaN 30.0 NaN NaN
    # R5 NaN 40.0 NaN 50.0

  • The sort parameter can be set to True for sorted column labels or False for unsorted.
  • The verify_integrity parameter, when True, raises an error for duplicate row labels.
  • The ignore_index parameter, when True, ignores row index labels.

Attributes of DataFrames

  • DataFrame attributes provide information about the DataFrame’s properties.
  • Example DataFrame (ForestAreaDF):

    ForestArea = {
    'Assam': pd.Series([78438, 2797, 10192, 15116], index=['GeoArea', 'VeryDense', 'ModeratelyDense', 'OpenForest']),
    'Kerala': pd.Series([38852, 1663, 9407, 9251], index=['GeoArea', 'VeryDense', 'ModeratelyDense', 'OpenForest']),
    'Delhi': pd.Series([1483, 6.72, 56.24, 129.45], index=['GeoArea', 'VeryDense', 'ModeratelyDense', 'OpenForest'])
    }
    ForestAreaDF = pd.DataFrame(ForestArea)
    # Output:
    # Assam Kerala Delhi
    # GeoArea 78438 38852 1483.00
    # VeryDense 2797 1663 6.72
    # ModeratelyDense 10192 9407 56.24
    # OpenForest 15116 9251 129.45

  • Common attributes include:
    • DataFrame.index: Displays row labels.
    • DataFrame.columns: Displays column labels.
    • DataFrame.dtypes: Displays the data type of each column.
    • DataFrame.values: Returns a NumPy array of all values without axes labels.
    • DataFrame.shape: Returns a tuple representing the DataFrame’s dimensions.
    • DataFrame.size: Returns the total number of elements in the DataFrame.

Importing and Exporting Data between CSV Files and DataFrames

Pandas provides functions to import data from and export data to CSV files.

Importing a CSV file into a DataFrame

  • The read_csv() function loads data from a CSV file into a DataFrame:

    marks = pd.read_csv("C:/NCERT/ResultData.csv", sep=",", header=0)
    # Output:
    # RollNo Name Eco Maths
    # 0 1 Arnab 18 57
    # 1 2 Kritika 23 45
    # 2 3 Divyam 51 37
    # 3 4 Vivaan 40 60
    # 4 5 Aaroosh 18 27

  • Parameters of read_csv():
    • filename: Specifies the file path and name.
    • sep: Defines the separator (e.g., comma, semicolon, tab); defaults to a space.
    • header: Specifies the row number for column names; header=0 uses the first row.
    • names:Allows specifying custom column names:

      marks1 = pd.read_csv("C:/NCERT/ResultData.csv", sep=",", names=['RNo', 'Student Name', 'Sub1', 'Sub2'])

Exporting a DataFrame to a CSV file

  • The to_csv() function saves a DataFrame to a CSV file:

    ResultDF.to_csv('C:/NCERT/resultout.csv')

  • Parameters of to_csv():
    • header: When False, excludes column names from the output.
    • index: When False, excludes row labels from the output.
    • sep: Specifies the separator for the output file.
  • Example excluding headers and indices:

    ResultDF.to_csv('C:/NCERT/resultonly.txt', sep='@', header=False, index=False)
    # Output in resultonly.txt:
    # 90@92@89@81@94
    # 91@81@91@71@95
    # 97@96@88@67@99

Pandas Series Vs NumPy ndarray

  • Pandas Series support non-unique index values, raising an exception only for operations that do not support duplicates.
  • Operations between Series automatically align data based on labels, allowing computations without explicit alignment.
  • Unmatched labels in operations result in NaN values, enhancing flexibility in data analysis.
  • Differences between Pandas Series and NumPy Arrays:
    • Series allow custom labeled indices (numbers or letters), while NumPy arrays use integer positions only.
    • Series can have indices in descending order, whereas NumPy array indices are fixed and start from zero.
    • Unmatched indices in Series operations produce NaN, while NumPy arrays fail to align without matching values.
    • Series require more memory compared to NumPy arrays, which are more memory-efficient.
The document Data Handling using Pandas - I Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts is a part of the Humanities/Arts Course Informatics Practices for Class 12.
All you need of Humanities/Arts at this link: Humanities/Arts
14 docs

FAQs on Data Handling using Pandas - I Chapter Notes - Informatics Practices for Class 12 - Humanities/Arts

1. What is a Pandas Series and how do you create one?
Ans. A Pandas Series is a one-dimensional array-like object that can hold data of any type, such as integers, floats, or strings. It has an associated array of data labels, known as the index. You can create a Series by passing a list or array-like object to the `pd.Series()` constructor. For example: python import pandas as pd data = [1, 2, 3, 4] series = pd.Series(data)
2. How do you create a DataFrame in Pandas?
Ans. A DataFrame is a two-dimensional, size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns). You can create a DataFrame by passing a dictionary of lists or arrays to the `pd.DataFrame()` constructor. For example: python import pandas as pd data = { 'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 35] } df = pd.DataFrame(data)
3. What operations can be performed on rows and columns in DataFrames?
Ans. In Pandas, you can perform various operations on rows and columns in DataFrames such as selecting, adding, removing, or modifying them. You can select columns using `df['column_name']` or by using `df.loc[]` for more complex selections. To add a new column, you can assign a list or Series to a new column name, like `df['new_column'] = [value1, value2, ...]`. To drop a column, use `df.drop('column_name', axis=1)`.
4. How can you access DataFrame elements through slicing?
Ans. You can access DataFrame elements through slicing using the `.loc[]` and `.iloc[]` indexers. The `.loc[]` indexer is label-based and allows you to access rows and columns by their labels. The `.iloc[]` indexer is integer-location based and allows you to access rows and columns by their integer index. For example: python # Using .loc[] row = df.loc[0] # Accesses the first row column = df['Name'] # Accesses the 'Name' column # Using .iloc[] row = df.iloc[0] # Accesses the first row column = df.iloc[:, 0] # Accesses the first column
5. How do you export a DataFrame to a CSV file?
Ans. You can export a DataFrame to a CSV file using the `to_csv()` method. You need to specify the filename and the index parameter if you want to include or exclude the row index in the CSV file. For example: python df.to_csv('output.csv', index=False) # Exports DataFrame to 'output.csv' without row index
Related Searches

Data Handling using Pandas - I Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts

,

Semester Notes

,

Previous Year Questions with Solutions

,

mock tests for examination

,

Free

,

ppt

,

Sample Paper

,

Summary

,

Extra Questions

,

Exam

,

shortcuts and tricks

,

Data Handling using Pandas - I Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts

,

past year papers

,

MCQs

,

Viva Questions

,

Objective type Questions

,

video lectures

,

Data Handling using Pandas - I Chapter Notes | Informatics Practices for Class 12 - Humanities/Arts

,

practice quizzes

,

study material

,

Important questions

,

pdf

;