Problem Scoping
Whenever we begin a new project, we encounter a number of challenges. These may be minor or major; sometimes we overlook them, and other times they require immediate attention. Problem scoping is the process of understanding a problem, identifying the aspects that affect it, and clearly defining the project’s goal. Good problem scoping helps ensure that an AI project addresses the right need and is feasible within constraints such as time, data and resources.
How to Identify Problem Scoping in an AI Project
Follow these steps to identify the scope of a problem in an AI project:
- Understand why the project was started – identify the business or social need, the trigger, and the expected benefit.
- Define the project’s primary objectives – be precise about what success looks like (for example: reduce processing time by 30%, increase accuracy to 90%, or classify images with 95% precision).
- Outline the project’s work statement – a short description of the work to be done and the expected outputs or deliverables.
- Determine the most important goals – prioritise outcomes and measurable targets.
- Choose important milestones – break the project into phases with checkpoints (data collection, model prototype, validation, deployment).
- Determine the major constraints – list limitations such as available data, compute resources, time, privacy rules and budget.
- Make a list of scope exclusions – explicitly record what will not be done in the project to avoid scope creep.
Acquiring Data
Collecting correct and dependable data is the foundation of any AI project. Data may be in the form of text, images, video, audio or numerical records. It can be gathered from sources such as websites, research journals, newspapers, sensors and databases. The process of gathering and preparing this material is called data acquisition.
4 W’s Problem Canvas
The 4 W’s of problem scoping are Who, What, Where and Why. They help to identify and understand the problem in a clear, structured way.
- Who – Who is directly or indirectly affected by the problem? Who are the stakeholders (users, customers, managers, maintenance teams)?
- What – What exactly is the problem? What evidence shows that the problem exists? What are the desired outcomes?
- Where – In which context or environment does the problem occur? Where will the solution run or be used?
- Why – Why must we address this problem? What are the advantages to stakeholders after solving it?
Statement of the Problem Template
After completing the 4 W’s, summarise your findings in a concise problem statement. This template collects the essential points (stakeholders, specific issue, context, measurable goals and constraints) so that anyone reviewing the project can quickly understand the scope and intent.
Data Acquisition: Core Concepts
What is Data?
Data is a representation of facts or instructions about an entity that can be processed or conveyed by a human or a machine. Examples include numbers, text, images, audio clips and videos. Data can describe states, events or behaviours and is the raw material for AI systems.
Types of Data
There are two broad types of data:
- Structured data
- Unstructured data
Structured data is stored in a fixed format or schema (for example, tables with rows and columns). It has a consistent order and is readily accessible by programmes and people - for example, spreadsheet records, numeric sensor logs and database entries.
Unstructured data does not follow a fixed data model. It includes photographs, video clips, audio files, text documents and logs. Unstructured data is rich in information but usually requires preprocessing (for example, feature extraction or annotation) before it can be used by machine learning models.
Dataset
A dataset is a collection of related data items organised for analysis - often represented in a tabular form where rows are records and columns are attributes. For example, a dataset of students’ test scores would include student identifiers, subject-wise marks and related attributes.
The dataset is usually divided into parts for supervised AI tasks:
- Training dataset – the portion used to teach the AI model how to perform a task (commonly around 60–80% of the data).
- Validation dataset – a separate portion used during model development to tune hyperparameters and choose the best model.
- Test dataset – a separate subset used once at the end to estimate the model’s real-world performance (commonly around 10–30% of the data).
Other important dataset considerations include labelling (assigning the correct outputs for supervised tasks), class balance (whether all categories have enough examples), and data augmentation (creating additional training examples by modifying existing ones for images, audio or text).
Acquiring Data from Reliable Sources
Common, reliable ways to gather data for AI projects include:
- Surveys – collecting structured responses from a chosen sample to gather information and insights.
- Cameras – capturing visual data (images and video) for tasks such as object detection and image classification.
- Web scripting (web scraping) – extracting structured information from websites for tasks like news monitoring, market research and price tracking. Follow website terms of use and legal guidelines.
- Observation – gathering information through systematic monitoring and field study.
- Sensors – measuring physical properties (temperature, motion, biometric signals) using devices that report numerical data.
- Application Programming Interfaces (APIs) – using software interfaces that allow secure access to data from other applications or services.
Also consider ethical and privacy issues when collecting data. Obtain consent where needed, anonymise personal information, and follow applicable laws and institutional guidelines.
System Mapping
How to create a System Map (example: Water Cycle)
- System maps show components of a system and the cause–effect relationships among them using arrows.
- Arrowheads indicate the direction of influence; a plus sign (+) near an arrow indicates a direct relationship (as X increases, Y increases), while a minus sign (−) indicates an inverse relationship (as X increases, Y decreases).
- In the Water Cycle system map, elements such as evaporation, condensation, precipitation and collection are connected to show how a change in one part affects the others.
- Building a system map helps you visualise the problem space; you can then identify where interventions or data collection will be most effective.
Now, try building your own system map for a problem you wish to solve. Identify key elements, draw directional links and mark whether each link is a positive (+) or negative (−) influence.
Data Exploration
Data exploration helps us understand the dataset’s size, distribution, quality and key patterns before modelling. Analysts use both statistical summaries and graphical visualisations to reveal trends, anomalies, missing values and relationships among features.
Why Data Exploration?
Exploration provides a clear idea of which features matter and how data should be preprocessed. It speeds up later stages such as feature selection, model choice and evaluation by uncovering patterns and potential issues early.
About Data Visualisation Charts
Data visualisation converts information into graphical form to make patterns easier to see and interpret. Common chart types used in data exploration include:
- Column chart – uses vertical bars to show comparisons across categories or changes over time; easy to compare heights of columns.
- Bar chart – uses horizontal or vertical bars to compare different categories; useful for categorical comparisons.
- Line chart – shows data trends over time.
- Pie chart – shows relative proportions of a whole (useful for small number of categories).
- Histogram – shows distribution of numerical data and helps identify skewness and outliers.
Other useful plots are scatter plots to check relationships between two numeric variables and box plots to detect outliers and understand spread.
Modelling
AI, ML & DL (Venn overview)
- Artificial Intelligence (AI) – the simulation of human intelligence in machines. AI systems can perform tasks that normally require human reasoning, such as planning, understanding language, and making decisions.
- Machine Learning (ML) – a subset of AI in which machines learn from data to make predictions or decisions without explicit programming for each example.
- Deep Learning (DL) – a subset of ML that uses multi-layered neural networks to learn representations of data. Deep learning typically requires large amounts of labelled data and powerful compute resources.
Rule-Based Modelling
Rule-based approaches require a developer to define explicit rules or logical relationships. The machine follows these rules to make decisions. These systems are transparent but can be time-consuming to author and hard to scale for complex patterns.
What is an AI Model?
An AI model is a program trained to recognise patterns from data and produce intelligent outputs. Modelling is the process of designing, training and validating these algorithms so they can perform tasks such as classification, regression and detection.
Rule-Based AI Model (Decision Tree)
A decision tree is a rule-based model presented as a tree of decisions. Each internal node represents a test on a feature, each branch represents the outcome of the test, and each leaf node represents a final decision or prediction. Decision trees are intuitive and easy to interpret; they are useful for classification and regression tasks when rules can be clearly defined.
Learning-Based Approach
The learning-based approach does not require the developer to define explicit rules. Instead, the system receives data (labelled or unlabelled) and automatically discovers patterns and relationships. Supervised learning uses labelled data to learn mappings from inputs to outputs; unsupervised learning finds structure in unlabelled data; reinforcement learning learns by trial and error through interaction with an environment.
Decision Tree in AI
The idea of a decision tree is similar to making a sequence of if–then choices. To design a decision tree you:
- Study the dataset and identify which features most strongly relate to the target output.
- Pick a feature to split the data based on criteria that increase homogeneity of each branch (in formal ML this is often information gain or Gini impurity).
- Repeat splitting until leaves are pure or a stopping rule (like maximum depth) is reached.
Points to Remember when Creating Decision Trees
- Carefully examine the dataset and determine the pattern that the output leaf follows.
- Remove redundant or irrelevant features; focus on parameters that directly affect the output.
- There may be multiple decision trees that correctly predict outputs; prefer the simplest tree that performs well (Occam’s principle).
- Decision trees can overfit; pruning or limiting tree depth helps generalisation.
Evaluation
Once a model - for example, a decision tree - has been built and trained, it must be evaluated to assess its performance and efficiency. Evaluation ensures the model makes accurate predictions on new, unseen data and meets the project’s objectives.
Common evaluation steps and metrics include:
- Use the test dataset (kept separate from training and validation) to estimate real-world performance.
- For classification tasks, compute metrics such as accuracy, precision, recall and F1-score.
- For regression tasks, compute metrics such as mean absolute error (MAE) and root mean squared error (RMSE).
- Use cross-validation to reduce variance in performance estimates.
- Analyse failure cases to understand when and why the model makes mistakes and whether more data, different features or a different model type are needed.
- Check computational efficiency and resource usage if the model will be deployed to constrained environments.
Key Evaluation Metrics (brief definitions)
- Accuracy – fraction of total correct predictions. Example formula: Accuracy = (TP + TN) ÷ (TP + TN + FP + FN).
- Precision – fraction of predicted positives that are actually positive. Example formula: Precision = TP ÷ (TP + FP).
- Recall (Sensitivity) – fraction of actual positives that are correctly predicted. Example formula: Recall = TP ÷ (TP + FN).
- F1-score – harmonic mean of precision and recall. Example formula: F1 = 2 × Precision × Recall ÷ (Precision + Recall).
- MAE (Mean Absolute Error) – average absolute difference between predicted and true values.
- RMSE (Root Mean Squared Error) – square root of the average squared differences between predicted and true values; gives larger weight to larger errors.
Other Evaluation Considerations
- Cross-validation – split the training data into several folds, train on different combinations and average performance to reduce variability of the estimate.
- Confusion matrix – a table showing counts of true positives, true negatives, false positives and false negatives; useful to compute the above metrics and to see types of errors.
- Overfitting – when a model learns the training data too well and performs poorly on new data. Symptoms include very high training accuracy and much lower test accuracy.
- Underfitting – when a model is too simple to capture the underlying pattern and performs poorly on both training and test data.
- Regularisation – techniques (for example, limiting model complexity or adding penalties) to reduce overfitting and improve generalisation.
The process of problem scoping, careful data acquisition, exploration, appropriate modelling (rule-based or learning-based) and careful evaluation together form a practical workflow for building reliable AI solutions suitable for classroom projects and real-world applications.