This document outlines the project’s core components:
- The customer definition.
- Functional and non-functional requirements.
- Detailed capacity estimation to guide the system’s architecture.
1. Customer and Business Definition
Based on the initial analysis, we can define the customer and the core business problem the project aims to solve.
- Customer: The primary customer is a Government-type entity focused on Smart Cities and Public Utilities. This entity is responsible for managing and improving the efficiency, reliability, and appeal of public transportation in a major urban center, such as Da Nang or Ho Chi Minh City.
- Business Problem: The public bus system is caught in a cycle of inefficiency. Passengers face unpredictable wait times and long journeys due to traffic and rigid schedules, making the bus an unreliable option for daily commutes. Simultaneously, bus operators are forced to service every stop regardless of demand, leading to wasted fuel, increased operational costs, and wear and tear on vehicles, especially during off-peak hours. This inefficiency further contributes to the poor service that deters ridership.
2. Requirements and Scope
The system is designed to bridge the information gap between passenger demand and bus operations.
Functional Requirements
The system will provide the following key functions:
- Real-time Bus Tracking: Passengers can view the live location of buses on a map via a mobile application.
- Dynamic ETA Calculation: The system will provide passengers with a continuously updated Estimated Time of Arrival (ETA) that accounts for traffic and the bus’s progress.
- Demand-Responsive Service (Digital Hail): Passengers can signal their intent to board a bus from a specific stop using the mobile app or a physical button at the bus stop.
- In-Cabin Decision Support: Bus drivers receive real-time notifications on a simple in-cabin interface, indicating which upcoming stops have waiting passengers, allowing them to safely bypass empty ones.
- Alternative Route Suggestions: In case of significant delays, the system will proactively suggest alternative routes or connections to passengers.
- Administrative Dashboard: A web-based portal for operators to monitor system health, manage routes, and analyze operational data.
Non-Functional Requirements
For the system (if needed)
| Category | Requirement | Priority | Justification |
|---|---|---|---|
| Performance | ETA should be accurate to within ±1 minute 95% of the time. | High | Core to building passenger trust and system reliability . |
| Location and status updates must be reflected in the app in under 5 seconds. | High | Ensures a “real-time” user experience . | |
| Reliability | The system must maintain >98% uptime during operational hours. | High | Essential for a public utility that users depend on daily . |
| The system must automatically recover from network disruptions within 30 seconds. | Medium | Ensures service continuity and data integrity . | |
| Connectivity | Stable 4G/LTE connectivity must be maintained in at least 95% of the operational area. | High | The backbone of the real-time data exchange . |
| Durability | Physical buttons at bus stops must be weather-resistant (IP56 rated) and withstand at least 300,000 presses. | High | Ensures long-term reliability of hardware in public spaces . |
| Safety & Compliance | All hardware must meet electrical safety standards (IEC 60950-1) and transportation regulations (ISO 26262). | High | Non-negotiable for public deployment and user safety . |
| Scalability | The architecture must support scaling to 500+ buses without a significant drop in performance. | Medium | Prepares the system for future expansion to a city-wide scale . |
For the actual AWS Architecture (Cloud - Centric)
Pillar: Performance Efficiency
Focuses on using resources efficiently to deliver the best performance.
| Requirement | Metric | Linked Capacity Assumption & Justification |
|---|---|---|
| API Read Latency | The GET /bus/eta API endpoint must maintain a 99th percentile (p99) latency of < 300ms. | With 5,000 concurrent users all potentially requesting ETAs, this stringent latency target ensures the application feels responsive even under peak load. |
| Data Ingestion Throughput | The system must sustain ingestion of 15 location updates per second (1 update/10s for each of 150 buses) with an end-to-end processing latency of < 2 seconds. | This requirement is directly derived from the number of IoT devices (buses) and their update frequency. It ensures the data backend can handle the constant stream without backlogs . |
| Elasticity for Peak Events | The architecture must automatically scale to handle 8,000 concurrent users—a 60% increase over the normal peak load of 5,000. | This provides a realistic buffer for common urban events (e.g., festivals, football matches) that cause predictable traffic spikes, rather than an arbitrary 3x multiplier. This is a form of dynamic scaling . |
Pillar: Reliability
Ensures the system can recover from failures and meet commitments.
| Requirement | Metric | Linked Capacity Assumption & Justification |
|---|---|---|
| Service Availability | Achieve 99.9% (“three nines”) monthly uptime during the 16 daily operating hours. | Critical for a public utility serving 20,000 daily users. This availability target ensures the service is dependable for daily commuters. |
| Fault Tolerance | The system must be deployed across a minimum of two Availability Zones (AZs), with automated failover. The failure of one AZ must not impact service availability . | This is a foundational best practice to ensure the workload can withstand the failure of a single data center and continue serving its thousands of users . |
| Disaster Recovery | RTO = 15 minutes / RPO = 5 minutes. | For a system processing constant real-time updates from 150 buses, this ensures that in a disaster, the service is restored quickly with minimal data loss. |
Pillar: Security
Focuses on protecting information and systems.
| Requirement | Metric | Linked Capacity Assumption & Justification |
|---|---|---|
| User Data Protection | 100% of data for the 20,000+ users (and future growth) must be encrypted at rest (AES-256) and in transit (TLS 1.2+). | Essential for protecting Personally Identifiable Information (PII) and building user trust at scale. |
| Device Authentication | The 150 bus devices and 500 bus stop devices must authenticate using unique, revocable X.509 certificates. | Prevents unauthorized devices from injecting malicious data into the system, a critical threat vector in any IoT network. |
| Network Isolation | Application and database tiers must be located in private subnets, isolated from direct public internet access. | This fundamental security posture reduces the attack surface for a system that is, by its nature, publicly exposed via its IoT endpoints and user application. |
Pillar: Cost Optimization
Focuses on avoiding unnecessary costs.
| Requirement | Metric | Linked Capacity Assumption & Justification |
|---|---|---|
| Off-Peak Scaling | Compute and database resources must automatically scale down during off-peak hours (e.g., 10 PM to 6 AM), reducing costs by at least 50% compared to peak-hour expenditure. | Our model assumes significantly lower traffic outside the 16 operating hours. This NFR ensures the architecture leverages that pattern to optimize cost. |
| Data Lifecycle Management | Raw IoT data (approx. 215 MB/day) must be automatically transitioned from hot storage (e.g., S3 Standard) to archival storage (e.g., S3 Glacier) after 90 days. | Balances the need for recent data for analysis against the high cost of storing terabytes of historical data long-term. |
Pillar: Operational Excellence
Focuses on running and monitoring systems to deliver business value.
| Requirement | Metric | Linked Capacity Assumption & Justification |
|---|---|---|
| Centralized Monitoring | All 150 buses and 500 bus stops, plus all backend services, must send metrics and logs to a centralized system (e.g., Amazon CloudWatch) . | With hundreds of distributed components, centralized monitoring is the only feasible way to get a holistic view of system health and troubleshoot issues effectively . |
| Automated Alerting | Automated alarms must trigger when key performance indicators (e.g., API p99 latency > 300ms, ingestion failure rate > 1%) are breached for more than 5 minutes . | For a system of this scale, manual monitoring is impractical. Automated alerts enable a small operations team to manage a large, distributed infrastructure proactively . |
Pillar: Sustainability (Optional)
Focuses on minimizing the environmental impacts of running cloud workloads.
| Requirement | Metric | Linked Capacity Assumption & Justification |
|---|---|---|
| Workload Scheduling | Non-critical data processing tasks (e.g., generating weekly analytics reports) must be scheduled to run during periods of low energy grid carbon intensity. | Even though the user base is regional, the cloud resources are not. This ensures that the environmental impact of the backend processing is minimized. |
| Efficient Hardware Selection | When selecting instance types, prioritize ARM-based AWS Graviton processors where workloads are compatible. | Graviton processors offer better performance per watt, reducing the overall energy consumption required to serve the 20,000 daily users. |
Out of Scope
To ensure focus and feasibility, the following features are explicitly out of scope for the initial version:
- Fare Collection and Payment Processing: The system will not handle ticketing or payments.
- Real-time Traffic Light Integration: While a future goal, the initial system will not directly interface with municipal traffic control systems.
- B2B Fleet Management Features: The system is designed for public use and will not include features for managing private fleets, such as pre-registered passenger lists.
3. Capacity Estimation
This section outlines the estimated load and data requirements for an initial deployment in a medium-sized city like Da Nang, which has approximately 15 major bus routes.
Assumptions
Note
We will choose a fairly large numbers for every aspects of the system usage in order to construct a comprehensive architecture.
- Number of Buses: 150 buses operating across all routes.
- Bus Stops: 500 bus stops equipped with hailing buttons.
- Operating Hours: 16 hours per day (6 AM to 10 PM).
- Peak Hours: 4 hours per day (7-9 AM and 5-7 PM).
- Active Users (Passengers): 20,000 daily active users (DAU).
- Peak Concurrent Users: 5,000 users during peak hours.
Estimation Tables (temporary)
IoT Data Generation
| Data Source | Frequency | Data per Message | Daily Data per Unit | Total Daily Data |
|---|---|---|---|---|
| Bus GPS Location | 1 update / 10 sec | 256 bytes | 1.4 MB | 210 MB |
| Bus Stop “Hail” | 50 hails / day (avg) | 128 bytes | 6.4 KB | 3.2 MB |
API Traffic Estimation
| API Endpoint | Calls per Second (Peak) | Calls per Second (Off-Peak) |
|---|---|---|
POST /bus/location (from buses) | 15 calls/sec | 15 calls/sec |
POST /stop/hail (from stops) | ~5 calls/sec | ~1 call/sec |
GET /bus/eta (from apps) | 500 calls/sec | 50 calls/sec |
| Total | ~520 calls/sec | ~66 calls/sec |
Storage Requirements
| Data Type | Daily Volume | Retention Policy | Estimated Yearly Storage |
|---|---|---|---|
| Raw GPS & Hail Data | ~215 MB | 90 days | ~75 GB (Hot Storage) |
| Aggregated Daily Metrics | ~20 MB | 3 years | ~22 GB (Cold Storage) |
| User & Route Data | Negligible | Indefinite | ~5 GB |
| Total (First Year) | ~102 GB |
Bandwidth Requirements
| Traffic Direction | Peak Rate | Sustained Rate |
|---|---|---|
| Ingress (from IoT devices) | ~5 Mbps | ~5 Mbps |
| Egress (to user apps) | ~150 Mbps | ~15 Mbps |