JOSAA Rank Prediction Engine: A Technical Walkthrough

This post walks through the development of a system designed to predict college admission closing ranks for the Joint Seat Allocation Authority (JoSAA) counselling process. Our main goal is to use historical data and machine learning to predict these ranks, offering students a data-driven look at potential admission outcomes.

Disclaimer: It’s important to note that this JOSAA Rank Predictor is an experimental project, primarily for educational and illustrative purposes. The predictions are based on statistical models and historical data, so they aren’t guaranteed to be spot-on. Always cross-reference with official JoSAA resources or consult advisors before making decisions based on these predictions.

The Core Technical Goal: The core technical challenge was to build a model that could predict the closing rank given a specific college, program, year, and counselling round.

Access the Code: The complete codebase for this project is open-source and available on GitHub: https://github.com/pponnada/josaa-rank-predictor

Table of Contents

  1. System Architecture and Pipeline
    1. Data Acquisition
    2. Data Storage and Structuring
    3. Initial Feature Engineering
    4. Data Preprocessing
    5. Advanced Feature Engineering
    6. Model Training and Evaluation
    7. Prediction Pipeline
  2. Conclusion

System Architecture and Pipeline

The system follows a pretty standard machine learning pipeline, taking raw data and transforming it into predictions through several key stages:

1. Data Acquisition

The foundation is historical opening and closing ranks from the official JoSAA admissions portal: JoSAA Opening Closing Rank Archive. We used the josaa-scrapper Chrome extension to automate the extraction of this data, pulling rank details across different years, rounds, institutes, and programs. The raw data is typically scraped into pipe-separated value (PSV) files, one for each year and round. Here’s an example of what 2020/round1.psv might look like:

Institute#Academic Program Name#Quota#Seat Type#Gender#Opening Rank#Closing Rank
Indian Institute of Technology Bhubaneswar#Civil Engineering (4 Years, Bachelor of Technology)#AI#OPEN#Gender-Neutral#6836#8816
Indian Institute of Technology Bhubaneswar#Civil Engineering (4 Years, Bachelor of Technology)#AI#OPEN#Female-only (including Supernumerary)#13184#14366
Indian Institute of Technology Bhubaneswar#Civil Engineering and M. Tech. in Structural Engineering (5 Years, Bachelor and Master of Technology (Dual Degree))#AI#OPEN#Gender-Neutral#9317#9751
Indian Institute of Technology Bhubaneswar#Civil Engineering and M. Tech. in Structural Engineering (5 Years, Bachelor and Master of Technology (Dual Degree))#AI#OPEN#Female-only (including Supernumerary)#15881#15881

2. Data Storage and Structuring

Raw data scraped from JoSAA, typically in CSV files, gets consolidated into a SQLite database by the build_db.py script. This gives us a structured, queryable SQL-based repository for all historical data, with the schema defined in schema.sql. This structured approach is key for efficient data handling later on. The build_db.py script handles a few key tasks:

  • Schema Definition: First, it sets up the database structure (tables, columns, data types, relationships) by running building the DDL commands dynamically from csv headers.
  • Data Ingestion: Then, it ingests the raw data from the CSVs generated by josaa-scrapper.
  • Data Transformation and Loading: The script parses this data, performs necessary cleaning, and loads it into the corresponding SQLite tables. This step is crucial for centralizing all historical rank data, ensuring consistent formatting, and making it easily queryable. Essentially, build_db.py builds the data foundation needed for efficient feature engineering and model training down the line.

3. Initial Feature Engineering

Next, a fairly comprehensive SQL query (query.sql) runs against the SQLite database. Its job is to pull relevant data and create some initial features. We focused this query on ‘OPEN’ seat types, ‘Gender-Neutral’ categories, and ‘AI’/’OS’ quotas, specifically for non-IIT institutes, as these represent a common use case. The key features engineered here aim to capture trends over time and relative rank positions. These include:

  • prev_year_closing_rank: The closing rank for the same college/branch combination from the previous year.
    • Why: This provides a strong baseline. Admission trends often show inertia, and the previous year’s rank is a significant indicator of the current year’s rank.
  • delta_closing_rank_1yr: Year-over-year change in closing rank.
    • Why: Captures the immediate trend (e.g., is this program becoming more or less competitive?). A large positive or negative delta can signal shifting demand.
  • delta_closing_rank_2yr_avg: Change in closing rank compared to the average of the last two years.
    • Why: Smooths out single-year anomalies and provides a more stable measure of the recent trend direction and magnitude.
  • is_final_round: A binary indicator (0 or 1) to denote if a particular entry is from the final counselling round for that year. Some years have 5 rounds while others have 6.
    • Why: Closing ranks can vary significantly between rounds. This feature helps the model distinguish between intermediate round data and the final admission cutoffs, which are typically the target for prediction.
  • round_relative_rank_diff: Difference in closing rank compared to the previous round in the same year.
    • Why: Indicates the volatility or stability of ranks within a single admission cycle. Large differences might suggest significant seat filling or withdrawals between rounds.
  • closing_rank_percent_change_from_round1: Percentage change in closing rank from Round 1 for a given year.
    • Why: Normalizes the rank changes across different programs and years, providing a relative measure of how much the rank has shifted from the initial round.
  • mean_closing_rank_last_2yrs: Simple moving average of closing ranks over the past two years.
    • Why: Offers a smoothed historical perspective, reducing the impact of noise from a single year’s data.
  • weighted_moving_avg: A weighted average of past years’ closing ranks, giving more importance to recent years.
    • Why: Assumes that more recent data is more relevant for predicting future ranks. This feature allows the model to weigh recent trends more heavily than older ones.

The resulting dataset, enriched with these features, is exported to historical_data.csv. Here’s a small snippet of what historical_data.csv looks like:

year,round,opening_rank,closing_rank,prev_year_closing_rank,delta_closing_rank_1yr,delta_closing_rank_2yr_avg,is_final_round,round_relative_rank_diff,closing_rank_percent_change_from_round1,mean_closing_rank_last_2yrs,weighted_moving_avg,college_name,academic_program_name
2020,1,40952.0,48265.0,,,48265.0,0,,0,56929.0,,"Assam University, Silchar","Agricultural Engineering (4 Years, Bachelor of Technology)"
2020,2,46576.0,51803.0,,,51803.0,0,2987.0,6.11889544411668,56929.0,,"Assam University, Silchar","Agricultural Engineering (4 Years, Bachelor of Technology)"
2020,3,46576.0,56464.0,,,56464.0,0,6129.0,15.6669944280564,56929.0,,"Assam University, Silchar","Agricultural Engineering (4 Years, Bachelor of Technology)"
2020,4,46576.0,58773.0,,,58773.0,0,4801.0,20.3970009832842,56929.0,,"Assam University, Silchar","Agricultural Engineering (4 Years, Bachelor of Technology)"

4. Data Preprocessing

The historical_data.csv (output from the SQL query) then goes through several preprocessing steps in preprocess.py – these are pretty standard but crucial for getting the data ready for an ML model:

  • Missing Value Imputation: Null values, particularly for historical rank features (e.g., prev_year_closing_rank for the earliest year in the dataset), are handled using appropriate imputation strategies.
  • Categorical Encoding: Textual features such as college_name and academic_program_name are transformed into a numerical representation using One-Hot Encoding. The fitted OneHotEncoder object is persisted as encoder.pkl for later use in decoding predictions.
  • Feature Scaling: Numerical features (e.g., opening_rank, closing_rank, and engineered rank-based features) are standardized using StandardScaler. This process scales features to have zero mean and unit variance, which benefits many learning algorithms by preventing features with larger magnitudes from dominating the learning process. The fitted scaler object is also saved as scaler.pkl.

The preprocessed dataset is then saved as preprocessed_data.csv.

5. Advanced Feature Engineering

The feature_engineering.py script ingests preprocessed_data.csv to perform further feature transformations or selection. This stage is intended to derive more sophisticated features that might capture more complex relationships in the data, potentially improving model performance. The final feature set for model training is exported as feature_engineered_data.csv.

6. Model Training and Evaluation

The feature_engineered_data.csv serves as input to the train_model.py script for model development:

  • Data Splitting: The dataset is partitioned into training and testing subsets. This allows the model to be trained on one portion of the data and evaluated on unseen data (the test set) to provide an unbiased assessment of its generalization capabilities. Specifically, data from earlier years is used for training, while data from more recent years is reserved for testing. This chronological split is crucial for time-series data as it simulates a real-world scenario where the model predicts future outcomes based on past trends.
  • Model Selection & Training: A RandomForestRegressor is employed for this regression task. Random Forests are ensemble learning methods that construct multiple decision trees during training and output the mean prediction of the individual trees, which generally improves predictive accuracy and controls over-fitting. The model is trained on the training subset to learn the mapping from input features to the target variable (closing_rank).
  • Persistence: The trained model object is serialized using pickle and saved to josaa_model.pkl. This allows the trained model to be reloaded and used for predictions without retraining.

7. Prediction Pipeline

The predict.py script orchestrates the generation of closing rank predictions for a target year (e.g., 2025, Round 6):

  • Artifact Loading: It loads the persisted model (josaa_model.pkl), scaler (scaler.pkl), and one-hot encoder (encoder.pkl) that were saved during the preprocessing and training phases.
  • Input Data Preparation: To predict for a future year, the script takes the latest available historical data for each unique college-branch combination as a template. It then updates time-dependent features (like ‘year’ and ‘round’) to reflect the target prediction period. Other features are derived or carried over based on the logic established in the feature engineering steps.
  • Prediction: The prepared feature matrix for the target year is fed into the loaded RandomForestRegressor model, which outputs scaled predictions for the closing_rank.
  • Inverse Transformation: The model’s predictions are initially in a scaled format, and categorical identifiers are one-hot encoded. To make them human-readable:
    • The scaler object’s inverse_transform method is used to convert the scaled closing_rank predictions back to their original rank values. This requires constructing a dummy DataFrame with the same structure as the data used to fit the scaler, placing the predicted scaled values in the correct column, and then applying the inverse transformation.
    • The encoder object’s inverse_transform method is used to convert the one-hot encoded college and program features back to their original string representations.
  • Reporting: The final, human-readable predictions, along with the last known historical final ranks for comparative analysis, are compiled into a pandas DataFrame. This DataFrame is then sorted by the predicted_closing_rank in ascending order and saved to a CSV file named prediction_report.csv. Here’s a glimpse of what prediction_report.csv might contain:
college_name,academic_program_name,year,round,historical_final_closing_rank,predicted_closing_rank
"School of Planning & Architecture, New Delhi","Architecture (5 Years, Bachelor of Architecture)",2025,6,238,222
"National Institute of Technology, Tiruchirappalli","Architecture (5 Years, Bachelor of Architecture)",2025,6,404,249
"School of Planning & Architecture, New Delhi","Planning (4 Years, Bachelor of Planning)",2025,6,435,337
National Institute of Technology Calicut,"Architecture (5 Years, Bachelor of Architecture)",2025,6,466,379

Conclusion

This project demonstrates a systematic, data-driven approach to building a predictive model for JoSAA closing ranks. It covers key stages of a typical machine learning workflow, from data acquisition and preprocessing to feature engineering, model training, and prediction.

While this system serves as an educational tool and a practical example of applying ML to real-world data, the accuracy of its predictions is subject to the quality and representativeness of the historical data and the inherent complexities of the admissions process.

The codebase is available on GitHub for review, further development, and contributions. Potential future enhancements could include exploring alternative modeling techniques (e.g., Gradient Boosting, Neural Networks), incorporating more diverse data sources, or refining feature engineering strategies to capture more nuanced trends.