Assignar Forms Library

December 15, 2020 · 6 min read

Zarmina Muhammad

Data Analyst

I would like to share my project story, with challenges faced and learnings during revamp and transformation to build readily available templates that could facilitate with forms autofill and swiftly generate refined and cleaner data to assist in future ML solutions, adding value to forms automation.

I had limited exposure to corporate world in Australia and Assignar Forms Library was one of the 1st challenges I was assigned to, during my primitive days of joining as Data Analyst at Assignar. The challenge was that there were forms within the Assignar's Product that were being generated in thousands, every week by customers. With this magnitude, due to the self-generated titles and randomly assigned questions, identification of unique forms with clear titles was required, and this was a daunting task.

From quantitative perspective, approximately 4 Million forms were being generated and submitted by Assignar’s customers from 2014-2020, however year 2020 alone contributed approx. 1.5 Million to the total number. This number is increasing every year exponentially as shown in the graph in below figure.

Forms Trend (2014-2020)

As we began the project formally, we made sure that we understood client needs and requirements to identify the correct scope of the project and deliverables. We quickly realised that the rationale was to transform existing practices i.e., build a readily available template that would facilitate customers to:

Fill the forms easily and quickly
Access customisable form templates, where customers could add or remove questions as required.
Access and analyse their own forms usage across each form category, where customers were empowered to make smart and prudent decisions in order to maintain or lessen the dependency on the form’s creation.
Generate refined and cleaner data that would assist in future ML solutions including tasks like forms automation etc.
Assess and analyse the forms usage within the Assignar product(s) in a more efficient & effective way, both at an aggregated level as well as from individual customer perspective

We clearly identified that client requirements and expectations were to create form templates with specific categories and standard questions, that must fulfil the following criteria:

Expected Outcomes

Categorisation: These form templates were to be used by the company as a new feature of the Assignar product, that would enable user to readily access available, user-friendly customisable templates through selection of the desired form category.
Easier Onboarding: Capability to improve customers’ forms usage success whilst making it easily accessible and iterative.
Data Standardisation: To facilitate the transformation of data from simple to a standardised form, that would facilitate customers to identify as which specific categories, the forms had to be generated in abundance and the reasons behind it.
Benchmarking: Introduction of benchmarking technique, to create better insights and reflect on how frequently forms had to be generated along with correct identification under each category, for example:
- Particular category or domain?
- What measures and protocols should had been followed to reduce this number?

We had to quickly brainstorm all possible ideas as how to move forward, and through pragmatic and structured approach, we drafted a comprehensive project plan. Tasks and deliverables were clearly defined and allocated to each team member. Following best business practices of the ‘Agile’ project management framework was instrumental in tracking progress towards our project deliverables and milestones. Project communication has also been a very critical element to our success where during project execution and delivery, we designed/customised communication channels (Zoom/slack/email/WhatsApp with daily follow up meetings) to continuously follow up with client, supervisor, and team.

The 3 main phases of ‘Forms’ library project from ‘Infrastructure’ perspective included:

├── Exploratory_Data_Analysis
├── Data Source
|
├── Form Categorisation: Identifying top 5 categories
|   |
│   ├── Analysis_1: Identifying top 5 occurring titles of Forms Data (as top 5 categories)
│   ├── Analysis_2: Defining Category variable
│   ├── Analysis_3: Defining top 5 categories from the results of topic_modeling (GSDMM)
|
├── Identifying sub-categories for all categories
|   |
│   ├── Analysis_4: Defining sub-categories using topic modelling-GSDMM)
|
├── Form templates
|   |
│   ├── Embedding form template data
|   ├── Cluster analysis
|   ├── Clustering post processing
|
├── Feedback Model

Form Categorisation: To identify the top form categories and their main title types. This includes following steps
- Individual title as a category: The titles with the higher question count was considered as a category
- Defining Category variable: Identifying the categories from the titles Via SQL query
- Traditional ML: Objective was to use the Dirichlet multinomial mixture clustering analysis to identify the top 5 topics in the form titles, and repeating the same process for identifying subcategories/titles within the top 5 titles

Form Templates: Form templates were generated for each title obtained as a result of form categorisation.
From categorisation 5 form-categories were obtained each with atleast 3 form-titles/subcategories. In this step a form template is created for each individual title. This includes
- Embedding_form_template_data: computing the sentence embedding with BERT thus converting questions to numeric vectors for each form.
- cluster analysis: Creating 20 clusters based on the sentence embedding to capture the sentence context.
- clustering-post-processing: In this file, we are merging question data with their options present in ffa_form and then selecting questions for each form category.

Feedback Model: To modify and update the existing templates based on customers’ responses.

Infrastructure Setup

This was a highly exciting and interesting journey, whilst we were going through various challenges during ‘Topic Modelling’ i.e., to select most suitable model for Short text topic modelling. Despite a huge data of almost a million titles, there were some categories that existed with only 500-1000 distinct titles. Performing an NLP was a daunting task at that level and hence along with the results from GSDMM, we did some manual tweaking to properly define the subcategories/titles. Once we decided and concluded on titles, embedding and clustering (DL) approaches were used to generate templates using the questions’ data. A feedback model was designed to detect behavioural patterns in the user’s form usage and automatically implement those changes within the template.

On a final note, I would like to conclude that my entire journey from planning/scoping until project delivery and conclusion, provided me tremendous opportunities to go through all project phases. Learning and engaging with clients, mentors and industry professionals had also been a very profound experience. This project was a great success story where I was involved in some critical phases of the project delivery including product roadmap designs and data workflows being top of the list, deliverables.