DATA PREPARATION AND TABULATION OF THIRD CENSUS DATA ON SSI UNITS
PROJECT EXECUTION METHODOLOGY

Design, Printing, Packing and Delivery of ICR forms
Scanning
Data Capture Using Intelligent Character Recognition (ICR)
Data Validation and Tabulation
Deliverables


    The project, envisaged for processing of the filled-in formats and data preparation of the Third Census of Small Scale Industries, was unique in nature and needed a special methodology for execution. The primary issues that were addressed in this project were: -

    1. Designing and printing of large volume of ICR forms
    2. Packing and delivering the forms in small lots as per requirements
    3. Automated high volume form processing using ICR technology in limited time
    4. Technology to scan duplex documents with advanced features such as image enhancement, provision to take dynamic data only, provision for different recognition & trainable engines, and to handle different styles of handwriting.
    5. Software development for automating all activities such as scanning, data extraction, data validation and tabulation
    6. Data validation checks and tabulation
    7. Generation of hard copies and soft copies of the data of desired quality
    8. Full proof, inbuilt security mechanisms to ensure the integrity of the data and prevent data leakages

  1. M/s CS Software Enterprise Limited (CSEL) along with its partners executed the project as per the scope of work prescribed in the written contract with the company and as per the guidelines given on day-to-day basis by the Census Cell of the Office of the Development Commissioner (Small Scale Industries) led by Shri M.V.S. Ranganadham, Director (Census). The procedures adopted during the execution of the project were as follows.

  1. Design, Printing, Packing and Delivery of ICR forms

  1. The project required printing of approximately thirty lakhs ICR forms in three formats to record the data pertaining to the small-scale industrial units of both registered and unregistered segments. The first milestone of the project was to design the ICR form formats to capture all the required information. It was very pertinent at that stage to envisage the variations in the size and complexity of every unit of data to ensure that the field survey team would not face problems in entering the complete information. This needed a coordinated effort from both the design team and the guidance of officers of the DC (SSI) to design foolproof, user-friendly forms. Issues such as requirements with respect to ICR technology, thickness of the paper etc., were addressed at this stage.


  2. Once the form design was finalised, the printing of sample ICR forms was taken up. The ICR forms were printed from a Printing Press well versed with the job. The nature and quality of ICR form printing played a vital role in the project. The entire ICR scanning hinged on the form reading and any deviations in the printing could have caused serious problems during the data capture. Sample copies of the printed forms were filled up with sample data and the data extraction procedures and printing quality of the forms were comprehensively tested before bulk-printing job was taken up.


  3. Once the forms were printed, they were checked for inadvertent mistakes, such as black patches, blots, skewing of the printed matter, etc., before they were packed and delivered to the Directorates of Industries located at all the State Capitals and Union Territories across the country.


  4. Once the printing of the forms was completed, the packing assignment was taken up. Fifty forms were packed in one bundle and each bundle was wrapped in polythene waterproof packets. Care was taken to pack the different types of forms (Format -I, Format -II and Format-III) in exact quantities as per the guidelines of the DC (SSI). This enabled avoiding unwarranted difficulties for the survey team in the field

  1. Scanning

  1. The filled-in forms after the survey were received in packets containing fifty numbers at SISI office, Okhla, Delhi directly from the respective District Industries Centres in different batches. These were handed over to CSEL under acknowledgement, by the concerned staff at SISI office, Okhla.


  2. After receiving the filled-in forms from the SISI, the document packets containing approximately 50 forms were bundled and labeled with a Batch number and Job number. The Job number assigned was the date on which the filled-in packets were received and Batch number was the serial number of the packets received on said date.


  3. The data capture from the forms was a two-stage process. First the forms were scanned with pre-set DPI settings and the images were stored in an indexed directory created, using Batch numbers and Job Numbers.

  1. The scanning of the forms was taken up using high-speed scanners such as Fujitsu 4099D, Kodak DS 2500 and Kodak i260 to ensure that at least 1,25,000 forms were scanned on any given day. The scanning was done using custom developed software which ensured that all the forms scanned were of 150 dpi resolution and were compressed to attain image size less than 60 KB per document

  1. The custom built software application had the features of checking for quality of the image, automatically binding page one and page two of the documents into single image file and of checking for missing pages, if any. The scanned images were stored in directories on the hard disk using their Batch number and Job number. The scanning software application, at the time of scanning, checked for the mismatch of scanned images and the count was mentioned, while creating the JOB and Batch number. Wherever there was a mismatch, the entire scanned images were deleted automatically and the operator rescanned the bundle again. At the end of each day, the back up of all the scanned images were taken on CDs.

  1. Data Capture Using Intelligent Character Recognition (ICR)

  1. Traditionally data capture is done through manual data entry. This age old process is not only tedious, time consuming but is also prone to errors. In recent years new technologies have been developed to capture data from handwritten forms and printed documents. The most significant among them is the Intelligent Character Recognition (ICR) for hand written documents and Optical Mark Recognition (OCR) for data capture from printed documents.


  2. Automated data capture and forms processing, whether paper-based or electronic, is rapidly becoming an integral and necessary component in the government, insurance and financial sectors. It results in savings of 50 to 75 percent in direct costs and a significant increase in productivity in comparison with manually processed forms.


  3. In order to expedite the data capture and to improve the accuracy levels of the data from over 26 lakhs of the filled-in data forms, the ICR technology was used. Cardiff TELEform 8.0 software was used to extract the data from the scanned images.


  4. Cardiff TELEform interprets handprint, machine print, check boxes and bar codes from scanned images. After automatically processing each form, TELEform highlights the illegible and invalid entries for operator attention. Because, TELEform processes the majority of the information, entry operators spend seconds verifying questionable data rather than minutes manually keying entire forms. Scanned forms are automatically identified, eliminating the need for manual sorting. After identifying a form, Cardiff Software's Tri-CRŽ recognition technology interprets the form's hand print (ICR), machine print (OCR), bar code and check box (OMR) fields.


  5. TELEform Reader runs in unattended mode enabling forms to be processed continuously. TELEform's automatic form identification process handles multi-page forms, identifying out of sequence and missing pages. The software also improves the quality of scanned forms by performing despeckle, half-tone removal, character smoothing and line thickening procedures.


  6. Tri-CRŽ leverages the strengths of multiple recognition engines to produce unprecedented accuracy for hand print and machine print data. The software engines examine the characters. Tri-CR then analyzes the results, balances the strengths of the individual engines and determines the correct interpretation of data.


  7. TELEform Verifier highlights questionable data entries. Three verification modes displaying characters, fields and forms enable quick data correction. To work efficiently, Verifier offers functions that ensure only accurate and complete data makes it to the database. These functions include:

    • Database validations
    • User-defined dictionary look ups
    • Numeric range tests
    • Date, currency and character-specific formatting
    • "Always review" and "Entry required" field designations

  8. TELEform includes a fully integrated Visual Basic programming language called BasicScriptT that allows to customize validation requirements. Using BasicScript arithmetic comparisons, financial calculations, calls to external applications, skip and fill logic, and other business logic routines can be incorporated.


  9. The data capture from the scanned images was carried out using Cardiff Teleform 8.0. The data capture software used, in addition to the features described above, has in built facility to recognize the form type and interpret the data. This feature helped in getting correct output even when all three types filled in formats are mixed up due to various reasons. The output generated by the ICR engine is comma-separated values. The data thus obtained was ported into Oracle database for carrying out the necessary validations.

  1. The data captured through the ICR process was verified by the data entry operators to ensure the correctness of the data. The operators checked every field against image. This process ensured that the data captured by the ICR process matched with the data on the document.

  1. Data Validation and Tabulation

  1. Once the data was captured, the defined validations were applied through the software application. The nonconforming data was retrieved and checked against the corresponding image file using the software developed and was corrected. The corrected data was again ported back into the database.

  1. The verified and validated data was analysed and multipliers were generated for various formats under the guidance of the Director (Census), DC (SSI), New Delhi. The Multipliers generated, together with the validated data was used to generate the required tables.


  1. Deliverables

  1. Once the database was created, the necessary tables were generated using the custom-built software. Wherever defined the hard copies of such reports were printed and delivered to the DC (SSI). The soft copies of the reports and database were also copied onto suitable medium such as CDs and DAT tapes and were handed over to the DC (SSI) for archival purpose.



Related Links:

Highlights of SSI Sector
Definitions