This Python-based COBOL copybook parser command accepts stdin or a filename, it reads in the copybook text file and returns CSV to stdout in the following format:
- Field Name (concatenated names based on level hierarchy)
- Data Type (‘Integer’, ‘Float’, ‘Double’ or ‘BCD’)
- Field Length
- Implied decimal position (unit-based, i.e 1st char. = 1)
Handles repeating nested field groups (layered OCCURS) and variable length records (OCCURS … DEPENDING ON). Looped fields are indented with ‘\t’ character(s), reference example output.
Keeping the copybook parser loosely coupled from the data record parser allows for post-processing of the file layout, such as piping to Unix commands for text manipulation, manual edits such as adding ‘*’ in front of field names to identify the key fields used when importing data into a relational database, etc.
Copybook Parser Source Code
Note: This code has been superceded by the pyCOBOL package, updated code is available at Sourceforge.
import copy import re import sys from optparse import * from pic import PIC """ Accepts: COPYBOOK Filename or STDIN Repeating blocks (OCCURS) & variable length records (DEPENDING ON) Returns Comma Separated Values (CSV) via STDOUT: 1. Flattened data, concatenated fields names built from level hierarchy. 2. Returns Field Name, Data Type, Length & Implied Decimal Position 3. Pythonic indentation/syntax for repeating blocks 'OCCURS x TIMES:' where x is an integer indicating number to times to repeat or ... where 'x' is a string/field name that the number of times depends on. """ class Field: def __init__(self): # 2-digit level, name, OCCURS, DEPENDING ON, PIC, COMP FIELD_PATTERN = r'^(?P<level>\d{2})\s+(?P<name>[\S-]+)' PIC_PATTERN = r'\s+PIC\s+(?P<pic>[-/+$,(0-9):A-Z*]+)' OCCURS_DEPENDS_PATTERN = r'\s+OCCURS.*DEPENDING ON (?P<occurs>[\S-]+)' OCCURS_PATTERN = r'\s+OCCURS (?P<occurs>\d+) TIMES' COMP_PATTERN = r'\s+COMP-(?P[123])' FIELD_PATTERNS = [ '%s%s%s.' % (FIELD_PATTERN, PIC_PATTERN, COMP_PATTERN), '%s%s.' % (FIELD_PATTERN, OCCURS_DEPENDS_PATTERN), '%s%s%s%s.' % (FIELD_PATTERN, OCCURS_PATTERN, PIC_PATTERN, COMP_PATTERN), '%s%s%s.' % (FIELD_PATTERN, OCCURS_PATTERN, PIC_PATTERN), '%s%s.' % (FIELD_PATTERN, OCCURS_PATTERN), '%s%s.' % (FIELD_PATTERN, PIC_PATTERN), '%s.' % (FIELD_PATTERN), ] self.FIELD_PATTERNS = [ re.compile(i) for i in FIELD_PATTERNS ] self.FIELDS = [ 'occurs', 'level', 'name', 'type', 'length', 'decimal_pos', 'pic', 'comp' ] self.picture = PIC() def parse(self, line): fields = { 'name': '', 'level': '0', 'occurs': '1', 'comp': '0' } line_num, line = line num_patterns = len(self.FIELD_PATTERNS) pattern_num = 0 match = False while (not match) and (pattern_num < num_patterns): match = self.FIELD_PATTERNS[pattern_num].match(line) if match: for key, value in match.groupdict().items(): fields[key] = value pattern_num += 1 if fields['occurs'].isdigit(): fields['occurs'] = int(fields['occurs']) for i in ['level', 'comp']: fields[i] = int(fields[i]) result = [ fields[i] for i in self.FIELDS[:3] ] if fields.has_key('pic'): result.extend(self.picture.parse(fields['pic'], fields['comp'])) return result class COPYBOOK: def __str__(self): return '\n'.join(','.join([i for i in self.fields ])) def parse_fields(self, lines): COMMENT_CHAR = '*' TERMINATE_CHAR = '.' lines = [ i.strip() for i in lines ] lines = [ i for i in lines if i ] lines = [ i for i in lines if i[0] != COMMENT_CHAR ] lines = [ i for i in enumerate(lines) ] field = Field() self.fields = [ field.parse(i) for i in lines ] def nested_field_names(self): LEVEL, NAME = 1, 2 names={} fields = [] for field in self.fields: # Levels 66 & 88 not supported if field[LEVEL] not in (66, 88): names[field[LEVEL]] = field[NAME] s = sorted(names.items()) name = [ j for i, j in s if field[LEVEL] >= i ] field[NAME] = '_'.join(name) fields.append(field) self.fields = fields def occurs_n_times(self): f = [] OCCURS, LEVEL, FIELDS = 0, 1, 1 levels = [0] for field in self.fields: if field[LEVEL] <= levels[-1]: levels.pop() if str(field[OCCURS]) != '1': print '%sOCCURS %r TIMES:' % ('\t' * (len(levels) - 1), field[OCCURS]) levels.append(field[LEVEL]) if len(field) > 3: print '%s%s' % ('\t' * (len(levels) - 1), ', '.join([ str(i) for i in field[2:] ])) def load_copybook(lines): copybook = COPYBOOK() copybook.parse_fields(lines) copybook.nested_field_names() copybook.occurs_n_times() return copybook if __name__ == '__main__': ver = "COPYBOOK Parser 0.8\nWritten by Brian Peterson." usage = "usage: copybook.py [FILE]" parser = OptionParser(usage, version=ver) (options, args) = parser.parse_args() if args: # Read data from FILE lines = open(args[0]).readlines() else: # No FILE arg found, read data from STDIN lines = sys.stdin.readlines() copybook = load_copybook(lines)
Sample Copybook File
Because of the sensitive nature of the actual data I am working with, I have chosen to substitute the following copybook example:
01 STUDENT. 20 ID PIC 9(8). 20 FIRST_NAME PIC X(32). * YYYYMMDD 20 DATE_OF_BIRTH PIC S9(8) COMP. 20 NUMOF_COURSES PIC 9(4) COMP. 20 NUMOF_BOOKS PIC 9(4) COMP. 25 COURSES OCCURS 8 TIMES DEPENDING ON NUMOF_COURSES. 30 COURSE_ID PIC 9(8). 30 COURSE_TITLE PIC X(48). 30 INSTRUCTOR_ID PIC 9(8). 30 NUMOF_ASSIGNMENTS PIC 9(4) COMP. 30 ASSIGNMENTS OCCURS 4 TIMES DEPENDING ON NUMOF_ASSIGNMENTS. 40 ASSIGNMENT_TYPE PIC x(12). 40 ASSIGNMENT_TITLE PIC X(48). * YYYYMMDD 40 DUE_DATE PIC S9(8) COMP. 40 GRADE PIC S9V9. 20 BOOKS. 25 BOOK OCCURS 1 TO 5 TIMES DEPENDING ON NUMOF_BOOKS. 30 ISBN PIC X(10). * YYYMMDD. 30 RETURN_DATE PIC 9(8) COMP.
Sample Output
STUDENT_ID, Integer, 8, 0 STUDENT_FIRST_NAME, Char, 32, 0 STUDENT_DATE_OF_BIRTH, Integer, 9, 0 STUDENT_NUMOF_COURSES, Integer, 4, 0 STUDENT_NUMOF_BOOKS, Integer, 4, 0 OCCURS 'NUMOF_COURSES' TIMES: STUDENT_NUMOF_BOOKS_COURSES_COURSE_ID, Integer, 8, 0 STUDENT_NUMOF_BOOKS_COURSES_COURSE_TITLE, Char, 48, 0 STUDENT_NUMOF_BOOKS_COURSES_INSTRUCTOR_ID, Integer, 8, 0 STUDENT_NUMOF_BOOKS_COURSES_NUMOF_ASSIGNMENTS, Integer, 4, 0 OCCURS 'NUMOF_ASSIGNMENTS' TIMES: STUDENT_NUMOF_BOOKS_COURSES_ASSIGNMENTS_ASSIGNMENT_TITLE, Char, 48, 0 STUDENT_NUMOF_BOOKS_COURSES_ASSIGNMENTS_DUE_DATE, Integer, 9, 0 STUDENT_NUMOF_BOOKS_COURSES_ASSIGNMENTS_GRADE, Float, 4, 3 OCCURS 'NUMOF_BOOKS' TIMES: STUDENT_BOOKS_BOOK_ISBN, Char, 10, 0 STUDENT_BOOKS_BOOK_RETURN_DATE, Integer, 8, 0
Leave a comment