Automated docx to text conversions

Today I had a truck load of AS/400 COBOL COPYBOOK files for a large data conversion project – strangely I received the COPYBOOK files in Microsoft Office docx format. 😕

Way too many to attempt manually, I located docx2txt at http://sourceforge.net/projects/docx2txt/  installed it and then:

ls *.docx|xargs -ti ../docx2txt-1.0/docx2txt.pl {}

… Wah lah!  But I’ve never used this script before… better check on the results and make sure there’s no funny stuff going on, i.e. weird control chars, etc.

hexdump -c <filename>.txt

Clean as a whistle… it sure is nice when others share their code making our lives easier… thanks for docx2txt.pl Sandeep Kumar 🙂

Note: Use the Perl script docx2txt.pl and not the Bash script docx2txt.sh; both are packaged in the download.  The Bash script does not handle spaces in file names well and has other limitations; the Perl script works like a champ. 

New to xargs?  Read on …

The xargs command is one of the most under utilized *nix commands. It looks a bit cryptic at first, but it is not mysterious at all…

ls *.docx|xargs -ti ../docx2txt-1.0/docx2txt.pl {}

Let’s break it down…

ls *.docx obviously returns a list of filenames.
The -i parameter in xargs replaces {} with each filename.
The -t parameter shows each command before it is executed.

xargs by example
Wikipedia – xargs
The joys of xargs

Advertisements

Tags: , ,

3 Responses to “Automated docx to text conversions”

  1. n00ti Says:

    Oh thank heavens! You are a saint. I have had people writing articles for me and despite all of my pleas for plain text they insist on sending me docx format!

    Even worse, all of the file names have spaces. I luckily was able to find docx2txt, but I have hundreds of files to process! The thought of doing it line by line was nauseating. I was looking for a batch process, thankfully you have supplied it! This makes me love Linux more and more. Never will I go back to window or join the fanatic CampMac.

    This bash script will nicely remove all of the pesky caps and spaces from filenames and replace with underscores:

    #!/bin/bash
    if [ -n “$1” ]
    then
    if [ -d “$1” ]
    then
    cd “$1″
    else
    echo invalid directory
    exit
    fi
    fi

    for i in *
    do
    OLDNAME=”$i”
    NEWNAME=`echo “$i” | tr ‘ ‘ ‘_’ | tr A-Z a-z | sed s/_-_/-/g`
    if [ “$NEWNAME” != “$OLDNAME” ]
    then
    TMPNAME=”$i”_TMP
    echo “”
    mv -v — “$OLDNAME” “$TMPNAME”
    mv -v — “$TMPNAME” “$NEWNAME”
    fi
    if [ -d “$NEWNAME” ]
    then
    echo Recursing lowercase for directory “$NEWNAME”
    $0 “$NEWNAME”
    fi
    done

    ————–

    Cheers mate! Mucho gracias!

  2. Alejandro Amo Says:

    nice tool. I just integrated it with a second bash script to do a mass conversion of docx files inside a folder and that saved my life 🙂

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


%d bloggers like this: