Today I had a truck load of AS/400 COBOL COPYBOOK files for a large data conversion project - strangely I received the COPYBOOK files in Microsoft Office docx format.
Way too many to attempt manually, I located docx2txt at http://sourceforge.net/projects/docx2txt/ installed it and then:
ls *.docx|xargs -ti ../docx2txt-1.0/docx2txt.pl {}
… Wah lah! But I’ve never used this script before… better check on the results and make sure there’s no funny stuff going on, i.e. weird control chars, etc.
hexdump -c <filename>.txt
Clean as a whistle… it sure is nice when others share their code making our lives easier… thanks for docx2txt.pl Sandeep Kumar
Note: Use the Perl script docx2txt.pl and not the Bash script docx2txt.sh; both are packaged in the download. The Bash script does not handle spaces in file names well and has other limitations; the Perl script works like a champ.
New to xargs? Read on …
The xargs command is one of the most under utilized *nix commands. It looks a bit cryptic at first, but it is not mysterious at all…
ls *.docx|xargs -ti ../docx2txt-1.0/docx2txt.pl {}
Let’s break it down…
ls *.docx obviously returns a list of filenames.
The -i parameter in xargs replaces {} with each filename.
The -t parameter shows each command before it is executed.
xargs by example
Wikipedia – xargs
The joys of xargs
Tags: Bash, Microsoft Office, Perl
June 23, 2010 at 06:00 |
Oh thank heavens! You are a saint. I have had people writing articles for me and despite all of my pleas for plain text they insist on sending me docx format!
Even worse, all of the file names have spaces. I luckily was able to find docx2txt, but I have hundreds of files to process! The thought of doing it line by line was nauseating. I was looking for a batch process, thankfully you have supplied it! This makes me love Linux more and more. Never will I go back to window or join the fanatic CampMac.
This bash script will nicely remove all of the pesky caps and spaces from filenames and replace with underscores:
#!/bin/bash
if [ -n "$1" ]
then
if [ -d "$1" ]
then
cd “$1″
else
echo invalid directory
exit
fi
fi
for i in *
do
OLDNAME=”$i”
NEWNAME=`echo “$i” | tr ‘ ‘ ‘_’ | tr A-Z a-z | sed s/_-_/-/g`
if [ "$NEWNAME" != "$OLDNAME" ]
then
TMPNAME=”$i”_TMP
echo “”
mv -v — “$OLDNAME” “$TMPNAME”
mv -v — “$TMPNAME” “$NEWNAME”
fi
if [ -d "$NEWNAME" ]
then
echo Recursing lowercase for directory “$NEWNAME”
$0 “$NEWNAME”
fi
done
————–
Cheers mate! Mucho gracias!
June 23, 2010 at 08:20 |
I am happy to hear that you found this post useful, and thank you for posting your Bash code for the filename cleanup script