Removing special characters,spaces,nulls in multiple columns

Post questions here relative to DataStage Enterprise/PX Edition for such areas as Parallel job design, Parallel datasets, BuildOps, Wrappers, etc.

Moderators: chulett, rschirm, roy

developeretl
Participant
Posts: 89
Joined: Sat Jul 24, 2010 11:33 pm

Removing special characters,spaces,nulls in multiple columns

Post by developeretl »

Hi,

I have flat files , transformed and moved to staging.
As output files are moved from datastage server to staging folder. when we are trying to load the data we are seeing special characters.
For this we have used pad char environment variable and trim d but we have atleast 10 columns .
My question is that is there any simple way to remove the special characters in datastage using sed or awk or shell script.
kpsita
Participant
Posts: 99
Joined: Tue Jul 21, 2009 11:43 pm

Post by kpsita »

You can remove special characters in datastage using Convert function.
KPSITA
developeretl
Participant
Posts: 89
Joined: Sat Jul 24, 2010 11:33 pm

Post by developeretl »

Can you please tell me whether I can use it for mutiple columns. I have atleast 20 coloumns and it is time consuming for me to change in each job. Is there any way to make it global or any unix command to be used.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

There are really many ways to do this, but I am not certain that convert would be the best method. You could -
  • Write a C routine to remove unprintable characters
    Write a sed/awk command that will do the same and invoke via a sequencer or during sourcing when using a sequential file
    You could import your sequential file into one column by using a delimiter not present in any row and then operate on that one column and then write it out using the correct metadata.
I have used all of these methods and it really depends on what you require. Is it just 1 or 2 occurrences of an unprintable character or is it random and you need to check for all unprintable characters.
developeretl
Participant
Posts: 89
Joined: Sat Jul 24, 2010 11:33 pm

Post by developeretl »

All the columns have special characters.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Right, but is it 1 or 2 special characters that keep showing up or are they random?
developeretl
Participant
Posts: 89
Joined: Sat Jul 24, 2010 11:33 pm

Post by developeretl »

Can you please provide me any sed or awk command?
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Maybe something like -

Code: Select all

 sed "s/[^a-z|0-9]//g;" file1 > file2
or the tr command will work, awk as well, but you will likely have to call a sub process

Again, there are a lot of ways to do this - this is just one of many.
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What makes them "special"? There are NO special characters. Every character is a character.
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
developeretl
Participant
Posts: 89
Joined: Sat Jul 24, 2010 11:33 pm

Post by developeretl »

For me the special characters are # ,@,!,%,&,* . I'm seeing this data in the file
ray.wurlod
Participant
Posts: 54607
Joined: Wed Oct 23, 2002 10:52 pm
Location: Sydney, Australia
Contact:

Post by ray.wurlod »

What's special about them? Nothing. You use @ every time you use an email address!
IBM Software Services Group
Any contribution to this forum is my own opinion and does not necessarily reflect any position that IBM may hold.
developeretl
Participant
Posts: 89
Joined: Sat Jul 24, 2010 11:33 pm

Post by developeretl »

Hi Ray,
Iam not experienced like to discuss more on that. Iam sorry if the terms used by me caused any concerns or my question is confusing. Iam new to the datawarehousing so i do not know what terms to be used in this forum so that it wont raises more questions while answering it. Please let me know if you have any list so that I should not follow in this forum.
mhester
Participant
Posts: 622
Joined: Tue Mar 04, 2003 5:26 am
Location: Phoenix, AZ
Contact:

Post by mhester »

Well, characters like that have been called "special" since I started working in IT back in 1984 so I know what you mean and I doubt there is anyone who frequents this forum that did not understand what you meant.

You can still use the sed command I gave above and that will work with some tweaks. If you do a google search on removing/replacing "special" or unprintable characters there will be pages of examples which remove printable as well as unprintable.

The characters you want to remove are special

:)
chulett
Charter Member
Charter Member
Posts: 43085
Joined: Tue Nov 12, 2002 4:34 pm
Location: Denver, CO

Post by chulett »

I think that falls into the category of "pet peeve". :wink:
-craig

"You can never have too many knives" -- Logan Nine Fingers
DSguru2B
Charter Member
Charter Member
Posts: 6854
Joined: Wed Feb 09, 2005 3:44 pm
Location: Houston, TX

Post by DSguru2B »

mhester wrote:Well, characters like that have been called "special" since I started working in IT back in 1984
1984 huh. I was still carrying around a sippi cup.

So many special character, that too in every column is weird. I havent been in IT for as long as Mike but even I can say that you need to investigate the origin of this data.
Nevertheless, you can use the CONVERT() function to remove these characters.
In some fields these characters maybe ok, like an email address or comment fields so use a scalpel approach rather than a machete.
Creativity is allowing yourself to make mistakes. Art is knowing which ones to keep.
Post Reply