x
ETF2L is looking for new Staff to recruit! Are you interested in supporting the league? Then click here for more details on what we can offer and how to apply! 

Forum

Help with scripting...no, not TF2

Created 29th May 2010 @ 14:59

Add A Reply Pages: 1 2 Next »

Grem

rEJ
TG

I got bit of data-related problem. Here me out. For my final year project in uni, I have to align strings and strings of DNA sequences in a piece of a software (Mesquite). These DNA sequences are available publicly (GenBank) and can be copied and pasted into a text file. I have been doing this for the last three weeks for ferns (that is what my project is on). Basically I focused on one order of ferns and a few families within it. I ended up with about 150+ text files for different genera, with a lot of GATC characters. Here is an example for one genera for one gene region:

http://dl.dropbox.com/u/366160/Genbank/Dryopteridaceae/Dryopteris/atpB.txt

Now here is my problem. For me to open this text file up in Mesquite, they need to be formatted correctly. So basically what I am looking for is for something (script, program etc) to parse all that text and convert it into something Mesquite will accept. I want the text file I linked above to look like (I’ve done the first one as an example) this:

http://pastebin.com/hGFKWrnU

As you can see I have gotten rid of some of the text, removed the space after ‘chloroplast’, and all the A,G,C,Ts are as a continuous string. I want something to do this en masse so that I dont have to sift through text files and do this manually. Anyone know anything that can do it? I can post what exact bits I want taken out, or if they’re are super adventurous I can give you the folder that houses all my text files. Post here or add me on steam, and we can talk further. Will pay if the help warrants it :)

Archy

guru
G-Yoda

I did something like that at uni, way back when we were playing around with c. Basically the assignment was to make programs that would for example open a .txt file, read it and print out some information about it, or change something in it, or make a new file with the changes saved in it without touching the original file. So if you get no response think of me as your last resort and ultimately add me on steam if needed and ill see what I can do.

I could try using my c skills (non-existant) to create something like that.
Could you make a before-after text file, and a list of what you have modified?

octochris

(0v0)

Could possibly do it with my lacking regex skills if you give me a before and after.

Quoted from octochris

Could possibly do it with my lacking regex skills if you give me a before and after.

Back in the line, bitch.

octochris

(0v0)

Quoted from dotfloat™

[…]

Back in the line, bitch.

:<

Skyride

DUCS

What OS are you on grem?

If you’re using linux, that just made it a lot easier. :)


Last edited by Skyride,

Grem

rEJ
TG

Before and after eh. Well as you can see from my dropbox link, thats the before version. So that is what I copied directly from GenBank, unformatted, with all the details. The pastebin link is the first sequence in the dropbox link, edited and formatted, so is the after.

The changes I have done is
-taken off the ‘gi’ number
-taken off the ‘|’
-shortened the latin name to ‘D aemula’ from ‘Dryopteris aemula’
-removed ‘ATP synthase beta chain’
-removed ‘gene, partial cds; chloroplast’
-removed 1char space after ‘chloroplast’
-return key after ‘chloroplast’ and the actual sequence without any spaces so that it is continuous.

Gimme a shout if you need any more info!


Last edited by Grem,

Grem

rEJ
TG

Quoted from Skyride

What OS are you on grem?

If you’re using linux, that just made it a lot easier. :)

win7/xp/mac os x

EDIT: nvm


Last edited by dotfloat™,

I’m nearly done with a horrible broken code, but first, a match!

http://xkcd.com/208/

o

Excal
UC#3

This what you’re after?

I have a quick hacky Java solution, which should work on any of your aforementioned platforms ;) Out of respect for dotfloat, I’ll hold off posting it for now, given that he called it first.


Last edited by o,

Quoted from o

This what you’re after?

I have a quick hacky Java solution, which should work on any of your aforementioned platforms ;) Out of respect for dotfloat, I’ll hold off posting it for now, given that he called it first.

Just post, it was a joke. :P

Grem

rEJ
TG

Quoted from o

This what you’re after?

I have a quick hacky Java solution, which should work on any of your aforementioned platforms ;) Out of respect for dotfloat, I’ll hold off posting it for now, given that he called it first.

That looks really good mate! Pretty much what I want. Will that work with this text file:

http://dl.dropbox.com/u/366160/Genbank/Dryopteridaceae/Elaphoglossum/rbcL.txt

Its a different plant and a different gene region, so some of the text strings and lengths might be different.

Add A Reply Pages: 1 2 Next »