ADDITIONAL RESOURCES
SWC Python lessons online: http://swcarpentry.github.io/python-novice-inflammation/
SWC Shell lessons online: http://swcarpentry.github.io/shell-novice/
SWC Git lessons online: http://swcarpentry.github.io/git-novice/
Becca's semester-long programming course:
http://ccbbatut.github.io/Biocomputing_Spring2016/
Amazing book for beginning to program: http://practicalcomputing.org/
- Currently this book is not available at the UCSF Library but I will put in a purchase order for it! - Ariel
Other programming resources: http://ccbbatut.github.io/Biocomputing_Spring2016/resources/
Becca says: I also wanted to point out that one of the best things about writing scripts for your analyses is that if you change your datafile (maybe there was an error during the input), it takes no time at all to rerun your analyses. This is also great if you receive comments from a reviewer, asking to transform your data before you run your analyses, again this takes no time at all to redo once your change your input file.
-------------------------------
Potential workflow:
DATA ---(Excel/bash)---> datafile ---(python/R script)---> cleaned data ---(python/R script)---> analyzed data
all of these can be run in a bash script ("the glue")
example bash script
# run concatenation and cleaning script on data files, write output to a csv file
python concat_and_clean_files.py data*.csv > clean_data.csv
# run analysis on clean data file, write output to new text file
python analyze_data.py clean_data.csv > analyzed_data.txt
more complicated bash script:
# run concatenation and cleaning script on data files, write output to a csv file
- python concat_and_clean_files.py data*.csv > clean_data.csv
- # cycle through 3 param1 options
- for param1 in 4 5 10
- do
- # cycle through 3 param2 options
- for param2 in 1 100 10000
- do
- # run analysis using sys.argv to read parameter permutations into python
- # write the output to files with names that reflect each parameter permutation
- python analyze_data.py clean_data.csv $param1 $param2 > analyzed_data_$param1_$param2.txt
- done
- done
- Welcome to Software Carpentry
- This is the pad for the 2016-04-29 Software Carpentry Workshop at UCSF.
- The website for the workshop can be found at <https://bsmith89.github.io/2016-04-29-ucsf>.
- We will use this Etherpad to share links and snippets of code, take notes, ask and answer questions, and whatever else comes to mind.
- The page displays a screen with three major parts:
- * The left side holds today's notes: please edit these as we go along.
- * The top right side shows the names of users who are logged in: please add your name and pick the color that best reflects your mood and personality.
- * The bottom right is a real time chat window for asking questions of the instructor and your fellow learners.
- To start, please add yourself to the attendee list below:
- - *Byron Smith (microbial ecology, University of Michigan)
- - *Becca Tarvin (Evolutionary Biology, University of Texas at Austin)
- - Alex Williams (Gladstone / UCSF, SF)
- - Manasi Mayekar (UCSF)
- -Juhi Ojha(UCSF)
- - Ariel Deardorff (UCSF Library)
- - Diego Castaneda, (Psychiatry, UCSF) : )!
- Harpreet Zoglauer ( Social Sciences,UCB)
- - Beverly Piggott (Neuroscience, UCSF)
- - Kathleen Cho (Neuroscience, UCSF)
- - Louisa Holmes (Center for Tobacco Control, UCSF)
- -Jessica Nielson (Neurosurgery, UCSF)
- -An Nguyen (Developmental Biology, UCSF)
- - Regina Lutz (Cell Biology, UCSF)
- - Victoria Wang (UCSF)
- Ajit Shah (UCSF)
- Youjin Lee (Medicine,UCSF)
- Selim Boudoukha (Biochemistry, UCSF)
- - Marin Vujic (Dermatology, UCSF)
- Zhongsheng Yu (Biochem&Biophy, UCSF)
- Daniel Linnen (UCSF, PhD Student)
- Elena Minones-Moyano (Neurology, UCSF)
- Serah Choi (Radiation Oncology, UCSF)
- Hsin Chen (immunology, UCSF)
- Dang Dao (Ophthalmology, UCSF)
- Sierra Niblett (Neurology, UCSF)
- Phillip Dumesic (Biochemistry, UCSF)
- Stella Tran (Diabetes Center, UCSF)
- Swetha Mohan (Neurology, UCSF)
- Vivi Tolani (Surgerry, UCSF)
- Carolina Alquézar (UCSF, Neurology)
- Simon Wang (Human Genetics, UCSF)
- Sasha Skinner (Neurology, UCSF)
- Nirupama Krishnamurthi (Medicine, UCSF)
- Rami JAAFAR (DIabetes center)
- Vijay Natarajan
- kai Zhao(lab medicine,UCSF)
- Johannes Thrul (Center for Tobacco Control, UCSF)
- Tracy Chow (Biochemistry)
- En Cai (Pathology)
- Chenling Xiong(BTS, UCSF)
- Shyam Srivats (Medicine, UCSF)
- (discipline, institution)
- (* denotes an instructor)
- Users are expected to follow our code of conduct: http://software-carpentry.org/conduct.html
- All content is publicly available under the Creative Commons Attribution License: https://creativecommons.org/licenses/by/4.0/
- Installation Questions:
-
- Hi Guys! I tried to run the test scripts here: http://bsmith89.github.io/2016-04-29-ucsf/setup/index.html but when I downloaded the files and typed the command into my terminal it said "no such file or directory." Here is the whole message: "LIB-142FVH6-LT:~ arieldeardorff$ python swc-installation-test-1.py python: can't open file 'swc-installation-test-1.py': [Errno 2] No such file or directory." Does the file need to be saved somewhere in particular in order to be run? Thanks!
- Hi Ariel - are you in the folder with the python scripts when you typed the command? If not, you need to cd (e.g. cd ~/Downloads/) into it first or type the full file path (python ~/Downloads/swc-installation-test-1.py). Not sure if this is the error or not, let me know if you get it to work. -B
- That was it! Thanks Becca! I forgot about cd'ing my way into the correct directory :)
- Log into etherpad: http://pad.software-carpentry.org/2016-04-29-ucsf-room1
- Shell (Day 1)
- Please open up a tab with <https://b.socrative.com/login/student/> and type in "SMITHSWC" where it asks for the Room Name.
- Download the files here: http://swcarpentry.github.io/shell-novice/shell-novice-data.zip and unpack them on your Desktop
- whoami program that prints username
- PS1="$ "
- pwd=present working directory from the root of the directory
- cd=moves to home directory (change directory; can change differnt directory;no argument takes to home directory or takes to the specified directory)
- ls=listing everything in the home directory (ls -F specifies which ones are directory)
- ls /Users/us/Desktop look somewhere else
- command is a single word no space
- space tells computer end of the command
- / used to separate directories
- control L moves to top of the screen
- cd /Users/us/Desktop/data-shell is absolute path
- cd Desktop/data-shell is a relative path (no forward slash)
- ls -F: Identifies directories with a / at the end
- cd .. moves back to the parent directory one level
- cd ../.. moves back to several levels of parent directory
- cd . is the current directory
- ls -F -a lists all hidden files
- cd ~ goes back to home directly
- cd ~ /directory
- tab will autocomplete (short-cut)
- mkdir name will create a new directory
- nano draft.txt will open a text editor
- Control O will save text file
- Enter
- Contol X to exit text file editor
- cat to look at content of file
- less to look at file only portion that fits current screen
- Q to exit less view
- head - n 5 aldrin.pdb - will show first 5 lines of aldrin.pdb
- rm draft.txt to remove file (this is irreversible)
- rmdir thesis to remove directory but only if the directory is empty
- rmdir -r to remove directory even if not empty (-r means recursive)
- control c cancels the current command (kind of like the "Escape" key on Windows / Mac)
- cp quotes/shakespeare.txt . (copy file to current directory)
- cp file1 file2 directory/
- mv shakespeare.text quotes/thebard.txt (move to directory quotes and rename)
- mv file directory moves file to directory
- mv can also be used to rename file e.g., mv quotes/shakespear.txt quotes/thebard.txt - will rename txt file
- Usually, if there is no output, then the command worked.
- If there is an error, the error will display.
- Ctrl+c will not run the code that is typed.
- With a spanish keyboard and the Spanish (ISO) layout I get the ~ character with alt + ñ.
- less aldrin.pdb (show what will fit on the screen) Q to exit
- head -n 5 aldrin.pdb (allows one to see top of the few lines of the file)
- wc aldrin.pdb (word count: lines, words, characters)
- wc -l will just show number of lines
- * is wildcard - if you use wc -l *.pdb will show all .pdb items in directly and number of lines of each document
- wc -l *.pdb (show word count for all files)
- man command (manual)
- Q or control C to get out
- wc -l *.pdb > lines.txt (print all lines output to lines.txt)
- wc --help -this will print the help info for the wc command. can replace wc with another command to get that command's help info.
- wc -l *.pdb > lines.txt -this will print the number of lines for all the files in a text file named "lines.txt" instead of printing in the window. If lines.txt exists already, this will overwrite that file. Helpful to check with tab to see if it exists
- sort -n lines.txt "n" flag indicates numerical ordering
- sort -n lines.txt > sorted_lines.txt write to a sorted file
- wc --help works for git-bash instead of man wc
- have a hanging line error (no $ prompt and no response from shell)? hit Ctrl+c to reset
- wc -l *.pdb | sort -n | head -n 1 combines/pipes all of the commands, executes one after the other, using the output of earlier as input of later command the -n 1 flag on the head command specifies that you want to list 1 line (can change to 2, 3, etc.) tail is the opposite of head. tail will give you the end of the txt file.
- piping
- wc -l *.pdb | sort -n | tail -n 2 |head -n 1
- sort salmon.txt | uniq > unique_salmon.txt
- uniq salmon.txt // uniq works by lines - collapses lines if they don't differ
- sort salmon.txt | uniq // if you sort first alphabetically it will collapse entire categories
- sort salmon.txt | uniq > unique_salmon.txt
- loop
- for filename in basilisk.dat unicorn.dat
- > do
- > head -n 3 $filename
- > done
- for filename in *.dat // will loop over all .dat files in directory
- > do
- > head -n 3 $filename
- > done
Things you learned (maybe):
- LS
- What is up with "ls -F"? (this flags the items that are folders with a /)
- How about "ls -a?" (shows hidden files)
- Q: why does "ls" work but "LS" doesn't? Answer: the shell is ornery and annoying.
- If you had a computer in like 1993: it's the same as "dir" on old DOS.
- PWD
- "print where we are" ("print working directory"? "present working directory?")
- CD
- "change directory" — move where we are.
- MKDIR
- RM
- Note: please don't accidentally remove all your files! (don't use -r as this means recursive)
- MV ("move a file")
- CAT ("catastrophically print an entire text file onto my terminal, causing it to scroll many pages probably." Actually short for "concatenate")
- wc
- sort
- Note: sort doesn't normally understand numbers! 2 comes after 111.
- head (and its opposite, "tail")
- less (this is like a really basic text viewer. Press 'q' to exit it)
- Python (Day 1 afternoon)
- Find materials for today's lesson here https://github.com/rdtarvin/swc_UCSF_python
- Download the directory with data and scripts to your computer. Here's how:
- Way #1 (requires that GIT be installed):
- OR Way #2:OR Way #2:
unzip file
go to directory with swc_UCSF_python-master
ipython notebook RDT_notebook.ipynb // opens notebook in browser
Shift-enter
In notebook
import numpy
numpy.loadtxt(fname='data/inflammation-01.csv', delimiter=',') // imports dataset
variable types
number=5.5 # float
number2=4 # integer
word="AGTC" # string
word[0] start counting at zero
word[0:2] # gives first two letters (letter 0 to 2, excluding 2)
word[-1] # gives last item
len(word) # gives length of word_
indexing
Bracket notation
word[1] # Gives you second letter of word
# Python starts counting at 0
data [row,column]
data.shape --> number of rows and column
data.mean --> mean of the whole data set
data[1].max() --> default is row, max value
data[:,20] --> all data in column 20
data[:,20].mean --> mean of all numbers in column 20
axis=0 --> axis running vertically downward across rows
axis=1 --> axis running horizontally across columns
data.mean(axis=0) --> mean of each row, meaning we operate along columns
plotting
%matplotlib inline
import matplotlib.pyplot
image=matplotlib.pyplot.imshow(data)
ave_plot = matplotlib.pyplot.plot(ave_inflammation)
max_plot = matplotlib.pyplot.plot(data.max(axis=0))
max_plot = matplotlib.pyplot.plot(data.min(axis=0))
import numpy
import matplotlib.pyplot
data = numpy.loadtxt(fname='data/inflammation-01.csv',delimiter=',')
# set up graphing space
fig = matplotlib.pyplot.figure(figsize=(10.0,3.0))
# position separate graphs
axes1 = fig.add_subplot(1,3,1)
axes2 = fig.add_subplot(1,3,2)
axes3 = fig.add_subplot(1,3,3)
axes1.set_ylabel('average')
axes1.plot(data.mean(axis=0))
axes1.set_ylabel('max')
axes1.plot(data.max(axis=0))
axes1.set_ylabel('min')
axes1.plot(data.min(axis=0))
fig.tight_layout()
matplotlib.pyplot.show()
if change axes1 to 2 and 3 will graph on separate graphes
Loops in Python
Python 2 no parentheses
Python 3 needs parentheses
word = AGTC
superword=word*10
for letter in superword:
print(letter)
for letter in word:
print(letter)
for letter in word:
print letter
[] will print letter at position
for letter in range(len(word)):
print(word[letter])
for letter in word:
print(letter)
# both of the two loops above will produce the same results
for letter in range(len(word)):
print "The index is", letter
print "The index at this index is", word[letter]
Lists in Python
nonsense=[]
nonsense.append('This is my first list addition')
# notice the differences in brackes - if you generate list [] - if you append to a list ()
nonsense.append(5)
nonsense.append(sequences)
nonsense[-1] # index list
nonsense[-1][0] # index list within a list
Loops exercise
wholeword=''
for item in newword:
print("wholeword before addition:", wholeword)
wholeword=wholeword+item
print("wholeword after addition:", wholeword)
wholeword=''
for item in range(0,len(newword),2):
print(newword[item])
list can change; string cannot be changed
***when dealing with loops, needs to indent otherwise loop does not work
wholeword=''
for item in range(len(newword)-1,-1,-1):
newword[item]
wholeword=wholeword+newword[item]
print wholeword
newword[::-1]
Let's reverse a string! (Without checking stackoverflow):
Try #1:
for some letter in a word:
- somehow print it in reverse order??
Try #2
- Ok, let's break the word up first. Actually, we don't have to, because we can access each element number "n" as word[ n ]
- So:
- word = "Newton"
- reversed = "" # ok, start it with nothing
- n = length of word
- for (number i goes from (n-1) down to 0):
- reversed = reversed + word[ i ]
- print("Ok, the reversed thing is: " + reversed)
- Probably that will work! But you ahve to figure out how to do the for loop!
Try #3: # working solution
- word1="Newton"
- word2=""
- for item in range(len(word1)-1,-1,-1):
- word2=word2+word1[item]
- print(word2)
Decision rules in Python
if ... :
- ## Remember that the indent (4 spaces) is important. This denotes a "block" of code, which we execute *if* the boolean is True.
elif ... :
##
else ... :
##
# will exit as soon as it hits the first condition that is true
Python Day 2 (morning)
Notebook for today can be found at: https://github.com/rdtarvin/swc_UCSF_python/blob/master/SWC_Day2_notebook.ipynb
You can also download the notebook from yesterday by clicking
I'll be working in python 3 now!
Functions!
def <function_name>(<arguments>, <default_value_names>=<default_values>):
<code_to_execute>
return <what_to_return>
def is_animal_a_snake(animal):
if (animal == "snake")
- return "yep it totally is"
- else:
Python 2 vs 3 "fun" fact #1:
Are you annoyed that "2 / 3" is 0 (instead of 0.66667) in python 2? Fix that by putting this in the top of your python 2 file:
- from __future__ import division
- This was added to python in 2001
Python 2 vs 3 "fun" fact #2:
- Why are there both python 2 and python 3? Isn't one an upgrade to the other?
- Python 3 has a few new features, but also doesn't support some speed-increasing python 2 features—the main purpose is so nerds can fight a holy war over which is better.
# Add function
def add(a,b):
result = a+b
return result
# need return statement to be able to save output value into variable
# return will stop function, if you have another line with print below return, it will not print value
#Tuple argument
def display(a=1,b=2,c=3):
print("a:", a, "b:", b, "c:", c)
return (a,b,c)
#display(x) to replace value in default tuple
### CHALLENGE:
write a function called fence that takes 2 arguments called original and wrapper and returns a string
a string: fence('name','*')
#solution
def fence(original, wrapper):
answer=wrapper+original+wrapper
print(answer)
return answer
## Script from yesterday
%matplotlib inline
import numpy
import matplotlib.pyplot
import glob
filenames = glob.glob('data/inflammation*.csv')
bad_maxes = 0
bad_mins = 0
ok_graphs = 0
for my_file in filenames:
print("File being analyzed is", my_file)
data = numpy.loadtxt(fname=my_file,delimiter=',')
## check for suspicious data
if data.max(axis=0)[0] == 0 and data.max(axis=0)[20] == 20:
print("suspicious looking maxima")
bad_maxes += 1
elif data.min(axis=0).sum() == 0:
print("Minima add up to zero!")
bad_mins += 1
else:
print("seems ok!")
ok_graphs += 1
# set up graphing space
fig = matplotlib.pyplot.figure(figsize=(10.0,3.0))
# position separate graphs
axes1 = fig.add_subplot(1,3,1)
axes2 = fig.add_subplot(1,3,2)
axes3 = fig.add_subplot(1,3,3)
# set y axis labels and tell it what to plot
axes1.set_ylabel('average')
axes1.plot(data.mean(axis=0))
axes2.set_ylabel('max')
axes2.plot(data.max(axis=0))
axes3.set_ylabel('min')
axes3.plot(data.min(axis=0))
fig.tight_layout()
Challenge #2
def fence(a,b):
result = b+a+b
print(result)
return result
fence("name","*")
## Break up scripts into separate functions
%matplotlib inline
import numpy
import matplotlib.pyplot
import glob
def analyze(filename):
data = numpy.loadtxt(fname=filename,delimiter=',')
# set up graphing space
fig = matplotlib.pyplot.figure(figsize=(10.0,3.0))
# position separate graphs
axes1 = fig.add_subplot(1,3,1)
axes2 = fig.add_subplot(1,3,2)
axes3 = fig.add_subplot(1,3,3)
# set y axis labels and tell it what to plot
axes1.set_ylabel('average')
axes1.plot(data.mean(axis=0))
axes2.set_ylabel('max')
axes2.plot(data.max(axis=0))
axes3.set_ylabel('min')
axes3.plot(data.min(axis=0))
fig.tight_layout()
matplotlib.pyplot.show()
def detect_problems(filename):
bad_maxes=0
bad_mins=0
ok_graphs=0
data = numpy.loadtxt(fname=filename,delimiter=',')
## check for suspicious data
if data.max(axis=0)[0] == 0 and data.max(axis=0)[20] == 20:
print ("suspicious looking maxima")
bad_maxes+=1
elif data.min(axis=0).sum() == 0:
print ("Minima add up to zero!")
bad_mins+=1
else:
print ("seems ok!")
ok_graphs+=1
return(bad_maxes, bad_mins, ok_graphs)
def main():
filenames=glob.glob('data/inflammation*.csv')
for my_file in filenames:
print("file being analyzed is:", my_file)
analyze(my_file)
detect_problems(my_file)
main()
## Build in assertion errors
numbers=[3,5,6,-1,-7,0,10]
total = 0
for n in numbers:
print(n)
assert n>0, "data should only contain positive values"
total += n
print('total is:', total)
## Assertion statements
def normalize_rectangle(rect):
'''
Normalizes a rectangle so that it is at the origin and 1.0 units long on its longest axis.
'''
## Git
$ git config --global user.name ""
$ git config --global user.email ""
$ git config --global color.ui "auto"
$ git config --global core.editor "nano -W"
$ git status
nano instructions.txt # add some lines to the file
git add instructions.txt # add to be tracked by git
git commit # add a message saying file was created
nano instructions.txt # change some lines in the file
git add instructions.txt # add agin to be tracked by git
git commit -m "message describing change"
git diff <filename> will show differences between files, if they are not committed yet
## if you've already commited changes
$ git diff HEAD~1 instructions.txt # will go back one change
$ git checkout HEAD~1 ingredients.txt # take a previous version from committed versions