Bioinformatics with Python and Hadoop Streaming

A while back I wrote a short paper about using the Python programming language in conjunction with Apache Hadoop. The purpose of doing so was to use several to many heavy duty machines to process problems in the field of Bioinformatics. At the time, I did not come across many resources so I've decided to post some of my work on the web. The intended audience would be students in an introductory Bioinformatics course or anyone with a will. It's really simple stuff!

The Paper/Tutorial:
http://donniedemuth.s3.amazonaws.com/bioinf_with_python_and_hadoop.pdf

Relevant Slides:
I also gave a presentation on functional programming, Python, and Hadoop. The slides relate heavily to the paper above.
http://donniedemuth.s3.amazonaws.com/LCS_python_hadoop_streaming.ppt

Note:
I haven't edited the material yet so there may be grammatical errors. And eventually I plan on moving the contents onto the blog.

CSCI E-190 Bioinformatics Algorithms

CSCI E-290 is a fairly new course in the Harvard Extension School curriculum related to three different programs: biology, computer science, and biotechnology. I'm taking it as an elective in my pursuit for a Masters in Liberal Arts (ALM-IT) with a concentration Software Engineering; though after a little bit of studying I'm curious why I never pursued the biotechnology path.

Note: When I was younger, I was terrible in the sciences (biology, chemistry, you name it). Actually, if it wasn't computer related it couldn't hold my attention.

The following are my impressions after the first lecture.

How was the lecturer?
Dr. Jeff Parker's lecture opened up to the following youtube clip:



I think it was a tell tail sign that he has a sense of humor and that the class will not be dry. Throughout the lecture I was certainly entertained. Surprisingly the pace of the lecture was very fair and of the 100 slides he had in his PowerPoint, we only covered fifty-or-so.

One thing that I noticed about the lecturer was that he seems very patient. A few students blurted out ridiculous questions, in-my-opinion not appropriate for a first lecture, and he took a deep breath and tried to answer them fairly. Personally, I don't know if I could ever be that calm. I understand that everyone learns in different ways but I wish people would respect those trying to educate them.

Overall Impression
It is hard for me to judge this class for the following reasons:

  • I love the Python programming language
  • This class uses Python
  • I am fascinated by what I'm learning

Thus, I am currently geeked out and favorable towards the class. If you want a fair look at the class, you may have to stay tuned for a later blog post. I plan to review some of the required course books (I bought many) and workload.

CSCI E-250 Abstraction and Design [First Impressions]

CSCI E-250 is the Harvard Extension School's offering of the Harvard College's CS-51: Abstraction and Design. There are two reasons I signed up for the class: a really good teaching fellow and I am always interested in becoming a better programmer.

The following are my impressions after the first lecture.

How was the lecturer?
The course is taught by Professor Greg Morrisett. He is not dull by any means, speaking with authority and passion (through his highly-validated opinions). For example, if you're a C++ programmer you could be offended by some of his opening remarks -- apparently, teaching the C++ would go against everything he believes in. Luckily, I avoid C and C++ like the plague so I think I like this guy already.

One statement made me very interested in the class. He states:
"[The class will] change the way you think about programming in a deep way."

A few more times he mentions that there will be an emphasis on stretching of the mind. The stretching is probably related to learning the functional programming language OCAML. And well, this is all exciting. I think I ought to look at programming in a different light.

Why OCAML?
In undergrad I studied Haskell, an older functional programming language, and I thought it was the best language ever. However, after that particular class I never used it again. I don't remember a thing about it except I was probably the only person that got an A in the class. My peers highly despised the language. It should be fun learning the concepts again.

In the Spring '09 edition of this course we'll be using OCAML and a few given reasons are:

  • Functional programming languages are better
  • FPLs are used on Wall Street and for huge problems, see Map-Reduce 
  • New FPLs are on the rise: F# and scala
  • Languages such as C++ are error prone no matter how skilled the programmer is
  • It uses one of the highest-performance compilers out there
  • It is mostly pure; few side-effects

After our first look of OCAML
The syntax of OCAML does not scare me a whole lot. I recently took a Ruby course so I have a little experience with the concept of a map function. However, a few simple things stood out:

  • Two Semi-colon symbols are used to end a statement or function
  • The underscore symbol matches any string
  • The in keyword can be viewed as a way to define multiple let statements within a block of code

Around the time the professor posed the question about why nested comments were important. If you ever programmed with Java you would realize that you can really nest comments. What occurs in some languages is that the inner comment-close syntax will also close an outer comment.

Update: I found a decent tutorial in written for the layman. 
Update-2: Here's another site I've never seen before. The code-codex has comparable code to many different languages and it seems like it could be a decent learning tool.


Overall Impression
Simply put: it was pretty darn cool. Even this link was thrown within the slides:





It's a game that asks you determine whether you determine if a person, by profile picture, is a serial killer or programming language inventor.


Not a lot of Extension students are taking this course. I DO NOT know why. It seems like this class is meant for them (me). I leave you with this statement from the lecturer: The class is intended for practicing software people and will to help you "kick ass" in the work place.

CSCI E-168 Web-based Software with Ruby and Ruby on Rails [REVIEW]

Update: Since I've posted this, I received several e-mails about the course. One included a course evaluation that may be relevant. I only included the comments of this eval and I am posting it anonymously for the person.



A fellow classmate reminded me to review the CSCI E-168 course. Here is my review!

Lectures
The lecturer, John Norman, is articulate and an incredible speaker. You can immediately tell he comes from an English-Literature background. His lessons are pleasant to listen to and you can literally hear the pride that puts into his work. I believe most people would enjoy his lectures.

The Ruby portion of the course was amazing. I felt that this was a language I could work with and the Ruby-only assignments were really fun.

Sections
Unfortunately, I had a scheduling conflict and could not stay for section most of the time. Since the class had over 70 students, there were two three different sections and YMMV.

Update 1-16-10:
Okay so there seems to be some misunderstanding whether or not I liked section. I was questioned about it in several private e-mails. In my past posts, I kind of gushed over class and section.

Of the three classes was taking it was the least useful of the group. However, it is worth mentioning that I did read ahead and finish many assignments earlyDid I need to go to section? Probably not. But I thought it was fun, and funny at times. For the most part I sat back and kept my mouth shut.

Workload
I thought the assignments were moderately difficult. None were easy and none were outrageous. If you devote a good ten hours or so, you will be able to finish any assignment. However, a good portion of the assignment grades are based on how rubyish your code is. You will get slapped on the wrist for not doing something the ruby way. Keep that in mind and be sure to comment and document everything.

One thing that made me sour was trying to figure out what to do for the assignment. The documentation is wild and it's not clear what to do. I really don't have any tips for this class except for re-iterating that you have to do things the ruby-and-rails way. The following link is my final assignment submission for Fall 2009: E-168 Final Submission

My project was an attempt at a discussion board system. You may be able to review the code to determine what you can or should not do.




Final Thought
* Warning: Any negativity below is primarily Rails (the framework) related and not directly course related

CSCI E-168 reminded me why I am not a Ruby-on-Rails guy. In a nutshell, the course was a lot like going to Church. It's all interesting but I really don't know if I can buy into this... 


Basically, there was a lot of cool things to learn but I'm not sure I would add it to my programming utility belt. As a career introvert the Rails community hits me in the wrong spot. There are far too happy go-lucky and bubble-gummy. It's the pop-music of web frameworks. A guy like me wouldn't fit in. What's wrong with that?!

This is an exaggeration but it also feels like a cult... An exclusive cult. Not like a Unixy-nerdy cult. It's a trendy iPody cult.

One gripe I had about the class is the guest panel at the end of the semester. Some incredible Ruby developers showed up: Dan Chak, author Dan Croak, and the owner of ravelry.com. They were invited to share their experience with the class. That sounds exciting right?

But from my point of view, I felt that the developers were not happy to be speaking in front of our class. Their mannerisms showed that (a) they were not getting paid and (b) why am I here? Moreover, the panel discussion was dominated by a previous Teaching Assistant who had very little to add; except for cutesy sarcastic jokes. When this person spoke, I found entertainment by watching the expressions on the two Dan's faces.

I would recommend this class to someone that wants to work with Rails. Throughout the semester, I received many Rails-related calls from recruiters. You may experience the same interest from the industry.

Additionally, this class changed my opinions about Ruby. That language is pretty darn cool and the one-liner assignment is one of my favorite assignments of all time. In 2010, I heard that Harvard Extension will be offering a Programming Ruby course and I would put my personal stamp of approval on it!

CSCI-E 207 Formal Systems and Introduction to Computational Theory [REVIEW]

This course made me experience Stockholm Syndrome. It held me captive and consume every waking moment of my life. For that, it became everything to me.

Lectures
It was difficult to find the time to watch or attend the lectures. When I took this course I was: simultaneously enrolled in three classes total, searching for full-time employment, and juggling a consulting gig. BAD IDEA!

The homework was so difficult that by the time you finish one... A brand new and much harder one is already posted online. Sure, lecture was important to understanding the material but it only covered 1% of what you'll have to learn in order to complete the Problem Sets. Well, it just felt that way!

Unfortunately, the lecturer could only cover so much in an hour and a half. Lecture provided an overview but section provided what you needed to know to get started on your homework.

Sections
Watch or attend section, or else. If you don't, you'll do poorly. I guarantee it! Brian was our Teaching Fellow and he was the best TF you could ever ask for. He never gave us the answers -- but I am certain he could teach chimps poetry. If you have a TF half as good as him you're in good shape. Often I felt section covered the material very clearly, much more than the book or lecture.

Workload

You'll have problem sets almost every week and they'll take about 20 hours or more to complete. This class seemed to be a "re-take" for others but this was the first time I covered this material. If you're taking this as a refresher, then sure, I guess it won't be that hard. But for me, it made me cry myself to sleep.

Besides the weekly homework, you'll have a midterm and a final. I studied diligently for them so I did pretty darn well. Both were difficult -- BUT not as difficult as I thought they would be.



You will also need to spend a week or two learning LaTeX and its markup language. LaTeX is intimidating at first and it does take some time to install. To get my MacBook set up correctly it took me over three hours. Most of the time was due to download a large gigabyte file. I believe there's some dmg you can download to set LaTeX all up for you. The markup language isn't too difficult but it does take a while to get used to as it's not intuitive (at first). On a MacBook, I found TexShop to my editor of choice.



I recommend anyone taking this class to download a tool named JFlap to help you with 2 or 3 of the Problem Sets. Trust me, JFlap is necessary and will help you test your answers. Drawing automaton or construct grammars is a piece a cake. Testing it is just as easy as your can run multiple input strings against your object (as shown above).

The assigned textbook is Introduction to the Theory of Computation (Sipser) and it's good but terse. I used the first edition because it was whole lot cheaper and I didn't have any problems. For the homework, you'll probably need more help if you can't figure things out on your own. A few books that helped me were:




How to Prove it (Vellemen): This book definitely helps for the first Problem Set and will help you write proofs for the rest of the semester. I came back to this book a few times to remind myself how to solve later problems in the course. It costs about $20.

Link to How to Prove it on Amazon



Introduction to Automata Theory, Languages, and Computation (Hopcroft, et al): I don't know why this book doesn't have a higher rating on Amazon. Out of all of the books I've used and looked at, this was by far the best for this class. I didn't find this until the end of the class and read it from cover to cover. If I had it early, many of the Problem Sets would have made so much more sense... SAD!

Link to Automata Theory on Amazon


Final Thoughts
The class was a roller coaster ride: I loved it, hated it, loved it, hated it, and so on. I ended up getting an A-minus but I know I earned it. If you want the same, work hard and treat this class like a full-time job. I'd take it again just to get that A.

CSCI-E 131b Communication Protocols and Internet Architectures [REVIEW]

Now that I have completed the class I ought to tell you all about what I thought.

Lectures
As many people will tell you, Leonard Evenchik is an excellent lecturer. I enjoyed his talks and I never felt "lost" even when I didn't understand the material. He's a great guy and although I did not have much one-on-one time with him I'm sure he would be fun to talk to.

If you can't attend class you can watch his recorded lectures online on the course website. I was pleasantly surprised to find a quality recorded product. The lectures will stream at various sizes and include the power point slides he is covering. If you're lucky enough to take his class when it's held at 1 Story Street, you'll have the better product to watch (many more camera angles).

This class is recorded and held live in the Fall and the Summer. The videos are replayed for those taking it in the Spring. Apparently the Summer class is a little experimental where the staff tries to introduce new technology.

Sections
When I took the course "Joe" was my TA and only 5-7 students showed up to section each week. Joe is a really good guy who had a sense of humor and encourage everyone to talk and share our newbie questions (the stuff covered in the class does seem silly for us not to know). I spoke a lot and I was surprised that the other students had the same questions that I had. An added benefit was Joe really helped us understand the homework problems and past lectures.

Workload
Okay, so if you're thinking of taking this class I bet you just want to know how hard is it? For me it was not difficult. But I did find the material interesting and the homework fun. Each homework might take 5 to 10 hours to complete and there are 5 of them.

You'll spend a few hours reading each week and most of it will be online material and relate to the homework. The assigned textbook Computer Networks, A Systems Approach (Peterson and Davie) is good but seems to lose importance during the middle chunk of the course.

I found that the book TCP/IP Illustrated, Volume 1: The Protocols (Stevens) covered that section better and is a better supplement for many of the homework problems. The problem with the later book is that it does not cover many of the modern technologies related to VOIP. Hint: Professor Evenchik is a VOIP expert.



However, you can buy that book on Amazon for $50 new, used for $20.
http://www.amazon.com/TCP-IP-Illustrated-1-Protocols/dp/0201633469

Final Thoughts

I enjoyed this class and I would recommend it to anyone interested in the IP stack and networking. It is ridiculous how much better I understand the Internet. If you're already comfortable with TCP/UDP/IP and all those other three-letter-acronyms then this class could be extremely easy. But for a guy like me, with no in-depth networking experience, I found this to be the most useful class I have ever taken.

CSCI-E 131b Final Review

Update 1/2/10
The Final didn't seem as difficult as I thought it would be -- I did study a whole lot and attend every section though! If you're taking this course in the future, I recommend having a good grasp on ALL concepts. Easier said than done right? But guess what, your brain will have a lot of extra room if you don't memorize the structure of specific packets and other things of that detail.

~~~*
~~~*~~~*~~~*~~~*~~~*~~~*~~~*~~~*~~~*~~~*~~~

Note: I formed this while I was studying for the course’s final and I cannot vouch that it is correct in any way. Much of it comes from our course’s textbook, Wikipedia, lecture notes, and my own random knowledge.

Terms

These are only the terms that I think are important.


802.3: The IEEE Ethernet standard used today.


802.11: The Wireless networking standard used today.


ACK: An abbreviation for acknowledgment. ACKs are usually sent to tell the sender the data or packet transfer was successful.


AES: Advanced Encryption Standard. A Shared-Key cipher that supersedes DES.


ARP: Address Resolution Protocol. Used to translate IP addresses into MAC addresses.


Bandwidth: The amount of data per time that can be transmitted over connection.


Broadcast: A way of sending packets to every host on a network.


CA: Certificate Authority. This is an entity that verifies and signs certificates to ensure the validity of a public key and name (domain).


Certificate: A digitally signed (hashed) document used to distribute public keys.


Checksum: A computation over some data that can take place before and after transmission. If the checksum matches, the before and after, then the data was transmitted in whole. Used for error detection.


CIDR: Classless Inter-Domain Routing. Used for sub-netting, allowing many more sizes of networks compared to typical Class A, B, C subnets.


Congestion: The occurrence packets are discarded when too many contend for a single resource.


Congestion Control: The avoidance of congestion in a network. We’ve discussed the Slow Start algorithm of TCP in class.


Connection-Oriented: A protocol where some initialization must occur between the sender and receiver before data may be transferred.


Connectionless: A protocol where data can be sent without any prior connection. Also known as a Datagram service.


CRC: A strong checksum that exists in many packet headers.


CSMA/CD: Carrier Sense Multiple Access with Collision Detection. This is a feature of Ethernet. It can tell when data being sent over the network as multiple nodes can be attached to it. In addition it is aware when more than one entity transmits data at a time.


Datagram: Analogous to connectionless. This is a transmission unit that contains the necessary information to deliver to its destination.


Demultiplexing: The counterpart to multiplexing, where many different “things” can share another “thing.” As in the case of protocols, IP uses the Protocol Number field to determine whether it is using TCP or UDP. And TCP/UDP uses the Port Number to allow many Layer-5 protocols to use it.


DES: Data Encryption Standard. This is a Shared-Key algorithm that uses a 64-bit shared key.


DHCP: Dynamic Host Configuration Protocol. A protocol used by a host to determine it’s own IP on a network.


DNS: Domain Name System. The naming system used by the Internet to resolve hostnames, implemented through a hierarchy of name-servers. Common DNS records include:

• A -specifies 32 bit IPv4 address

• AAAA –IPv6 address record

• MX -mail exchange record

• NS -specifies authoritative name server for a domain

• CNAME -canonical name, provides alias functionality

• HINFO -specifies limited host information

• SRV –identifies a specific service

• NAPTR


Encapsulation: The process of taking a higher level protocol and placing it within the payload of a lower level protocol.


Ethernet: A data layer protocol that uses CSMA/CD. The original Ethernet had various hosts that used vampire clamps to connect to a large wire, “the ether.”


Firewall: A router that follows some security policy to filter packets.


Flow Control: Used to prevent a sender from overloading a receiver. We most commonly see Sliding Window as a mechanism in HLDC and TCP.


Forwarding: Routers operate on the store-and-forward principal. Packets are first stored in a buffer and then sent to its destination.


Forwarding Table: Maintained by routers to help decide where to forward packets.


Fragmentation/Assembly: Packets may be split into small sizes by a router if they are too large for a network.


Frame: A (link-layer) name for a packet sent between two links.


FTP: File Transfer Protocol. A standard TCP based protocol used for transferring files.


H.323: A protocol used for Internet telephony. Assumes that the end devices are “simple” and provides more control and configuration over SIP.


ICMP: Internet Control Message Protocol. Allows reporting based on the IP datagram, as IP is connectionless and occasionally some response can be helpful to certain applications/protocols.


IMAP: Internet Message Access Protocol. Allows a user to access their mail without downloading it to their machine first.


IPSEC: IP Security. This is an architecture used to provide authentication and security to the IP layer of the Internet. Transport (encrypt only the data) and Tunnel (encrypt all, assigning a new header) modes are provided.


Jitter: Timing variations in network latency.


MAC: Media Access Control. We see this as a way to share single device with a common network.


MD5: Message Digest version 5. This is a Digital Signature/Hashing algorithm.


MIME: Multipurpose Internet Mail Extensions. Email was original text-based and MIME provides a way to convert and specify binary data to text.


MTU: Maximum Transmission Unit. The largest sized packet that can be sent on a given network.


Multicast: A special form of a broadcast to send data to a specified group of nodes.


Multiplexing: A way to share a single resource. Both UDP and TCP use the IP protocol to send data over the net. Examples:

FDM – Frequency Division Multiplexing – A different frequency for each user.

TDM – Time Division Multiplexing – Time intervals for each user or Statistical by a queue.


NAT: Network Address Translation. Typically implemented by routers to assign outgoing traffic from a local address some known public address so it may access the Internet. NAT may use port numbers where multiple local hosts share a single public address.


NFS: Network File System. A protocol to make file access over a network appear transparent.


OSPF: Open Shortest Path First. Used by routers to construct a network topology and be aware of changes noticed by other routers.


Packet Switching: Is the term used to describe how data is sent through the network. It uses store-and-forward and implies statistical multiplexing.


Proxy: An intermediate machine between a sender and receiver which can intercept messages and provide some service.


Public Key Encryption: An encryption algorithm where users have a private and public key used to encrypt and decrypt messages, versus some shared key. The private key can encrypt a message that only the public key can decrypt.


QoS: Quality of Service. An implementation of QoS can allow a network to make guarantees on packet delivery. Certain packets may be marked for expedited delivery.


RIP: Routing Information Protocol. Each router only aware of its networks and forwards this information to other connected routers.


RSA: A public-key encryption algorithm.


RTCP: Real-Time Transport Control Protocol. RTCP provides out-of-band statistics and control information for an RTP flow.


RTP: Real-Time Transport Protocol. RTP is an end-to-end protocol used to send data with real-time constraints. This is unreliable but sequenced.


RTT: Round-Trip Time. This is simply the latency to reach the destination and back.


SDP: Session Description Protocol. This is a format for describing streaming media initialization parameters in an ASCII string. SDP is intended for describing multimedia communication sessions for the purposes of session announcement, session invitation, and parameter negotiation. SDP does not deliver media itself but is used for negotiation between end points of media type, format, and all associated properties. SDP is designed to be extensible to support new media types and formats.


SIP: Session Initiation Protocol. This is an application layer protocol used in multimedia applications. It determines the correct device with which to communicate to reach a user, determines is the user is willing or able to partake in communication, determines the choice of media and coding scheme to use, and establishes the session.


Sliding Window: Sliding Window Protocols are a feature of packet-based data transmission protocols. They are used in the data link layer as well as in TCP. They are used to keep a record of the frame sequences sent, and their respective acknowledgements received, by both the users. Their additional feature over a simpler protocol is that can allow multiple packets to be "in transmission" simultaneously, rather than waiting for each packet to be acknowledged before sending the next.


Slow Start: Slow-start is part of the congestion control strategy used by TCP, the data transmission protocol used by many Internet applications. Slow-start is used in conjunction with other algorithms to avoid sending more data than the network is capable of transmitting, that is, network congestion.


SMTP: Simple Mail Transfer Protocol. This is an Internet standard for electronic mail transmission across networks. For receiving messages, client applications usually use either the Post Office Protocol (POP) or the Internet Message Access Protocol (IMAP) to access their mail accounts on a mail server.


Sub-netting:

• Class A, networks 1 -126, /8 prefix

• Class B, networks 128 -191, /16 prefix

• Class C, networks 192 -223, /24 prefix

Private IP Addresses

• 10/8 10.0.0.0 to 10.255.255.255

• 172.16/12 172.16.0.0 to 172.31.255.255

• 169.254/16 169.254.0.0 to 169.254.255.255

• 192.168/16 192.168.0.0 to 192.168.255.255


TCP: Transmission Control Protocol. This is a connection-oriented and sequenced protocol that ensures the delivery of data. Some well known TCP port numbers include:

20,21 FTP

22 SSH

23 Telnet

25 SMTP

80 HTTP

110 POP3

1720 H.323

5060 SIP


UDP: User Datagram Protocol. This is a connection-less and un-sequenced protocol.


Virtual Circuit: Provided by connection-oriented networks where a connection is initialized, a virtual circuit is formed, and then data is sent.


VPN: Virtual Private Network. Provide some network tunneling between nodes and forms a virtual circuit. It has two modes, Transport and Tunnel. Transport – only the data/payload is encrypted. Tunnel – the whole IP packet (data and header) is encrypted, into a new IP packet with a new header.


SP3

SP3 is a framework for describing Protocols that we use solely in class. In this sections, I’ll cover some of the Protocols I think are important using SP3.


SP3: Guidelines

Service – What service is provided by this technology? For example: is data reliable, sequenced, or unreliable (connectionless), and what combinations of these features exist?

Purpose – What does this technology attempt to solve? For example: addressing, multiplexing, sequencing, error detection/correction, flow control, security, fragmentation and assembly.

Packets – Describe the (header) fields of the packet.

Procedures – What are the procedures to use this technology? For example: connection establishment, capability agreement, and data transfer.


PPP: Point to Point Protocol

Much of this is from lecture notes:

Service – PPP provides a connection-oriented service and, like HDLC, gives the physical layer the appearance of being an error-free link.

Purpose – To deliver the promised level of service, PPP is capable of encapsulating multiple-protocol datagrams, using a link-control-protocol for establishing, configuring, and testing the data-link connections, and using a family of Network Control Protocols (NCPs) for establishing and configuring different network-layer protocols. This provides framing, encapsulation, authentication, among others.

Packet – PPP frames look similar to the ISO HDLC standard. The fields contained are: flag, address, control, protocol, information, FCS, and flag. Each frame begins and ends with a flag field set to 0x7E. The address field is always set to 0xFF and the control byte begins at 0x03. The protocol field declares the type of data/payload is in the information field. The FCS is the frame check sequence used to detect errors in the frame.

Procedures – PPP, a reliable link layer protocol implements the following procedures: link initialization, link data transfer, link termination, and error handling.


HDLC

Service – HLDC provides a reliable Data Link layer service and as such, it gives the physical layer the appearance of being an error-free link.

Purpose – To deliver the promised level of service, a reliable Data Link level protocol such as HDLC must handle the following problems: Synchronization and framing, data transparency, data transfer, addressing, flow control, error detection, and error correction.

Packet

Procedures – HDLC, a reliable link layer protocol implements the following procedures: link initialization, link data transfer, link disconnect, and link error handling.


802.3

Much of this is from lecture notes:

Service - 802.3 is an unreliable data link layer local protocol, where each device on a network may transmit data at it’s own discretion. 802.3 uses a logical bus configuration, and is well suited to a network with a light to medium load.

Purpose – 802.3 provides an unreliable level service to the Network layer with no acknowledgements or traffic prioritization. Error detection but not correction is provided with a checksum mechanism.

Packet – Preamble (7), Start of Frame Delimiter (1), Destination Address (6), Source Address (6), Length Field (2), Data (0-1500), Pad (0-46), Checksum/CRC (4)

Procedures – When a node wants to transmit data in 802.3, it listens to the physical cable. If the cable is busy, it waits until it is available and then tries transmitting again. If there is a collision during transmission, both nodes which were sending data immediately stop transmitting and wait a random amount of time before attempting to retransmit.


Ethernet

The Ethernet is very similar to its successor, 802.3, but differs in the packet definition. The Protocol Type of Ethernet was replaced with the Length Field. The Protocol Type can still exist in 802.3 as it is commonly the first bytes in the body.

Packet – Preamble (7), Start of Frame Delimiter (1), Destination Address (6), Source Address (6), Protocol Type (2), Data (0-1500), Pad (0-46), Checksum/CRC (4)


Frame Relay protocol

The following is from lecture notes:

Service - Frame Relay is a simple data link level protocol that provides a method to transfer data (frames) very quickly from one network point to another network point(s). It provides an unreliable service and it is used in networks where the physical layer communications lines are reliable and fast.

Purpose - The Frame Relay protocol is unreliable: it provides error detection but not error correction. It has minimal overhead, provides the address functionality that is required to deliver a frame via the use of a circuit ID called a DLCI, and it provides no flow control. There is very limited congestion control. When problems arise because such techniques are not implemented, frames that cannot be delivered are discarded. As a result of this lack of reliability, upper layer protocols must provide any necessary reliability.

Packet- A Frame Relay packet begins and ends with a flag character (7E hex). After the Beginning Flag character, the next bytes contain Addressing information used to transfer the packet across the link. Specifically, these bytes indicate which virtual circuit (DLCI) to use to route the packet and if the packet is eligible for being discarded (DE bit).

There are also bits to indicate whether or not the network is becoming congested (the Forward Explicit Congestion Notification bit - FECN, and the Backward Explicit Congestion Notification bit –BECN.) The congestion bits are provided for the benefit of the application (i.e., so it may take actions to prevent congestion problems from occurring.) The last two bytes of the packet (prior to the Ending Flag character) contain a Cyclic Redundancy Check (CRC). The remainder of the packet consists of the payload data.

Procedures - There are very few procedural aspects to the Frame Relay protocol. Packets are simply routed in the network via the use of DLCIs (Data Link Connection Identifiers), with each DLCI being configured to reference a specific destination system. Procedures are defined for congestion notification via the use of BECN and FECN bits. Any packet delivery problems have to be dealt with by upper layer protocols or user applications (which are implemented in the Customer Premises Equipment - CPE.)


Internet Protocol

Service – IP is a connectionless, unreliable Network/Internet layer protocol.

Purpose – IP provides an unreliable service to the Transport layer with no acknowledgements or guarantee of delivery. It does so by using the connectionless datagram service. It relies on the Transport layer (UDP/TCP) to define the reliability of the data traffic, sequencing, and any error correction. IP may loose packets and deliver them out of order. Thus, it does not perform sequencing, flow control, and has little use of the IP header checksum field. IP does however have options, addressing, and the capability for fragmentation and reassembly.

Packet – Version, IHL, TOS/IP Precedence, Total Length, Identification, Fragment Offset, Time to Live, Protocol, Header Checksum, Source Address, Destination Address, Options.

Procedures – IP can fragment packets that are too large for the underlying network.


IPSEC

Authenticates and encrypts each IP packet of the data stream. Protects data flow between a pair of hosts.

Packet – The AH Operates on top of IP using IP protocol number 51. Next Header, Payload Length, RESERVED, Security Parameters Index, Sequence Number, Authentication Data (necessary data to authenticate the packet). ESP is also a member. Security Parameters Index, Sequence Number, Payload Data, Padding, Authentication Data.

Procedures -

Internet Key Exchange (IKE) – sets up a security association by handling negotiation of protocols and algorithms and generates the encryption and authentication keys to be used

Authentication Header (AH) – Provides connectionless integrity and data origin authentication for IP datagrams to provide protection against replay attacks

Encapsulating Security Payloud (ESP) – provide confidentiality, data origin authentication, connectionless integrity


ARP: Address Resolution Protocol

Service – ARP provides automatic mapping from IP address to MAC address.

Purpose – Due to routing, it becomes necessary to find the physical interface address when given an IP address. This is because physical addresses only have relevance within local networks and an IP allows packets to be sent across networks. ARP is simply a means of asking for ownership of an IP address.

Packet – The ARP packet has the following fields: Ethernet destination address, Ethernet source address, frame type, hardware type, protocol type, hardware size, protocol size, operation type, sender Ethernet address, sender IP address, target Ethernet address, and target IP address. Notably, the operation type describes whether the packet is an ARP request, ARP reply, RARP request, or RARP reply. The Ethernet destination address, in an ARP request, is the broadcast address.

Procedures – To translate a Network Layer IP address to a Link Layer MAC address, ARP will first look at it’s ARP cache to determine if a translation already exists. Of course, this only takes place if the destination address belongs to the network of the current device. The ARP cache entries typically have some expiration time and thus if an entry is not found, ARP will broadcast an ARP request. This requests asks “if you have this IP address, please respond.” When the owner of the IP address receives the ARP request, it will respond with an ARP reply. Upon receiving the ARP reply, the data can then be added to the ARP cache and a Link Layer frame can be added with the correct physical destination address.


ICMP: Internet Control Messaging Protocol

Service – ICMP, supports IP at the Network layer. It helps communicate error and informational messages, whereas IP is relatively simple in nature and does not.

Purpose – Since IP is unreliable, connectionless, and un-acknowledged; ICMP was created to provide error reporting, diagnostics, and testing. Though, ICMP packets can be lost and discarded themselves.

Packet – ICMP messages transmitted within IP datagram with the following fields: type, code, checksum, and the contents. Type determines what kind of ICMP message it is and code helps specify the type even further. The checksum is calculated from the ICMP header and data.

Procedures – When a packet is inspected at the Network layer, it is possible that some condition may be acted upon and an ICMP message to be generated. This ICMP packet will be sent back to the sender. In a typical case, when a router receives an IP packet with a Time-to-live at zero, it will drop the packet and send back an ICMP message reporting “Time Exceeded.”


UDP: User Data Protocol

Service – UDP is an unreliable Transport layer protocol based similarly to the connectionless, unreliable IP protocol. It allows applications to access the IP with no bells and whistles.

Purpose – UDP provides a datagram-oriented Transport layer protocol. UDP provides no reliability, like IP. There is no guarantee that the datagrams will reach the destination. Thus, it is connectionless and data can be lost and transmission is unreliable. Also, is no flow control, congestion control, and segmentation. That said, UDP provides the capability for multiplexing and de-multiplexing through the use of port numbers.

Packet – Source port number, destination port number, UDP length, UDP checksum, and data. The port numbers are used for multiplexing and de-multiplexing; allowing many applications (and same applications) to use IP for network communication. For the checksum, a pseudo-header is generated with extra information for the calculation. These miscellaneous fields are source IP address, destination IP address, IP protocol field, and UDP length.

Procedures – Like IP, no prior connection is needed. Packets are just sent out with no need for acknowledgement. Thus no connection is needed to be initiated or disconnected. No error handling is used either.


TCP: Transmission Control Protocol

Service – TCP is a reliable Transport layer protocol and provides transport-layer addressing to allow multiple software applications to simultaneously use a single IP address. It allows a pair of devices to establish a virtual connection and then pass data bi-directionally.

Purpose –TCP provides a reliable, connection-oriented service to the application layer. Like UDP, TCP has multiplexing and de-multiplexing indentified with the use of port numbering. The checksum also provides some means error detection. The sequence number is used to identify each byte, providing data reliability. Flow Control uses the sliding-window algorithm to establish the connection and wait for acknowledgments. There is also congestion control that uses the Slow Start algorithm to prevent a device from overloading the network links.

Packet – Source Port, Destination Port, Sequence Number, Acknowledgement, (Offset/Reserved/ECN/ControlBits), Window, Checksum, Urgent Pointer, Options, Payload.

Procedures – TCP protocol operations may be divided into three phases. Connections must be properly established in a multi-step handshake process (connection establishment) before entering the data transfer phase. After data transmission is completed, the connection termination closes established virtual circuits and releases all allocated resources.


RTP: Real-time Transport Protocol

Service – RTP provides an unreliable but sequenced service to transmit data. It is unreliable for timeliness and sequenced to make sure data arrives in order. Data that is not in order is dropped.

Purpose – UDP and TCP do not meet the demands of Real-Time data. Data needs to arrive in order as fast as possible.

Packet – Version, Padding, Extension, CSRC Count, Marker, Payload Type (Type of Audo/Video and encryption), Sequence Number, Timestamp, SSRC (Synchronization Source), CSRC (Contribution Source).

Procedures – RTP provides end-to-end network transport functions suitable for applications transmitting real-time data, such as audio, video or simulation data, over multicast or unicast network services. RTP does not address resource reservation and does not guarantee quality-of-service for real-time services. The data transport is augmented by a control protocol (RTCP) to allow monitoring of the data delivery in a manner scalable to large multicast networks, and to provide minimal control and identification functionality. RTP and RTCP are designed to be independent of the underlying transport and network layers.


DNS: Domain Name System

Service – A service that uses a hierarchy of Name Servers to determine the IP Address for a human-readable URI.

Purpose – IP addresses are necessary to send data over the Internet. However it is more common for humans to remember readable names. Thus DNS provides a way to convert these names into IP addresses.

Packet – Indentification, QR, Opcode, (Many other single bit fields), Total Questions, Total Answer RRs (Resource Records), Total Authority RRs, Total Additional RRs, Questions, Answer RRs, Authority RRs, Additional RRs.

Procedures – Every machine connected to the Internet should have a local DNS server. Whenever someone attempts to hit some public domain, the request first heads to the local DNS. If the record is not cache, then it goes through a process of questioning the Root Name Servers, to a TLD Name Server, and eventually to the Name Server that contains the record which is being looked up.


QOS: Quality of Service

Service - To provide some guarantee of network performance for some given application.

Purpose – With Real-Time data, it may be necessary to allot some portion of the network to a particular application. Issues that occur in networks relate to: bandwidth, delay, jitter, error rate, etc.

Packet – Using the differentiated services code point (DSCP) markings in IP, DiffServ can indicate:

• Codepoint = 000000 Best effort (Standard Packet)

• Codepoint = 101110 Expedited Forwarding (EF) – strict low latency queue

Procedure – A few things can be done to provide QoS. One method includes increasing bandwidth. However, it is common to mark the packet using some specific criteria (DiffServ). Then each router will examine the packet to determine how to handle it. In this case, all routers in a network with QoS must be using DiffServ for this to work.


SMTP: Simple Mail Transfer Protocol

Service – Provides a text-based way to send electronic mail.

Purpose – SMTP is a relatively simple, text-based protocol, in which a mail sender communicates with a mail receiver by issuing simple command strings and supplying necessary data over a reliable ordered data stream channel, typically a Transmission Control Protocol (TCP) connection.

Packet – SMTP uses a series of commands. HELO, MAIL FROM, RCPT TO, DATA (headers and body), QUIT.

Procedure – After the message sender (SMTP client) establishes a reliable communications channel to the message receiver (SMTP server), the session is opened with a greeting by the server, usually containing its fully qualified domain name, in this case smtp.example.com. The client initiates its dialog by responding with a HELO command identifying itself in the command's parameter. With the rest of the commands, the sender can construct an e-mail message to store on the recipients mail server.


SIP: Session Initiation Protocol

Service - SIP is primarily used in setting up and tearing down voice or video calls. It has also found applications in messaging applications, such as instant messaging, and event subscription and notification.

Purpose - A motivating goal for SIP was to provide a signaling and call setup protocol for IP-based communications that can support a superset of the call processing functions and features present in the public switched telephone network (PSTN). SIP by itself does not define these features; rather, its focus is call-setup and signaling. However, it was designed to enable the construction of functionalities of network elements designated proxy servers and user agents. These are features that permit familiar telephone-like operations: dialing a number, causing a phone to ring, hearing ringback tones or a busy signal.

Packet - It is a text-based protocol, incorporating many elements of the Hypertext Transfer Protocol (HTTP) and the Simple Mail Transfer Protocol (SMTP), allowing for direct inspection by administrators. Commands include:

• REGISTER: Notify current IP address and the URLs to receive calls.

• INVITE: Used to establish a media session between user agents.

• ACK: Confirms reliable message exchanges.

• CANCEL: Terminates a pending request.

• BYE: Terminates a session between two users in a conference.

• OPTIONS: Requests information about the capabilities of a caller

Procedure - SIP employs design elements similar to HTTP-like request/response transaction model. Each transaction consists of a client request that invokes a particular method or function on the server and at least one response. SIP reuses most of the header fields, encoding rules and status codes of HTTP, providing a readable text-based format.

SIP typically relies on a Proxy server to help establish a connection with a remote user. A proxy server "is an intermediary entity that acts as both a server and a client for the purpose of making requests on behalf of other clients. A proxy server primarily plays the role of routing, which means its job is to ensure that a request is sent to another entity "closer" to the targeted user. Proxies are also useful for enforcing policy (for example, making sure a user is allowed to make a call). A proxy interprets, and, if necessary, rewrites specific parts of a request message before forwarding it." "A registrar is a server that accepts REGISTER requests and places the information it receives in those requests into the location service for the domain it handles." "A redirect server is a user agent server that generates 3xx responses to requests it receives, directing the client to contact an alternate set of URIs. The redirect server allows SIP Proxy Servers to direct SIP session invitations to external domains."


Questions from Review

T/F The IETF runs the Internet and its networks.

This is not quite a T/F question. Yes and No. The IETF produces technical documents that influence how people design, use, and manage the Internet. They do not however run the Internet. Many parties are involved in the distinction of managing the Internet.


Describe the 7-Layer OSI Model:

Layer 1 is the Physical Layer. At this layer, data is physically moved across a network encoded as electronic signals. Here the specifications for the hardware, encoding/decoding, signaling, and transmission/reception are defined.


Layer 2 is the Data Link Layer that is responsible for data that is transmitted between local devices. Error detection and error handling, logical link control (LLC), media access control (MAC), and addressing are important here. LLC allows this layer to abstract the defining physical network below it. MAC provides the capability for multiple machines to share a single resource. Additionally, MAC addresses are assigned as globally unique 48-bit numbers.


Layer 3 is the Network Layer which defines network boundaries and how they can be interconnected. The key protocol at this layer is the Internet Protocol (IP), commonly referred to as the backbone of the Internet. Important services at this layer are IP addressing, fragmentation and reassembly, error handling, and routing. The IP address differs from the MAC address and is independent of hardware. However, it must be unique at the network level and has two important parts: the network id and the host id. Fragmentation and reassembly allows this layer to split up packets that are too large for the link layer. Also, routing, determining where and how to send incoming packets, occurs at this level by inspecting the IP address.


Layer 4 is the Transport Layer. TCP and UDP are the main protocols that operate at this layer. Connection-oriented and connectionless services are offered in addition to keeping track of the connections software programs are using through ports. Like the network layer, data can be fragmented here through the process of segmentation. Moreover, important features include flow control, congestion control, and multiplexing and de-multiplexing.


Layer 5 is the Session Layer. Its purpose is to establish and control sessions between software.


Layer 6 is the Presentation Layer provides the capability to translate, compress, and encrypt software data.


Layer 7, the Application Layer, makes use of all layers below it and provides the capabilities that a user or system need on the network. There are many protocols that exist at this layer (FTP, HTTP, DHCP, NNTP, IRC, etc).


Describe the 5-Layer TCP/IP Model:

Similar to the OSI model, we have the Physical, Link (network interface), Network (Internet), Transport, and Application layers. From the bottom up:


The Physical Layer is responsible for transmitting the data over the network encoded as electronic signals.

The Link Layer handles the communication of data among local networks. Ethernet and the 802 protocols are commonly used at this layer.


At the Network Layer we have the IP protocol as well as ICMP, among others. This layer is responsible for routing and defining network boundaries.


The Transport Layer helps manage data communication across networks. It can do so with the TCP (reliable) and UDP (unreliable) protocols.


The Application Layer includes many application protocols that allow users and systems to use the network as a resource.


Describe how routers manage the water sprinklers at Fenway Park.

Personally, I do not know much about the sprinkler system at the Stadiums around the US. In this day and age though, I could imagine there being some central system that controls many of the day-to-day operations. One such program may be in charge of running the sprinkler system for a set period of time.

For the sake of the question, let this system be placed in some control room that’s off limits to most employees. There were concerns that the machine running the software could be tampered with so management wanted it to be locked away safely in some server room.


However, the job of the groundskeeper is to occasionally access this software to ensure that the field is in exceptional playing condition. The network administrator allows him to access the software through a remote and water-proof laptop. This laptop can connect wirelessly to a private wireless connection within the stadium.


To access the program, the groundskeeper can use an Internet Browser to open up the link that displays the controls of the sprinkler system (with the correct credentials of course). Thus, when doing so, the laptop is communicating with some wireless router which then itself communicates to the network in which the server resides.


Describe how video traffic is carried on the Internet.

Like voice traffic, video places an emphasis on timeliness over reliability -- as a reliable service can introduce delay. Real-Time Transport Protocol (RTP), defined at the Application Layer, was introduced to help stream media over networks. Since reliability is not of the upmost important, an UDP/IP datagram is used with RTP. UDP, unlike TCP, is un-reliable and is not subjected to Flow and Congestion Control.


Thus, when video is streamed out to users, we assume the following: that the sender is capturing and compressing the data, and generating the RTP packets. The software will determine how many frames, at which rate, and what the size of the transfer will be. Larger frames will traverse several packets where smaller frames can be squeezed into a single RTP frame.


The client’s application will receive the data in some buffer, with the capability of reordering packets that arrive out of order. Depending on the application, some algorithm may be used to delay playback unless a reliable stream can be viewed. Based on the amount of packet loss and jitter (packets arriving at differing intervals), a steady and clear playback experience may be possible after waiting for enough data to arrive.

How are routers involved with Harvard’s parking meters?


With the new pay-station meters around the Harvard, it is very likely that the meters are connected to the Internet or some private network. I would assume that the meters within close proximity to each other, perhaps all in Cambridge, share a common local network. Let’s assume that they are connected in a very simple manner using Ethernet switches.


Now there are two main reasons I see these meters connected to the Internet. One is for credit card validation and the other is for remote administration and reporting. Let us also assume that the meters run some stripped down Operation System that allows it do to this. Hence, the local network home to the meters must contain some Network Router. The OS running on the meters must also be aware of this router, and the interface, which connects to this network, will be considered the gateway. This router will then be connected to some ISP (Internet Service Provider) through another interface to allow data to travel remotely.


Thus, when a user uses the pay-station to purchase a 2-hour parking receipt, the meter will read the user’s credit card information and validate it with some on-line service. To reach the on-line service, the meter send traffic through it’s local router, to reach the ISP’s router, and through some interworking until it reaches the destination. The service will respond and perhaps ask for some credentials or the credit information. In turn, the meter can send the data over and eventually expect some verification.

Additionally, it would be useful for the county/police station to be able to monitor the meters. Take for instance a case where a reckless driver crashes into a meter. The meter may be able to send out message that travels to the local police station’s command center. Without routing and being able to connect to the Internet, it would take much longer for such an example to be noticed.


What are some differences between H323 and SIP?

See http://www.packetizer.com/ipmc/h323_vs_sip/

The biggest difference that I can tell is that SIP is better suited for the Internet and Internet developers. H.323 is better suited for Telephony Companies where more control is necessary.

About Me

My Photo
Sometimes there's a man... I won't say a hero, 'cause what's a hero? But sometimes, there's a man – and I'm talkin' about Donnie here – sometimes, there's a man, well, he's the man for his time and place. He fits right in there. And that's Donnie.