ONUG Spring 2015 | Jason Edelman's Blog

I’ll continue to update this throughout the next two days. Feel free to issue a pull request if you’re also here at the conference and want to add to this post.

onug-logo

General

Location: Open Networking User Group (ONUG) at Columbia University

ONUG currently has 6 working groups:

NSV
SD-WAN
Virtual Network Overlays
Common Management tools across network, compute, and storage
Network State Collection, correlation, and analytics
Traffic Monitoring and Visibility

It is interesting and awesome to see that half of the working groups are all about Day 2 operations and management of networks. This is exactly what’s needed in the industry.

Sessions

Creating Business Value with Cloud Infrastructure

Speaker: Adrian Cockcroft

Developers don’t need any of that referring to NSV/NFV.
2009 developed the Cloudicorn, took knowledge gained to Battery
Docker wasn’t on anyone’s roadmap for 2014. It’s on everyone’s roadmap for 2015
2014 was the year that Enterpises finally embraced cloud and DevOps
Optimizing from IT cost to delivery and speed - Nordstrom - ended up yielding lower costs
Product IT reports into the business
Director is the highest Corp IT title
Immutable microservice deployments scales
If your QA team is saying there are too many bugs in a release, you aren’t releasing often enough - don’t you remember “CHANGE ONE THING AT TIME”. If you are doing 6 features in 6 weeks, try 2 features in 2 weeks!
If you haven’t read the Phoenix Project, get it now and read it.
Onto “how is that network managed” referring to a large global network with 1000s of devices?
- SDN? Nope!
- It’s with DNS!
  - Built A tool to do an automated DNS management across multiple vendors
  - Big flat HTTP network
  - Same for public and private API access
State of the art in networking (SDN) is VPC - this is the real competitor for Enterprise network vendors
Adrian is interested in building on top of the public clouds and extending them (including from the perspective of network)
Also recommends reading LEAN ENTERPRISE and Building Microservices
Go, go, go - 80% of the projects Adrian has seen in the past 6 months have been written in go

SD-WAN Working Group Test Results Presentation

Companies: Talari, Glue, Viptela, Cisco, Riverbed, Silver Peak, VeloCloud

Ixia heavily involved in testing each solution
- Co-authored test plan
- Certified the results
- Tests were run in vendor labs - Ixia observed remotely
What was the testing about?
- SD-WAN architecture diagram shown: MPLS and INTERNET into each site (multi-site topology)
- All vendors used same test plan, design, and simulated the same types of traffic (http, voice, etc.)

Cisco

Tested with ISR-4451 and CSR1000V
Passed all tests
- Three items are noted in yellow
- Cisco does not provide app/perf stats natively- they use a 3rd party partner LiveAction to do this
- Northbound API is on the roadmap - but LiveAction has one that can be leveraged
- FIPS validation is coming soon, not there yet.

It will be interesting to see the dynamic between Cisco and GLUE going forward as GLUE has been pretty much married to Cisco providing a similar solution to what Cisco is now trying to do on their own.

Viptela

Key to understand how this integrates into an existing network
Can integrate and place Viptela vEdge devices north of the existing L2 switch or L3 device without modifying core L2/L3 functionality
Offers virtual and physical appliances

Glue Networks

Tested with ISR-4451 and CSR 1000V
Cisco’s IWAN architecture that Glue automates
Glue automates best of breed architectures
Operate at higher level in the stack compared to the other vendors that ran tests
Northbound API is still pre-release
< 2 min for full SDWAN/IWAN feature set using 3 ZTP methods

Talari Networks

FIPS coming this year (nor next)
Open northbound API coming soon
Declined to test Zero touch deployment at branch
Has been in this market since 2007 (before SD-WAN was a thing)

VeloCloud

Tested physical and virtual appliances
Tested dual internet and MPLS/INTERNET into physical and also MPLS/INTERNET into virtual
As other vendors, FIPS and Northbound API were yellow
Tested re-routing voice flows during brownout
- No reset to the application
One-click business policy
Ensure compliance, security, and app performance

Silver Peak

Focused on WAN optimization
Accomplished this by building an overlay
VMs are first class citizens!
Over half of their deployments are running as VMs
Example: Specific policy by application such as for Office 365 vs. Facebook
FIPS is yellow - future?
Northbound API is coming as with nearly every other vendor
Uses IPsec for transport
Hitless failover
TCP sequence number stays during failover over internet/mpls links

Riverbed

Tested with Steelhead devices
Steelhead controller is where policy is defined and pushed out from there
FIPS is certified!!!
Northbound API, L2/L3 interop with directly connected switch/router, CPE in p/v form factors on commodity h/w, and dynamic traffic engineering are all yellow
Offers a Python based scripting environment - SDK on GitHub

Blows my mind so many vendors don’t have the northbound API in their shipping product.

Nuage Networks Tech Field Day Event

Started in DC networking, extended to the branch, announcing something new today!!
It’s the velocity of IT that really matters right now
- Starting with Business Requirements and going through Dev/Test, Deployments, Ops, and Customer Feedback
Dev/Test and Deployments covered in previous sessions, today the announcement is geared to Ops including Overlay and underlay correlation, service navigation, and fault management
Nuage is talking about the Open Network Approach for assessing service health compared to the two other options: “hardware centric” and “service centric” views
So what is the solution?
- The route monitor - peers with the physical network AND the Correlation Engine (aggregates p + v)
- Listens to all routing protocols updates, at least the LAYER 3 topology
- Peers via OSPF and BGP
- Uses SNMP to understand port status MAC table, etc.
- Analyze faults
- Many of the components were pulled from existing ALU SP products
Example: Nuage agent on hypervisors listening to libvirt events

Top notch announcement by Nuage today. Have to say the product that was presented and shown could literally be a standalone product regardless of what is deployed in the virtual and physical networks. Great job, Nuage. Kudos to their UI designers!

The Great Debate

Lew Tucker is representing Open Source and Charlie Giancarlo is representing Proprietary Software

Do you trust open source?
- Charlie: answering the question as it pertains to security. It depends on the number of “investigators.” Expectations need to be set properly.
- Lew: Agrees, but adds that software gets better overtime. Should not expect open source to be free.
Charlie makes a great in that the biggest consumers of open source are companies selling closed source products.
Discussions around open source governance: vendor-led open source projects vs. community-led
One of the differences is the origin of the software - was it initially developed by Yahoo? Or was it initially a blueprint (no code) such as OpenStack?
Thoughts of orgs that don’t have a governance model, but have a benevolent dictatorship such as OVS - “anything could work”
Role of ONUG: proprietary user groups have been the norm, but it should help in getting users more options and more choice
Missed a lot here, but the general consensus is, it all depends, who cares, and will a company provide support needed for open source?
Would have been nice if this was a true debate
Open source is a success if you are using it

NEC Tech Field Day Event

Here we are back at Day 2 at the Open Networking User Group. Getting started with NEC in a Tech Field Day event!

NEC has a long history in the SDN market starting with working with the Stanford team back in the 2008 time frame
They have a $3,000 SDN starter kit that includes HA licensing for their controllers and a license for up to 5 switches. Switches purchased separately from NEC or 3rd parties (including virtual switches). OF support required.
90-day free trial
NEC participated in the ONUG testing that validated product support for Network Service Chaining
- Focus on orchestration using third party virtual network functions
Policy can be global or local
Device can be located anywhere in the network - service chaining using flow filters will ensure the traffic gets to where it needs to be
Full set of REST APIs
NEC declined to perform two tests in the 10 that were provided as part of the official test plan since those specific functions are provided by partners
Partners include Palo Alto, F5, Riverbed, Radware (DDOS)
Ixia is now presenting - they were brought in to help perform and execute the test plan for the NSV working group
All tests were service agnostic - the point of the working group and tests were to validate the orchestration of the services, i.e. ensure the L2/L3/L4 of the network was able to redirect traffic as necessary to the appliances (physical + virtual)
NEC went into a demo to show the NSV functionality as well as some failover functionality
One part that is really slick is that they can show a hop by hop view of any particular traffic flow, i.e. show you in-port and out-port from every switch
They also provide a health monitor to assist in the failover of network appliances that do not provide their own functionality (or given a design that needs it)

Facebook

Generates over 3+ Billion logs resulting in only .99% of alerts
99% of the alerts are handled automatically
Quote “Engineers build robots, robots manage networks”
Correlation Engine is key - ~1200 unique master alarms
Use Case: memory leak
- Solving manually: lots of humans + coffee
- Automatically: Alarms generated, does the device have redundancy (can it be taken out of prod), if so, call drainer, take all traffic off device, active CPU is reloaded, standby CPU takes over, redundancy recovers (active is reloaded), redundancy is restored. Humans start looking at why it occurred?
Engineers are focused on writing alert handlers
Software Limitations
- Lack of programmability across control, config, and monitoring
Topology of fabric dictates config for a given device - pushed dynamically
Heat map: important to know what normal is - if problems, device is drained
Najam went into an overview of the OCP Networking Project
Facebook:
- System Integration shop turned into an Engineering shop that is actually creating new things
- FB is on a journey to convert network engineering team to software engineering team
- Hiring more software engineers - they are now attracting more software engineers
- Everyone is writing some level of code

Town Hall Meeting: Will the DevOps Model Deliver in the Enterprise?

Moderator: Graeme Hay Panelists: Najam Ahmad (Facebook), Tim Gerla (Ansible), Marc Wooldward (vArmour), Mike Dvorkin (Cisco), Dimitri Stilladis (Nuage)

Why Now?

There is a name for it. It can now be called DevOps - it inherently is more tangible, but at the same time, there is now a multitude of APIs that can enable this. - Marc

Is it more about attitude than tools?

Scale, coupled with execution, allows for no other way to manage and operate networks (Facebook). - Najam

Speed is important, but not everyone is Facebook. For Enterprises, they need controlled speed because there are massively critical apps running over the network - it’s just Facebook. - Dimitri

Najam keeps referring to DevOps as FB building their own software features and capabilities that sit on the switch - probably a little different than the traditional DevOps conversation. This was further clarified by the panel too after the fact.

Networks have become extremely complex, but complexity must be reduced before software development and DevOps approaches are integrated with the networks.

Rather than build teams across the technology, build the team around a solution (or problem looking to be solved). Platform team that has multiple skill sets. Assign the right people to a problem and great things can happen.

Looking at the problem holistically could help you come to a solution faster and be more cost effective. Facebook built out a standalone network fabric for a back-end cluster and needed a faster sample rate for NetFlow than the switch hardware could support. The network team typically wants to trade-in their hardware and get more expensive gear to get features needed. FB looked at the problem and realized they could put an agent in the kernel of every host to do line rate sampling across the data center without ever requiring NetFlow functionality in the network. This could not have happened with silos internal to FB.