I recently co-organized a one day workshop to bring together faculty who teach in Information Technology (IT) degree programs. The workshop was a pre-conference event for the annual ACM SIGITE conference held in Calgary. The workshop was a small gathering made possible by some remaining funds from the NSF-supported HumIT project. We had a great group of faculty, shown in the photo below, who brought their enthusiasm and expertise to create a day of interesting discussion and exploration of possibilities for student participation in humanitarian FOSS projects.
The attendees were all experienced IT faculty and they had the usual dual reaction to the idea of students participating in humanitarian FOSS projects. On the one hand, they were intrigued by the excellent educational opportunity and potential motivational boost for students. On the other hand, they were cautious about the learning curve, complications, and overhead for instructors who try to integrate student humanitarian FOSS participation into their classes.
One particularly interesting part of the discussion was whether student participation was easier or more difficult for IT students compared to students in other computing majors. One of the workshop participants, Jake Miller of Pennsylvania College of Technology, suggested that it might be more difficult for IT students to participate because they would be providing services for the FOSS project rather than making contribution to the product. Heidi Ellis, co-investigator for the HumIT project, and I have raised similar conjectures about IT student participation. But what is really the issue?
The fundamental issue here seems to be whether it’s easier for a new FOSS participant to contribute to the FOSS product or to services related to the product. As an instructor trying to plan student participation, there’s great appeal to finding possible contributions to the FOSS project that are manageable within the constraints of a course. This seems to translate into factors including the following:
- Size – Something big enough to be interesting, but not so big that students can’t grasp the task and complete it within the time frame of a course
- Dependence – Something that is clearly part of the project, but not strongly interdependent with other tasks or parts of the system
- Schedule – Something that is needed as part of the project’s forward progress, but not on the critical path of project activities
It’s easy to think of tasks related to the product that have the desirable characteristics for each of these factors. For example, fixing non-critical bugs or developing a plug-in or generally free-standing module. On the services side, the concern seems to be that more of the IT services tasks have less desirable characteristics for student work. Size seems relatively manageable. The concerns seem to center around the other two factors above. Dependence may be more of an issue because many services tasks seem to have a greater need for knowledge of context. It also seems that more of the examples people think of for IT are items that have time pressure associated with them (e.g., providing support services to users).
The bottom line is that there definitely seems to be an additional layer of hesitation among IT instructors about whether student participation in FOSS is manageable, and I think there is merit to that opinion. But I also think that there is still plenty of room for IT students to engage in humanitarian FOSS work. We’ve had some initial success in this area, and I think broader opportunities exist. But clearly we need to provide additional demonstration of success to help IT instructors understand the opportunities.
“Big Data” is getting a lot of attention lately as a key computing area for the coming years. Even the White House has gotten involved with this year’s announcement of a Federal Big Data initiative. But exactly how big is “big data”? It’s a moving target of course, shifting with our growing ability to generate, store, and process ever larger volumes of data.
The IBM 2314 Disks, introduced between 1965 and 1970 were a technical wonder in their day. But it took a whole row of large appliance sized units to crack 200 MB, and the “big data” of that day was mostly stored on tapes and accessible only via slow sequential processing of carts full of tape reels. Megabytes clearly qualified as big data.
Today I can beat that string of 2314 disks by an order of magnitude with a USB stick for under $20. Clearly the economics are radically different. But where does that leave the qualifying level for “big data”?
Wikipedia, that font of modern knowledge, provides an interesting perspective. A quick browse of the entries for gigabyte, terabyte, petabyte, exabyte provide all the scale we need without even worrying about a yottabyte. The system and storage examples in the Wikipedia entries are informative:
- Megabytes clearly don’t make a blip on the Big Data horizon. The Big Data of yesteryear is a routine unit for the size of individual files today.
- Gigabytes can be covered with examples of modest amounts of image, audio, or video data that most computer users deal with routinely. A few music CD’s or the video on a DVD breaks into gigabyte territory. There’s not much here that will impress as Big Data.
- Terabytes are just one step up the scale, but things start to get much more interesting. The examples deal with data capacities and system sized from the last 10 to 15 years. They include the first one terabyte disk drive in 2007, about six terabytes of data for a complete dump of Wikipedia in 2010, and 45 terabytes to store 20 years of observations from the Hubble telescope. Clearly, at this point we start to be entering “big data” territory.
- Petabytes start to move beyond the range of single systems. Netflix stores one petabyte of video to stream. World of Warcraft has 1.3 petabytes of game storage. Hadron Collider experiments are producing 15 petabytes of data per year. IBM and Cray are pushing the boundary of storage arrays with systems in the 100 – 500 petabyte range.
- Exabytes examples start to leave systems behind and mostly describe global scale capacities. Global Internet traffic was 21 Exabytes per month in 2010. Worldwide information storage capacity was estimated at 295 exabytes of compressed information in 2007. On the other hand, astronomers are expecting 10 exabytes of data per hour from the Square Kilometre Array (SKA) telescope, although full operation is not projected until 2024.
So this scale would seem to clearly put gigabytes in the yawn category and probably below the threshold of Big Data. Terabytes clearly qualify, and probably will account for much of the Big Data efforts at the moment. Petabytes cover the really impressive data collections for today and really seem to contain the upper boundary of what even the most ambitious Big Data projects will be able to handle. Exabytes are rushing at us in the future, but mostly beyond what anyone will be able to address in the next few years.
So the bottom line: Big Data today has moved beyond Gigabytes. It is squarely in Terabytes and edging up into Petabytes. Exabytes and beyond is in the future. And we still don’t need to try to comprehend what a yottabyte is.
2314 disk – Scott Gerstenberger. Wikimedia Commons.
750 Gigabyte disk – Christian Jansky. Wikmedia Commons.
Software engineering will always be an uncomfortable fit with the traditional engineering disciplines. One of the key issues is the fact that all the other engineering disciplines create physical artifacts but software engineering does not. This difference means that the basis in physics and chemistry shared by all the other engineering disciplines is simply not relevant to software engineering.
This week I had a graphic reminder of this gap when attending the annual conference for the American Society for Engineering Education, where I presented several papers related to software engineering education. The exhibit hall was filled with vendors selling engineering education products, many of which involved equipment or scale models of large artifacts like bridges. Reflective of the relatively minor presence of software engineering at the conference, there were no vendors in the large exhibit hall who were positioned to support software engineering education.
This minor representation for software engineering reflects a national problem. Federal projections indicate that we should be graduating about five to seven times the number of computing majors that we are now graduating. Software engineering majors should be a key part of that group. “Software engineer” has even topped the list repeatedly in recent years as the best career opportunity available. Any yet the number of undergraduate programs in software engineering nationally is in the low 30’s and most of those programs have small numbers of students.
The lack of software engineering majors is a looming national economic problem. It’s a problem for the other engineering disciplines too. While browsing the exhibit hall at ASEE, I couldn’t help but note the extensive integration of software with all those displays of engineering equipment. With almost every exhibit, like the one shown to the right, there was a laptop or tablet that was used to provide controls or models or processing. In a profession where concern for attributes like reliability and performance are typical, the software engineer in me was inclined to guess that all that software was likely to be a weak link in many of these products. Until we start to take the challenges of software engineering more seriously, software will remain a weak link in engineering artifacts and beyond.
Data science, data analytics, and big data are all topics that have a rising buzz in the last few years. As with many “new” tech topics, much of what these terms encompass is not new at all. There clearly are ties to existing activity in areas like data mining, decision support systems, business intelligence, visualization, etc. So what’s new and why the new terms and growing buzz?
One key to the shift in discussion clearly is the data itself. There are several categories of data that are simply exploding in size and importance. In trying to get your head around data science, it seems useful to categorize the types of data involved. My current mental model is that there are three broad categories of data that seem relevant to the discussions of data science. They are:
Human Generated Data
The volume of data published on the Web by individuals is truly one of the amazing features of our time. And the publication rate and variety of this data continues to accelerate. For anyone interested in what people are doing and thinking, this is a total game changer. Some examples of data in this category are:
- Clickstreams and navigation histories of Web activity
- Tweets – person to person message interactions
- Facebook, Linkedin, and semi-public records of people’s lives and interactions
- Citizen science – data gathering in support of science by interested non-scientists
Device Generated Data
There have been devices that generate massive amounts of data for decades, with areas like medicine, lab science, and aerospace providing ready examples. But the number and type of devices that create large data streams accessible via the Web is rising sharply. Projecting forward to fully instrumented intelligent infrastructure implies that the history of device generated data is barely a trickle compared to the future. Some examples of data in this category are:
- Scientific devices – e.g., medical and molecular imaging
- Sensors – intelligent infrastructure
- Video and audio capture – traffic cams; security cams
Newly Accessible Data
As more and more of the world’s data shifts online, there are legacy data sources that take on new meaning. Much of this is data that was previously paper or computerized but off the Net. It includes data that may have been previously available, but that was prohibitively expensive and time consuming to access and aggregate. Examples of data in this category are:
- Real estate transactions
- Legal filings
- Price data
The iSchool at Drexel has active research efforts that address a variety of topics related to data science. Our degree programs increasingly address these topics too. And clearly the development of education for data science has just begun.
The recent announcement by Blackboard (Bb) that it was acquiring two Moodle service providers was quite interesting to anyone who follows open source in higher education. Over the years, Blackboard has emerged as a market leader in the Learning Management System (LMS) arena, through both product development and acquisition. At the same time, Blackboard has attracted considerable heat and a large dose of scorn for a patent the company filed and tried to enforce. That patent was viewed by many to be an attempt to corner the LMS market and to claim invention of many LMS features in use well before Blackboard’s supposed date of invention. Coverage of the long story and eventual Blackboard loss in the courts can be found here. Particularly for fans of open source, this sort of behavior does not make Blackboard an admired company, and acquisitions in the Moodle niche are much more likely to raise eyebrows than cheers.
It’s interesting however to see how Blackboard explains this latest move. It’s also important to note that Blackboard recently returned to being a private company after trading publicly for some years. That switch may have provided increased flexibility in strategy formation.
Blackboard’s strategy already includes multiple learning platforms due to acquisitions. The company has also broadened its scope beyond the LMS niche to address a range of educational institution application needs, including a push into areas like student services. Finally, Blackboard also grows by providing services, not just software. Taken together this means that accommodating the open source world makes sense for Blackboard in two ways:
- Enterprise sales – In the push to cover the education enterprise, Bb will sometimes be sole provider for an institution across the whole Bb product line. But much more often, like any enterprise vendor, Bb will sell some applications and need to co-exist with products from other vendors in other applications. Open source is just another flavor with which to co-exist.
- Services – To the extent that Bb is a service provider, large open source projects like Moodle and Sakai create a business opportunity. Blackboard clearly is moving to be a service player for both of these open source communities.
So, in spite of the history that seems to make Blackboard an unlikely candidate for good citizenship in open source communities, it’s not hard to see a business case for moving in that direction. And this step in the evolution of Blackboard makes an interesting case study for the continuing evolution of open source as a significant, not to be ignored, part of the software industry. Of course, the case study is still being written. And open source advocates who have followed Blackboard over the years will be excused if they want to wait to see how this plays out!
Over the few years that I’ve been exploring the open source world, I’ve come to realize that there is quite a bit about open source that most people, including most technical people, don’t understand. Since I’m a faculty type, I got beyond some of this early on by looking at research literature. As with many technical topics, the growth of open source means that it has attracted a good bit of researcher interest. See for example, Deek and McHugh or FLOSShub. Most people don’t have much tolerance for wading through research papers though, so many of the things known about open source are not widely known.
One of the misconceptions has to do with the number of developers on most projects. People seem to expect that projects have lots of developers, when just the opposite is true for most projects. Research studies show that the average number of developers across the broad sweep of FOSS projects is one per project. That’s right, most projects have a single developer!
The community team at Source Forge recently blogged about this and published a nice graph showing the distribution of developers by project. The steep drop-off in that curve tells the tale. Source Forge “About” currently indicates that there are 324,000 projects on the site. 269,000 of them have only one developer. Yes, the large and popular projects mostly have quite a few developers, but only 21 have over 100, and of those 21, only 7 are over 200.
This preponderance of single developer projects and overwhelming majority of projects with no more than a small development team presents a very different picture of the FOSS ecosystem. Clearly, one reason for this picture is that forges contain many projects that have been started but never really gone anywhere. But, in terms of student participation in open source, it implies lots of opportunity. Given the large number of projects, there clearly are going to be quite a few that could use some additional developers.
Finding the sweet spot on that curve of development team size is one of the challenges for getting students involved in FOSS. I don’t think there is a magic team size, but rather that team size is one of the factors that should be considered in project selection. We’ve been working on a framework to help faculty with this problem of selecting projects for student participation. This will need additional development, but we recently presented initial ideas at the ACM SIGCSE annual symposium. The paper is:
Ellis, Heidi J. C., Michelle Purcell, Gregory W. Hislop. “An Approach for Evaluating FOSS Projects for Student Participation.” Proceedings, ACM SIGCSE Symposium. March, 2012.
I’ve been increasingly involved in the world of Free and Open Source Software (FOSS) in recent years, and that involvement has made me re-think the role of blogs. Blogs always seemed like an interesting development in the evolution of the Web, but didn’t have much appeal to me personally. As I came to understand the FOSS world however, I had to re-consider blogging. If you follow FOSS, it becomes clear fairly quickly that blogs are a key communication vehicle, and also a key mechanism to establish presence and credibility in the FOSS community. So I decided to blog as part of joining the the FOSS world.
That was almost a year and a half ago. As you can see, my initial blogging effort consisted of exactly one post. I’m sure that there are many blogs that are started with a single post and stop right there, so this isn’t a surprising result. But it is interesting to consider why this might be so. In particular, it seems that professionally oriented blogs are an uneasy fit (at best) with professional life.
In my case, the profession is being a faculty member at a research university. Writing is part of the job, but not the sort of writing that appears in a blog. Academic culture is very much more about publication of polished, finished products. And publication also includes a filtering process (and stamp of approval) provided by the peer review and editing process typical of academic publications. Publications that have not been through that filtering and approval process are not valued much, and faculty have little incentive (or have actual disincentive) to spend time on other writing, like blogging.
While academic culture is relevant to my failure to blog, I’m also struck that similar cultural biases exist in the commercial world. In academia, the writing issue is primarily related to reputation of an individual. In the commercial world, the concern is much more about reputation, intellectual property, and liability of the organization that employs the blogger. But the effect is much the same in creating no incentive and some disincentive to blog.
So yes, on one level I was just “too busy” to blog. But I managed to get to a whole bunch of other things during this time of being “too busy”. Blogging never got the priority in part because the openness that a professionally oriented blog implies just doesn’t fit the culture that surrounds me. It seems that this issue applies to all attempts to marry openness principles with existing organizational cultures and personal work habits. That doesn’t seem insurmountable, but it’s an issue to remember when encouraging openness in the workplace and among students.