April. 27, 2012 | SPEAKER: Afsin Akdogan (University of Southern California) Voronoi-based Geospatial Query Processing with MapReduce |
| Details |
| Date and Time | April. 27, 2012 3 pm | | Location | DBH 3011 |
| Speaker | Afsin Akdogan (University of Southern California) | | Title | Voronoi-based Geospatial Query Processing with MapReduce | | Abstract |
Geospatial queries (GQ) have been used in a wide variety of applications such as decision support systems, profile-based marketing, bioinformatics and GIS. Most of the existing query-answering approaches assume non parallel processing on a single machine although GQs are intrinsically parallelizable. There are some approaches that have been designed for parallel databases and cluster systems; however, these only apply to the systems with limited parallel processing capability, far from that of cloud-based platforms. In this study, I present the problem of parallel geospatial query processing with MapReduce programming model. Our approach creates a spatial index, Voronoi diagram, for given data points in 2D space and enables efficient processing of GQs. We evaluated the performance of our proposed techniques and correspondingly compared them with their closest related work while varying the number of employed nodes.
| | Speaker Bio | Afsin Akdogan received his master’s degree in computer science from Cornell University in 2009. He received a best paper award in IEEE Cloud Computing Technology and Science conference in 2010. He has also interned at Yahoo. He is currently working towards his Ph.D. degree in computer science at the University of Southern California and his research focuses on cloud computing, parallel data processing languages and geo-spatial databases. |
|
April. 20, 2012 | SPEAKER: Leila Jalali (Ph.D. student in ISG) A Reflective Approach to Synchronization for Consistent Multisimulations |
| Details |
| Date and Time | April. 20, 2012 3 pm | | Location | DBH 3011 |
| Speaker | Leila Jalali (Ph.D. student in ISG) | | Title | A Reflective Approach to Synchronization for Consistent Multisimulations | | Abstract | In this talk, I consider the challenge of designing a framework that supports the integration of multiple existing autonomous simulation models into an integrated simulation environment (multisimulation). In particular, I focus on solutions for synchronization problem in multisimulation to orchestrate consistent information flow through multiple simulator: (1) a transaction-based approach to modeling the synchronization problem in multisimulations by mapping it to a problem similar to multidatabase concurrency; we express multisimulation synchronization as a scheduling problem where the goal is to generate “correct schedules” for time advancement and data exchange across simulators that meets the dependencies without loss of concurrency, (2) a hybrid scheduling strategy which adapts itself to the “right” level of pessimism/optimism based on the state of the execution and underlying dependencies, and (3) relaxation model for dependencies which guarantee bounded violation of consistency to support higher levels of concurrency. We also develop two key optimizations: (a) efficient checkpointing/rollback techniques, and (b) relaxation model for dependencies which guarantee bounded violation of consistency to support higher levels of concurrency. We evaluate our proposed techniques via a detailed case study from the emergency response domain by integrating three disparate simulators – a fire simulator (CFAST), an evacuation simulator (Drillsim) and a communication simulator (LTEsim).
|
|
April. 13, 2012 (Special Time\Place) | SPEAKER: Jennifer Widom (Stanford) Data-Centric Human Computation + From 100 Students to 100,000 |
| Details |
| Date and Time | April. 13, 2012 (Special Time\Place) 11 am | | Location | DBH 6011 |
| Speaker | Jennifer Widom (Stanford) | | Title | Data-Centric Human Computation + From 100 Students to 100,000 | | Abstract | This talk will have two completely independent parts -- one related to research and the other to education.
In the first part of the talk, I'll describe our ongoing research in leveraging human computation for tasks related to data. Human computation ("crowdsourcing") augments traditional computation with the use of human abilities to solve sub-problems that are difficult for computers, e.g., object or image comparisons, information extraction, relevance judgements, and data gathering. We are addressing two different types of data-centric human computation: (1) Fundamental algorithms, such as sorting, clustering, and data cleaning, in which the basic operations (e.g., compare, filter) are performed by humans. (2) A database-system like platform in which declarative queries are posed by users, and the system orchestrates a combination of stored and crowdsourced data to answer them. Common to both areas is the need to formalize and optimize new tradeoffs among latency (humans are much slower than computers), cost (humans require real money to perform tasks), and quality (humans are inaccurate and inconsistent).
In the second part of the talk, I'll describe my recent experience teaching introductory databases to 60,000 students. Admittedly only 25,000 of them submitted their homework, and a mere 6500 achieved a strong final score. But even with 6500 students, I more than quadrupled the total number of students I've taught in my entire 18-year academic career. I began by "flipping" the way I teach my Stanford course and, as a side-effect, making all components of the course freely available online. But the big inflection point came when I offered the online course in a structured fashion with a schedule, automatically-graded assignments and exams, and most importantly a worldwide community of students. I'll cover a variety of topics related to the massive online course, both logistical and social, while avoiding speculation on the future of higher education.
| | Speaker Bio | Jennifer Widom is the Fletcher Jones Professor and Chair of the Computer Science Department at Stanford University. She received her Bachelor's degree from the Indiana University School of Music in 1982 and her Computer Science Ph.D. from Cornell University in 1987. She was a Research Staff Member at the IBM Almaden Research Center before joining the Stanford faculty in 1993. Her research interests span many aspects of nontraditional data management. She is an ACM Fellow and a member of the National Academy of Engineering and the American Academy of Arts and Sciences; she received the ACM SIGMOD Edgar F. Codd Innovations Award in 2007 and was a Guggenheim Fellow in 2000; she has served on a variety of program committees, advisory boards, and editorial boards. |
|
March. 16, 2012 (Special Time\Place) | SPEAKER: Cyrus Shahabi (USC) TransDec:
A Data-Driven Framework for Decision-Making in Transportation Systems |
| Details |
| Date and Time | March. 16, 2012 (Special Time\Place) 11 am | | Location | DBH 6011 |
| Speaker | Cyrus Shahabi (USC) | | Title | TransDec:
A Data-Driven Framework for Decision-Making in Transportation Systems | | Abstract | The vast amounts of transportation datasets (traffic flow, incidents, etc.) collected by various federal and
state agencies are extremely valuable in 1) real-time decision-making, planning, and management of the
transportation systems, and 2) conducting research to develop new policies to enhance the efficacy of the
transportation systems. In this talk, I will present our data-driven framework, dubbed TransDec (short
for Transportation Decision-Making), which enables real-time integration, visualization, querying, and
analysis of dynamic and archived transportation data. I will show that considering the large size of the
transportation data, variety of the data (different modalities and resolutions), and frequent changes of the
data, implementation of such a scalable system that allows for effective querying and analysis of both
archived and real-time data is an intrinsically challenging data management task. Subsequently, I will
focus on a route-planning problem where the weights on the road-network edges vary as a function of
time due to the variability of traffic congestion. I will show that naïve approaches to address this problem
are either inaccurate or slow, motivating the need for new solutions. Consequently, I will discuss our
initial approach to this problem and demonstrate its implementation within the TransDec framework. | | Speaker Bio | Cyrus Shahabi is a Professor and the Director of the Information Laboratory (InfoLAB) at the Computer
Science Department and also the Director of the NSF's Integrated Media
Systems Center (IMSC) at the University of Southern California. He is also the
CTO and co-founder of a USC spin-off, Geosemble Technologies. He
received his B.S. in Computer Engineering from Sharif University of
Technology in 1989 and then his M.S. and Ph.D. Degrees in Computer
Science from the University of Southern California in May 1993 and
August 1996, respectively. He authored two books and more than hundred-
fifty research papers in the areas of databases, GIS and multimedia. Dr. Shahabi has received funding from several agencies such as NIJ, NSF, NASA, NIH, DARPA, AFRL,
and DHS as well as several industries such as Google, Microsoft, NCR, NGC, and Chevron. He was an
Associate Editor of IEEE Transactions on Parallel and Distributed Systems (TPDS) from 2004 to 2009.
He is currently on the editorial board of the VLDB Journal, IEEE Transactions on Knowledge and Data
Engineering (TKDE), ACM Computers in Entertainment and Journal of Spatial Information Science. He
is the founding chair of IEEE NetDB workshop and also the general co-chair of ACM GIS 2007, 2008
and 2009. He chaired the nomination committee of ACM SIGSPATIAL for the 2011-2014 terms. He
regularly serves on the program committee of major conferences such as VLDB, ACM SIGMOD, IEEE
ICDE, ACM SIGKDD, and ACM Multimedia. Dr. Shahabi is a recipient of the ACM Distinguished
Scientist award in 2009, the 2003 U.S. Presidential Early Career Awards for Scientists and Engineers
(PECASE), the NSF CAREER award in 2002, and the 2001 Okawa Foundation Research Grant for
Information and Telecommunications. He was the recipient of US Vietnam Education Foundation (VEF)
faculty fellowship award in 2011, an organizer of the 2011 National Academy of Engineering “Japan-
America Frontiers of Engineering” program, an invited speaker in the 2010 National Research Council
(of the National Academies) Committee on New Research Directions for the National Geospatial-
Intelligence Agency, and a participant in the 2005 National Academy of Engineering “Frontiers of
Engineering” program. |
|
March. 9, 2012 | SPEAKER: Nga Dang (Ph.D. student in ISG) QuARES: A Quality-Aware Renewable Energy-driven Sensing Framework |
| Details |
| Date and Time | March. 9, 2012 3 pm | | Location | DBH 3011 |
| Speaker | Nga Dang (Ph.D. student in ISG) | | Title | QuARES: A Quality-Aware Renewable Energy-driven Sensing Framework | | Abstract | Mobile devices, such as smartphones and tablets, are getting increasingly popular, and continue to generate record-high amount of mobile data traffic. For example a recent Cisco report indicates that mobile data traffic will increase 39 times by 2015, while 66% of such boost is due to video traffic. Network capacity issue may be partially coped by deploying more cellular base stations, installing dedicated broadcast networks, or upgrading the cellular base stations to support 4G. However, these approaches all result in additional costs on new network infrastructure, and might not be fully compatible with existing
obile devices. Also, according to the report, the network capacity provided by cellular network providers is predicted to be only 10 time increasing by 2015, which implies that the above methods do not still meet the requirement for increasing mobile traffic. A better way is moving data to other networks to reduce heavy traffic in cellular networks. In our research, we study motivations and methods to offload part of mobile traffic from cellular networks to other networks such as WiFi or Ad Hoc, which are available in most modern smartphones. Such these methods are cheap, practical, and easily implemented. |
|
March. 1, 2012 (Special Time\Place) | SPEAKER: Archan Misra (Singapore Management University) Real-time Mobile Sensing/Analytics and the LiveLabs Experimentation Platform |
| Details |
| Date and Time | March. 1, 2012 (Special Time\Place) 11 am | | Location | DBH 4011 |
| Speaker | Archan Misra (Singapore Management University) | | Title | Real-time Mobile Sensing/Analytics and the LiveLabs Experimentation Platform | | Abstract |
This talk explores the ongoing transformation of the mobile device into a combined “sensing and
analytics” platform, distinguished by two key features: a) efficient localized processing of sensor data
streams and b) localized coordination and distributed computation among a set of proximal mobile
nodes. I will first introduce the LiveLabs Experimentation Platform, a unique “urban behavioral testbed”
that combines innovations in wireless networks, mobile sensing and App deployment to enable an
ecosystem of industry partners to test next-generation context-based applications on approx. 30,000 real-
life users in urban environments, such as the SMU campus, 2 major shopping malls and a resort theme
park. I will then describe ongoing research on offline and near-real time energy-efficient, continuous
smartphone-based human context estimation or “activity mining”, with a special focus on how such
analytics can utilize proximity-driven social interactions. I will then briefly cover two ongoing projects
that exploit such context-sensing to: a) optimize the delivery of mobile advertising and b) perform real-
time adaptation of femtocellular indoor networks.
|
|
Feb. 17, 2012 | SPEAKER: Russell Sears (Yahoo! Research) A general purpose Log Structured Merge Tree |
| Details |
| Date and Time | Feb. 17, 2012 3 pm | | Location | DBH 3011 |
| Speaker | Russell Sears (Yahoo! Research) | | Title | A general purpose Log Structured Merge Tree | | Abstract |
Data management workloads are increasingly write-intensive and subject
to strict latency SLAs. This presents a dilemma: Traditional update
in place systems have unmatched latency properties but poor write
throughput. In contrast, existing log structured techniques
significantly improve write throughput but generally sacrifice read
performance and exhibit unacceptable latency spikes.
We begin by presenting a new performance metric: read fanout, and
argue that, along with read amplification and write amplification, it
better characterizes the real-world performance of index algorithms than
existing approaches such as asymptotic analysis and price/performance.
We then present a Log Structured Merge (LSM) tree implementation that
combines the best properties of B-Trees and log structured approaches:
(1) Unlike existing log structured trees, our implementation has
near-optimal read and scan performance, and (2) we present merge
algorithms that bound write latencies without impacting write
throughput or allowing merges to block application writes for extended
periods of time. We do this by introducing a new ``spring and gear''
scheduler that ensures merges at each level of the tree make steady
progress. This allows us to avoid blocking application writes without
resorting to techniques that degrade read performance.
We use Bloom filters to improve index performance, and find that a
number of subtleties arise. First, it is important to ensure that
reads can safely stop after finding the first version of a record.
Otherwise, frequently written items will incur multiple disk
lookups. Second, many applications and data management architectures
check for preexisting values at insertion time. Avoiding the disk
seek performed by the check is crucial for such applications.
This work will appear in Sigmod 2012.
|
|
Feb. 10, 2012 (Special Time\Place) | SPEAKER: Anhai Doan (U. Wisconsin and Walmart Labs - ex Kosmix) Social Media, Data Integration, and Human Computation |
| Details |
| Date and Time | Feb. 10, 2012 (Special Time\Place) 11 am | | Location | DBH 6011 |
| Speaker | Anhai Doan (U. Wisconsin and Walmart Labs - ex Kosmix) | | Title | Social Media, Data Integration, and Human Computation | | Abstract | Social media has emerged as a major frontier on the World-Wide Web, with applications ranging from helping teenagers track Justin Bieber to e-commerce to fostering revolutions. In this talk I will discuss our work in this area, as carried out at Wisconsin, Kosmix, and @WalmartLabs. I describe how we integrate data from 'traditional' Web sources to build a global taxonomy, greatly expand it with social-media data, then leverage it to build consumer-facing applications. Example applications include building topic pages, detecting Twitter events, and monitoring these events. I discuss the critical role of data integration and human computation in processing social media. Finally, I discuss how all of these can help the emerging area of social commerce, and why Walmart recently acquired Kosmix to make inroads into this new and exciting area. | | Speaker Bio | AnHai Doan is an Associate Professor at the University of Wisconsin-Madison. His interests cover databases, AI, and Web, with a current focus on data integration, large-scale knowledge bases, social media, crowdsourcing, human computation, and information extraction. He received the ACM Doctoral Dissertation Award in 2003, a CAREER Award in 2004, and a Sloan Fellowship in 2007. AnHai was Chief Scientist of Kosmix, a social media startup acquired by Walmart in 2011. Currently he also works as Chief Scientist of @WalmartLabs, a research and development lab devoted to integrating social and mobile data for e-commerce. |
|
Feb. 3, 2012 (Special Time\Place) | SPEAKER: Yannis Papakonstantinou (UCSD) Declarative, optimizable data-driven specifications of web and mobile applications |
| Details |
| Date and Time | Feb. 3, 2012 (Special Time\Place) 11 am | | Location | DBH 6011 |
| Speaker | Yannis Papakonstantinou (UCSD) | | Title | Declarative, optimizable data-driven specifications of web and mobile applications | | Abstract | Developers of web and mobile application development write too much low level "plumbing" code to efficiently access, integrate and coordinate application state that resides on multiple sub-systems of the architecture, and is accessed using different languages: SQL at the database server; HTML and Javascript at the browser, which in HTML5 includes its own database state; Java or other programming languages at the application server.
The FORWARD project replaces such low level code with declarative specifications. Its cornerstones are
(i) the unified application state virtual database, which enables modeling and manipulating the entire application state in an extension of SQL, named SQL++
(ii) specification of Ajax pages as essentially rendered views over the unified application state.
Consequently the following three problems are resolved by appropriate reduction to data management problems, where prior database research literature is leveraged and extended.
1. The partial change of Ajax pages, in response to application state changes, is reduced to an incremental view maintenance problem. Id's that retain the provenance of the page data play an instrumental efficiency role.
2. Efficient data access is reduced to semistructured query processing over an integrated view that involves large database(s) and small main memory-based sources.
3. The inherent location transparency of the specifications is exploited in order to perform computation at the appropriate location (browser vs server). More broadly, the talk discusses ongoing and future work in utilizing the increased abilities of HTML5 clients towards achieving low latency mobile web applications applications, while location transparency of the specifications is retained. | | Speaker Bio |
Yannis Papakonstantinou is a Professor of Computer Science and Engineering at the University of California, San Diego. His research is in the intersection of data management technologies and the web, where he has published over eighty research articles. He has given multiple tutorials and invited talks, has served on journal editorial boards and has chaired and participated in program committees for many international conferences and workshops.
Yannis was the CEO and Chief Scientist of Enosys Software, which built and commercialized an early XML-based Enterprise Information Integration platform. Enosys Software was acquired in 2003 by BEA Systems. He was the CEO and is the Chief Scientist of app2you, which has commercialized UCSD R and D on rapid development of web applications for data-driven analytics and business process management. He is the Chief Computer Scientist of a pharmaceutical spin-off startup in the area of data analytics for the pharmaceutical industry. He has been in the technical advisory board of multiple startups, currently including Brightscope Inc.
Yannis holds a Diploma of Electrical Engineering from the National Technical University of Athens, MS and Ph.D. in Computer Science from Stanford University (1997) and an NSF CAREER award for his work on data integration. |
|
Jan. 27, 2012 | SPEAKER: Kurt Brown (EMC/Greenplum) The Future of Big Data Analytics |
| Details |
| Date and Time | Jan. 27, 2012 3 pm | | Location | DBH 3011 |
| Speaker | Kurt Brown (EMC/Greenplum) | | Title | The Future of Big Data Analytics | | Abstract | "Big Data" and analytics have both existed in some form for as long as computing itself, but only now has technology advanced to the point that, together, they are starting to qualitatively change the way organizations and individuals perceive, understand, and predict the world around them. In this talk, I'll set Big Data Analytics in a historical context to help sort out what aspects of current technologies (hardware, software, and programming models) are simply transient artifacts or long-term trends, and to project where Big Data Analytics is possibly headed (from the perspective of Greenplum and EMC). | | Speaker Bio | Kurt Brown is currently Director of Advanced R and D at Greenplum/EMC.
Prior to EMC, he co-directed Intel's Berkeley Research Lab, spent 13 years with IBM in operating systems
and database R and D on the East and West coasts, and co-founded three startups in database middleware,
small business marketing services, and residential energy management.
He received his PhD in 1995 from the University of Wisconsin for work in automated database performance tuning.
|
|
Jan. 13, 2011 | SPEAKER: Thomas Bodner The Stratosphere Parallel Analysis Framework, Present and Future |
| Details |
| Date and Time | Jan. 13, 2011 3:00 pm | | Location | DBH 3011 |
| Speaker | Thomas Bodner | | Title | The Stratosphere Parallel Analysis Framework, Present and Future | | Abstract | Data-intensive computing is a much investigated topic in current research.
Next to parallel databases, new flavors of data processors have established themselves - most prominently the MapReduce programming and execution model.
The new systems provide key features that current parallel databases lack, such as flexibility in the data models, the ability to
parallelize custom functions, and fault tolerance that enables them to scale out to thousands of machines.
This talk presents the current state of Stratosphere system, a cloud data and query processor that has been released as open-source in spring 2011.
The system consists of the parallel data programming model PACT, an extension of the MapReduce programming model for the specification of complex data-intensive tasks in the cloud,
and the elastic, massively parallel execution engine Nephele, a Dryad-like parallel data processor. Furthermore, I give a demo of the most recent Stratosphere release.
And finally, I report on future enhancements for Stratosphere, particularly, for the compilation, optimization and parallel execution of data-intensive operations in the system.
| | Speaker Bio | Since October 2010, Thomas Bodner is a Master's student at the department for Database Systems and Information Management (DIMA) at the Technical University of Berlin.
Between 2007 and 2010, Thomas Bodner completed the Applied Computer Science program at the University of Cooperative Education, Stuttgart, jointly with IBM Germany as partner.
In the course of his undergraduate studies, he studied abroad for one semester at the Royal Melbourne Institute of Technology, Australia and worked as an intern at the IBM Almaden Research Center,
California, USA and the IBM Böblingen Laboratory in Germany, exploring query optimization and in-memory technologies for database management systems. His research interests include architectures
for information management, query processing and optimization, benchmarking and machine learning.
|
|
Dec. 9, 2011 (Special Time\Place) | SPEAKER: Pat Helland (Microsoft) If You Have Too Much Data, then "Good Enough" Is Good Enough |
| Details |
| Date and Time | Dec. 9, 2011 (Special Time\Place) 11 am | | Location | DBH 6011 |
| Speaker | Pat Helland (Microsoft) | | Title | If You Have Too Much Data, then "Good Enough" Is Good Enough | | Abstract | Classic database systems offer crisp answers for a relatively small amount of data. These systems hold their data in one or a relatively small number of computers. With a tightly defined schema and transactional consistency, the results returned from queries are crisp and accurate.
New systems have humongous amounts of data content, change rates, and querying rates and take lots of computers to hold and process. The data quality and meaning are fuzzy. The schema, if present, is likely to vary across the data. The origin of the data may be suspect, and its staleness may vary.
Today's data systems coalesce data from many sources. The Internet, B2B, and enterprise application integration (EAI) combine data from different places. No computer is an island. This large amount of interconnectivity and interdependency has led to a relaxation of many database principles.
In this talk, consider the some of the ways in which today's answers differ from what we used to expect.
| | Speaker Bio |
Pat Helland has been working in distributed systems, transaction processing, databases, and similar areas since 1978.
For most of the 1980s, he was the chief architect of Tandem Computers' TMF (Transaction Monitoring Facility), which provided distributed transactions for the NonStop System.
With the exception of a two-year stint at Amazon, Helland has worked at Microsoft Corporation since 1994 where he was the architect for Microsoft Transaction Server and SQL Service Broker.
Until September, 2011, he was working on Cosmos, a distributed computation and storage system that provides back-end support for Bing.
Pat recently relocated to San Francisco with his wife to be close to the grandchildren and to explore new opportunities in "Big Data" and/or "Cloud Computing".
|
|
Nov. 18, 2011 | SPEAKER: Yi Pan and Masood Mortazavi (Yahoo!) Scalability and Programming Model in Serving Storage Systems |
| Details |
| Date and Time | Nov. 18, 2011 3pm | | Location | DBH 3011 |
| Speaker | Yi Pan and Masood Mortazavi (Yahoo!) | | Title | Scalability and Programming Model in Serving Storage Systems | | Abstract | We will review some of the storage technologies Yahoo applications use in Yahoo's cloud platform. These serving storage systems can scale to extremely large numbers of records. After discussing overall architecture of these scalable storage systems, we will focus on Sherpa (PNUTS). Sherpa is a multi-tenant, distributed, highly elastic key-value store with a well-defined transaction semantics that serves data for 100s of Yahoo applications. To exemplify the type of scalability challenges we face, we will describe how we're evolving Sherpa along various dimensions. We will then focus on the programmability dimension and explain how we have implemented a highly scalable, eventually consistent indexing system for Sherpa. Design decisions we have made to balance concerns related to consistency and availability will be discussed,
and we hope to elucidate the basic questions that come up, repeatedly, when evolving such massively scalable systems while they are in operation. | | Speaker Bio | Dr. Masood Mortazavi works as a senior principal architect at Yahoo's serving storage systems group. His interests include distributed systems, scalability, multi-tenancy and cloud serving systems. Masood has also worked for Huawei Technologies, Sun Microsystems, Tecknowledge and Hughes Aircrafts. Masood's LinkedIn profile can be found here: http://www.linkedin.com/in/mortazavi . . . At Yahoo, he helps advance cloud platform and storage technologies.
Dr. Yi Pan graduated with a Ph.D. degree in computer science from University of California at Irvine. He got his B.S. and M.S. Degree from Fudan University in Shanghai, China. His main interests expand across many areas in large scale distributed computer networks and applications. Currently, he works as a principal software engineer in Yahoo!’s Cloud Platform Group. His main goal is to push forward Yahoo!’s state-of-art cloud storage systems with innovative features. |
|
Nov. 4, 2011 | SPEAKER: Thomas Bodner Myriad - Parallel Data Generation on Shared-Nothing Architectures |
| Details |
| Date and Time | Nov. 4, 2011 3:30 pm | | Location | DBH 3011 |
| Speaker | Thomas Bodner | | Title | Myriad - Parallel Data Generation on Shared-Nothing Architectures | | Abstract |
The need for efficient data generation for the purposes of testing and benchmarking newly developed data-intensive computing systems has increased with the emergence of
big data problems. As synthetic data model specifications evolve over time the data generator programs implementing these models have to be continuously adapted –
a task that might become complex as the set of model constraints grows. This talk presents Myriad - a new parallel data generation toolkit. Data generators created
with the toolkit can produce very large datasets by exploiting a completely parallel execution model, while at the same time maintain cross-partition dependencies, correlations and distributions in the generated data.
In addition, I report on our efforts towards a benchmark suite for large-scale parallel analysis systems that uses Myriad for the generation of large social network graphs and OLAP-style relational datasets.
| | Speaker Bio | Since October 2010, Thomas Bodner is a Master's student at the department for Database Systems and Information Management (DIMA) at the Technical University of Berlin.
Between 2007 and 2010, Thomas Bodner completed the Applied Computer Science program at the University of Cooperative Education, Stuttgart, jointly with IBM Germany as partner.
In the course of his undergraduate studies, he studied abroad for one semester at the Royal Melbourne Institute of Technology, Australia and worked as an intern at the IBM Almaden Research Center,
California, USA and the IBM Böblingen Laboratory in Germany, exploring query optimization and in-memory technologies for database management systems. His research interests include architectures
for information management, query processing and optimization, benchmarking and machine learning.
|
|
Oct. 21, 2011 | SPEAKER: David Lomet (Microsoft Research) Deuteronomy: Transaction Support for Cloud Data |
| Details |
| Date and Time | Oct. 21, 2011 3pm | | Location | DBH 3011 |
| Speaker | David Lomet (Microsoft Research) | | Title | Deuteronomy: Transaction Support for Cloud Data | | Abstract | The Deuteronomy system supports efficient and scalable ACID transactions in the cloud by decomposing the storage engine
into: (a) a transactional component (TC) that manages transactions and their ``logical" concurrency control and undo/redo recovery,
and (b) a data component (DC) that knows about the access methods and supports a record-oriented interface with atomic operations,
but knows nothing about transactions. The Deuteronomy TC can be applied to data anywhere, in the cloud, local, etc. with a variety
of deployments for both the TC and DC components. In this talk, we first describe the architecture of our TC, and the considerations
that led to it. We next describe the contract between TC and DC, how we changed the operation protocol to simplify it and make it more efficient.
We have implemented both TC and multiple DCs, and will describe our TC implementation in detail.
We will end a few words about observed performance and scalability. | | Speaker Bio |
David Lomet is a principal researcher managing the Microsoft Research Database Group. Earlier, he worked at Digital, IBM Research, and Wang Institute.
He has a CS Ph.D from the University of Pennsylvania. He is author of over 100 papers (two SIGMOD "best papers") and has 45 patents.
He has served on program committees (SIGMOD, PODS, VLDB, ICDE...), was ICDE'2000 PC co-chair, VLDB'2006 PC core chair, and is on the ICDE Steering Committee,
the VLDB Board, is TCDE Chair and has been an editor for TODS, VLDBJ, and JDPD. He is the Data Engineering Bulletin EIC, for which he received the SIGMOD Contributions Award.
He received IEEE Golden Core, Outstanding, and Meritorious Service Awards and is a Fellow of IEEE, ACM, and AAAS.
|
|
Oct. 21, 2011 (Special Time\Place) | SPEAKER: Danny Sullivan (Editor In Chief, Search Engine Land) From Search 1.0 to Search 4.0 |
| Details |
| Date and Time | Oct. 21, 2011 (Special Time\Place) 11am | | Location | DBH 6011 |
| Speaker | Danny Sullivan (Editor In Chief, Search Engine Land) | | Title | From Search 1.0 to Search 4.0 | | Abstract | When search engines first began, they focused on crawling web pages
and "words on the page" ranking analysis. That system quickly failed,
being far too easy to game. Search 2.0 gave us ranking where links
were used as votes; Search 3.0, a third generational system,
introduced blending vertical search results with web matches.
Currently underway, the fourth generational trend of Search 4.0 taps
into human signals, from social networks and personalization, to
refine search results. The "how and why" of this evolution has
unfolded. | | Speaker Bio | Widely considered a leading "search engine guru," Danny Sullivan has
been helping webmasters, marketers and everyday web users understand
how search engines work for over a decade. Danny's expertise about
search engines is often sought by the media, and he has been quoted in
places like The Wall St. Journal, USA Today, The Los Angeles Times,
Forbes, The New Yorker and Newsweek and ABC's Nightline. Danny began
covering search engines in late 1995, when he undertook a study of how
they indexed web pages. The results were published online as "A
Webmaster's Guide To Search Engines," a pioneering effort to answer
the many questions site designers and Internet publicists had about
search engines. Danny currently heads up Search Engine Land as
editor-in-chief, which covers all aspects of search marketing and
search engine news. Danny also serves as Third Door Media's chief
content officer, which owns Search Engine Land and the SMX: Search
Marketing Expo conference series. Danny also maintains a personal blog
called Daggle and microblogs on Twitter: @dannysullivan. |
|
Oct. 14, 2011 | SPEAKER: Tyson Condie (Yahoo! Research) Scal(a)ing up Machine Learning and Graph-based Analytics |
| Details |
| Date and Time | Oct. 14, 2011 3pm | | Location | DBH 3011 |
| Speaker | Tyson Condie (Yahoo! Research) | | Title | Scal(a)ing up Machine Learning and Graph-based Analytics | | Abstract |
Machine learning practitioners are increasingly interested in applying their algorithms to Big Data. Unfortunately, current high-level
languages for data analytics (e.g., Hive, Pig, Sawzall, Scope) do not fully cover this domain. One key missing ingredient is the means to
efficiently support iteration over the data. Zaharia et al., were the first to answer this call from a systems perspective with Spark.
Spark adds the notion of a working set to data-parallel workflows and has published speed-ups of 30x over Hadoop MapReduce for many machine learning and graph algorithms.
Unfortunately, Spark does cover the whole pipeline of Big Data analytics; at Yahoo!, it is common to compose Pig, MPI and direct MapReduce program modules into workflows.
This fractioning of individual processing steps can be a major pain e.g., for optimization, debugging, and code readability. Our prescription to this dilemma is a new DSL
for data analytics called ScalOps. Like Pig, ScalOps combines the declarative style of SQL and the low-level procedural style of MapReduce. Like Spark, ScalOps can optimize
its runtime—the Hyracks parallel-database engine—for repeated access to data collections. ScalOps is part of a broader research agenda to explore new abstractions
for machine learning and graph-based analytics. In this talk, I will present example workflows from the machine learning domain expressed in ScalOps and their translation to Hyracks recursive query plans. |
|
Sept. 30, 2011 | SPEAKER: Grad. students System Demo |
| Details |
| Date and Time | Sept. 30, 2011 3pm | | Location | DBH 3011 |
| Speaker | Grad. students | | Title | System Demo |
|
Sept. 23, 2011 | SPEAKER: ISG memebers ISG Gathering |
| Details |
| Date and Time | Sept. 23, 2011 3pm | | Location | DBH 3011 |
| Speaker | ISG memebers | | Title | ISG Gathering |
|
June 3, 2011 | SPEAKER: Donald Kossman Predictable Performance for Unpredictable Workloads |
| Details |
| Date and Time | June 3, 2011 2pm | | Location | DBH 3011 |
| Speaker | Donald Kossman | | Title | Predictable Performance for Unpredictable Workloads | | Abstract | This talk presents the design of SwissBox.
SwissBox is a database appliance designed to process thousands of concurrent queries and updates with bounded
query response times and strict data freshness guarantees. The system was designed to aggressively share operations
between concurrent queries and updates. This talk shows the design of the storage manager (called Crescando)
and the design of the query processor (called SharedDB). Furthermore, the talk presents the results of
performance experiments with workloads from an airline reservation system. | | Speaker Bio | Donald Kossmann is a professor for Computer Science at ETH Zurich (Switzerland). He received his MS from the University of Karlsruhe and completed his PhD at the University of Aachen. After that, he held positions at the University of Maryland, the IBM Almaden Research Center, the University of Passau, the University of Munich, and the University of Heidelberg. He is an ACM fellow, member of the board of trustees of the VLDB endowment, and was the program committee chair of the ACM SIGMOD Conf., 2009. He is a co-founder of i-TV-T (1998), XQRL Inc.
(acquired by BEA in 2002), and 28msec Inc. (2007).
His research interests lie in the area of databases and information systems. |
|
May 20, 2011 | SPEAKER: Ronen Vaisenberg Scheduling and Actuating Camera Networks to Maximize Event Detection |
| Details |
| Date and Time | May 20, 2011 2pm | | Location | DBH 3011 |
| Speaker | Ronen Vaisenberg | | Title | Scheduling and Actuating Camera Networks to Maximize Event Detection | | Abstract | A distributed camera network allows for many compelling applications, such as large-scale tracking, face recognition, occupancy monitoring or event detection.
In most practical systems, resources are either constrained or mutually exclusive. Constraints arise from network bandwidth restrictions, I/O and disk usage from writing images,
and CPU usage needed to extract features from the images. Detecting events in real time requires dynamically choosing a subset of the available sensors for processing at any given time.
Furthermore, certain camera configurations are not feasible. For example, a camera cannot zoom into two different regions in its field of view.
Zooming into a specific area in the field of view of a camera would generate a high resolution image of the region in the expense of a wider field of view.
Thus, the field of view needs to be changed dynamically to get a higher resolution images of certain regions of the space at the expanse other regions.
In order to illustrate the complexity of this problem, consider a face recognition application, which is only interested in high resolution (by means of optical zoom)
facial images. If we always zoom into a region to look for a high res face, we might miss presence of a person in different region and hence opportunity for zooming later to get the face in next time step.
In this talk we examine the problem of scheduling sensors for data collection and actuating them on real time to maximize some user-specified objective - e.g.,
detecting as much motion as possible or collect as many high resolution facial images.
The main idea behind our approach is the use of sensor semantics to guide the scheduling process. We learn a dynamic probabilistic model of motion correlations
between cameras, and use the model to guide resource allocation for our sensor network.
Although previous work has leveraged probabilistic models for sensor-scheduling, our work is distinct in its focus on real-time building-monitoring using a camera network.
We validate our approach using a sensor network of a dozen cameras spread throughout a university building, recording measurements of unscripted human activity over a two week period.
We automatically learn a semantic model of typical behaviors, and show that one can significantly improve efficiency of resource allocation and actuation by exploiting this model.
|
|
May 13, 2011 (Special) | SPEAKER: Prof. John Ousterhout (Stanford) RAMCloud: Scalable High-Performance Storage Entirely in DRAM |
| Details |
| Date and Time | May 13, 2011 (Special) 11am | | Location | DBH 6011 |
| Speaker | Prof. John Ousterhout (Stanford) | | Title | RAMCloud: Scalable High-Performance Storage Entirely in DRAM | | Abstract | Disk-oriented approaches to online storage are becoming increasingly problematic: they do not scale gracefully to meet the needs of new large-scale Web applications, and improvements in disk capacity have out-stripped improvements in access speed. In this talk I will describe a new approach to datacenter storage called RAMCloud, where information is kept entirely in DRAM and large-scale systems are created by aggregating the main memories of thousands of commodity servers. A RAMCloud can provide durable and available storage with 100-1000x the throughput of disk-based systems and 100-1000x lower access latency. By combining low latency and large scale, RAMClouds will enable a new class of applications that manipulate large datasets more intensively than has ever been possible.
| | Speaker Bio | John Ousterhout is Professor (Research) of Computer Science at Stanford University. His current research focuses on infrastructure
for Web applications and cloud computing. Ousterhout's prior positions include 14 years in industry where he founded two companies (Scriptics and Electric Cloud), preceded by 14 years as Professor of Computer Science at U.C. Berkeley. He is the creator of the Tcl scripting language and is also well known for his work in distributed operating systems and file systems. Ousterhout received a BS degree in Physics from Yale University and a PhD in Computer Science from Carnegie Mellon University. He is a member of the National Academy of Engineering and has received numerous awards, including the ACM Software System Award, the ACM Grace Murray Hopper Award, the National Science Foundation Presidential Young Investigator Award, and the U.C. Berkeley Distinguished Teaching Award.
|
|
May 9, 2011 (Special) | Scaling Up to Large (Really Large) Systems |
| Details |
| Date and Time | May 9, 2011 (Special) 11am | | Location | DBH 3011 |
| Speaker | Prof. Barton P. Miller | | Title | Scaling Up to Large (Really Large) Systems | | Abstract | I will discuss the problem of developing tools and middleware for large scale
parallel environments. We are especially interested in systems, both leadership
class parallel computers and clusters that have 100,000's or even millions
of processors. The infrastructure that we have developed to address this
problem is called MRNet, the Multicast/Reduction Network. MRNet's approach
to scale is to structure control and data flow in a tree-based overlay
network (TBON) that allows for efficient request distribution and flexible
data reductions.
I will then present an overview of the MRNet design, architecture, and
computational model and then discuss several of the applications of MRNet.
The applications include scalable automated performance analysis, a vision
clustering application and, most recently, an effort to develop our first
petascale debugging tool, STAT, a scalable stack trace analyzer running
currently on 100,000's of processors on both the Cray XT and IBM BlueGene. | | Speaker Bio | Prof. Barton Miller is a Professor of Computer Sciences at the University of Wisconsin.
Bart is a product of the UC System: he received his BA degree from UC San Diego in 1977
and his MS and PhD in Computer Science from UC Berkeley in 1980 and 1984, respectively.
His research interests include distributed and parallel program performance and tools,
binary code analysis and instrumentation, computer security, scalable systems, operating
systems, and software testing. Bart is a Fellow of the ACM. |
|
May 6, 2011 | SPEAKER: Matthias Nicola, IBM A Matter of Time: Temporal Data Management in DB2 for z/OS |
| Details |
| Date and Time | May 6, 2011 2pm | | Location | DBH 3011 |
| Speaker | Matthias Nicola, IBM | | Title | A Matter of Time: Temporal Data Management in DB2 for z/OS | | Abstract | Time is a critical dimension in data management. For many enterprises
it is useful or even required to have the ability to go back in time and look at a
past state of the database. Many applications also need to manage time in their
business records, such as contract start and end dates, expiration dates, or
"effective dates" to indicate that information is valid for a certain period in the past, presence, or future. This presentation
describes typical use cases for temporal data management and describes
the temporal capabilities in DB2, including system time, business time, and bitemporal support.
| | Speaker Bio | Matthias Nicola is a senior software engineer at IBM's Silicon Valley Lab, in
San Jose, CA, USA. He focuses on DB2 performance and benchmarking, XML, temporal data
management, in-database analytics, and other emerging technologies. Matthias also works
closely with customers and business partners to help them design, optimize and implement
DB2 solutions. Previously Matthias worked on data warehouse performance at Informix Software.
Matthias received his PhD in computer science from the Technical University of Aachen, Germany.
|
|
April 25, 2011 (Special) | Mining Billion-node Graphs |
| Details |
| Date and Time | April 25, 2011 (Special) 11am | | Location | DBH 6011 |
| Speaker | Prof. Christos Faloutsos, CMU | | Title | Mining Billion-node Graphs | | Abstract | What do graphs look like? How do they evolve over time? How to handle a
graph with a billion nodes? We present a comprehensive list of static
and temporal laws, and some recent observations on real graphs (like,
e.g., ``eigenSpokes''). We present tools, and specifically ``oddBall''
for discovering anomalies and patterns, as well as fast algorithms for
immunization. Finally, we present an overview of the PEGASUS system
which is designed to handle billion-node graphs, running on top of the
"hadoop" system. | | Speaker Bio | Christos Faloutsos is a Professor at Carnegie Mellon University. He has
received the Presidential Young Investigator Award by the National
Science Foundation (1989), the Research Contributions Award in ICDM
2006, the SIGKDD Innovations Award (2010), seventeen ``best paper''
awards, (including two ``test of time'') and four teaching awards. He
has served as a member of the executive committee of SIGKDD; he is an
ACM Fellow; he has published over 200 refereed articles, 11 book
chapters and one monograph. He holds five patents and he has given over
30 tutorials and over 10 invited distinguished lectures. His research
interests include data mining for graphs and streams, fractals, database
performance, and indexing for multimedia and bio-informatics data.
|
|
April 22, 2011 | Algebraic Comprehensions (Database Optimization for Web 2.0 Queries) |
| Details |
| Date and Time | April 22, 2011 2pm | | Location | DBH 3011 |
| Speaker | Jerome Simeon, IBM Research T.J. Watson | | Title | Algebraic Comprehensions (Database Optimization for Web 2.0 Queries) | | Abstract | Direct support for querying is becoming a "must have" for programming languages targeting Web 2.0 and Cloud development. Most of those languages (Microsoft's Linq, University of Edinburgh's Links, EPFL's Scala, Yahoo!'s Pig Latin, IBM's Thorn, etc) rely on the classic notion of comprehensions over collections. At the language level, comprehensions are a perfect choice, being well understood programming constructs, and capturing the expressive power of SQL iterators. At the compiler level, however, they are at odds with database optimizers which mostly rely on relational (or nested-relational) algebras. That mismatch was clearly on display during the design of XQuery, whose semantics is based on comprehensions, and for which most implementations target relational backends. We propose a alternative functional semantic formulation of XQuery to the one proposed by W3C, which is also based on comprehensions but has the benefit of corresponding precisely to compilation into a typed algebra that supports traditional database optimizations. First, this provides a formal foundation for XQuery implementations that want to ensure semantics integrity with the standard, along with modern database optimization techniques. Also, it provides key insights into the nature of database compilers that we believe is essential for the integration of database and programming languages technology. We notably discover that type systems for database algebras require an original solution to the old problem of subtyping with record concatenation, and that such a type system can eliminate the need for complex side conditions used in query language optimization.
| | Speaker Bio | Jerome Simeon is a Researcher for the Scalable XML Infrastructure Group at IBM T.J. Watson. He holds a degree in Engineering from EcolePolytechnique, and a Ph.D. from Universite d'Orsay. Previously, Jerome worked at INRIA from 1995 to 1999, and Bell Laboratories from 1999 to 2004. His research interests include databases, programming languages, compilers, and semantics, with a focus on Web development. He has put his work into practice in areas ranging from telecommunication infrastructure, to music. He is a co-editor for five of the W3C XML Query specifications, and has published more than 50 papers in scientific journals and international conferences. He is also a project lead for the Galax open-source XQuery implementation, and a co-author of "XQuery from the Experts" (Addison Wesley, 2004).
|
|
April 15, 2011 | SPEAKER: Tyson Condie, Yahoo! Research RubySky: Exploring Big Data with Transparency and Adjustability |
| Details |
| Date and Time | April 15, 2011 2pm | | Location | DBH 3011 |
| Speaker | Tyson Condie, Yahoo! Research | | Title | RubySky: Exploring Big Data with Transparency and Adjustability | | Abstract | In this talk, I will introduce a new scripting language for ad-hoc exploration of large data sets, called RubySky.
As with several prior efforts, RubySky scripts execute either in a local environment or in the cloud (Hadoop).
Typically, cloud-based execution is highly opaque and hands-off, rendering debugging and iterative code development
very difficult. RubySky, on the other hand, aims for a more transparent and adjustable paradigm.
It includes the ability to ``peek into'' intermediate cloud execution pathways, integrated as a first-class language construct.
Also integrated into the language is a way for the user to make last-minute code revisions, at any point at which troublesome
data is encountered in the cloud.
Combined, these features aim to improve usability for users who develop and run single-use scripts that
explore new data sets. This is joint work with Christopher Olston at Yahoo! Research.
|
|
April 1, 2011 | Replicated Data Consistency Explained through Baseball |
| Details |
| Date and Time | April 1, 2011 2pm | | Location | DBH 3011 |
| Speaker | Doug Terry, Microsoft Research | | Title | Replicated Data Consistency Explained through Baseball | | Abstract | A variety of relaxed consistency models for replicated data have
been proposed and studied as an alternative to one-copy serializability, and
some of these are being used in cloud storage systems. The designers of
such systems particularly avoid two-phase commit for updates to
geo-replicated data that spans multiple data centers on different
continents. Instead, many cloud services, including systems from Amazon,
Yahoo, and Microsoft, have adopted techniques that provide eventual
consistency. This talk explores the hows and whys of different consistency
models. The discussion will be driven by a simple example: maintaining the
score of a baseball game. We'll see that people with various roles in the
game can tolerate and benefit from different types of consistency when
accessing the score.
| | Speaker Bio | Doug Terry is a Principal Researcher in the Microsoft Research Silicon
Valley lab. His research focuses on the design and implementation of novel
distributed systems including mobile and cloud services. He currently is
serving as Chair of ACM's Special Interest Group on Operating Systems
(SIGOPS) and as a member of the ACM Council. Prior to joining Microsoft,
Doug was the co-founder and CTO of a start-up company named Cogenia, Chief
Scientist of the Computer Science Laboratory at Xerox PARC, and an Adjunct
Professor in the Computer Science Division at U. C. Berkeley, where he still
occasionally teaches a graduate course on distributed systems. Doug has a
Ph.D. in Computer Science from U.C. Berkeley and is an ACM Fellow. |
|
Mar 31, 2011 | Cimbiosys: Content-based Replication for Mobile Devices and the Cloud |
| Details |
| Date and Time | Mar 31, 2011 11am | | Location | DBH 6011 |
| Speaker | Doug Terry, Microsoft Research | | Title | Cimbiosys: Content-based Replication for Mobile Devices and the Cloud | | Abstract | As people increasingly use mobile devices and cloud services to
share large data collections, exploiting communication proximity and
selectively replicating content is essential. Cimbiosys is a replicated
storage platform that permits each device to define its own content-based
filtering criteria and to exchange data directly with other devices. This
talk focuses on the key challenge of ensuring eventual consistency in the
face of fluid network connectivity, redefinable content filters, and
arbitrary updates. Notably, Cimbiosys guarantees that each device
eventually stores precisely those items whose latest version matches its
custom filter and represents its replication-specific metadata in a compact
form, resulting in low data synchronization overhead. This permits ad hoc
replication between newly encountered devices and frequent synchronization
between established partners, even over low bandwidth wireless networks or
across geo-distributed data centers. (This talk will be a Ted and Janice Smith
Distinguished lecture, and not at the normal time or place for ISG Seminars.)
| | Speaker Bio | Doug Terry is a Principal Researcher in the Microsoft Research Silicon
Valley lab. His research focuses on the design and implementation of novel
distributed systems including mobile and cloud services. He currently is
serving as Chair of ACM's Special Interest Group on Operating Systems
(SIGOPS) and as a member of the ACM Council. Prior to joining Microsoft,
Doug was the co-founder and CTO of a start-up company named Cogenia, Chief
Scientist of the Computer Science Laboratory at Xerox PARC, and an Adjunct
Professor in the Computer Science Division at U. C. Berkeley, where he still
occasionally teaches a graduate course on distributed systems. Doug has a
Ph.D. in Computer Science from U.C. Berkeley and is an ACM Fellow.
|
|
Mar 25, 2011 | Answering Approximate String Queries on Large Data Sets Using External Memory |
| Details |
| Date and Time | Mar 25, 2011 2pm | | Location | DBH 3011 |
| Speaker | Alexander Behm, UCI PhD student | | Title | Answering Approximate String Queries on Large Data Sets Using External Memory | | Abstract | An approximate string query is to find from a
collection of strings those that are similar to a given query string.
Answering such queries is important in many applications such
as data cleaning and record linkage, where errors could occur
in queries as well as the data. Many existing algorithms have
focused on in-memory indexes. In this paper we investigate how
to efficiently answer such queries in a disk-based setting, by
systematically studying the effects of storing data and indexes
on disk. We devise a novel physical layout for an inverted
index to answer queries and we study how to construct it with
limited buffer space. To answer queries, we develop a cost-based,
adaptive algorithm that balances the I/O costs of retrieving
candidate matches and accessing inverted lists. Experiments
on large, real datasets verify that simply adapting existing
algorithms to a disk-based setting does not work well and that our
new techniques answer queries efficiently. Further, our solutions
significantly outperform a recent tree-based index, BED-tree.
This talk is a ICDE practice talk.
|
|
Mar 18, 2011 | SPEAKER: Pinaki Sinha Summarization of Personal Photo Collections |
| Details |
| Date and Time | Mar 18, 2011 2pm | | Location | DBH 3011 |
| Speaker | Pinaki Sinha | | Title | Summarization of Personal Photo Collections | | Abstract | The volume of personal photos hosted on photo archives and social sharing platforms
has been increasing exponentially. According to recent estimates, 6 Billion photos are uploaded
on Facebook per month. It is difficult to get an overview of a large collection of personal
photos without browsing though the entire database manually. In this talk, I will discuss a
framework to generate representative subset summaries from photo collections present on personal
archives or social networks. I will define salient properties of an effective photo summary
and model summarization as an optimization of these properties, given the size constraints.
Computer vision, and IR based techniques will be used to generate summaries that "look good" as well as
are informative. I will also introduce information theory based metrics for evaluating photo
summaries based on their information content and the ability to satisfy user's information needs.
I will also discuss the manual evaluation experiments that were done to evaluate summaries.
|
|
Mar 11, 2011 | Entity resolution |
| Details |
| Date and Time | Mar 11, 2011 2pm | | Location | DBH 3011 |
|
Mar 4, 2011 | SPEAKER: Rares Vernica Efficient Processing of Set-Similarity Joins on Large Clusters |
| Details |
| Date and Time | Mar 4, 2011 2pm | | Location | DBH 3011 |
| Speaker | Rares Vernica | | Title | Efficient Processing of Set-Similarity Joins on Large Clusters |
|
Feb 23, 2011 | SPEAKER: Dr. Terence Sim Getting More From Fisher |
| Details |
| Date and Time | Feb 23, 2011 3pm | | Location | DBH 3011 |
| Speaker | Dr. Terence Sim | | Title | Getting More From Fisher | | Abstract | The Fisher Linear Discriminant (FLD) is commonly used
in classification to find a subspace that maximally separates
class patterns according to the Fisher Criterion. It was
previously proven that a pre-whitening step can be used to
truly optimize the Fisher Criterion. In this talk, we show that
more insight and more applications may be derived from this
classical technique.
First, we explore the subspaces induced by this whitened FLD.
In particular, we show how the Identity Space and Variation
Space are useful for decomposing and representing data.
We give sufficient conditions for these spaces to exist. Through
experiments we also show how these spaces may
be used for classification and image synthesis.
Second, we further extend classical Fisher to handle data exhibiting
multiple factors (modes), e.g. face images that exhibit personal
identity, illumination, and pose. We call our method Multimodal
Discriminant Analysis (MMDA), which is useful for decomposing
a dataset into independent modes. For face images, MMDA
effectively separates identity, illumination and pose into mutually
orthogonal subspaces. MMDA is based on maximizing the
Fisher Criterion on all modes simultaneously, and is therefore
well-suited for multimodal and mode-invariant pattern recognition.
We also show that MMDA may be used for dimension reduction,
and for synthesizing face images under novel illumination, and
even novel personal identity. | | Speaker Bio | Terence Sim is an Asst. Prof. at the School of Computing, National University of Singapore. He teaches an undergraduate course in computer vision, as well as a graduate course in multimedia fundamentals. For research, he works primarily in these areas: face recognition, biometrics, and computational photography. He is also interested in computer vision problems in general, such as shape-from-shading, photometric stereo, object recognition. On the side, he dabbles with some aspects of music processing, such as polyphonic music transcription. Dr. Sim serves as Vice-Chairman of the Biometrics Technical Committee (BTC), Singapore, and Chairman of the Cross-Jurisdictional and Societal Aspects Working Group (WG6) within the BTC. The interesting issues here are the legal and privacy aspects of using biometrics. He also serves as Vice-President of the Pattern Recognition and Machine Intelligence Association (PREMIA), a national professional body for pattern recognition. Dr. Sim obtained his PhD from Carnegie Mellon in 2002, his MSc from Stanford University in 1991, and his SB from MIT in 1990. |
|
Feb 16, 2011 | New Principles for Information Integration |
| Details |
| Date and Time | Feb 16, 2011 11am | | Location | DBH 4011 |
| Speaker | Laura Haas (IBM) | | Title | New Principles for Information Integration | | Abstract | Ten years ago, Clio introduced nonprocedural schema mappings to describe the relationship between data in heterogeneous schemas. This enabled powerful tools for mapping discovery and integration code generation, greatly simplifying the integration process. However, further progress is needed. We see an opportunity to raise the level of abstraction further, and propose two new principles that the next generation of integration systems should embody. Holistic information integration supports iteration across the various integration tasks, leveraging information about both schema and data to improve the integrated result. Integration independence allows applications to be independent of how, when, and where information integration takes place, making materialization and the timing of transformations an optimization decision that is transparent to applications. This talk introduces these principles and describes some promising recent work in these directions. | | Speaker Bio | Laura Haas is an IBM Fellow and has been director of computer science at IBM Almaden Research Center since 2005. Previously, Dr. Haas was responsible for Information Integration Solutions (IIS) architecture in IBM's Software Group after leading the IIS development team through its first two years. She joined the development team in 2001 as manager of DB2 UDB Query Compiler development. Before that, Dr. Haas was a research staff member and manager at the Almaden lab for nearly twenty years. In IBM Research, she worked on and managed a number of exploratory projects in distributed database systems. Dr. Haas is best known for her work on the Starburst query processor (from which DB2 UDB was developed); on Garlic, a system which allowed federation of heterogeneous data sources; and on Clio, the first semi-automatic tool for heterogeneous schema mapping. Garlic technology, married with DB2 UDB query processing, is the basis for the IBM WebSphere Information Server's federation capabilities, while Clio capabilities are a core differentiator in IBM’s Rational Data Architect. Dr. Haas has received several IBM awards for Outstanding Technical Achievement and Outstanding Innovation, and an IBM Corporate Award for her work on federated database technology. In 2010 she was recognized with the Anita Borg Institute Technical Leadership Award. She is a member of the National Academy of Engineering and the IBM Academy of Technology, an ACM Fellow, and Vice Chair of the board of the Computing Research Association. Dr. Haas received her PhD from the University of Texas at Austin, and her bachelor degree from Harvard University. |
|
Feb 4, 2011 (POSTPONED) | Homebrew Databases |
| Details |
| Date and Time | Feb 4, 2011 (POSTPONED) 2pm | | Location | DBH 3011 |
|