大数据系统方面的经典论文

推荐人:@JerryLead

说明:下面倾向选取已经在工业界广泛使用的系统论文,还有很多优秀论文没有在列表中,可以查阅近年来SOSP/OSDI/EuroSys/USENIX ATC/SIGMOD/VLDB/NIPS/ICML/KDD等相关会议获取。

分布式数据并行处理框架与编程模型

[Google MapReduce] Jeffrey DeanSanjay Ghemawat:
MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004: 137-150

[Microsoft Dryad] Michael IsardMihai BudiuYuan YuAndrew BirrellDennis Fetterly:
Dryad: distributed data-parallel programs from sequential building blocks. EuroSys 2007: 59-72

[Microsoft DryadLINQ] Yuan YuMichael IsardDennis FetterlyMihai BudiuÚlfar ErlingssonPradeep Kumar GundaJon Currey:
DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. OSDI 2008: 1-14

[Google FlumeJava] Craig ChambersAshish RaniwalaFrances PerryStephen AdamsRobert R. HenryRobert BradshawNathan Weizenbaum:
FlumeJava: easy, efficient data-parallel pipelines. PLDI 2010: 363-375

[Apache Spark Core] Matei ZahariaMosharaf ChowdhuryTathagata DasAnkur DaveJustin MaMurphy McCaulyMichael J. FranklinScott ShenkerIon Stoica:
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. NSDI2012: 15-28

[Google Cloud Dataflow] Tyler AkidauRobert BradshawCraig ChambersSlava ChernyakRafael Fernández-MoctezumaReuven LaxSam McVeetyDaniel MillsFrances PerryEric SchmidtSam Whittle:
The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing. PVLDB 8(12): 1792-1803 (2015)

[Apache Tez] Bikas SahaHitesh ShahSiddharth SethGopal VijayaraghavanArun C. MurthyCarlo Curino:
Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications. SIGMOD Conference 2015: 1357-1369

[Apache Flink] Paris CarboneAsterios KatsifodimosStephan EwenVolker MarklSeif HaridiKostas Tzoumas:
Apache Flink™: Stream and Batch Processing in a Single Engine. IEEE Data Eng. Bull. 38(4): 28-38 (2015)

大数据SQL

[Google Sawzall] Rob PikeSean DorwardRobert GriesemerSean Quinlan:
Interpreting the data: Parallel analysis with Sawzall. Scientific Programming 13(4): 277-298 (2005)

[Apache Pig] Christopher OlstonBenjamin ReedUtkarsh SrivastavaRavi KumarAndrew Tomkins:
Pig latin: a not-so-foreign language for data processing. SIGMOD Conference 2008: 1099-1110

[Apache Hive] Ashish ThusooJoydeep Sen SarmaNamit JainZheng ShaoPrasad ChakkaSuresh AnthonyHao LiuPete Wyckoff,Raghotham Murthy:
Hive - A Warehousing Solution Over a Map-Reduce Framework. PVLDB 2(2): 1626-1629 (2009)

[Berkeley Spark Shark] Reynold S. XinJosh RosenMatei ZahariaMichael J. FranklinScott ShenkerIon Stoica:
Shark: SQL and rich analytics at scale. SIGMOD Conference 2013: 13-24

[Apache Spark SQL] Michael ArmbrustReynold S. XinCheng LianYin HuaiDavies LiuJoseph K. BradleyXiangrui MengTomer KaftanMichael J. FranklinAli GhodsiMatei Zaharia:
Spark SQL: Relational Data Processing in Spark. SIGMOD Conference 2015: 1383-1394

[Google Tenzing] Biswapesh ChattopadhyayLiang LinWeiran LiuSagar MittalPrathyusha AragondaVera LychaginaYounghee KwonMichael Wong:
Tenzing A SQL Implementation On The MapReduce Framework. PVLDB 4(12): 1318-1327 (2011)

大规模图计算

[Google Pregel] Grzegorz MalewiczMatthew H. AusternAart J. C. BikJames C. DehnertIlan HornNaty LeiserGrzegorz Czajkowski:
Pregel: a system for large-scale graph processing. SIGMOD Conference 2010: 135-146

[CMU GraphLab] Yucheng LowJoseph GonzalezAapo KyrolaDanny BicksonCarlos GuestrinJoseph M. Hellerstein:
Distributed GraphLab: A Framework for Machine Learning in the Cloud. PVLDB 5(8): 716-727 (2012)

[CMU PowerGraph] Joseph E. GonzalezYucheng LowHaijie GuDanny BicksonCarlos Guestrin:
PowerGraph: Distributed Graph-Parallel Computation on Natural Graphs. OSDI 2012: 17-30

[CMU GraphChi] Aapo KyrolaGuy E. BlellochCarlos Guestrin:
GraphChi: Large-Scale Graph Computation on Just a PC. OSDI 2012: 31-46

[Apache Spark GraphX] Joseph E. GonzalezReynold S. XinAnkur DaveDaniel CrankshawMichael J. FranklinIon Stoica:
GraphX: Graph Processing in a Distributed Dataflow Framework. OSDI 2014: 599-613

分布式机器学习

[Google Distbelief] Jeffrey DeanGreg CorradoRajat MongaKai ChenMatthieu DevinQuoc V. LeMark Z. MaoMarc'Aurelio RanzatoAndrew W. SeniorPaul A. TuckerKe YangAndrew Y. Ng:
Large Scale Distributed Deep Networks. NIPS 2012: 1232-1240

[CMU Parameter Server] Mu LiDavid G. AndersenJun Woo ParkAlexander J. SmolaAmr AhmedVanja JosifovskiJames Long,Eugene J. ShekitaBor-Yiing Su:
Scaling Distributed Machine Learning with the Parameter Server. OSDI 2014: 583-598

[CMU Petuum] Eric P. XingQirong HoWei DaiJin Kyu KimJinliang WeiSeunghak LeeXun ZhengPengtao XieAbhimanu KumarYaoliang Yu:
Petuum: A New Platform for Distributed Machine Learning on Big Data. KDD 2015: 1335-1344

[Google TensorFlow] Martín AbadiAshish AgarwalPaul BarhamEugene BrevdoZhifeng ChenCraig CitroGregory S. CorradoAndy Davis,Jeffrey DeanMatthieu DevinSanjay GhemawatIan J. GoodfellowAndrew HarpGeoffrey IrvingMichael Isard,Yangqing JiaRafal JózefowiczLukasz KaiserManjunath KudlurJosh LevenbergDan ManéRajat MongaSherry Moore,Derek Gordon MurrayChris OlahMike SchusterJonathon ShlensBenoit SteinerIlya SutskeverKunal TalwarPaul A. TuckerVincent VanhouckeVijay VasudevanFernanda B. ViégasOriol VinyalsPete WardenMartin WattenbergMartin WickeYuan YuXiaoqiang Zheng:
TensorFlow: Large-Scale Machine Learning on Heterogeneous Distributed Systems. CoRR abs/1603.04467 (2016)

[Open-source MXNet] Tianqi ChenMu LiYutian LiMin LinNaiyan WangMinjie WangTianjun XiaoBing XuChiyuan ZhangZheng Zhang:
MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. CoRRabs/1512.01274 (2015)

[Apache Spark MLlib] Xiangrui MengJoseph K. BradleyBurak YavuzEvan R. SparksShivaram VenkataramanDavies LiuJeremy FreemanD. B. TsaiManish AmdeSean OwenDoris XinReynold XinMichael J. FranklinReza ZadehMatei ZahariaAmeet Talwalkar:
MLlib: Machine Learning in Apache Spark. CoRR abs/1505.06807 (2015)

[CMU SSP Protocol] Henggang CuiJames CiparQirong HoJin Kyu KimSeunghak LeeAbhimanu KumarJinliang WeiWei Dai,Gregory R. GangerPhillip B. GibbonsGarth A. GibsonEric P. Xing:
Exploiting Bounded Staleness to Speed Up Big Data Analytics. USENIX Annual Technical Conference2014: 37-48

流式数据处理

[Apache Spark Streaming] Matei ZahariaTathagata DasHaoyuan LiTimothy HunterScott ShenkerIon Stoica:
Discretized streams: fault-tolerant streaming computation at scale. SOSP 2013: 423-438

[Google MillWheel] Tyler AkidauAlex BalikovKaya BekirogluSlava ChernyakJosh HabermanReuven LaxSam McVeetyDaniel MillsPaul NordstromSam Whittle:
MillWheel: Fault-Tolerant Stream Processing at Internet Scale. PVLDB 6(11): 1033-1044 (2013)

[Microsoft TimeStream] Zhengping QianYong HeChunzhi SuZhuojie WuHongyu ZhuTaizhi ZhangLidong ZhouYuan YuZheng Zhang:
TimeStream: reliable stream computation in the cloud. EuroSys 2013: 1-14

资源管理与任务调度

[Apache Hadoop YARN] Vinod Kumar VavilapalliArun C. MurthyChris DouglasSharad AgarwalMahadev KonarRobert Evans,Thomas GravesJason LoweHitesh ShahSiddharth SethBikas SahaCarlo CurinoOwen O'MalleySanjay RadiaBenjamin ReedEric Baldeschwieler:
Apache Hadoop YARN: yet another resource negotiator. SoCC 2013: 5:1-5:16

[Apache Mesos] Benjamin HindmanAndy KonwinskiMatei ZahariaAli GhodsiAnthony D. JosephRandy H. KatzScott ShenkerIon Stoica:
Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center. NSDI 2011

[Google Borg] Abhishek VermaLuis PedrosaMadhukar KorupoluDavid OppenheimerEric TuneJohn Wilkes:
Large-scale cluster management at Google with Borg. EuroSys 2015: 18:1-18:17

Report Story

留下你的评论