学习Spark

副标题:无

作   者:(美)卡劳 等著

分类号:

ISBN:9787564159214

微信扫一扫,移动浏览光盘

简介

  所有领域中产生的数据都越来越大。你如何有效 地利用这些数据?本书介绍了Apache Spark,一种能 迅速执行数据分析过程的开源集群计算系统。利用 Spark,你能够通过Python、Java和Scala中的简单 API迅速地处理大数据集 卡劳主编的《学习Spark(影印版)(英文版)》由 Spark的开发者撰写完成,得到数据科学家和工程师 的支持,本书中的内容能够随时运行。你将学习如何 只通过几行代码执行并行任务,并覆盖了从简单批量 作业到流处理和机器学习等应用。

目录

ForewordPreface1. Introduction to Data Analysis with SparkWhat Is Apache Spark?A Unified StackSpark CoreSpark SQLSpark StreamingMLlibGraphXCluster ManagersWho Uses Spark, and for What?Data Science TasksData Processing ApplicationsA Brief History of SparkSpark Versions and ReleasesStorage Layers for Spark2. Downloading Spark and Getting StartedDownloading SparkIntroduction to Spark's Python and Scala ShellsIntroduction to Core Spark ConceptsStandalone ApplicationsInitializing a SparkContextBuilding Standalone ApplicationsConclusion3. Programming with RDDsRDD BasicsCreating RDDsRDD OperationsTransformationsActionsLazy EvaluationPassing Functions to SparkPythonScalaJavaCommon Transformations and ActionsBasic RDDsConverting Between RDD TypesPersistence (Caching)Conclusion4. Working with Key/Value PairsMotivationCreating Pair RDDsTransformations on Pair RDDsAggregationsGrouping DataJoinsSorting DataActions Available on Pair RDDsData Partitioning (Advanced)Determining an RDD's PartitionerOperations That Benefit from PartitioningOperations That Affect PartitioningExample: PageRankCustom PartitionersConclusion5. Loading and Saving Your DataMotivationFile FormatsText FilesJSONComma-Separated Values and Tab-Separated ValuesSequenceFilesObject FilesHadoop Input and Output FormatsFile CompressionFilesystemsLocal/"Regular" FSAmazon $3HDFSStructured Data with Spark SQLApache HiveJSONDatabasesJava Database ConnectivityCassandraHBaseElasticsearchConclusion6. Advanced Spark ProgrammingIntroductionAccumulatorsAccumulators and Fault ToleranceCustom AccumulatorsBroadcast VariablesOptimizing BroadcastsWorking on a Per-Partition BasisPiping to External ProgramsNumeric RDD OperationsConclusion7. Running on a ClusterIntroductionSpark Runtime ArchitectureThe DriverExecutorsCluster ManagerLaunching a ProgramSummaryDeploying Applications with spark-submitPackaging Your Code and DependenciesA Java Spark Application Built with MavenA Scala Spark Application Built with sbtDependency ConflictsScheduling Within and Between Spark ApplicationsCluster ManagersStandalone Cluster ManagerHadoop YARNApache MesosAmazon EC2Which Cluster Manager to Use?Conclusion8. Tuning and Debugging SparkConfiguring Spark with SparkConfComponents of Execution: Jobs, Tasks, and StagesFinding InformationSpark Web UIDriver and Executor LogsKey Performance ConsiderationsLevel of ParallelismSerialization FormatMemory ManagementHardware ProvisioningConclusion9. Spark SQLLinking with Spark SQLUsing Spark SQL in ApplicationsInitializing Spark SQLBasic Query ExampleSchemaRDDsCachingLoading and Saving DataApache HiveParquetJSONFrom RDDsJDBC/ODBC ServerWorking with BeelineLong-Lived Tables and QueriesUser-Defined FunctionsSpark SQL UDFsHive UDFsSpark SQL PerformancePerformance Tuning OptionsConclusion10. Spark StreamingA Simple ExampleArchitecture and AbstractionTransformationsStateless TransformationsStateful TransformationsOutput OperationsInput SourcesCore SourcesAdditional SourcesMultiple Sources and Cluster Sizing24/7 OperationCheckpointingDriver Fault ToleranceWorker Fault ToleranceReceiver Fault ToleranceProcessing GuaranteesStreaming UIPerformance ConsiderationsBatch and Window SizesLevel of ParallelismGarbage Collection and Memory UsageConclusion11. Machine Learning with MLlibOverviewSystem RequirementsMachine Learning BasicsExample: Spam ClassificationData TypesWorking with VectorsAlgorithmsFeature ExtractionStatisticsClassification and RegressionClusteringCollaborative Filtering and RecommendationDimensionality ReductionModel EvaluationTips and Performance ConsiderationsPreparing FeaturesConfiguring AlgorithmsCaching RDDs to ReuseRecognizing SparsityLevel of ParallelismPipeline APIConclusionIndex

已确认勘误

次印刷

页码 勘误内容 提交人 修订印次

学习Spark
    • 名称
    • 类型
    • 大小

    光盘服务联系方式: 020-38250260    客服QQ:4006604884

    意见反馈

    14:15

    关闭

    云图客服:

    尊敬的用户,您好!您有任何提议或者建议都可以在此提出来,我们会谦虚地接受任何意见。

    或者您是想咨询:

    用户发送的提问,这种方式就需要有位在线客服来回答用户的问题,这种 就属于对话式的,问题是这种提问是否需要用户登录才能提问

    Video Player
    ×
    Audio Player
    ×
    pdf Player
    ×
    Current View

    看过该图书的还喜欢

    some pictures

    解忧杂货店

    东野圭吾 (作者), 李盈春 (译者)

    loading icon